Distribution Matching Losses Can Hallucinate Features in Medical Image Translation
This paper discusses how distribution matching losses, such as those used in CycleGAN, when used to synthesize medical images can lead to mis-diagnosis of medical conditions. It seems appealing to use these new image synthesis methods for translating images from a source to a target domain because they can produce high quality images and some even do not require paired data. However, the basis of how these image translation models work is through matching the translation output to the distribution of the target domain. This can cause an issue when the data provided in the target domain has an over or under representation of some classes (e.g. healthy or sick). When the output of an algorithm is a transformed image there are uncertainties whether all known and unknown class labels have been preserved or changed. Therefore, we recommend that these translated images should not be used for direct interpretation (e.g. by doctors) because they may lead to misdiagnosis of patients based on hallucinated image features by an algorithm that matches a distribution. However there are many recent papers that seem as though this is the goal.
Keywords:distribution matching, image synthesis, domain adaptation
The introduction of adversarial losses  made it possible to train new kinds of models based on implicit distribution matching. Recently, adversarial approaches such as CycleGAN , pix2pix , UNIT , Adversarially Learned Inference (ALI) , and GibbsNet  have been proposed for un-paired and paired image translation between two domains. These approaches have been used recently in medical imaging research for translating images between domains such as MRI and CT. However, there is a bias when the output of these models are used for interpretation. When translating images from a source domain to a target domain, these models are trained to match the target domain distribution, where they may hallucinate images by adding or removing image features. This can cause a problem when the target distribution during training has over or under representation of known or unknown labels compared to the test time distribution. Due to such a bias, we recommend until better solutions are proposed that maintain the vital information, such translated images should not be used for medical diagnosis, since they can lead to mis-diagnosis of medical conditions. This issue should be discussed because recently several papers have been published performing image translation using distribution matching. The main motivation for many of these approaches was to translate images from a source domain to a target domain such that they could be later used for interpretation (e.g. by doctors). Applications include MR to CT [7, 8], CS-MRI [9, 10], CT to PET , and automatic H&E staining .
We demonstrate the problem with a caricature example in Figure 1 where we cure cancer (in images) and cause cancer (in images) using a CycleGAN that translates between Flair and T1 MRI samples. In Figure 1(a) the model has been trained only on healthy T1 samples which causes it to remove cancer from the image. This model has learned to match the target distribution regardless of maintaining features that are present in the image. In the following sections, we demonstrate how these methods introduce a bias in image translation due to matching the target distribution.
We draw attention to this issue in the specific use case where the images are presented for interpretation. However, we do not aim to discourage work using these losses for data augmentation to improve the performance of a classification, segmentation, or other model.
2 Problem Statement
Our argument is that the composition of the source and target domains can bias the image transformation to cause an unwanted feature hallucination. We systematically review the objective functions used for image translation in Table 1 and discuss how they each exhibit this bias.
Let’s first consider a standard GAN model  where the generator is a transformation function which maps samples from the source domain to samples from the target domain . The discriminator is trained given samples from through which the transformation function can match the distribution of .
In order to minimize this objective the transformation function will need to produce images that match real images from the distribution . Here there are no constraints to force a correct mapping between and , so for a non-finite we can consider it to be equal to a Gaussian noise typically used in a GAN.
In order to better enforce the mapping between the domains CycleGAN  extends the generator loss to include cycle consistency terms:
Here the function is composed of the inverse transformation to create a reconstruction loss that will regularize both transformations to not ignore the source image. However, this process does not provide a guarantee that a correct mapping will be made. In order to match the target distribution, image features can be hallucinated and information to reconstruct an image in the other domain can be encoded . Moreover, due to having un-paired source and target data, the target distribution that the generator is trained on may be even distinct from the target distribution that corresponds to the data in the source domain (e.g. having only tumor targets while the source is all healthy). This makes the models such as CycleGAN even more prone to hallucinate features due to the way the data in the target domain is gathered.
Another approach to solve this problem is using a conditional discriminator [3, 14]. The intuition here is that giving the discriminator the source image as well as the transformed image , we can model the joint distribution. This approach requires paired examples in order to provide real source and target pairs to the discriminator. The dataset still plays a role in determining what the discriminator learns and therefore how the transformation function operates. The discriminator is trained by:
Even in the case of CondGAN that the source and target domain distributions correspond to each other due to having paired data, the discriminator can assign more/less capacity to a feature (e.g. tumors), due to having over/under representation of those features in the target distribution. This can be a source of bias in how those features are translated.
Finally, we look at how to train a transformation using only a L1 loss without any adversarial distribution matching term. With this classic approach we consider transformations based on pixel wise error:
Unlike GAN models that match the target distribution over the entire image, L1 predicts each pixel locally given its receptive field without the need to account for global consistency. As long as some pixels present the category of interest in the image (e.g. tumor), L1 can learn a mapping. However, L1 still can suffer from a bias when the train and test distributions are different, e.g. when no tumor pixels are provided during training, which can be caused by having new known or unknown labels at test time.
With all these approaches to domain translation we find there is the potential for bias in the training data (specifically for our experiments below).
|Discriminator Loss (max)||Domain Transformer/Generator Loss (min)|
3 Bias Impact
We use the BRATS2013  synthetic MRI dataset because we can visually inspect the presence of a tumor, it is freely available to the public, and we have paired data to inspect results. Our task for analysis is to transform Flair MRI images (source domain) into T1-weighted images (target domain). We start with 1700 image slices where 50% are healthy and 50% have tumors. We use 1400 to construct training sets for the models and 300 as a holdout test set used to test if the transformation added or removed tumors.
In this section, we construct two training scenarios: unpaired and paired. For the CycleGAN we use an unpaired training scenario which keeps the distribution fixed in the source domain (with 50% healthy and 50% tumor samples) and changes the ratio of healthy to cancer samples in the target domain to simulate how the distribution matching works when the target distribution is irrelevant to the source distribution. For the CondGAN and L1 models we use a paired training scenario where both the source and target domains have the same proportion of healthy to tumor examples because they have to be presented as pairs to the model.
We train 3 models under 11 different percentages of tumor examples in the target distribution, which vary from 0% to 100% with tumors. In place of a doctor to classify the transformed samples we use an impartial CNN classifier (4 convolutional layers with ReLU and Stride-2 convolutions, 1 fully connected layer with no non-linearity, and a two-way softmax output layer) which obtains 80% accuracy on the test set. The results of using this classifier on the generated T1 samples with different target domain composition is shown in Figure 2. As we change the composition of the target domain we can observe the bias impact on the class of the transformed examples from the holdout test set. If there was no bias in matching the target distribution due to the composition of the samples in the target domain, there would be no difference in the percentage of the images diagnosed with a tumor as we change the target domain composition in Figure 2. We also compute the mean absolute pixel reconstruction error between the ground truth image in the target domain and the translated image. If a large feature is added or removed it should produce a large pixel error. If the translation was doing a perfect job, the pixel error should have been 0 for all cases.
We draw the readers attention to CycleGAN which produces the most dramatic change in class labels, since the model learns to map a balanced (tumor to healthy) source domain to an unbalanced composition in the target domain, which encourages the model to add or remove features (see samples in Figure S1). This indicates such models are subject to even more bias due to the composition of the features in the target domain that can be different from the ones in the source domain.
For CondGAN, the pixel error changes across as the composition of tumor/healthy changes, indicating there is a bias due to the training data composition. Perceptually the L1 loss appears the most consistent producing the least bias. However, it has error when it is trained on 0% tumor and the model is asked to translate tumor samples at test time (0% for L1 in Figure 2 bottom row and Figure 3 (a)), which is due to a mis-match between train and test distributions. It indicates that if at test time images with new known or unknown labels (e.g. a new disease) are presented to the model, it cannot transform them properly. In Figure 3 we show examples of the translated images between the models. Note how for GAN based models the cancer tumor gradually appears and gets bigger from left to right. L1 mostly suffers in Figure 3 (a) for 0%. Interestingly, in the case of 100% tumor it can translate healthy images even though it was not trained with healthy images. We believe this is due to having both healthy and tumor regions in each image which allows the network to see healthy sub-regions and learn to translate both categories. Further samples are available in the supplementary information in Figures S2 and S3.
In this work we discussed concerns about how distribution matching losses, such as those used in CycleGAN, can lead to mis-diagnosis of medical conditions. We have presented experimental evidence that when the output of an algorithm matches a distribution, for unpaired or paired data translation, all known and unknown class labels might not be preserved. Therefore, these translated images should not be used for interpretation (e.g. by doctors) without proper tools to verify the translation process. We illustrate this problem using dramatic examples of tumors being added and removed from MRI images. We hope that future methods will take steps to ensure that this bias does not influence the outcome of a medical diagnosis.
We thank Adriana Romero Soriano, Michal Drozdzal, and Mohammad Havaei for their valuable input and assistance on the project. This work is partially funded by a grant from the U.S. National Science Foundation Graduate Research Fellowship Program (grant number: DGE-1356104) and the Institut de valorisation des donnees (IVADO). This work utilized the supercomputing facilities managed by the Montreal Institute for Learning Algorithms, NSERC, Compute Canada, and Calcul Quebec.
-  Goodfellow, I.J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative Adversarial Networks. In: Neural Information Processing Systems (2014)
-  Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks. In: International Conference on Computer Vision (2017)
-  Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: Computer Vision and Pattern Recognition (2017)
-  Liu, M.Y., Breuel, T., Kautz, J.: Unsupervised Image-to-Image Translation Networks. In: Neural Information Processing Systems (2017)
-  Dumoulin, V., Belghazi, I., Poole, B., Mastropietro, O., Lamb, A., Arjovsky, M., Courville, A.: Adversarially Learned Inference. In: International Conference on Learning Representations (2017)
-  Lamb, A., Hjelm, D., Ganin, Y., Cohen, J.P., Courville, A., Bengio, Y.: GibbsNet: Iterative Adversarial Inference for Deep Graphical Models. In: Neural Information Processing Systems (2017)
-  Wolterink, J.M., Dinkla, A.M., Savenije, M.H., Seevinck, P.R., van den Berg, C.A., Išgum, I.: Deep MR to CT synthesis using unpaired data. In: Workshop on Simulation and Synthesis in Medical Imaging (2017)
-  Nie, D., Trullo, R., Petitjean, C., Ruan, S., Shen, D.: Medical Image Synthesis with Context-Aware Generative Adversarial Networks. In: Medical Image Computing and Computer-Assisted Intervention (2016)
-  Quan, T.M., Nguyen-Duc, T., Jeong, W.K.: Compressed Sensing MRI Reconstruction using a Generative Adversarial Network with a Cyclic Loss. IEEE Transactions on Medical Imaging (2018)
-  Yang, G., Yu, S., Dong, H., Slabaugh, G., Dragotti, P.L., Ye, X., Liu, F., Arridge, S., Keegan, J., Guo, Y., Firmin, D.: DAGAN: Deep De-Aliasing Generative Adversarial Networks for Fast Compressed Sensing MRI Reconstruction. IEEE Transactions on Medical Imaging (2018)
-  Ben-Cohen, A., Klang, E., Raskin, S.P., Amitai, M.M., Greenspan, H.: Virtual PET Images from CT Data Using Deep Convolutional Networks: Initial Results. In: MICCAI Workshop on Simulation and Synthesis in Medical Imaging (2017)
-  Bayramolu, N., Kaakinen, M., Eklund, L.: Towards Virtual H&E Staining of Hyperspectral Lung Histology Images Using Conditional Generative Adversarial Networks. In: International Conference on Computer Vision (2017)
-  Chu, C., Zhmoginov, A., Sandler, M.: CycleGAN, a Master of Steganography. In: Neural Information Processing Systems Workshop on Machine Deception (2017)
-  Mirza, M., Osindero, S.: Conditional Generative Adversarial Nets. arXiv: 1411.1784 (2014)
-  Menze, B.H., Jakab, A., Bauer, S., et al.: The Multimodal Brain Tumor Image Segmentation Benchmark (BRATS). IEEE Transactions on Medical Imaging (2015)
(b) Examples with a tumor from the holdout test set