DiamondGAN: Unified Multi-Modal Generative Adversarial Networks for MRI Sequences Synthesis

DiamondGAN: Unified Multi-Modal
Generative Adversarial Networks
for MRI Sequences Synthesis

Hongwei Li equal contribution1. Dept. of Informatics, Technical University of Munich, Germany
2. Dept. of Neuroradiology, Klinikum rechts der Isar, Germany
3. Dept. of Computer Science and Engineering, Southern University of Science and Technology, China
4. Shenzhen Institude of Artificial Intelligence and Robotics for Society, China
5. Institude for Advanced Study, Technical University of Munich, Germany
11email: {hongwei.li, bjoern.menze}@tum.de
   Johannes C. Paetzold 1. Dept. of Informatics, Technical University of Munich, Germany
2. Dept. of Neuroradiology, Klinikum rechts der Isar, Germany
3. Dept. of Computer Science and Engineering, Southern University of Science and Technology, China
4. Shenzhen Institude of Artificial Intelligence and Robotics for Society, China
5. Institude for Advanced Study, Technical University of Munich, Germany
11email: {hongwei.li, bjoern.menze}@tum.de
   Anjany Sekuboyina 1. Dept. of Informatics, Technical University of Munich, Germany
2. Dept. of Neuroradiology, Klinikum rechts der Isar, Germany
3. Dept. of Computer Science and Engineering, Southern University of Science and Technology, China
4. Shenzhen Institude of Artificial Intelligence and Robotics for Society, China
5. Institude for Advanced Study, Technical University of Munich, Germany
11email: {hongwei.li, bjoern.menze}@tum.de
   Florian Kofler 1. Dept. of Informatics, Technical University of Munich, Germany
2. Dept. of Neuroradiology, Klinikum rechts der Isar, Germany
3. Dept. of Computer Science and Engineering, Southern University of Science and Technology, China
4. Shenzhen Institude of Artificial Intelligence and Robotics for Society, China
5. Institude for Advanced Study, Technical University of Munich, Germany
11email: {hongwei.li, bjoern.menze}@tum.de
   Jianguo Zhang 1. Dept. of Informatics, Technical University of Munich, Germany
2. Dept. of Neuroradiology, Klinikum rechts der Isar, Germany
3. Dept. of Computer Science and Engineering, Southern University of Science and Technology, China
4. Shenzhen Institude of Artificial Intelligence and Robotics for Society, China
5. Institude for Advanced Study, Technical University of Munich, Germany
11email: {hongwei.li, bjoern.menze}@tum.de
   Jan S. Kirschke 1. Dept. of Informatics, Technical University of Munich, Germany
2. Dept. of Neuroradiology, Klinikum rechts der Isar, Germany
3. Dept. of Computer Science and Engineering, Southern University of Science and Technology, China
4. Shenzhen Institude of Artificial Intelligence and Robotics for Society, China
5. Institude for Advanced Study, Technical University of Munich, Germany
11email: {hongwei.li, bjoern.menze}@tum.de
   Benedikt Wiestler 1. Dept. of Informatics, Technical University of Munich, Germany
2. Dept. of Neuroradiology, Klinikum rechts der Isar, Germany
3. Dept. of Computer Science and Engineering, Southern University of Science and Technology, China
4. Shenzhen Institude of Artificial Intelligence and Robotics for Society, China
5. Institude for Advanced Study, Technical University of Munich, Germany
11email: {hongwei.li, bjoern.menze}@tum.de
   Bjoern Menze 1. Dept. of Informatics, Technical University of Munich, Germany
2. Dept. of Neuroradiology, Klinikum rechts der Isar, Germany
3. Dept. of Computer Science and Engineering, Southern University of Science and Technology, China
4. Shenzhen Institude of Artificial Intelligence and Robotics for Society, China
5. Institude for Advanced Study, Technical University of Munich, Germany
11email: {hongwei.li, bjoern.menze}@tum.de
Abstract

Synthesizing MR imaging sequences is highly relevant in clinical practice, as single sequences are often missing or are of poor quality (e.g. due to motion). Naturally, the idea arises that a target modality would benefit from multi-modal input, as proprietary information of individual modalities can be synergistic. However, existing methods fail to scale up to multiple non-aligned imaging modalities, facing common drawbacks of complex imaging sequences. We propose a novel, scalable and multi-modal approach called DiamondGAN. Our model is capable of performing flexible non-aligned cross-modality synthesis and data infill, when given multiple modalities or any of their arbitrary subsets, learning structured information in an end-to-end fashion. We synthesize two MRI sequences with clinical relevance (i.e., double inversion recovery (DIR) and contrast-enhanced T1 (T1-c)), reconstructed from three common sequences. In addition, we perform a multi-rater visual evaluation experiment and find that trained radiologists are unable to distinguish synthetic DIR images from real ones.

1 Introduction

In clinical practice, magnetic resonance imaging (MRI) datasets often consists of high-dimensional image volumes with multiple imaging protocols and repeated scans acquired at multiple time points. Given the multiplicity of possible sequence parameters, protocols largely vary depends on the imaging centers, hindering their comparability. This often leads to repeated exams or severely limits the clinical information that can be drawn from those MRI studies. Particularly, in the case of multiple sclerosis, longitudinal comparisons of MRI studies are the main reason for treatment decisions and existing lesion quantification tools require complete identical modalities at multiple time points. Potentially, cross-modality image synthesis technique can resolve those obstacles through efficient data infilling and re-synthesis.

Recently, generative adversarial networks (GANs) have been applied in translating MRI sequences, positron emission tomography (PET) and computed tomography (CT) images. Most of them are one-to-one cross-modality synthesis approaches, for example, PET [12] synthesis and MRI sequences translation [3]. A recent multi-modal synthesis method [10] has limited scalability because the input and output modalities are required to be spatially aligned. Although there are several multi-domain translation algorithms [2] in the computer vision community, these approaches design one-to-multiple domain translation but do not model the multiple-to-one domain mapping. Especially in medical images synthesis, multiple-to-one cross-modality mapping is highly relevant as proprietary information of individual and non-aligned modalities can be synergistic.

There are three main challenges in the scenario of multi-modal cross-modality medical image synthesis: 1) the input and target modalities are assumed to be not spatially-aligned because registration methods for aligning multiple modalities may fail, restricting the applicability of conventional regression approaches. 2) input modalities may be missing due to different clinical settings between centers, thus a traditional regression-based data infill would be restricted to the smallest uniform subset or rely on iterative data infill methods. 3) existing approaches have limited scalability, e.g. in a Cycle-GAN [14] setting, one would therefore have to train individual models for possible combinations of the input modalities.

Contributions

1) We propose DiamondGAN, which is a unified, scalable multi-modal generative adversarial network. It learns the multiple-to-one cross-modality mapping among non-aligned modalities using only a pair of generators and discriminators, optimized with a multi-modal cycle-consistency loss function. 2) We provide both qualitative and quantitative results on two clinically-relevant MRI sequences synthesis tasks, showing DiamondGAN’s superiority over baseline models. 3) We present the results of extensive visual evaluation, performed by fourteen experienced radiologists to confirm the quality of synthetic images.

2 Methodology

2.1 Multi-Modal Cross-Modality Synthesis

Given an input set of n modalities: X = {xi = 1, …, n} and a target modality T. Our goal is to learn a generator G that learns mappings from multiple input modalities to one target modality. We assume that 1) all the modalities, i.e., X and T, are not spatially-aligned because it is rather difficult to obtain strictly spatially-aligned images as mentioned in Section 1; 2) the input modalities can be any subset of X, denoted as X’ during the training and inference stages as some modalities of a subject may be missing in clinical practice.

We enforce G to be capable of translating any subset X’ into a target modality T using a condition c which indicates the presence of the input modalities, i.e., G(X’, c) T. This condition handles the missing modality issue and makes it a scalable model in both the training and the inference stages. We further introduce a multi-modal cycle-consistency loss to handle the ”non-aligned modalities” issue among the input and output. Fig. 1 illustrates the main idea of our proposed approach. We regularly generate the condition c and the corresponding multi-modal data X of all possible combinations, so that G learns to flexibly translate the arbitrary multi-modal input. As mentioned in the caption of Fig. 1, we use an availability condition to serve as an indicator of the input modalities. It is spatially replicated to the image size () and is a part of the two-stream network input. In the case of 3 modalities as the input, the condition would indicate that every input modality is given.

Figure 1: Left: The high-level idea behind DiamondGAN, which is capable of learning mappings between any subset of multiple input modalities (X) to a target modality in a single model. This mapping represents a diamond-shape topology. Right: Overview of DiamondGAN. It consists of two modules, a pair of discriminators D and a pair of generators G. (a) D1 and D2 learn to distinguish between the real and synthetic images from multi-modal input and the target output respectively. (b) G1 takes both multi-modal input and the condition as input and generates a target modality. The condition c is a binary vector: , where indicates the corresponding input modality as available (1) or not (0). It is spatially replicated and concatenated with the input modalities in the feature level. (c) G2 tries to generate the original modalities from the synthetic target modality given the original availability condition.

2.1.1 Multi-Modal Reconstruction Loss

We aim to train G to guarantee that a generated target modality preserves the content of its input modalities. The input modalities are assumed to be not spatially aligned or not from the same subject as mentioned above. In this situation, the traditional cycle loss [14] as well as the regression loss [5] would fail to tackle the multi-modal and non-alignment issues. To alleviate the two problems, we extend the traditional cycle-consistency loss [14] to a multi-modal one. Specifically, we concatenate the source modalities into a multi-channel input and define a multi-channel output as the target modality. We then simultaneously train two generators and in a cycle-consistency fashion. Please note that the output target modality is in multiple channels which correspond to the input modalities. The loss function of the generator is defined as:

(1)

2.1.2 Adversarial Loss

To make the generated images indistinguishable from real images, we adopt an adversarial loss:

(2)

where G generates a target modality G(X, c) conditioned on the presence of input modalities X, while D tries to distinguish between real input modalities and generated ones. Similarly, G generates the original input modalities G(T, c) conditioned on the presence of original input modalities X and D tries to distinguish between the real target modality and the generated one. The generators try to minimize this objective, while the discriminators to maximize it.

2.1.3 Full Objective

The objective functions to optimize D and G respectively are

(3)

where is the hyper-parameter that balances the reconstruction loss and adversarial loss.

2.2 Implementation

2.2.1 Two-Stream Network Architecture

To leverage the information from both input modalities and corresponding availability conditions, we build a two-stream network architecture based on the popular encoder-decoder network [6]. It takes the multi-modal images and condition as two inputs and merges them in the feature level. This network contains stride-2 convolutions, residual blocks [4] and fractionally-strided convolutions (1/2 stride). We use 6 blocks for the input size of , where , and are the number of modalities, height and width of the images respectively. The input and availability conditions pass through two encoders and are merged in the last feature layer before the decoder. PatchGANs [6] is used for the discriminator network, which classifies the patch feature maps to real or fake, instead of using a fully-connected layer.

2.2.2 Training Details

We apply two recent techniques to stabilize the training of the model. First, for (Eq. 2), we replace the negative log likelihood objective by a least-squares loss [9]. Second, to reduce the model oscillation, we update the discriminators using a history of generated images rather than the ones produced by the latest generators, as proposed in [11]. Thus we put the 25 previously generated images in an image buffer. We set = 10 in Equation 3 for all the experiments. We use the Adam solver [7] with a batch size of 5. All networks were trained from scratch with a learning rate of 0.0002 and for 20 epochs. When given input modalities, for each epoch the parameters in both generator and discriminator are updated for 2-1 times given 2-1 training subsets of input modalities excluding empty set. The implementations of our model are available in https://github.com/hongweilibran/DiamondGAN.

2.3 Visual Rating and Evaluation Protocol

Quantitative evaluation of generated images in terms of standard scores for errors and correlation remains a debatable task [1]. Additionally, the evaluation with common metrics such as PSNR and MAE [13] would not tell us to whether the algorithm captures clinically relevant small substructures. Therefore, we strive to get experts’ estimates of the image quality. We design a multi-rater quality evaluation experiment. Neuro-radiologists rated the images in a browser-application. In each trial, they were provided with two images. On the left side, one real source image of a T1 or Flair images is presented. On the other side, a paired image of the target modality is shown which is either a real image or a generated one. The displayed paired images were randomly chosen in the pool of generated images and real ones. This particular setup enables the experts to identify very small inconsistency or implausibility between the two images immediately. For evaluation, the experts were asked to rate the plausibility of the image on the right based on the real image on the left, to assign a 6-star rating, where 6 stars denoted a perfectly plausible image and 1 star a completely implausible image. The images were presented in 280 trials.

3 Experiments

3.0.1 Datasets

Dataset 1 consists of 65 scans of patients with MS lesions from a local hospital, acquired with a multi-parametric protocol, which includes co-registered Flair, T1, T2, double inversion recovery (DIR) and contrast-enhanced T1 (T1-c) after skull-stripping. The first three modalities are common modalities in most MS lesion exams. DIR is a MRI pulse sequence, which suppresses signal from the cerebrospinal fluid and the white matter, enhancing the inflammatory lesion. T1-c is a MRI sequence which requires a paramagnetic contrast agent (usually gadolinium) that reduces the T1 relaxation time and thereby increases the signal intensity. Synthesizing DIR and T1-c is of clinical relevance because it can substantially reduce medical costs. We mainly report our result on Dataset 1. Additional Dataset 2 is used to demonstrate that our approach can work on multiple datasets with incomplete and non-aligned modalities. It is a part of the public MICCAI-WMH dataset [8], and includes 40 subjects with two modalities (Flair and T1). 2D axial slices are used for training the network. All the slices are cropped or padded to a uniform size of 240 240 and intensity values are rescaled to [-1, 1].

3.0.2 Reconstructing DIR and T1-c from Common Modalities

We perform two image synthesis tasks on two clinically-relevant MRI sequences (DIR and T1-c), using three common modalities (i.e., Flair, T1 and T2). We separate the Dataset 1 into a training set, a validation set and a test set, resulting in 30 scans (2015 slices for each modality) for training and 35 scans for testing (2100 slices for each modality). To obtain the optimal hyper-parameters of the model, we use 5 out of the 30 training scans as a validation set. A common approach for quantitative evaluation of medical GAN images is to calculate relative errors and signal to noise ratio between the synthetic image and the real image [13]. Table 1 shows the results of peak signal-to-noise ratio (PSNR)and mean absolute error (MAE) by comparing the synthetic images and real T1-c and DIR images. For the synthetic DIR and T1-c images, we report the highest PSNR and the lowest MAE for a combined T1+T2+Flair input to our model. In the DIR synthesis experiment, the listed scores of using multiple inputs to our GAN are comparable (MAE 0.058-0.065). Whereas, the scores for single inputs are substantially worse (MAE 0.073-0.084). For the T1-c synthesis task, we find that any combination of multi-modal inputs involving the T1 modality (MAE 0.045-0.048) results in better scores compared to other inputs. This indicates that our model successfully extracts the relevant information, as T1-c is a T1 scan with a contrast enhancing agent. For comparison, we implement CycleGAN [14] to perform one-to-one cross-modality synthesis, the best results of CycleGAN are listed in Table. 1. For DIR synthesis, using Flair images as the input of CycleGAN achieves the highest PSNR and lowest MAE while for T1-c, using T1 as the input gets the best performance. The proposed model outperforms CycleGAN in both tasks. We further replace a part of the training Flair and T1 images in Dataset 1 with images from Dataset 2 (totally 794 images for each modality) and we find the result on same testing set is comparable to using the original Dataset 1.

Wilconxon signed-rank tests are conducted on the PSNR and MAE pairs generated by DiamondGAN (with 3 modalities) and CycleGAN respectively. Although the improvements of PSNR and MAE look small in whole image level, they are statistically significant (p-value0.0001) in the case of DIR in Table 1. This improvement is highly relevant for biomaker synthesis and for pathological evaluation especially in the case of MS lesions with small volumes.

Figure 2: Samples of synthetic T1-c and DIR images given the combination of T1, T2 and Flair modalities. Difference images are generated and visualized in heat maps. The synthetic images preserve the tissue contrast and the anatomy information. However, we find more differences in synthetic DIR images than in synthetic T1-c ones, especially around the brain boundary. This could be due to the alignment error by registration methods.

.     DIR   DIR   T1-c   T1-c   CycleGAN [14] 17.34 0.068 20.36 0.045   15.46 0.084 20.21 0.048 15.99 0.073 19.34 0.054 16.16 0.078 17.15 0.068 17.41 0.065 20.75 0.046 18.58 0.059 19.78 0.051 18.02 0.062 20.40 0.047 18.63 0.058 20.86 0.045

Table 1: Quantitative evaluation of our generated images compared to the real DIR and T1-c image using PSNR and MAE as evaluation metrics. Results show that the generated images benefit from a multi-modal input. indicates that higher values corresponds to better image qualities.
Figure 3: Box plots showing the rating scores of synthetic images and real ones for T1-c modality on the left and DIR modality on the right. The means are shown as black numbers. DiamondGAN achieves comparable plausibility levels for the DIR modality.

3.0.3 Visual Evaluations by Neuroradiologists

Fourteen neuro-radiologists with median 5+ years of professional experience participated. Each evaluated 210 synthetic images and 70 original images. The 210 synthetic images are generated enforcing 6 different input conditions in which each condition includes 35 samples. The rating results of the 14 raters are averaged and the box plots of the results are shown in Figure 3. For the synthesis of T1-c images, we found that three multi-modal combinations (i.e., T1, T1+Flair and T1+T2+Flair) gave comparable results, while the ones based solely on a Flair were consistently rated as implausible. The plausibility of DIR images synthesized with input was rated on average 0.83 stars higher than that with solely T1 input. This is plausible as the DIR is a complex sequence containing proprietary information, its synthesis thus benefits from multiple input sources. For the synthetic images with T1+T2+Flair input, the experts assigned an identical rating to the synthetic and original images (4.54 stars vs 4.7 stars).

We conduct Wilcoxon rank-sum tests on the paired rating scores of synthetic and real images from 14 raters on 6 conditions which results in 6 pairs of 14 observations. Results show that the pair of rating scores on synthetic DIR images by T1+T2+Flair input and real DIR images are not significantly different (p-value = 0.1432) while all other pairs are significantly different (p-values 0.0001). This demonstrates that trained radiologists are unable to distinguish our synthetic DIR images from real ones. Furthermore, the experts ratings for the individual conditions of synthetic images are in agreement with the metrical evaluation in Table 1. For T1-c synthesis, the PSNR and MAE scores are consistently good when T1 modality is fed to DiamondGAN.

4 Conclusion and Discussion

This work introduces a novel approach for multi-modal medical image synthesis, with extensive multi-rater experiments and statistical tests. This multi-modal approach allows us to mine the structured information inside the existing extensive MRI sequences. Pathological evaluation is the ultimate goal of this work. Our approach is evaluated by clinical partners who contributed the datasets. We compared synthetic DIR sequence with conventional FLAIR sequence in a MS lesions detection task in a cohort study. The proposed DiamondGAN has the potential to reduce medical costs in clinical practice.

4.0.1 Acknowledgement

This work is suppport by Technische Universität München - Institute for Advanced Study, funded by the German Excellence Initiative and European Union 7 Framework Programme under grant agreement No. 291763. HL and BW are supported by the funding from Zentrum Digitalisierung Bayern.

References

  • [1] Borji, A.: Pros and cons of gan evaluation measures. Computer Vision and Image Understanding 179, 41–65 (2019)
  • [2] Choi, Y., et al.: Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. In: CVPR. pp. 8789–8797 (2018)
  • [3] Dar, S.U., et al.: Image synthesis in multi-contrast mri with conditional generative adversarial networks. IEEE transactions on medical imaging (2019)
  • [4] He, K., et al.: Deep residual learning for image recognition. In: CVPR. pp. 770–778 (2016)
  • [5] Isola, P., et al.: Image-to-image translation with conditional adversarial networks. In: CVPR. pp. 1125–1134 (2017)
  • [6] Johnson, J., et al.: Perceptual losses for real-time style transfer and super-resolution. In: ECCV. pp. 694–711. Springer (2016)
  • [7] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
  • [8] Kuijf, H.J., Biesbroek, J.M., de Bresser, J., Heinen, R., Andermatt, S., Bento, M., Berseth, M., Belyaev, M., Cardoso, M.J., Casamitjana, A., et al.: Standardized assessment of automatic segmentation of white matter hyperintensities; results of the wmh segmentation challenge. IEEE transactions on medical imaging (2019)
  • [9] Mao, X., et al.: Least squares generative adversarial networks. In: CVPR. pp. 2794–2802 (2017)
  • [10] Sharma, A., Hamarneh, G.: Missing mri pulse sequence synthesis using multi-modal generative adversarial network. arXiv preprint arXiv:1904.12200 (2019)
  • [11] Shrivastava, A., et al.: Learning from simulated and unsupervised images through adversarial training. In: CVPR. pp. 2107–2116 (2017)
  • [12] Wang, Y., Yu, B., Wang, L., Zu, C., Lalush, D.S., Lin, W., Wu, X., Zhou, J., Shen, D., Zhou, L.: 3d conditional generative adversarial networks for high-quality pet image estimation at low dose. Neuroimage 174, 550–562 (2018)
  • [13] Welander, P., et al.: Generative adversarial networks for image-to-image translation on multi-contrast mr images-a comparison of cyclegan and unit. arXiv preprint arXiv:1806.07777 (2018)
  • [14] Zhu, J.Y., et al.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: CVPR. pp. 2223–2232 (2017)
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
384037
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description