Conditional Domain Adaptation GANs for Biomedical Image Segmentation
Due to visual differences in biomedical image datasets acquired using distinct digitization techniques, Transfer Learning is an important step for improving the generalization capabilities of Neural Networks in this area. Despite succeeding in classification tasks, most Domain Adaptation strategies face serious limitations in segmentation. Therefore, improving on previous Image Translation networks, we propose a Domain Adaptation method for biomedical image segmentation based on adversarial networks that can learn from both unlabeled and labeled data. Our experimental procedure compares our method using several domains, datasets, segmentation tasks and baselines, performing quantitative and qualitative comparisons of the proposed method with baselines. The proposed method shows consistently better results than the baselines in scarce label scenarios, often achieving Jaccard values greater than 0.9 and adequate segmentation quality in most tasks and datasets.
keywords:Deep Learning, Domain Adaptation, Biomedical Images, Semantic Segmentation, Image Translation, Semi-Supervised Learning.
Radiology has been a useful tool for assessing health conditions since the last decades of the century, when X-Rays were first used for medical purposes. Since then the field has become an essential tool for detecting, diagnosing and treating medical issues. More recently, algorithms have been coupled with these imaging techniques and other medical information in order to provide second opinions to physicians via Computer-Aided Detection and Computer-Aided Diagnosis (CAD) systems. In recent decades, Machine Learning algorithms were incorporated to the body of knowledge of CAD systems, providing automatic methodologies for finding patterns in big data scenarios, improving the capabilities of human physicians.
During the last half decade traditional Machine Learning pipelines have been losing ground to integrated Deep Neural Networks (DNNs) that can be trained from end-to-end Litjens et al. (2017). DNNs can integrate the steps of feature extraction and statistical inference over unstructured data, such as images, text or temporal sequences. Deep Learning models for images usually are built upon some form of trainable convolutional operation Krizhevsky et al. (2012). Convolutional Neural Networks (CNNs) and their variations are the most popular architectures for image classification, detection and segmentation in Computer Vision.
In the medical community, labeled data is often limited and there are large amounts of unlabeled datasets that can be used for unsupervised learning. To make matters worse, the generalization of DNNs is normally limited by the variability of the training data, which is a major hamper, as different digitization techniques and devices used to acquire different datasets tend to produce biomedical images with distinct visual features. Therefore, the study of methods that can use both labeled and unlabeled data is an active research area in both Computer Vision and Biomedical Image Processing.
Domain Adaptation (DA) Zhang et al. (2017) methods are often used to improve the generalization of DNNs over images a supervised manner. The most popular method for deep DA is Transfer Learning via Fine-Tuning pre-trained DNNs in larger datasets, such as ImageNet Deng et al. (2009). However, Fine-Tuning only learns from labeled data, ignoring the potentially larger amounts of unlabeled data available. During the last years, several approaches have been proposed for Unsupervised Domain Adaptation (UDA), Semi-Supervised Domain Adaptation (SSDA) and Fully Supervised Domain Adaptation (FSDA) Zhang et al. (2017).
This paper describes a DNN for DA that works for the whole spectrum of UDA to FSDA, being able to learn from both labeled and unlabeled data. As can be seen in Figure 1, a major novelty of our method – henceforth known as Conditional Domain Adaptation Generative Adversarial Network (CoDAGAN) – is allowing multiple datasets to be used conjointly in the training procedure. Most other methods in the DA literature only allow for one source and one target domains, as shown by Zhang et al. Zhang et al. (2017).
Other sections in this paper are organized as follows. Section 2 presents the previous works that paved the way for the proposal of CoDAGANs. Section 3 describes CoDAGAN architecture, training procedure and loss components. Section 4 shows the experimental setup of this work, including datasets, hyperparameters, experimental protocol and baselines. Section 5 discusses the results found during experiments using CoDAGANs for UDA, SSDA and FSDA. At last, Section 6 finalizes this work with our final remarks.
2 Related Work
2.1 Image-to-Image Translation
Image-to-Image Translation Networks are Generative Adversarial Networks (GANs) Goodfellow et al. (2014) capable of transforming samples from one image domain into images from another. Access to paired images from the two domains simplifies the learning process considerably, as comparison losses can be devised using only pixel-level or patch-level comparisons between the original and translated images Isola et al. (2017). Paired Image-to-Image Translation can be achieved, therefore, by Conditional GANs (CGANs) Mirza and Osindero (2014) paired with a simple regression model. Requiring paired samples reduces the applicability of Image Translation to a very limited subset of image domains where there is a possibility of generating paired datasets. This limitation motivated the creation of Unpaired Image-to-Image Translation methods Zhu et al. (2017); Liu et al. (2017); Huang et al. (2018). These networks are based on Cycle-Consistency, which models the translation process between two image domain as an invertible process represented by a cycle, as can be seen in Figure 4. This cyclic structure allows for Cycle-Consistency losses to be coupled with adversarial losses of traditional GANs.
A Cycle-Consistent loss can be formulated as follows: let and be two image domains containing unpaired image samples and . Consider then two functions and that perform the translations and respectively. Then a loss can be devised by comparing the pairs of images and . In other words, the relations and should be maintained in the translation process. The counterparts of the generative networks in GANs are discriminative networks, which are trained to identify if an image is natural from the domain or translated samples originally from other domains. and are referred to as the discriminative networks for datasets and , respectively. Discriminative networks are usually traditional supervised networks, such as CNNs Krizhevsky et al. (2012); Simonyan and Zisserman (2014), which are trained to distinguishing real images from fake images generated by and .
Some efforts have been spent in proposing Unpaired Image Translation GANs for multi-domain scenarios, as the case of StarGANs Choi et al. (2017), but these networks do not explicitly present isomorphic representations of the data, as UNIT and MUNIT architectures do. Other advantages of UNIT and MUNIT over StarGANs is that they also compute reconstruction (Cycle-Consistent) losses on the isomorphic representations, beside the traditional Cycle-Consistency between real and reconstructed images. CoDAGANs were built to be agnostic to the Image Translation network used as basis for the implementation, being able to transform any Image Translation GAN that has an isomorphic representation of the data into a multi-domain architecture with only minor changes to the generator and discriminator networks.
2.2 Visual Domain Adaptation
Domain Adaptation can be done for fully labeled (FSDA), partially labeled (SSDA) and unlabeled datasets (UDA). In UDA scenarios, no labels are available for the target set, while SSDA tasks have both labeled and unlabeled samples on the target domain. FSDA contains only labeled data in the target domain and it is the most common practice nowadays among deep DA methods due to the popularity of Fine-Tuning pretrained DNNs.
Zhang et al. Zhang et al. (2017) describes a taxonomy for DA tasks comprising most of the spectrum of deep and shallow knowledge transfer techniques. This taxonomy describes several classes of problems with variations in feature and label spaces between source and target domains, data labeling, balanced/unbalanced data and sequential/non-sequential data. CoDAGANs cannot be put in one single category in Zhang et al.’s Zhang et al. (2017) taxonomy, as they allow for a dataset to be source and target at the same time and are suitable for UDA, SSDA and FSDA, being able to learn from both unsupervised and supervised data. Other recent surveys in the visual DA literature Patel et al. (2015); Shao et al. (2015); Wang and Deng (2018); Csurka (2017) point to a lack of knowledge transfer methods for tasks other than classification, such as detection, segmentation and retrieval.
CoDAGANs are DNNs that combine unsupervised and supervised learning to perform UDA, SSDA or FSDA between two or more image sets. These architectures are based on adaptations of preexisting Unsupervised Image-to-Image Translation networks Zhu et al. (2017); Liu et al. (2017); Huang et al. (2018), adding supervision to the process in order to perform DA. Generator networks () in Image Translation GANs are usually implemented using Encoder-Decoder architectures as U-Nets Ronneberger et al. (2015). At the end of the Encoder () there is a middle-level representation that can be trained to be isomorphic in these architectures. serves as input of the Decoder (). Isomorphism allows for learning a supervised model on that is capable of inferring over several datasets. This semi-supervised learning scheme can be seen in Figure 1.
We employed the Unsupervised Image-to-Image Translation (UNIT) and Multimodal Unsupervised Image-to-Image Translation (MUNIT) networks as basis for the generation of . On top of that, we added the supervised model – which is based on a U-net Ronneberger et al. (2015) for semantic segmentation – and made some considerable changes to the translation approaches, mainly regarding the architecture and conditional distribution modelling of the original GANs, as discussed further in this section. The exact architecture for depends on the basis translation network chosen for the adaptation. In our case, both UNIT and MUNIT use VAE-like architectures Kingma and Welling (2013) for , containing downsampling (), upsampling () and residual layers.
The shape of depends on the architecture for . UNIT, for example, assumes a single latent space between the image domains, while MUNIT separates the content of an image from its style. CoDAGANs use the single latent space when it is based on UNIT and only the content when it is built upon MUNIT, as the style vector has no spatial resolution.
A training iteration on a CoDAGAN follows the sequence presented in Figure 5. The generator network – such as U-nets Ronneberger et al. (2015) and Variational Autoencoders Kingma and Welling (2013) – is an Encoder-Decoder architecture. However, instead of mapping the input image into itself or into a semantic map as its Encoder-Decoder counterparts, it is capable of translating samples from one image dataset into synthetic samples from another dataset. The encoding half of this architecture () receives images from the various datasets and creates an isomorphic representation somewhere between the image domains in a high dimensional space. This code will be henceforth described as and is expected to correlate important features in the domains in an unsupervised manner Liu and Tuzel (2016). Decoders () in CoDAGAN generators are able to read and produce synthetic images from the same domain or from other domains used in the learning process. This isomorphic representation is an integral part of both UNIT Liu et al. (2017) and MUNIT Huang et al. (2018) translations, as they also enforce good reconstructions for in the learning process. It also plays an essential role in CoDAGANs, as all supervised learning is performed on .
As shown in Figure 5, CoDAGANs include five unsupervised subroutines: a) Encode, b) Decode, c) Reencode, d) Redecode and e) Discriminate; and two f) Supervision subroutines, which are the only labeled ones. These subroutines will be detailed further in the following paragraphs.
First, a pair of datasets (source) and (target) are randomly selected among the potentially large number of datasets selected for training. A minibatch of images from is then appended to a code generated by a One Hot Encoding scheme, aiming to inform the encoder of the source dataset for the sample. The 2-uple is passed to the encoder , producing an intermediate isomorphic representation for according to the marginal distributions computed by for dataset .
The information flow is then split into two distinct branches: 1) is fed to the supervised model ; 2) is appended to a code and passed through the decoder conditioned to dataset . produces , which is a translation of the minibatch to the style of dataset .
The Reencode procedure performs the same operation of generating an isomorphic representation as the Encode subroutine, but receiving as input the synthetic image . More specifically, the reencoded isomorphic representation is generated by .
Again the architecture splits into two branches: 1) is passed to in order to produce the prediction ; 2) the isomorphic representation is decoded as in , producing the reconstruction , which can be compared to via a Cycle-Consistency loss.
At the end of Decode, the synthetic image is produced. The original samples and the translated images are merged in a single batch and passed to , which uses a supervised loss in order to classify between real and synthetic samples from the datasets.
At the end of Encode and Reencode subroutines, for each sample which has a corresponding label , the isomorphisms and are both fed to the same supervised model . The model performs the desired supervised task, generating the predictions and . Both these predictions can be compared in a supervised manner to , if there are labels for the image in this minibatch. As there are always at least some labeled samples in this scenario, is trained to perform inference on isomorphic encodings of both originally labeled data () and data translated by the CoDAGAN for the style of other data ().
If domain shift is computed and adjusted properly during the training procedure, the properties and are achieved, satisfying Cycle-Consistency and Isomorphism, respectively. After training, it does not matter which input dataset among the training ones is conditionally fed to to the generation of isomorphism , as samples from all datasets should all belong to the same joint distribution in -space. Therefore any learning performed on and is universal to all datasets used in the training procedure. Instead of performing only translation , all previously mentioned subroutines are run simultaneously for both and , as in UNIT Liu et al. (2017) and MUNIT Huang et al. (2018). Translations are analogous to the case described previously.
One should notice that performs spatial downsample, while performs upsample, consequently the model should take into account the amount of downsampling layers in . More specifically, we removed the first two layers of the U-nets Ronneberger et al. (2015) when they were used as the model , as they perform identical functions to the two downsampling layers. The amount of input channels of must also be compatible with the amount of output channels in . Another constraint for the architecture is that the upsampling performed by should always compensate the downsampling factor of , characterizing as a symmetric Encoder-Decoder architecture.
The discriminator for CoDAGANs is basically the same as the discriminator from the original Cycle-Consistency network, that is, a basic CNN that classifies between real and fake samples from the domains. The only addition to is conditional training in order for the discriminator to know the domain the sample is supposed to belong to, which allows to use its marginal distribution over the datasets for determining the likelihood of veracity for the sample. It is important to notice that our model is agnostic to the choice of Unsupervised Image-to-Image Translation architecture, therefore future advances in this area based on Cycle-Consistency should be equally portable to perform DA and further benefit CoDAGANs’ performance.
While CoGANs use a coupled architecture composed of 2 encoders ( and ) and 2 decoders ( and ) for learning a joint distribution over datasets and , CoDAGANs use only one generator composed of one Encoder and one Decoder ( and ). Additionally to the data from some dataset , is conditionally fed a One Hot Encoding , as in . The code forces the generator to encode the data according to the marginal distribution optimized for , conditioning the method to these data, as shown in Figure 5. The code for a second dataset is received by the decoder, as in , in order to produce the translation to dataset .
3.1 Training Routines in CoDAGANs
In each iteration of a traditional GAN there are two routines for training the networks: 1) freezing the discriminator and updating the generator (Gen Update); and 2) freezing the generator and updating the discriminator (Dis Update). CoDAGANs add a new supervised routine to this scheme in order to perform DA: Model Update. The subroutines described in Section 3 that compose the three routines of CoDAGANs are presented in Table 1
Since the first proposal of GANs Goodfellow et al. (2014), stability has been considered a major problem in GAN training. Adversarial training is known to be more susceptible to instability than traditional training procedures for DNNs Goodfellow et al. (2014). Therefore, in order to achieve more stable results, we split the training procedure of CoDAGANs into two phases: Full Training and Supervision Tuning.
During the first 75% of the epochs in a CoDAGAN training procedure, Full Training is performed. This training phase is composed of the procedures Dis Update, Gen Update and Model Update.That is, for each iteration in the Full Training phase, at first the discriminator is optimized, followed by an update of and finishing with the update of the supervised model. During this phase, adversarial training enforces the creation of good isomorphic representations by and translations between the domains. At the same time, the model uses the existing labels to improve translations performed by by adding semantic meaning to the translated visual features.
The last 25% of the network epochs are trained in the Supervision Tuning setting. This phase freezes and and performs only the Model Update procedure, effectively tuning the supervised model to a stationary isomorphic representation. Freezing has the effect of removing the instability generated by adversarial training in the translation process, as it is harder for to converge properly while the isomorphic input is constantly changing its visual properties due to changes in .
3.2 CoDAGAN Loss
Both UNIT Liu et al. (2017) and MUNIT Huang et al. (2018) optimize conjointly GAN-like adversarial loss components and Cycle-Consistency reconstruction losses. Cycle-Consistency losses () are used in order to provide unsupervised training capabilities to these translation methods, allowing for the use of unpaired image datasets, as paired samples from distinct domains are often hard or impossible to create. Cycle-Consistency is often achieved via Variational inference, which tries to find an upper bound to the Maximum Likelihood Estimation (MLE) of high dimensional data Kingma and Welling (2013). Variational losses allow VAEs to generate new samples learnt from an approximation to the original data distribution as well as reconstruct images from these distributions. Optimizing an upper bound to the MLE allows VAEs to produce samples with high statistical likelihood, but still with low visual quality.
Adversarial losses () are often complementarily used with reconstruction losses in order to yield high visual quality and detailed images, as GANs are widely observed to take bigger risks in generating samples than simple regression losses Isola et al. (2017); Zhu et al. (2017). Simpler approaches to image generation tend to average the possible outcomes of new samples, producing low quality images, therefore GANs produce less blurry and more realistic images than non-adversarial approaches in most settings. Unsupervised Image-to-Image Translation architectures normally use a weighted sum of these previously discussed losses as their total loss function (), as in:
More details on UNIT and MUNIT loss components can be found in their respective original papers Liu et al. (2017); Huang et al. (2018). One should notice that we only presented the architecture-agnostic routines and loss components for CoDAGANs in the previous subsections, as the choice of Unsupervised Image-to-Image Translation basis network might introduce new objective terms and/or architectural changes. MUNIT, for instance, computes reconstruction losses to both the pair of images and the pair of isomorphic representations , which are separated into style and content components.
CoDAGANs add a new supervised component to the completely unsupervised loss of Unsupervised Image-to-Image Translation methods. The supervised component for CoDAGANs is the default cost function for supervised classification/segmentation tasks in DNNs, the Cross-Entropy loss:
where represents the ground truths and represents the predicted probability for the positive class in a binary classification scenario. The full objective for CoDAGANs is, therefore, defined by:
4 Experimental Setup
All networks were implemented using the PyTorch111https://pytorch.org/ framework. We built upon the MUNIT/UNIT implementation from Huang et al. Huang et al. (2018)222https://github.com/nvlabs/MUNIT and segmentation architectures from the pytorch-semantic-segmentation333https://github.com/zijundeng/pytorch-semantic-segmentation/ repository. Tests were conducted on NVIDIA Titan X Pascal GPUs with 12GB of memory. The official implementation for CoDAGANs is available at our website444http://www.patreo.dcc.ufmg.br/.
Architectural choices and hyperparameters can be further analysed according to the codes and configuration files in the project’s website, but the main ones are described in the following paragraphs. CoDAGANs for CXR and Mammographic images were run for 400 epochs, as this was empirically found to be a good stopping point for convergence in these networks. Learning rate was set to with L2 normalization by weight decay with value . is composed of two downsampling layers followed by two residual layers for both UNIT Liu et al. (2017) and MUNIT Huang et al. (2018) based implementations, as these configurations were observed to simultaneously yield satisfactory results and have small GPU memory requirements. The first downsampling layer contains 32 convolutional filters, doubling this number for each subsequent layer. was implemented using a Least Squares Generative Adversarial Network (LSGAN) Mao et al. (2017) objective with two layers, although, differently from MUNIT, we do not employ multiscale discriminators due to GPU memory constraints. Also distinctly from MUNIT and UNIT, we do not employ the perceptual loss – further detailed by Huang et al. Huang et al. (2018) – due to the dissimilarities between the domains used for pretraining and the biomedical images used in our work.
We chose the state-of-the-art Adam solver Kingma and Ba (2014), as it mitigates several optimization problems of the traditional Stochastic Gradient Descent (SGD), helping to counterweight the inherent difficulties in training GANs.
We tested our methodology in a total of 7 datasets, 4 of them being Chest X-Ray (CXR) datasets and 3 of them being mammography datasets. The chosen CXR datasets are the Japanese Society of Radiological Technology (JSRT) Shiraishi et al. (2000)555http://db.jsrt.or.jp/eng.php, the Chest X-Ray 8 Wang et al. (2017)666https://nihcc.app.box.com/v/ChestXray-NIHCC/folder/37178474737 and the Montgomery and Shenzhen sets Jaeger et al. (2014)777https://ceb.nlm.nih.gov/repositories/tuberculosis-chest-x-ray-image-data-sets/. The mammographic image datasets used in this work are the Mammographic Image Analysis Society (MIAS) dataset Suckling et al. (2015)888https://www.repository.cam.ac.uk/handle/1810/250394, INbreast Moreira et al. (2012)999http://medicalresearch.inescporto.pt/breastresearch/index.php/Get_INbreast_Database and the Digital Database for Screening Mammography (DDSM) Heath et al. (2000)101010http://marathon.csee.usf.edu/Mammography/Database.html.
JSRT contains pixel-level ground truths for lungs, clavicles and heart segmentations, while the Montgomery Set contains only annotations for the lungs. The Shenzhen and the Chest X-Ray 8 datasets do not possess any pixel-level annotations. For the MIAS dataset, we used the same annotations for pectoral muscle and breast region segmentation as Rampun et al. Rampun et al. (2017), which were provided by the authors. The INbreast dataset also contains annotations for the pectoral muscle in their images, while a simple thresholding can be used in order to remove the background, easily allowing for the construction of breast region ground truths. At last, DDSM does not contain any pixel-level annotation for any of these tasks. While MIAS and DDSM were digitized from Screen-Film Mammography (SFM) data, INbreast is a more modern Full Field Digital Mammography (FFDM) dataset, thus it is expected that the different digitization methods present a distinct domain shift. The datasets without pixel-level annotations provide only unsupervised information to our method, while we vary the percentages of ground truth use for the labeled datasets, as explained further in Section 5.
4.3 Experimental Protocol
All datasets were sliced into training and test sets according to an 80%/20% division. No validation set was allowed, as most small data scenarios cannot spare labeled data for validation purposes. We considered the epochs 360, 370, 380, 390 and 400 for computing the mean values presented in Section 5 in order to correctly model the statistical variations of CoDAGANs in the DA tasks. Therefore, all objective metric average values and standard deviations were computed according to these five last epochs.
For quantitative assessment we chose the Jaccard (Intersection over Union – IoU) metric, which is a popular choice in segmentation and detection tasks and is widely used in all tested applications, allowing for easy comparisons with previous literature. Jaccard () for a binary classification task is given by the following equation:
where , and refer to True Positives, False Negatives and False Positives, respectively. values range between 0 and 1, but for aesthetic reasons we present these metrics as percentages in Section 5.
Therefore, Fine-Tuning is our main baseline for SSDA and FSDA, but it still does not work in UDA scenarios, as it requires labeled data. Many UDA shallow and deep methods rely on variations of the Maximum Mean Discrepancy (MMD) Li et al. (2017) metric to perform knowledge transfer by comparing the statistical moments of the two datasets. MMD-based approaches are a kind of Feature Representation Learning, which only take into account the features space, ignoring label space, thus, it is fully unsupervised. As there are no MMD methods specifically designed for dense labeling tasks, we adapted a Radial Basis Function (RBF) kernel version of MMD Li et al. (2017)111111https://github.com/OctoberChang/MMD-GAN for dense prediction and compare it to our method in Section 5.
5 Results and Discussion
Discussion of the quantitative results presented in this section are divided by application, but all the numerical results are presented in Table 2. Section 5.1 presents the UDA, SSDA and FSDA result regarding the application of segmentation of anatomical structures in mammographic images, while Section 5.2 discusses the results for CXR images. Qualitative analysis of segmentations in both unlabeled and labeled data are presented in Section 5.3 for all applications, as well as visual assessments of the isomorphic representations generated by CoDAGANs, where supervised learning is performed and Domain Generalization is enforced.
5.1 Results for the Mammographic Datasets
We evaluated six different test scenarios of transfer learning in mammographic images from UDA () to FSDA (), passing through several scenarios of SSDA (, , and ). average values and standard deviations for mammographic image tasks are shown in Table (b)b (pectoral segmentation) and Table (a)a (breast region segmentation). The first lines in these tables present the label configuration used in each column for all mammographic datasets used in test set. In these tables, lines beginning in “% Labels” indicate the label configuration used in the tests presented in the correspondent column to the three datasets. Lines with “(X) to (Y)” represent Fine-Tuning or MMD results from dataset (X) to dataset (Y). Lines with only (X) present the results for CoDAGANs or training From Scratch for a single dataset according to the column label configuration. Following paragraphs discuss the DA results for CoDAGANs using MUNIT (), CoDAGANs using UNIT (), using pretrained models from a source dataset to a target one with (, , , and ) and without () Fine-Tuning, UDA transfer using the RBF MMD described in Section 4.4 and training the segmentation networks from scratch on the fully labeled or partially labeled datasets. Bold values in these tables indicate the best results for the corresponding dataset. As there are three datasets being evaluated in Table (b)b, there are three bold values for each experiment. Analogously, Table (a)a only has two bold values per column because only two datasets are being objectively evaluated.
Even for the completely unlabeled case in pectoral segmentation, CoDAGANs achieved results of 83.75% for the INbreast dataset and 86.94% for the DDSM dataset, while the best baselines achieved 79.40% and 77.75% respectively for these image sets. Best results for all datasets in the FSDA scenario () were attributed to CoDAGANs capability for learning from different data sources, which allowed the method to learn from both MIAS and INbreast labels and transfer this knowledge to DDSM.
The MIAS dataset was used as a source dataset in both Tables (b)b and (a)a, providing 100% of labels in all test cases. INbreast is a more modern dataset and the amount of labels from these data we used varied according to the experiment from 0% of labels () to 100% of labels (). INbreast serves, therefore, as both a source and a target dataset in our tests.
As DDSM does not possess pixel-level labels, we created some ground truths only for the a small subset of images from this dataset for the pectoral muscle segmentation task in order to objectively evaluate the UDA. One should notice that these ground truths were used only on the test procedure, but not in training, as all cases presented in Tables (b)b and (a)a show DDSM with 0% of labeled data. Thus DDSM is used only as a source dataset in our experiments. Breast region segmentation analysis on DDSM was only performed qualitatively, as there are no ground truths for this task.
One can notice that in Table (b)b CoDAGANs achieved the best results for most test scenarios in pectoral muscle segmentation, losing only on , and in the INbreast dataset. In both UDA and FSDA CoDAGANs achieved top scores for all datasets, performing better all baselines. Using a single model, CoDAGANs achieved state-of-the-art results on both source and target data.
Breast region segmentation (Table (a)a) proved to be an easier task, with most methods achieving values higher than 95%. The MMD-based transfer performed in these tests surpass CoDAGANs in all test cases, even though CoDAGANs achieved competitive results in the target INbreast dataset. The relative lower performance of CoDAGANs in this task can be attributed to the high transferability of pretrained models in this task, as can be seen in experiment , where pretrained models with no Fine-Tuning already achieved a value of 95.02%. CoDAGANs were able, though, to aggregate both source and target data in experiment to make better predictions in the MIAS source dataset.
Figure 11 show the values from Tables (b)b and (a)a with confidence intervals for . A first noteworthy trait in Figures (a)a and (d)d is that CoDAGANs maintained their capability to perform inference on the MIAS source dataset for all experiments. It is also noticeable that CoDAGANs built upon both MUNIT and UNIT behaved very similarly in almost all cases.
One can see in Figure (b)b that CoDAGANs achieve better results than both pretrained models and MMD on UDA scenarios, while maintaining competitive results in SSDA experiments. It is also evidenced in both Figures (b)b and (e)e that training from scratch the segmentation networks with no DA is by far the worst alternative until a relatively large amount of labeled samples is present in the target set. Training from scratch only presented itself as a viable alternative to these dense labeling tasks in the and scenarios, where there is abundance of labeled samples.
5.2 Results for the CXR Datasets
CXR results can be seen in Tables (c)c and (d)d for all three labeled tasks in the JSRT dataset: lungs, heart and clavicle segmentation. The JSRT and Montgomery image sets are objectively evaluated in the lung field segmentation task, as shown in Table (c)c, while Shenzhen and Chest X-Ray 8 do not possess pixel-level ground truths for quantitative evaluation. As there are no ground truths for heart and clavicle segmentation apart from the JSRT ones, Table (d)d only shows the segmentation results for these tasks in the source dataset, which is the only one that can be assessed quantitatively. UDA for unlabeled datasets is evaluated qualitatively on Section 5.3. Analogously to Section 5.1, bold values in Tables (c)c and (d)d represent the best overall results in a given label configuration for each one of the datasets to be quantitatively assessed for the corresponding task.
Figure 15 shows the confidence intervals for in lung segmentation for the JSRT source dataset (Figure (a)a), the target image set Montgomery (Figure (b)b) and for JSRT in heart and clavicle segmentation (Figure (c)c).
One can see by Figures (a)a and (c)c that segmentation in the source dataset is preserved, while Figure (b)b shows the UDA, SSDA and FSDA efficiency of CoDAGANs in a target dataset. Instead of simply preserving the segmentation capabilities of model in the task of clavicle segmentation – a highly unbalanced and hard task for CXR images – on the source dataset, CoDAGANs significantly improved the performance of the segmentation architecture with the use of the unlabeled data from the other datasets, as can be seen in Figure (c)c. This result shows the potential of CoDAGANs for merging both labeled and unlabeled data in its semi-supervised learning process.
Figure (b)b evidences that training from scratch, MMD, fixed pretrained models and Fine-Tuning are all suboptimal when there is scarce labeled data in the target domain. As in mammogram segmentations, MMD on hardly improves segmentation results compared to simply using the fixed pretrained networks from other datasets, reinforcing that this method does not work properly for dense labeling problems. CoDAGANs are shown to be significantly more robust in these scarce label scenarios, achieving close to 90% of in the target Montgomery dataset even for the completely unlabeled transfer experiment (). This test case shows the inability of MMD and pretrained models to compensate for the domain shift between the two datasets in these dense labeling tasks, with both methods achieving scores of less than 10% in . Training from scratch and Fine-Tuning only achieve competitive results in and , both cases where there is abundant labeled data in the target domain.
5.3 Qualitative Analysis
Figure 22 show segmentation qualitative results for two tasks of mammographic image segmentation and three tasks of CXR segmentation, as well as some noticeable segmentation errors. Figure (d)d shows predictions for pectoral muscle segmentation , while Figure (e)e presents breast region segmentation on experiment . Experiment for lung field segmentation is shown in Figure (a)a, while heart and clavicle segmentation DA experiments () are shown respectively on Figures (b)b and (c)c.
First four rows of images in Figure (d)d show samples from labeled datasets MIAS and INbreast achieving almost perfect segmentation predictions in images with quite distinct density patterns. The last two rows show the UDA results for samples in the unlabeled dataset DDSM (labels were used only for testing), wherein segmentation also adequately detected and segmented the pectoral muscles. Figure (e)e shows results for breast region segmentation in the same datasets as Figure (d)d. Background on the target DDSM dataset is considerably harder to segment than the source MIAS and INbreast datasets due to outdated digitization procedures, resulting in fuzzier breast boundary contours and screen film artifacts. Despite this, CoDAGANs were able to correctly assess the breast region area for most images using the supervised knowledge transferred from the easier source datasets.
Figure (a)a shows DA results for lung field segmentation in two labeled (JSRT and Montgomery) and two unlabeled target datasets (Shenzhen and Chest X-Ray 8). Target datasets in this case are considerably harder than the source ones due to poor image contrast, presence of unforeseen artifacts as pacemakers, rotation and scale differences and a much wider variety of lung sizes, shapes and health conditions. Yet CoDAGANs were able to perform adequate lung segmentations in the vast majority of images, only presenting errors in distinctly difficult images. Heart and clavicle segmentation (Figures (b)b and (c)c) are harder tasks than lung segmentation due to heart boundary fuzziness and a high variability of clavicle shapes and positions. In addition, clavicle segmentation is a highly unbalanced task. Those factors, paired with the fact that the well-behaved samples from the JSRT dataset are the only source of labels to this task contributed to higher segmentation error rates mainly in clavicle segmentation. Heart segmentation results, even though qualitatively worse than lung segmentations, were still considered adequate for the vast majority of target dataset images. Some wrongfully segmented samples from DDSM, Chest X-Ray 8 and Shenzhen can be seen in Figure (f)f. A full assessment of results and CoDAGAN errors can be seen in this project’s webpage121212http://www.patreo.dcc.ufmg.br/codagans/.
Another important qualitative assessment to be performed in CoDAGANs is to show that the same objects in distinct datasets are represented similarly in the isomorphic representation of the algorithm. This is shown in Figure 25 for three different activation channels in four CXR images (Figure (a)a) and three mammographic samples (Figure (b)b) from distinct datasets.
Despite the clear visual distinctions between the original samples, visual patterns that compose the patient’s anatomical structures in Figure (a)a, such as ribs and lung contours, are visibly similar in the activation channels of samples from all four CXR datasets. In Figure (b)b, high density tissue patterns and important object contours in the mammograms from distinct sources are encoded similarly by CoDAGANs. Breast boundaries are also visually similar across samples from all mammographic datasets, as CoDAGANs are able to infer that these information are semantically similar despite the differences in the visual patterns of the images. These results show that CoDAGANs successfully create a joint representation for high semantic-level information which encodes analogous visual patterns across datasets in a similar manner by only looking only at the marginal distribution.
This paper proposed a method that covers the whole spectrum of UDA, SSDA and FSDA in dense labeling tasks for multiple source and target biomedical datasets. We performed an extensive quantitative and qualitative experimental evaluation on several distinct domains, datasets and tasks, comparing CoDAGANs with traditional baselines in the DA literature. CoDAGANs were shown to be an effective DA method that could learn a single model that performs satisfactorily for several different datasets, even when the visual patterns of these data were clearly distinct. We also showed that CoDAGANs work when built upon two distinct Unsupervised Image-to-Image Translation methods (UNIT Liu et al. (2017) and MUNIT Huang et al. (2018)), evidencing its agnosticism to the image translation architecture.
CoDAGANs achieved scores in UDA settings comparable to FSDA in other methods. These experiments also showed that MMD-based strategies Li et al. (2017) were ineffective in scarce labeling scenarios. Our method presented significantly better values in most experiments where labeled data was scarce in the target datasets, while the baselines only achieved good objective results when labels were abundant. CoDAGANs were observed to perform satisfactory DA even when the labeled source dataset was considerably simpler than the target unlabeled datasets.In experiment for CXR lung, clavicle and heart segmentations, the JSRT source dataset contains images acquired in a much more controlled environment than the all other CXR target data, while still working as viable source of labels for UDA, SSDA and FSDA. Another evidence of the capabilities of CoDAGANs is shown in its good performance in highly imbalanced tasks, as the case of clavicle segmentation, where the Region of Interest represents only a tiny portion of the pixels.
CoDAGANs, despite being tested only for segmentation tasks in this paper, are not conceptually limited to dense labeling tasks nor to biomedical images. The main future work for CoDAGANs is testing their DA capabilities in other tasks, such as classification and regression and detection. We also intend to test CoDAGANs in other biomedical image domains, such as Magnetic Resonance Imaging and Computerized Tomography.
Authors would like to thank NVIDIA for the donation of the GPUs and for the financial support provided by CAPES, CNPq and FAPEMIG (APQ-00449-17) that allowed the execution of this work.
- Litjens et al. (2017) G. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi, M. Ghafoorian, J. A. van der Laak, B. Van Ginneken, C. I. Sánchez, A Survey on Deep Learning in Medical Image Analysis, Medical Image Analysis 42 (2017) 60–88.
- Krizhevsky et al. (2012) A. Krizhevsky, I. Sutskever, G. E. Hinton, ImageNet Classification with Deep Convolutional Neural Networks, in: NIPS, 2012, pp. 1097–1105.
- Zhang et al. (2017) J. Zhang, W. Li, P. Ogunbona, Transfer Learning For Cross-Dataset Recognition: A Survey (2017).
- Deng et al. (2009) J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, ImageNet: A Large-Scale Hierarchical Image Database, in: CVPR, IEEE, 2009, pp. 248–255.
- Goodfellow et al. (2014) I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio, Generative Adversarial Nets, in: NIPS, 2014, pp. 2672–2680.
- Isola et al. (2017) P. Isola, J.-Y. Zhu, T. Zhou, A. A. Efros, Image-to-Image Translation with Conditional Adversarial Networks, in: CVPR, IEEE, 2017, pp. 5967–5976.
- Mirza and Osindero (2014) M. Mirza, S. Osindero, Conditional Generative Adversarial Nets, arXiv preprint arXiv:1411.1784 (2014).
- Zhu et al. (2017) J.-Y. Zhu, T. Park, P. Isola, A. A. Efros, Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks, in: ICCV, 2017.
- Liu et al. (2017) M.-Y. Liu, T. Breuel, J. Kautz, Unsupervised Image-to-Image Translation Networks, in: NIPS, 2017, pp. 700–708.
- Huang et al. (2018) X. Huang, M.-Y. Liu, S. Belongie, J. Kautz, Multimodal Unsupervised Image-to-Image Translation, arXiv preprint arXiv:1804.04732 (2018).
- Simonyan and Zisserman (2014) K. Simonyan, A. Zisserman, Very Deep Convolutional Networks for Large-Scale Image Recognition, arXiv preprint arXiv:1409.1556 (2014).
- Choi et al. (2017) Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, J. Choo, StarGAN: Unified Generative Adversarial Networks for Multi-Domain Image-to-Image Translation, arXiv preprint 1711 (2017).
- Patel et al. (2015) V. M. Patel, R. Gopalan, R. Li, R. Chellappa, Visual Domain Adaptation: A Survey of Recent Advances, IEEE Signal Processing Magazine 32 (2015) 53–69.
- Shao et al. (2015) L. Shao, F. Zhu, X. Li, Transfer Learning for Visual Categorization: A Survey, IEEE Transactions on Neural Networks and Learning Systems 26 (2015) 1019–1034.
- Wang and Deng (2018) M. Wang, W. Deng, Deep Visual Domain Adaptation: A Survey, Neurocomputing (2018).
- Csurka (2017) G. Csurka, Domain Adaptation for Visual Applications: A Comprehensive Survey, arXiv preprint arXiv:1702.05374 (2017).
- Ronneberger et al. (2015) O. Ronneberger, P. Fischer, T. Brox, U-Net: Convolutional Networks for Biomedical Image Segmentation, in: MICCAI, Springer, 2015, pp. 234–241.
- Kingma and Welling (2013) D. P. Kingma, M. Welling, Auto-Encoding Variational Bayes, arXiv preprint arXiv:1312.6114 (2013).
- Liu and Tuzel (2016) M.-Y. Liu, O. Tuzel, Coupled Generative Adversarial Networks, in: NIPS, 2016, pp. 469–477.
- Mao et al. (2017) X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, S. P. Smolley, Least Squares Generative Adversarial Networks, in: ICCV, IEEE, 2017, pp. 2813–2821.
- Kingma and Ba (2014) D. P. Kingma, J. Ba, Adam: A Method for Stochastic Optimization, arXiv preprint arXiv:1412.6980 (2014).
- Shiraishi et al. (2000) J. Shiraishi, S. Katsuragawa, J. Ikezoe, T. Matsumoto, T. Kobayashi, K.-i. Komatsu, M. Matsui, H. Fujita, Y. Kodera, K. Doi, Development of a Digital Image Database for Chest Radiographs with and Without a Lung Nodule: Receiver Operating Characteristic Analysis of Radiologists’ Detection of Pulmonary Nodules, American Journal of Roentgenology 174 (2000) 71–74.
- Wang et al. (2017) X. Wang, Y. Peng, L. Lu, Z. Lu, M. Bagheri, R. M. Summers, ChestX-Ray8: Hospital-Scale Chest X-Ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases, in: CVPR, IEEE, 2017, pp. 3462–3471.
- Jaeger et al. (2014) S. Jaeger, S. Candemir, S. Antani, Y.-X. J. Wáng, P.-X. Lu, G. Thoma, Two Public Chest X-ray Datasets for Computer-Aided Screening of Pulmonary Diseases, Quantitative Imaging in Medicine and Surgery 4 (2014) 475.
- Suckling et al. (2015) J. Suckling, J. Parker, D. Dance, S. Astley, I. Hutt, C. Boggis, I. Ricketts, E. Stamatakis, N. Cerneaz, S. Kok, et al., Mammographic Image Analysis Society (MIAS) Database (2015).
- Moreira et al. (2012) I. C. Moreira, I. Amaral, I. Domingues, A. Cardoso, M. J. Cardoso, J. S. Cardoso, INbreast: Toward a Full-Field Digital Mammographic Database, Academic Radiology 19 (2012) 236–248.
- Heath et al. (2000) M. Heath, K. Bowyer, D. Kopans, R. Moore, W. P. Kegelmeyer, The Digital Database for Screening Mammography, in: Proceedings of the 5th International Workshop on Digital Mammography, Medical Physics Publishing, 2000, pp. 212–218.
- Rampun et al. (2017) A. Rampun, P. J. Morrow, B. W. Scotney, J. Winder, Fully Automated Breast Boundary and Pectoral Muscle Segmentation in Mammograms, Artificial Intelligence in Medicine 79 (2017) 28 – 41.
- Li et al. (2017) C.-L. Li, W.-C. Chang, Y. Cheng, Y. Yang, B. Póczos, MMD GAN: Towards Deeper Understanding of Moment Matching Network, in: NIPS, 2017, pp. 2203–2213.