Revisiting CycleGAN for semi-supervised segmentation
In this work, we study the problem of training deep networks for semantic image segmentation using only a fraction of annotated images, which may significantly reduce human annotation efforts. Particularly, we propose a strategy that exploits the unpaired image style transfer capabilities of CycleGAN in semi-supervised segmentation. Unlike recent works using adversarial learning for semi-supervised segmentation, we enforce cycle consistency to learn a bidirectional mapping between unpaired images and segmentation masks. This adds an unsupervised regularization effect that boosts the segmentation performance when annotated data is limited. Experiments on three different public segmentation benchmarks (PASCAL VOC 2012, Cityscapes and ACDC) demonstrate the effectiveness of the proposed method. The proposed model achieves 2-4% of improvement with respect to the baseline and outperforms recent approaches for this task, particularly in low labeled data regime.
Deep learning methods have recently emerged as an efficient solution for semantic image segmentation, achieving outstanding performance in a wide range of applications like analyzing natural scenes, autonomous driving or medical imaging. Despite their success, a main limitation of these methods is the need for large training datasets of pixel-level annotated images. Acquiring such labeled images is a time consuming process that may require user expertise in various scenarios. This impedes the applicability of deep models to applications where labeled images are scarce.
Semi-supervised learning (SSL) has been proposed to overcome the shortage of labeled data. In this scenario, we assume that a large set of unlabeled images is available during training, in addition to a small set of images with strong annotations. Consider a SSL segmentation setting with two distinct subsets: contains labeled images and their corresponding ground-truth mask , and contains unlabeled images (typically ). In this setting, the objective is often formulated as maximizing a log-likelihood with respect the learning parameters of a deep network, through the supervision provided by the labeled set . On the other hand, unsupervised images in can be leveraged in different ways, typically introducing a regularization effect in deep models and therefore improving their generalization capabilities.
Generative adversarial networks (GANs)  have shown to be an efficient solution for unsupervised domain adaptation [10, 9, 28, 29], a problem related to semi-supervised learning. GAN-based methods for domain adaptation use adversarial learning to match the distributions of source and target data, commonly at the input or in feature space. Recently, the CycleGAN model  has become a popular choice to transfer image style between domains, as it eliminates the restriction of corresponding image pairs during training . This model finds a mapping between source and target images which preserves key attributes between the input and the transformed image using a cycle consistency loss. While CycleGAN has been widely employed to learn a mapping between different domains, it has not yet been investigated in more traditional semi-supervised scenarios where there is no domain shift between labeled and unlabeled data.
In this work, we leverage the unpaired domain adaptation ability of CycleGAN to learn a bidirectional mapping from unlabeled real images to available ground truth masks. This mapping, learned in conjunction with the standard supervised mapping from labeled images to their corresponding labels, acts as an unsupervised regularization loss which helps train the network when labeled data is limited. The proposed method contrasts with recent work on domain adaptation for segmentation [9, 13], where the CycleGAN is employed to map images across two domains. It also differs significantly from recent work using GAN-generated images for semi-supervised segmentation , in which cycle consistency is not enforced. The main contributions of this paper can be summarized as follows:
To our knowledge, this is the first semi-supervised segmentation method using CycleGAN to learn a cycle-consistent mapping between unlabeled real images and ground truth masks. The proposed technique acts as an unsupervised regularization prior which improves segmentation performance when labeled data is limited.
We validate our approach on three challenging segmentation tasks from different applications (i.e., natural scenes, autonomous driving and medical imaging), and show that our method is dataset-independent and effective for a wide range of scenarios.
Additionally, we present an ablation study which analyzes the effect of various components of the proposed unsupervised loss and demonstrates the usefulness of these components for improving performance. We believe this analysis is important for future investigations of CycleGANs applied to semi-supervised segmentation.
The rest of the paper is organized as follows. In Section 2, we give a brief overview of relevant work on semantic segmentation with a focus on semi-supervised learning and adversarial learning. Section 3 then presents our model which is evaluated on three challenging datasets in Section 4. Finally, we conclude with a summary of our main contributions and results.
2 Related work
Supervised methods based on convolutional neural networks (CNNs) are driving progress in semantic segmentation [18, 26, 4]. Despite their success, training these networks requires a large number of densely-annotated images which are expensive to obtain. A solution to address this limitation is weakly-supervised learning, where easier-to-obtain annotations like image-level tags [21, 15, 32], bounding boxes [6, 25] or scribbles  are instead used to train segmentation models. However, weakly-supervised methods still require some human interaction, which may be difficult to get in certain scenarios.
Semi-supervision is a special type of weakly-supervised learning where many unlabeled images are also available for training [1, 2, 20, 24, 23, 33]. Instead of relying on weak annotations, semi-supervised learning (SSL) typically uses domain- or task-agnostic properties of the data to regularize learning. Recently, several SSL methods have been proposed for semantic segmentation, for instance based on self-training , distillation , attention learning , manifold embedding , co-training , and temporal ensembling . As these methods, the proposed approach can also leverage unlabeled image directly, without the need for weak annotations or task-specific priors.
Adversarial learning has also shown great promise for training deep segmentation models with few strongly-annotated images [27, 11, 31]. An interesting approach to include unlabeled images during training is to add a discriminator network in the model, which must determine whether the output of the segmentation network corresponds to a labeled or unlabeled image . This encourages the segmentation network to have a similar distribution of outputs for images with and without annotations, thereby helping generalization. A potential issue with this approach is that the adversarial network can have a reverse effect, where the output for annotated images becomes growingly similar to incorrect predictions obtained for unlabeled images. A related strategy uses the discriminator to predict a confidence map for the segmentation, enforcing this output to be maximum for annotated images . For unlabeled images, areas of high confidence are used to update the segmentation network in a self-teaching manner. The main limitation of this approach is that a confidence threshold must be provided, the value of which can affect performance.
Until now, only a single work has applied Generative Adversarial Networks (GANs) for semi-supervised segmentation . In this previous work, generated images are used for training in addition to both labeled and unlabeled data. The trained segmentation network must predict the correct labels for real images or a special fake label for generated images. For this method to work, fake images should be generated from outside the distribution of real images so that the segmentation network learns a better representation of the manifold (i.e., fake images constitute negative examples). In contrast, our method uses cycle-consistent GANs to better estimate the distribution of real images and their corresponding segmentation masks.
3.1 CycleGAN for semi-supervised segmentation
The proposed architecture for semi-supervised segmentation, illustrated in Figure 1, is based on the cycle-consistent GAN (CycleGAN) model  which has shown outstanding performance for unpaired image-to-image translation. This architecture is composed of four inter-connected networks, two conditional generators and two discriminators, which are trained simultaneously. In the original CycleGAN model, the generators are employed to learn a bidirectional mapping from an image domain to the other. On the other hand, discriminators try to determine whether an image from the corresponding domain is real or generated. By fooling the discriminators through adversarial learning, the model thus learns to generate images from the true distribution without requiring paired images. A cycle-consistency loss is also added to ensure that generators are consistent, i.e. that we recover the same image when going through both generators sequentially.
In our semi-supervised segmentation model, the CycleGAN is instead used to map images to their corresponding segmentation mask and vice-versa. The first generator (), corresponding to the segmentation network that we want to obtain, learns a mapping from an image to its segmentation labels. The first discriminator () tries to differentiate these generated labels from real segmentation masks. Note that the combination of of is similar to semi-supervised segmentation approach presented in . Conversely, the second generator () learns to map a segmentation mask to its image. In our semi-supervised segmentation setting, this generator is only used to improve training. Likewise, the second discriminator () receives an image as input and predicts whether this image is real or generated. To enforce cycle consistency, generators are trained so that feeding the labels generated by for an image into gives that same image, and passing back to the image generated by for a segmentation mask gives that same mask. Figure 2 shows examples of images, ground truth labels, generated images and generated labels obtained for the three datasets used in our experiments.
|Image||Ground truth labels||Generated image||Generated labels|
3.2 Loss functions
In this section, we formally define the loss functions employed to train our model in a semi-supervised setting where the data comes from three distributions: labeled images (), ground truth masks of labeled images (), and unlabeled images (). The first loss function is a standard supervised segmentation loss that imposes the segmentation network () to generate labels of ground truth masks:
where is the pixelwise cross-entropy defined as
In this expression, and are the ground truth and predicted probabilities that pixel has label . Likewise, we employ a pixelwise L2 norm between a labeled image and the image generated from its corresponding ground truth as supervised loss to train the image generator :
To exploit unlabeled images, we incorporate two additional types of losses: adversarial losses and cycle-consistency losses. The adversarial losses are used to train the generators and discriminators in a competing fashion, and help the generators produce realistic images and segmentation masks. To have a better training of the discriminators, we follow the approach presented in  and use a least square loss instead of the traditional cross-entropy. It was shown in this previous work that this loss function leads to minimizing the Pearson divergence. Suppose that is the predicted probability that segmentation labels correspond to a ground truth mask. We define the adversarial loss for as
Similarly, let be the predicted probability that an image is real, the adversarial loss for the image discriminator is defined as
The first cycle consistency loss measures the difference between an unlabeled image and the regenerated image after passing through generators and sequentially. Here, we use the L1 norm since it leads to sharper images than the L2 norm:
On the other hand, since the segmentation labels are categorical variables, we use cross-entropy to evaluate the difference between a ground-truth segmentation mask and the regenerated labels after passing through generators and in sequence:
Finally, the total loss is obtained by combining all six loss terms:
In practice, learning is performed in an alternating fashion, where the parameters of the generators are optimized while considering those of the discriminators as fixed, and vice-versa.
3.3 Implementation details
Following the original implementation of CycleGAN, we adopt the architecture proposed in  for our generators, since it has shown impressive results for image-style transfer. This network is composed of two stride-2 convolutions, followed by 9 residual blocks and two fractionally-strided convolutions with stride . Similarly, instance normalization  was employed and no drop-out was adopted. Furthermore, we used softmax as output function when generating segmentation labels from images, whereas tanh was the selected function when translating from segmentation labels to images, in order to have continuous values. In pre-processing, each channel of an image is normalized to the range by subtracting its mean value and dividing by the difference between the maximum and minimum value.
Unlike the original CycleGAN model, we make use of pixel-wise discriminators  where the size of the output is the same as the input and the adversarial label (i.e., real / generated) is recopied at each output pixel. We found this model to perform better than having a single discriminator output. Each discriminator contains three convolutional blocks, followed by Leaky ReLU activations with negative slope of . In addition, batch normalization is used in the discriminators after the second convolutional block.
Both generators and discriminators were trained using Adam optimizer  with and parameters equal to 0.5 and 0.999. Learning rate was initially set to 210 with a linear decay after every 100 epochs, during 400 epochs. Furthermore, batch size was set to 5 in all experiments. The values of the weighting terms in Eq. (8) were set to , , , and . The code was implemented in Pytorch 3.3  and experiments were run on a server equipped with a NVIDIA Titan V GPU (12 GBs). The code is made publicly available at https://github.com/arnab39/Semi-supervised-segmentation-cycleGAN.
We conduct experiments on three different public semantic segmentation benchmarks: PASCAL VOC 2012 , Cityscapes  and the Automated Cardiac Diagnosis Challenge (ACDC) MICCAI 2017 Challenge .
PASCAL VOC 2012: This dataset contains 21 common object classes, including one background class. In our experiments, we employed the augmented set composed of 10,582 images, which we split into training (8994 images) and validation (1588 images) subsets. In addition, due to memory limitations, we resized all images to 200200 pixels before being fed into the network.
Cityscapes: This second dataset contains 50 videos from driving scenes where a total of 20 classes (including background) are manually annotated. In our experiments, we split the 3475 provided images into training (2953 images) and validation (522 images) subsets. As in the previous case, all images were resized to a 128256 pixel resolution.
ACDC: This medical image set focuses on the segmentation of cardiac structures (the left ventricular endocardium and epicardium and the right ventricular endocardium) and consists of 100 cine magnetic resonance (MR) exams covering normal cases and subjects with well-defined defined pathologies: dilated cardiomyopathy, hypertrophic cardiomyopathy, myocardial infarction with altered left ventricular ejection fraction and abnormal right ventricle. Each exam contains acquisitions at the diastolic and systolic phases. For our experiments, we employed 75 exams for training and the remaining 25 for validation.
It is important to note that, since we aim at isolating the performance of each method and not achieving state-of-the-art results, no data augmentation was performed in any of the datasets for training.
4.2 Evaluation protocol
We use the mean intersection over union (mIOU) metric to evaluate the segmentation results of all the models. This metric can be defined as , where , , and are the true positive, false positive, and false negative pixels, respectively, determined over the whole validation set.
To have an upper-bound on performance, we train a network in a fully-supervised manner, employing all available training images. We also trained the same model from scratch using only 10%, 20%, 30% or 50% of labeled images, and refer to this baseline as Partial. Our semi-supervised method is trained with the same subsets as the Partial baseline, however it also makes use of unlabeled training images. Last, we compare our method to the approach presented in , which has shown state-of-art performance for semi-supervised segmentation.
In the following section, we report the experimental results of the proposed approach on the three datasets described in Section 4.1.
4.3.1 Comparison on benchmarks
|Hung et al. ||20||0.2032||0.3490||0.8063|
Table 1 reports the results obtained by the tested approaches on the three benchmark datasets. We first observe that, in all cases, the proposed model outperforms the partial supervision baseline when training with a reduced set of labeled images. This difference is particularly significant when pixel-level annotations are scarce (i.e., 10% and 20% of the whole training set), where the proposed model achieves 2-4% of improvement. As the number of labeled images increases, the gap between the baseline and the proposed models decreases, with a gain close to 1% when training with half of the whole training set. Furthermore, we found that the semi-supervised segmentation approach of Hung et al.  obtained poor results for all three datasets, with lower accuracy than the partial supervision baseline (Partial). In the original work , authors used a generator pre-trained using ImageNet. In our experiments, to have an unbiased comparison, we tested methods without such pre-training (i.e., all generators and discriminators were trained from scratch). This could potentially explain our lower results obtained for Hung et al.’s method.
|Image||Ground truth||Full||Partial||Hung et al. ||Ours|
|Partial||Hung et al. ||Ours|
|Partial||Hung et al. ||Ours|
|Image||Ground truth||Full||Partial||Hung et al. ||Ours|
A visual comparison of results is given in Figures 7, 4 and 5. It can be seen that the proposed method predicts a segmentation closer to the network trained with all images (Full) than the partial supervision baseline (Partial) and the Hung et al.’s model. While predicted region boundaries are sometimes inaccurate, the global semantic information of the image (i.e., actual class labels) appears to be better learned by our model compared to the partial supervision baseline. In addition, our model seems to better capture details of thin objects –e.g., legs or persons– compared to both the baseline and the method in .
|w/o labels cycle loss||0.2627|
|w/o image cycle loss||0.2733|
|w/o labels discr. loss||0.2614|
|w/o image discr. loss||0.2543|
4.3.2 Ablation study
To further analyze the effect of the different components of the proposed model, we conduct an ablation study where the model is trained while removing a single loss term of Eq. (8). Specifically, we train the model without the labels cycle-consistency loss , image cycle-consistency loss , labels discriminator loss , or image discriminator loss . Note that these modifications correspond to setting , , or to 0, respectively. For this experiment, we investigate the performance of the model trained with 20% of labeled data on PASCAL VOC 2012.
The results of our ablation study are summarized in Table 2. The proposed model containing all loss terms reaches a mIOU value of 0.2981. If we remove the cycle consistency loss on the generation of segmentation labels, this value is reduced to 0.2627. However, removing the cycle consistency loss on the image generation leads to an even lower accuracy of 0.2733, suggesting that the cycle consistency loss on segmentation masks has a stronger impact in the model. Regarding the significance of the losses in the discriminators, we observe a reverse effect. A lower performance is observed if the loss on the discriminator is ignored, which is responsible of differentiating between unlabeled and generated images.
5 Discussion and conclusion
We presented a semi-supervised method for image semantic segmentation, where the key idea is to leverage CycleGAN to learn a cycle-consistent mapping between unlabeled real images and available ground truth masks. Unlike recent work using adversarial learning for semi-supervised segmentation [27, 11, 31], the proposed strategy enforces consistency between unpaired images and segmentation masks, which acts as an unsupervised regularizer. From the reported results, we have shown that this strategy improves segmentation performance, particularly when annotated data is scarce.
Due to the high computational and memory requirements of generating large images, our experiments have employed images with reduced size, in particular for the Cityscape dataset where the resolution was reduced from pixels to . This is in large part responsible for the lower accuracy values obtained in our experiments, compared to those reported in the literature. In a future investigation, we will evaluate the performance of our model on full-sized images. Moreover, in this work, we used the same network for both generators ( and ). This architectural choice was made to achieve a better learning equilibrium during training (i.e., avoid a generator learning much faster than the other). Employing different networks in future experiments could however improve performance.
-  (2017) Semi-supervised learning for network-based cardiac mr image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 253–260. Cited by: §2.
-  (2017) Semi-supervised deep learning for fully convolutional networks. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 311–319. Cited by: §2.
-  (2018) Deep learning techniques for automatic MRI cardiac multi-structures segmentation and diagnosis: is the problem solved?. IEEE Transactions on Medical Imaging. Cited by: §4.1.
-  (2018) DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (4), pp. 834–848. Cited by: §2.
-  (2016) The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3213–3223. Cited by: §4.1.
-  (2015) Boxsup: exploiting bounding boxes to supervise convolutional networks for semantic segmentation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1635–1643. Cited by: §2.
-  (2010) The pascal visual object classes (voc) challenge. International journal of computer vision 88 (2), pp. 303–338. Cited by: §4.1.
-  (2014) Generative adversarial nets. In Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.), pp. 2672–2680. Cited by: §1.
-  (2017) Cycada: cycle-consistent adversarial domain adaptation. arXiv preprint arXiv:1711.03213. Cited by: §1, §1.
-  (2016) FCNs in the wild: pixel-level adversarial and constraint-based adaptation. arXiv preprint arXiv:1612.02649. Cited by: §1.
-  (2018) Adversarial learning for semi-supervised semantic segmentation. In Proceedings of the British Machine Vision Conference (BMVC), pp. 1. Cited by: §2, §3.3, Figure 3, Figure 4, Figure 5, §4.2, §4.3.1, §4.3.1, Table 1, §5.
-  (2017) Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1125–1134. Cited by: §1.
-  (2018) Tumor-aware, adversarial domain adaptation from ct to mri for lung cancer segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 777–785. Cited by: §1.
-  (2016) Perceptual losses for real-time style transfer and super-resolution. In European conference on computer vision, pp. 694–711. Cited by: §3.3.
-  (2019) Constrained-CNN losses for weakly supervised segmentation. Medical image analysis. Cited by: §2.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §3.3.
-  (2016) Scribblesup: scribble-supervised convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3159–3167. Cited by: §2.
-  (2015) Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431–3440. Cited by: §2.
-  (2017) Least squares generative adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2794–2802. Cited by: §3.2.
-  (2018) A robust deep attention network to noisy labels in semi-supervised biomedical segmentation. arXiv preprint arXiv:1807.11719. Cited by: §2.
-  (2015) Weakly-and semi-supervised learning of a DCNN for semantic image segmentation. In ICCV, Cited by: §2.
-  (2017) Automatic differentiation in PyTorch. In NIPS Autodiff Workshop, Cited by: §3.3.
-  (2019) Deep co-training for semi-supervised image segmentation. arXiv preprint arXiv:1903.11233. Cited by: §2.
-  (2018) Unsupervised domain adaptation for medical imaging segmentation with self-ensembling. arXiv preprint arXiv:1811.06042. Cited by: §2.
-  (2017) Deepcut: object segmentation from bounding box annotations using convolutional neural networks. IEEE transactions on medical imaging 36 (2), pp. 674–683. Cited by: §2.
-  (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §2.
-  (2017) Semi supervised semantic segmentation using generative adversarial network. In Computer Vision (ICCV), 2017 IEEE International Conference on, pp. 5689–5697. Cited by: §1, §2, §2, §5.
-  (2018) Learning to adapt structured output space for semantic segmentation. In Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
-  (2017) Adversarial discriminative domain adaptation. In Computer Vision and Pattern Recognition (CVPR), Vol. 1, pp. 4. Cited by: §1.
-  (2016) Instance normalization: the missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022. Cited by: §3.3.
-  (2017) Deep adversarial networks for biomedical image segmentation utilizing unannotated images. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 408–416. Cited by: §2, §3.1, §5.
-  (2019) Collaborative learning of semi-supervised segmentation and classification for medical images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2079–2088. Cited by: §2.
-  (2018) Semi-supervised multi-organ segmentation via multi-planar co-training. arXiv preprint arXiv:1804.02586. Cited by: §2.
-  (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2223–2232. Cited by: §1, §3.1.