Constrained-CNN losses for
weakly supervised segmentation
Weak supervision, e.g., in the form of partial labels or image tags, is currently attracting significant attention in CNN segmentation as it can mitigate the lack of full and laborious pixel/voxel annotations. Enforcing high-order (global) inequality constraints on the network output, for instance, on the size of the target region, can leverage unlabeled data, guiding training with domain-specific knowledge. Inequality constraints are very flexible because they do not assume exact prior knowledge. However, constrained Lagrangian dual optimization has been largely avoided in deep networks, mainly for computational tractability reasons. To the best of our knowledge, the method of Pathak et al. pathak2015constrained (2015) is the only prior work that addresses deep CNNs with linear constraints in weakly supervised segmentation. It uses the constraints to synthesize fully-labeled training masks (proposals) from weak labels, mimicking full supervision and facilitating dual optimization.
We propose to introduce a differentiable term, which enforces inequality constraints directly in the loss function, avoiding expensive Lagrangian dual iterates and proposal generation. From constrained-optimization perspective, our simple approach is not optimal as there is no guarantee that the constraints are satisfied. However, surprisingly, it yields substantially better results than the proposal-based constrained CNNs in pathak2015constrained (2015), while reducing the computational demand for training. In the context of cardiac images, we reached a segmentation performance close to full supervision using a fraction () of the full ground-truth labels and image-level tags. While our experiments focused on basic linear constraints such as the target-region size and image tags, our framework can be easily extended to other non-linear constraints, e.g., invariant shape moments Klodt2011 (2011) or other region statistics Lim2014 (2014). Therefore, it has the potential to close the gap between weakly and fully supervised learning in semantic medical image segmentation. Our code is publicly available.
Constrained-CNN losses for
weakly supervised segmentation
Hoel Kervadec ÉTS Montréal firstname.lastname@example.org Jose Dolz ÉTS Montréal Meng Tang University of Waterloo Department of computer science Éric Granger ÉTS Montréal Yuri Boykov University of Waterloo Department of computer science Ismail Ben Ayed ÉTS Montréal
noticebox[b]1st Conference on Medical Imaging with Deep Learning (MIDL 2018), Amsterdam, The Netherlands.\end@float
In the recent years, deep convolutional neural networks (CNNs) have been dominating semantic segmentation problems, both in computer vision and medical imaging, achieving ground-breaking performances when full-supervision is available Litjens2017 (2017); FCN (2015). In semantic segmentation, full supervision requires laborious pixel/voxel annotations, which may not be available in a breadth of applications, more so when dealing with volumetric data. Therefore, weak supervision with partial labels, for instance, bounding boxes deepcut (2017), points Bearman2016 (2016), scribbles tang2018regularized (2018); ncloss:cvpr18 (2018); scribblesup (2016), or image tags pathak2015constrained (2015); papandreou2015weakly (2015), is attracting significant research attention. Imposing prior knowledge on the the network’s output in the form of unsupervised loss terms is a well-established approach in machine learning weston2012deep (2012); goodfellow2016deep (2016). Such priors can be viewed as regularization terms that leverage unlabeled data, embedding domain-specific knowledge. For instance, the recent studies in tang2018regularized (2018); ncloss:cvpr18 (2018) showed that direct regularization losses, e.g., dense conditional random field (CRF) or pairwise clustering, can yield outstanding results in weakly supervised segmentation, reaching almost full-supervision performances in natural image segmentation; see the results in tang2018regularized (2018). Surprisingly, such a principled direct-loss approach is not common in weakly supervised segmentation. In fact, most of the existing techniques synthesize fully-labeled training masks (proposals) from the available partial labels, mimicking full supervision deepcut (2017); papandreou2015weakly (2015); scribblesup (2016); kolesnikov2016seed (2016). Typically, such proposal-based techniques iterate two steps: CNN learning and proposal generation facilitated by dense CRFs and fast mean-field inference koltun:NIPS11 (2011), which are now the de-facto choice for pairwise regularization in semantic segmentation algorithms.
This study continues our line of recent work in tang2018regularized (2018), in which we showed the potential of direct pairwise regularization losses (e.g., dense CRF and Potts) in weakly supervised segmentation. Our purpose here is to embed high-order (global) inequality constraints on the network output directly in the loss function, so as to guide learning. For instance, assume that we have some prior knowledge on the size (or volume) of the target region, e.g., in the form of lower and upper bounds on size, a common scenario in medical image segmentation Niethammer2013 (2013); Gorelick2013 (2013). Let denotes a given training image, with is a discrete image domain and the number of pixels/voxels in the image. is weak (partial) ground-truth segmentation of the image. It takes the form of a partial annotation of the target region, e.g., a few points (see the examples in Fig. 2) or image-level tags. In this case, one can optimize a cross-entropy loss subject to inequality constraints on the network outputs pathak2015constrained (2015):
where S = is a vector of softmax probabilities111The softmax probabilities take the form: , where is a real scalar function representing the output of the network. For notation simplicity, we omit the dependence of on and as this does not result in any ambiguity in the presentation. generated by the network at each pixel and . Priors and denote the given upper and lower bounds on the size (or cardinality) of the target region. Inequality constraints of the form in (1) are very flexible because they do not assume exact knowledge of the target size, unlike Zhang2017 (2017); Boykov2015 (2015). Also, multiple instance learning constraints pathak2015constrained (2015), which enforce image-tag priors, can be handled by constrained model (1). Image tags are a form of weak supervision, which enforce the constraints that a target region is present or absent in a given training image pathak2015constrained (2015). They can be viewed as particular cases of the inequality constraints in (1). For instance, a suppression constraint, which takes the form , enforces that the target region is not in the image. enforces the presence of the region.
Even though constraints of the form (1) are linear (and hence convex) with respect to the network outputs, constrained problem (1) is very challenging due to the non-convexity of CNNs. One possibility would be to minimize the corresponding Lagrangian dual. However, as pointed out in pathak2015constrained (2015); Marquez-Neila2017 (2017), this is computationally intractable for semantic segmentation networks involving millions of parameters; one has to optimize a CNN within each dual iteration. In fact, constrained optimization has been largely avoided in deep networks Ravi2018 (2018), even thought some Lagrangian techniques were applied to neural networks a long time before the deep learning era Zhang1992 (1992); Platt1988 (1988). These constrained optimization techniques are not applicable to deep CNNs as they solve large linear systems of equations. The numerical solvers underlying these constrained techniques would have to deal with matrices of very large dimensions in the case of deep networks Marquez-Neila2017 (2017).
To the best of our knowledge, the method of Pathak et al. pathak2015constrained (2015) is the only prior work that addresses constrained deep CNNs in weakly supervised segmentation. It uses the constraints to synthesize fully-labeled training masks (proposals) from the available partial labels, mimicking full supervision, which avoids intractable dual optimization of the constraints when minimizing the loss function. The main idea of pathak2015constrained (2015) is to model the proposals as a latent distribution. Then, they minimize a KL divergence, encouraging the softmax output of the CNN to match the latent distribution as closely as possible. Therefore, they impose constraints on the latent distribution rather than on the network output, which facilitates significantly Lagrangian dual optimization. This decouples stochastic gradient descent learning of the network parameters and constrained optimization: The authors of pathak2015constrained (2015) alternate between optimizing w.r.t the latent distribution, which corresponds to proposal generation subject to the constraints222This sub-problem is convex when the constraints are convex., and standard stochastic gradient descent for optimizing w.r.t the network parameters.
We propose to introduce a differentiable term, which enforces inequality constraints (1) directly in the loss function, avoiding expensive Lagrangian dual iterates and proposal generation. From constrained optimization perspective, our simple approach is not optimal as there is no guarantee that the constraints are satisfied. However, surprisingly, it yields substantially better results than the proposal-based constrained CNNs in pathak2015constrained (2015), while reducing the computational demand for training. In the context of cardiac image segmentation, we reached a performance close to full supervision while using a fraction of the full ground-truth labels () and image-level tags. Our framework can be easily extended to non-linear inequality constraints, e.g., invariant shape moments Klodt2011 (2011) or other region statistics Lim2014 (2014). Therefore, it has the potential to close the gap between weakly and fully supervised learning in semantic medical image segmentation. Our code is publicly available333.
2 Proposed loss function
We propose the following loss for weakly supervised segmentation:
with function given by (See the illustration in Fig. 1):
Now, our differentiable term accommodates standard stochastic gradient descent. During back-propagation, the term of gradient-descent update corresponding to can be written as follows:
where denotes the standard derivative of the softmax outputs of the network. The gradient in (4) has a clear interpretation. During back-propagation, when the current constraints are satisfied, i.e., , observe that . Therefore, in this case, the gradient stemming from our term has no effect on the current parameter update. Now, suppose without loss of generality that the current set of parameters corresponds to , which means the current target region is smaller than its lower bound (constraint violation). In this case, term is positive and, therefore, the first line of (4) performs a gradient ascent step on softmax outputs, increasing . This makes sense because it increases the size of the current region, , so as to satisfy the constraint. The case has a similar interpretation.
In the next sections, we first give details about the dataset, the weakly annotated labels and our implementation. Then, we empirically evaluate our proposed high-order size loss. Particularly, we analyze its contribution to the segmentation performance, and compare the results to to the weakly supervised case with no regularization term, to image proposals generation and to the fully supervised setting.
3 Experimental set-up
Experiments on the proposed high-order size loss focused on left ventricular endocardium segmentation. For this purpose we employed the training set from publicly available dataset from the 2017 ACDC Challenge 444https://www.creatis.insa-lyon.fr/Challenge/acdc/. This set consists on 100 cine magnetic resonance (MR) exams covering well defined pathologies: dilated cardiomyopathy, hypertrophic cardiomyopathy, myocardial infarction with altered left ventricular ejection fraction, abnormal right ventricle and patients without cardiac disease. Exams were acquired in breath hold with a retrospective or prospective gating and with a SSFP sequence in 2-chambers, 4-chambers and in short axis orientations. A series of short axis slices cover the LV from the base to the apex, with a thickness of 5 to 8 mm and often an interslice gap of 5 mm. The spatial resolution goes from 0.83 to 1.75 mm/pixel.
For all the experiments, we employed the same 75 exams for training and the remaining 25 for validation purposes. To increase the variability of the data, we augment the dataset by randomly rotating, flipping, mirroring and scaling the images.
3.2 Weakly annotated labels
To generate the weak labels we employed binary erosion on the fully annotated labels with a kernel of size 1010. If the resulted label disappeared, we repeated the operation with a smaller kernel (i.e., 77) until we get a small contour. Thus, the total number of annotated pixels represented the 0.1 of the labeled pixels in the fully supervised scenario. Figure 2 depicts some examples of fully annotated images and their corresponding weak labels.
To compute the lower and upper bounds of the proposed size loss, manual segmentations from only one subject were employed. Specifically, we computed the minimum and maximum size of left ventricular endocardium over the slices, and then multiplied by a factor of 1.1 and 0.9 the minimum and maximum value, respectively, to account for size variations across exams.
3.3 Training and implementation details
For all the proposed experiments we employ ENet paszke2016enet (2016) network, as it has shown a good trade-off between accuracy and inference time. Nevertheless, the proposed high-order size loss is general and it can be applied on any CNN. The network is trained from scratch by employing Adam optimizer and a batch size of 1. The initial learning rate was set to 5e and it was decreased by 2 after 100 epochs. The weight of the size loss in (4) was empirically set to 1e and its influence on the performance has not been investigated in this work. Input images had a size of 256 256 pixels.
The proposals method pathak2015constrained (2015) reuses the same network and loss function than the fully supervised setting. At each iteration, a synthetic ground truth is generated using a projected gradient ascent, before computing the cross entropy between and . We found that limiting the number of iterations for the PGA to (instead of the original ) saved time without impacting significantly the results. We employ one constraint (), with following the same strategy as in Section 3.2. We had to add this multiplicative constant in order to make it work. This is discussed more in depth in section 4.2.
For evaluation purposes, since each method has a different loss function, we resort to the common to compare them.
This section introduces the experimental results of this paper. First, in Sec. 4.1 we evaluate the impact of including an additional size loss during training in a weakly supervised setting. We show that by incorporating this size prior directly into the loss yields a highly significant boost on performance. Then, in Sec. 4.2 we compare the effect of using direct loss and generating proposals in training in an iterative scheme. Compared to this strategy, our proposed method achieved state-of-the-art performance on weakly supervised segmentation with small blobby regions as labels. We also provide the results for the fully supervised setting in Sec. 4.3. After that, qualitative evaluation is assessed in Sec. 4.4. In addition, we compare the different learning strategies in terms of efficiency (Sec. 4.5), showing that the direct introduction of the proposed loss during learning does not affect training times.
4.1 Weakly supervised segmentation with size loss
We trained a segmentation network from weakly annotated images with no additional information which served as baseline. Training on this model relies on computing the cross-entropy on labeled pixels. Then, we trained another network by employing the same weakly annotated images and we considered slightly more supervision than simply few annotated pixels. First, we included size prior information on the superior bound when the target is present in the image. In this case, in eq. 3 and 4 is replaced by 0. Thus, the CNN is constrained to generate segmentations whose sizes must be below the superior bound b. Then, if the target is not present on the image, we basically replace the upper bound by 0, similar to MIL scenarios for suppression constraints. This will result on minimizing the predicted size to 0. To investigate the effect of bounding the CNN output on both sides, we trained a third network from weakly annotated images and we included size prior information as both inferior and superior bounds. Forward and backward pass are defined by eq. 3 and 4, where and are computed from one training subject following the steps described in Sec 3.2. The lower and upper size tight bounds have a value of 98 and 1723 pixels, respectively. Cross-entropy on annotated pixels and the proposed size loss are jointly minimized.
Table 1 reports results on the validation set for all the methods. The baseline is the network with cross entropy loss only on labeled pixels. This baseline achieves poor results, since no other information rather than the annotated pixels is given. As such, there is no regularization term that can give some hints on the desired segmentation. Evaluation on this baseline reported mean DSC values below 0.10, which are considered unsatisfactory. Indeed, after 100 epochs approximately (Fig. 3), we observed that the training got stuck and the network did not learn afterwards. On the other hand, constraining the CNN prediction with the proposed size loss significantly increased the performance of the network, as can be seen on both Table 1 and Fig 3. If we employ only a superior bound , we achieve a mean DSC value of 0.8189. If a lower bound is also considered, mean DSC increases up to 0.8415.
|Proposals (one bound)||0.6124|
|Proposals (one bound)||0.0659|
|Cross-Entropy + Size loss (one bound)||0.8107|
|Cross-Entropy + Size loss (one bound)||0.8189|
|Cross-Entropy + Size loss (two bounds)||0.8415|
|Loose bound / Tight bound|
4.2 Implicit loss against segmentation proposals
In this section we compare the implicit inclusion of the proposed high-order size loss during training to segmentation proposals generation pathak2015constrained (2015). First, we imposed the same tight bound employed in our proposed size loss. However, the performance of this method was deficient, as the network quickly produces only empty segmentations, and could not recover from them. This explains the flat line in the left plot in Fig. 3. We observed that by employing a much more loose upper bound – one sixth of the image size – this approach achieved better results, as reported on Table 1 and Fig. 3 (right). Learning of this method remained very unstable, while slowing overfitting. To compare fairly with the proposed size loss, we also evaluated our loss with a loose bound equal to 10000. As shown on Table 1, constraining the learning with our size loss on a high bound reported an increase on performance with respect to the proposals generation approach in pathak2015constrained (2015) under the same conditions.
Since we were not able to use tight bound on pathak2015constrained (2015), we did not investigate the case with two bounds. Indeed, a single upper bound is equivalent to , so adding slightly over – as in Sec. 4.1– would be effectively the same. As it can be seen on the results, compared to our direct size loss, generating proposals gives inferior segmentation performance on different situations.
4.3 Pixel-level annotations
Finally, we compared the weakly supervised strategies to a network trained on fully labeled images. Training the network with strong pixel-level annotations achieved a mean DSC of 0.9284 on the validation set. This result represents an increase of a 10 with respect to the best proposed method. Nevertheless, it is noteworthy to mention that the performance achieved by the proposed weakly supervised learning approach with size loss constraints approaches the fully supervised setting with only 0.1 of the annotated pixels. This indicates that the current work shortens the gap between fully and weakly supervised learning for semantic segmentation, pushing the bounds of the latter.
4.4 Qualitative results
To get some intuition about the different learning strategies and their effects on the segmentation we visualize some results in Fig. 4. For the direct size loss and proposals generation approaches we selected the best performing model to display these images. We can observe that in all cases, the baseline gives unacceptable segmentations, which is in line with the findings observed in Table 1 and Fig. 3. We find that generating proposals during training might actually improve the segmentation performance compared to the baseline. Nevertheless, looking at the examples on Fig. 4 (fourth column) we can observe that these segmentations are far from being satisfactory. Nevertheless, integrating the proposed size loss directly in the back-propagation increases substantially the accuracy of the network, as can be seen in the last column of Fig. 4. An interesting observation is that, in some cases (last row), weakly supervised learning produces more reliable segmentations than training with full supervision.
In this section we compare the several learning approaches in terms of efficiency (Table 2). The weakly supervised baseline and the fully supervised model need to compute only one loss per pass. This is reflected in the lowest training times reported in the table. Including the size loss does not affect on the computation time, as can be seen in these results. As expected, the iterative process introduced by pathak2015constrained (2015) at each forward pass cause a non negligible overhead during training. To generate their synthetic ground truth, they need to optimize the Lagragian function of their constrained loss with respect to its -parameters, which requires alternating between computing softmax probabilities and gradients of those probabilities. Even in simple optimization case (with only one constraint) where this parameter converge fast, it adds a few milliseconds at each iteration.
|Method||Training time (ms/image)|
|Weakly supervised (CE Only)||10|
|Weakly supervised (CE + Size loss)||10|
|Weakly supervised (Proposals)||15|
We presented a novel loss function for weakly supervised image segmentation, which despite its simplicity reach almost full supervision performances. We perform significantly better than the other proposed methods for this task, while having negligible computation overhead.
This work is supported by the National Science and Engineering Research Council of Canada (NSERC), discovery grant program, and by the ETS Research Chair on Artificial Intelligence in Medical Imaging.
- Bearman2016  Amy L. Bearman, Olga Russakovsky, Vittorio Ferrari, and Fei-Fei Li. What’s the point: Semantic segmentation with point supervision. In European Conference on Computer Vision (ECCV), pages 549–565, 2016.
- Boykov2015  Yuri Boykov, Hossam N. Isack, Carl Olsson, and Ismail Ben Ayed. Volumetric bias in segmentation and reconstruction: Secrets and solutions. In IEEE International Conference on Computer Vision (ICCV), pages 1769–1777, 2015.
- goodfellow2016deep  Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016.
- Gorelick2013  Lena Gorelick, Frank R. Schmidt, and Yuri Boykov. Fast trust region for segmentation. In 2013 IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, June 23-28, 2013, pages 1714–1721, 2013.
- Klodt2011  Maria Klodt and Daniel Cremers. A convex framework for image segmentation with moment constraints. In IEEE International Conference on Computer Vision, ICCV 2011, Barcelona, Spain, November 6-13, 2011, pages 2236–2243, 2011.
- kolesnikov2016seed  Alexander Kolesnikov and Christoph H Lampert. Seed, expand and constrain: Three principles for weakly-supervised image segmentation. In European Conference on Computer Vision, pages 695–711. Springer, 2016.
- koltun:NIPS11  Philipp Krahenbuhl and Vladlen Koltun. Efficient inference in fully connected CRFs with Gaussian edge potentials. In NIPS, 2011.
- Lim2014  Yongsub Lim, Kyomin Jung, and Pushmeet Kohli. Efficient energy minimization for enforcing label statistics. IEEE Trans. Pattern Anal. Mach. Intell., 36(9):1893–1899, 2014.
- scribblesup  Di Lin, Jifeng Dai, Jiaya Jia, Kaiming He, and Jian Sun. Scribblesup: Scribble-supervised convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3159–3167, 2016.
- Litjens2017  Geert J. S. Litjens, Thijs Kooi, Babak Ehteshami Bejnordi, Arnaud Arindra Adiyoso Setio, Francesco Ciompi, Mohsen Ghafoorian, Jeroen A. W. M. van der Laak, Bram van Ginneken, and Clara I. Sánchez. A survey on deep learning in medical image analysis. Medical Image Analysis, 42:60–88, 2017.
- FCN  Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3431–3440, 2015.
- Marquez-Neila2017  Pablo Márquez-Neila, Mathieu Salzmann, and Pascal Fua. Imposing hard constraints on deep networks: Promises and limitations. CoRR, abs/1706.02025, 2017.
- Niethammer2013  Marc Niethammer and Christopher Zach. Segmentation with area constraints. Medical Image Analysis, 17(1):101–112, 2013.
- papandreou2015weakly  George Papandreou, Liang-Chieh Chen, Kevin Murphy, and Alan L Yuille. Weakly-and semi-supervised learning of a DCNN for semantic image segmentation. arXiv preprint arXiv:1502.02734, 2015.
- paszke2016enet  Adam Paszke, Abhishek Chaurasia, Sangpil Kim, and Eugenio Culurciello. Enet: A deep neural network architecture for real-time semantic segmentation. arXiv preprint arXiv:1606.02147, 2016.
- paszke2017automatic  Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017.
- pathak2015constrained  Deepak Pathak, Philipp Krahenbuhl, and Trevor Darrell. Constrained convolutional neural networks for weakly supervised segmentation. In Proceedings of the IEEE international conference on computer vision, pages 1796–1804, 2015.
- Platt1988  J. C. Platt and A. H. Barr. Constrained differential optimization. Technical report, California Institute of Technology, 1988.
- deepcut  Martin Rajchl, Matthew CH Lee, Ozan Oktay, Konstantinos Kamnitsas, Jonathan Passerat-Palmbach, Wenjia Bai, Mellisa Damodaram, Mary A Rutherford, Joseph V Hajnal, Bernhard Kainz, et al. Deepcut: Object segmentation from bounding box annotations using convolutional neural networks. IEEE transactions on medical imaging, 36(2):674–683, 2017.
- Ravi2018  Vishnu Sai Rao Lokhande Vikas Singh Sathya N. Ravi, Tuan Dinh. Constrained deep learning using conditional gradient and applications in computer vision. arXiv:1803.0645, 2018.
- ncloss:cvpr18  Meng Tang, Abdelaziz Djelouah, Federico Perazzi, Yuri Boykov, and Christopher Schroers. Normalized Cut Loss for Weakly-supervised CNN Segmentation. In IEEE conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, June 2018.
- tang2018regularized  Meng Tang, Federico Perazzi, Abdelaziz Djelouah, Ismail Ben Ayed, Christopher Schroers, and Yuri Boykov. On regularized losses for weakly-supervised cnn segmentation. arXiv preprint arXiv:1803.09569, 2018.
- weston2012deep  Jason Weston, Frédéric Ratle, Hossein Mobahi, and Ronan Collobert. Deep learning via semi-supervised embedding. In Neural Networks: Tricks of the Trade, pages 639–655. Springer, 2012.
- Zhang1992  S. Zhang and A. Constantinides. Lagrange programming neural networks. IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing, 39(7):441–452, 1992.
- Zhang2017  Yang Zhang, Philip David, and Boqing Gong. Curriculum domain adaptation for semantic segmentation of urban scenes. In IEEE International Conference on Computer Vision (ICCV), pages 2039–2049, 2017.