ConstrainedCNN losses for
weakly supervised segmentation
Abstract
Weak supervision, e.g., in the form of partial labels or image tags, is currently attracting significant attention in CNN segmentation as it can mitigate the lack of full and laborious pixel/voxel annotations. Enforcing highorder (global) inequality constraints on the network output, for instance, on the size of the target region, can leverage unlabeled data, guiding training with domainspecific knowledge. Inequality constraints are very flexible because they do not assume exact prior knowledge. However, constrained Lagrangian dual optimization has been largely avoided in deep networks, mainly for computational tractability reasons. To the best of our knowledge, the method of Pathak et al. pathak2015constrained (2015) is the only prior work that addresses deep CNNs with linear constraints in weakly supervised segmentation. It uses the constraints to synthesize fullylabeled training masks (proposals) from weak labels, mimicking full supervision and facilitating dual optimization.
We propose to introduce a differentiable term, which enforces inequality constraints directly in the loss function, avoiding expensive Lagrangian dual iterates and proposal generation. From constrainedoptimization perspective, our simple approach is not optimal as there is no guarantee that the constraints are satisfied. However, surprisingly, it yields substantially better results than the proposalbased constrained CNNs in pathak2015constrained (2015), while reducing the computational demand for training. In the context of cardiac images, we reached a segmentation performance close to full supervision using a fraction () of the full groundtruth labels and imagelevel tags. While our experiments focused on basic linear constraints such as the targetregion size and image tags, our framework can be easily extended to other nonlinear constraints, e.g., invariant shape moments Klodt2011 (2011) or other region statistics Lim2014 (2014). Therefore, it has the potential to close the gap between weakly and fully supervised learning in semantic medical image segmentation. Our code is publicly available.
ConstrainedCNN losses for
weakly supervised segmentation
Hoel Kervadec ÉTS Montréal hoel.kervadec.1@etsmtl.net Jose Dolz ÉTS Montréal Meng Tang University of Waterloo Department of computer science Éric Granger ÉTS Montréal Yuri Boykov University of Waterloo Department of computer science Ismail Ben Ayed ÉTS Montréal
noticebox[b]1st Conference on Medical Imaging with Deep Learning (MIDL 2018), Amsterdam, The Netherlands.\end@float
1 Introduction
In the recent years, deep convolutional neural networks (CNNs) have been dominating semantic segmentation problems, both in computer vision and medical imaging, achieving groundbreaking performances when fullsupervision is available Litjens2017 (2017); FCN (2015). In semantic segmentation, full supervision requires laborious pixel/voxel annotations, which may not be available in a breadth of applications, more so when dealing with volumetric data. Therefore, weak supervision with partial labels, for instance, bounding boxes deepcut (2017), points Bearman2016 (2016), scribbles tang2018regularized (2018); ncloss:cvpr18 (2018); scribblesup (2016), or image tags pathak2015constrained (2015); papandreou2015weakly (2015), is attracting significant research attention. Imposing prior knowledge on the the network’s output in the form of unsupervised loss terms is a wellestablished approach in machine learning weston2012deep (2012); goodfellow2016deep (2016). Such priors can be viewed as regularization terms that leverage unlabeled data, embedding domainspecific knowledge. For instance, the recent studies in tang2018regularized (2018); ncloss:cvpr18 (2018) showed that direct regularization losses, e.g., dense conditional random field (CRF) or pairwise clustering, can yield outstanding results in weakly supervised segmentation, reaching almost fullsupervision performances in natural image segmentation; see the results in tang2018regularized (2018). Surprisingly, such a principled directloss approach is not common in weakly supervised segmentation. In fact, most of the existing techniques synthesize fullylabeled training masks (proposals) from the available partial labels, mimicking full supervision deepcut (2017); papandreou2015weakly (2015); scribblesup (2016); kolesnikov2016seed (2016). Typically, such proposalbased techniques iterate two steps: CNN learning and proposal generation facilitated by dense CRFs and fast meanfield inference koltun:NIPS11 (2011), which are now the defacto choice for pairwise regularization in semantic segmentation algorithms.
This study continues our line of recent work in tang2018regularized (2018), in which we showed the potential of direct pairwise regularization losses (e.g., dense CRF and Potts) in weakly supervised segmentation. Our purpose here is to embed highorder (global) inequality constraints on the network output directly in the loss function, so as to guide learning. For instance, assume that we have some prior knowledge on the size (or volume) of the target region, e.g., in the form of lower and upper bounds on size, a common scenario in medical image segmentation Niethammer2013 (2013); Gorelick2013 (2013). Let denotes a given training image, with is a discrete image domain and the number of pixels/voxels in the image. is weak (partial) groundtruth segmentation of the image. It takes the form of a partial annotation of the target region, e.g., a few points (see the examples in Fig. 2) or imagelevel tags. In this case, one can optimize a crossentropy loss subject to inequality constraints on the network outputs pathak2015constrained (2015):
(1) 
where S = is a vector of softmax probabilities^{1}^{1}1The softmax probabilities take the form: , where is a real scalar function representing the output of the network. For notation simplicity, we omit the dependence of on and as this does not result in any ambiguity in the presentation. generated by the network at each pixel and . Priors and denote the given upper and lower bounds on the size (or cardinality) of the target region. Inequality constraints of the form in (1) are very flexible because they do not assume exact knowledge of the target size, unlike Zhang2017 (2017); Boykov2015 (2015). Also, multiple instance learning constraints pathak2015constrained (2015), which enforce imagetag priors, can be handled by constrained model (1). Image tags are a form of weak supervision, which enforce the constraints that a target region is present or absent in a given training image pathak2015constrained (2015). They can be viewed as particular cases of the inequality constraints in (1). For instance, a suppression constraint, which takes the form , enforces that the target region is not in the image. enforces the presence of the region.
Even though constraints of the form (1) are linear (and hence convex) with respect to the network outputs, constrained problem (1) is very challenging due to the nonconvexity of CNNs. One possibility would be to minimize the corresponding Lagrangian dual. However, as pointed out in pathak2015constrained (2015); MarquezNeila2017 (2017), this is computationally intractable for semantic segmentation networks involving millions of parameters; one has to optimize a CNN within each dual iteration. In fact, constrained optimization has been largely avoided in deep networks Ravi2018 (2018), even thought some Lagrangian techniques were applied to neural networks a long time before the deep learning era Zhang1992 (1992); Platt1988 (1988). These constrained optimization techniques are not applicable to deep CNNs as they solve large linear systems of equations. The numerical solvers underlying these constrained techniques would have to deal with matrices of very large dimensions in the case of deep networks MarquezNeila2017 (2017).
To the best of our knowledge, the method of Pathak et al. pathak2015constrained (2015) is the only prior work that addresses constrained deep CNNs in weakly supervised segmentation. It uses the constraints to synthesize fullylabeled training masks (proposals) from the available partial labels, mimicking full supervision, which avoids intractable dual optimization of the constraints when minimizing the loss function. The main idea of pathak2015constrained (2015) is to model the proposals as a latent distribution. Then, they minimize a KL divergence, encouraging the softmax output of the CNN to match the latent distribution as closely as possible. Therefore, they impose constraints on the latent distribution rather than on the network output, which facilitates significantly Lagrangian dual optimization. This decouples stochastic gradient descent learning of the network parameters and constrained optimization: The authors of pathak2015constrained (2015) alternate between optimizing w.r.t the latent distribution, which corresponds to proposal generation subject to the constraints^{2}^{2}2This subproblem is convex when the constraints are convex., and standard stochastic gradient descent for optimizing w.r.t the network parameters.
We propose to introduce a differentiable term, which enforces inequality constraints (1) directly in the loss function, avoiding expensive Lagrangian dual iterates and proposal generation. From constrained optimization perspective, our simple approach is not optimal as there is no guarantee that the constraints are satisfied. However, surprisingly, it yields substantially better results than the proposalbased constrained CNNs in pathak2015constrained (2015), while reducing the computational demand for training. In the context of cardiac image segmentation, we reached a performance close to full supervision while using a fraction of the full groundtruth labels () and imagelevel tags. Our framework can be easily extended to nonlinear inequality constraints, e.g., invariant shape moments Klodt2011 (2011) or other region statistics Lim2014 (2014). Therefore, it has the potential to close the gap between weakly and fully supervised learning in semantic medical image segmentation. Our code is publicly available^{3}^{3}3.
2 Proposed loss function
We propose the following loss for weakly supervised segmentation:
(2) 
with function given by (See the illustration in Fig. 1):
(3) 
Now, our differentiable term accommodates standard stochastic gradient descent. During backpropagation, the term of gradientdescent update corresponding to can be written as follows:
(4) 
where denotes the standard derivative of the softmax outputs of the network. The gradient in (4) has a clear interpretation. During backpropagation, when the current constraints are satisfied, i.e., , observe that . Therefore, in this case, the gradient stemming from our term has no effect on the current parameter update. Now, suppose without loss of generality that the current set of parameters corresponds to , which means the current target region is smaller than its lower bound (constraint violation). In this case, term is positive and, therefore, the first line of (4) performs a gradient ascent step on softmax outputs, increasing . This makes sense because it increases the size of the current region, , so as to satisfy the constraint. The case has a similar interpretation.
In the next sections, we first give details about the dataset, the weakly annotated labels and our implementation. Then, we empirically evaluate our proposed highorder size loss. Particularly, we analyze its contribution to the segmentation performance, and compare the results to to the weakly supervised case with no regularization term, to image proposals generation and to the fully supervised setting.
3 Experimental setup
3.1 Dataset
Experiments on the proposed highorder size loss focused on left ventricular endocardium segmentation. For this purpose we employed the training set from publicly available dataset from the 2017 ACDC Challenge ^{4}^{4}4https://www.creatis.insalyon.fr/Challenge/acdc/. This set consists on 100 cine magnetic resonance (MR) exams covering well defined pathologies: dilated cardiomyopathy, hypertrophic cardiomyopathy, myocardial infarction with altered left ventricular ejection fraction, abnormal right ventricle and patients without cardiac disease. Exams were acquired in breath hold with a retrospective or prospective gating and with a SSFP sequence in 2chambers, 4chambers and in short axis orientations. A series of short axis slices cover the LV from the base to the apex, with a thickness of 5 to 8 mm and often an interslice gap of 5 mm. The spatial resolution goes from 0.83 to 1.75 mm/pixel.
For all the experiments, we employed the same 75 exams for training and the remaining 25 for validation purposes. To increase the variability of the data, we augment the dataset by randomly rotating, flipping, mirroring and scaling the images.
3.2 Weakly annotated labels
To generate the weak labels we employed binary erosion on the fully annotated labels with a kernel of size 1010. If the resulted label disappeared, we repeated the operation with a smaller kernel (i.e., 77) until we get a small contour. Thus, the total number of annotated pixels represented the 0.1 of the labeled pixels in the fully supervised scenario. Figure 2 depicts some examples of fully annotated images and their corresponding weak labels.
To compute the lower and upper bounds of the proposed size loss, manual segmentations from only one subject were employed. Specifically, we computed the minimum and maximum size of left ventricular endocardium over the slices, and then multiplied by a factor of 1.1 and 0.9 the minimum and maximum value, respectively, to account for size variations across exams.
3.3 Training and implementation details
For all the proposed experiments we employ ENet paszke2016enet (2016) network, as it has shown a good tradeoff between accuracy and inference time. Nevertheless, the proposed highorder size loss is general and it can be applied on any CNN. The network is trained from scratch by employing Adam optimizer and a batch size of 1. The initial learning rate was set to 5e and it was decreased by 2 after 100 epochs. The weight of the size loss in (4) was empirically set to 1e and its influence on the performance has not been investigated in this work. Input images had a size of 256 256 pixels.
The proposals method pathak2015constrained (2015) reuses the same network and loss function than the fully supervised setting. At each iteration, a synthetic ground truth is generated using a projected gradient ascent, before computing the cross entropy between and . We found that limiting the number of iterations for the PGA to (instead of the original ) saved time without impacting significantly the results. We employ one constraint (), with following the same strategy as in Section 3.2. We had to add this multiplicative constant in order to make it work. This is discussed more in depth in section 4.2.
For evaluation purposes, since each method has a different loss function, we resort to the common to compare them.
We used a combination of PyTorch paszke2017automatic (2017) and NumPy for our implementations, and ran the experiments on a machine equipped with a NVIDIA GTX 1080 Ti GPU (11GBs of memory). The code is available at https://github.com/LIVIAETS/SizeLoss_WSS.
4 Results
This section introduces the experimental results of this paper. First, in Sec. 4.1 we evaluate the impact of including an additional size loss during training in a weakly supervised setting. We show that by incorporating this size prior directly into the loss yields a highly significant boost on performance. Then, in Sec. 4.2 we compare the effect of using direct loss and generating proposals in training in an iterative scheme. Compared to this strategy, our proposed method achieved stateoftheart performance on weakly supervised segmentation with small blobby regions as labels. We also provide the results for the fully supervised setting in Sec. 4.3. After that, qualitative evaluation is assessed in Sec. 4.4. In addition, we compare the different learning strategies in terms of efficiency (Sec. 4.5), showing that the direct introduction of the proposed loss during learning does not affect training times.
4.1 Weakly supervised segmentation with size loss
We trained a segmentation network from weakly annotated images with no additional information which served as baseline. Training on this model relies on computing the crossentropy on labeled pixels. Then, we trained another network by employing the same weakly annotated images and we considered slightly more supervision than simply few annotated pixels. First, we included size prior information on the superior bound when the target is present in the image. In this case, in eq. 3 and 4 is replaced by 0. Thus, the CNN is constrained to generate segmentations whose sizes must be below the superior bound b. Then, if the target is not present on the image, we basically replace the upper bound by 0, similar to MIL scenarios for suppression constraints. This will result on minimizing the predicted size to 0. To investigate the effect of bounding the CNN output on both sides, we trained a third network from weakly annotated images and we included size prior information as both inferior and superior bounds. Forward and backward pass are defined by eq. 3 and 4, where and are computed from one training subject following the steps described in Sec 3.2. The lower and upper size tight bounds have a value of 98 and 1723 pixels, respectively. Crossentropy on annotated pixels and the proposed size loss are jointly minimized.
Table 1 reports results on the validation set for all the methods. The baseline is the network with cross entropy loss only on labeled pixels. This baseline achieves poor results, since no other information rather than the annotated pixels is given. As such, there is no regularization term that can give some hints on the desired segmentation. Evaluation on this baseline reported mean DSC values below 0.10, which are considered unsatisfactory. Indeed, after 100 epochs approximately (Fig. 3), we observed that the training got stuck and the network did not learn afterwards. On the other hand, constraining the CNN prediction with the proposed size loss significantly increased the performance of the network, as can be seen on both Table 1 and Fig 3. If we employ only a superior bound , we achieve a mean DSC value of 0.8189. If a lower bound is also considered, mean DSC increases up to 0.8415.
Method  DSC (Val)  


CrossEntropy  0.0721  
Proposals (one bound)  0.6124  
Proposals (one bound)  0.0659  
CrossEntropy + Size loss (one bound)  0.8107  
CrossEntropy + Size loss (one bound)  0.8189  
CrossEntropy + Size loss (two bounds)  0.8415  

—  0.9284  
Loose bound / Tight bound 
4.2 Implicit loss against segmentation proposals
In this section we compare the implicit inclusion of the proposed highorder size loss during training to segmentation proposals generation pathak2015constrained (2015). First, we imposed the same tight bound employed in our proposed size loss. However, the performance of this method was deficient, as the network quickly produces only empty segmentations, and could not recover from them. This explains the flat line in the left plot in Fig. 3. We observed that by employing a much more loose upper bound – one sixth of the image size – this approach achieved better results, as reported on Table 1 and Fig. 3 (right). Learning of this method remained very unstable, while slowing overfitting. To compare fairly with the proposed size loss, we also evaluated our loss with a loose bound equal to 10000. As shown on Table 1, constraining the learning with our size loss on a high bound reported an increase on performance with respect to the proposals generation approach in pathak2015constrained (2015) under the same conditions.
Since we were not able to use tight bound on pathak2015constrained (2015), we did not investigate the case with two bounds. Indeed, a single upper bound is equivalent to , so adding slightly over – as in Sec. 4.1– would be effectively the same. As it can be seen on the results, compared to our direct size loss, generating proposals gives inferior segmentation performance on different situations.
4.3 Pixellevel annotations
Finally, we compared the weakly supervised strategies to a network trained on fully labeled images. Training the network with strong pixellevel annotations achieved a mean DSC of 0.9284 on the validation set. This result represents an increase of a 10 with respect to the best proposed method. Nevertheless, it is noteworthy to mention that the performance achieved by the proposed weakly supervised learning approach with size loss constraints approaches the fully supervised setting with only 0.1 of the annotated pixels. This indicates that the current work shortens the gap between fully and weakly supervised learning for semantic segmentation, pushing the bounds of the latter.
4.4 Qualitative results
To get some intuition about the different learning strategies and their effects on the segmentation we visualize some results in Fig. 4. For the direct size loss and proposals generation approaches we selected the best performing model to display these images. We can observe that in all cases, the baseline gives unacceptable segmentations, which is in line with the findings observed in Table 1 and Fig. 3. We find that generating proposals during training might actually improve the segmentation performance compared to the baseline. Nevertheless, looking at the examples on Fig. 4 (fourth column) we can observe that these segmentations are far from being satisfactory. Nevertheless, integrating the proposed size loss directly in the backpropagation increases substantially the accuracy of the network, as can be seen in the last column of Fig. 4. An interesting observation is that, in some cases (last row), weakly supervised learning produces more reliable segmentations than training with full supervision.
4.5 Efficiency
In this section we compare the several learning approaches in terms of efficiency (Table 2). The weakly supervised baseline and the fully supervised model need to compute only one loss per pass. This is reflected in the lowest training times reported in the table. Including the size loss does not affect on the computation time, as can be seen in these results. As expected, the iterative process introduced by pathak2015constrained (2015) at each forward pass cause a non negligible overhead during training. To generate their synthetic ground truth, they need to optimize the Lagragian function of their constrained loss with respect to its parameters, which requires alternating between computing softmax probabilities and gradients of those probabilities. Even in simple optimization case (with only one constraint) where this parameter converge fast, it adds a few milliseconds at each iteration.
Method  Training time (ms/image) 

Weakly supervised (CE Only)  10 
Weakly supervised (CE + Size loss)  10 
Weakly supervised (Proposals)  15 
Fully supervised  10 
5 Conclusion
We presented a novel loss function for weakly supervised image segmentation, which despite its simplicity reach almost full supervision performances. We perform significantly better than the other proposed methods for this task, while having negligible computation overhead.
Acknowledgments
This work is supported by the National Science and Engineering Research Council of Canada (NSERC), discovery grant program, and by the ETS Research Chair on Artificial Intelligence in Medical Imaging.
References
 Bearman2016 [2016] Amy L. Bearman, Olga Russakovsky, Vittorio Ferrari, and FeiFei Li. What’s the point: Semantic segmentation with point supervision. In European Conference on Computer Vision (ECCV), pages 549–565, 2016.
 Boykov2015 [2015] Yuri Boykov, Hossam N. Isack, Carl Olsson, and Ismail Ben Ayed. Volumetric bias in segmentation and reconstruction: Secrets and solutions. In IEEE International Conference on Computer Vision (ICCV), pages 1769–1777, 2015.
 goodfellow2016deep [2016] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016.
 Gorelick2013 [2013] Lena Gorelick, Frank R. Schmidt, and Yuri Boykov. Fast trust region for segmentation. In 2013 IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, June 2328, 2013, pages 1714–1721, 2013.
 Klodt2011 [2011] Maria Klodt and Daniel Cremers. A convex framework for image segmentation with moment constraints. In IEEE International Conference on Computer Vision, ICCV 2011, Barcelona, Spain, November 613, 2011, pages 2236–2243, 2011.
 kolesnikov2016seed [2016] Alexander Kolesnikov and Christoph H Lampert. Seed, expand and constrain: Three principles for weaklysupervised image segmentation. In European Conference on Computer Vision, pages 695–711. Springer, 2016.
 koltun:NIPS11 [2011] Philipp Krahenbuhl and Vladlen Koltun. Efficient inference in fully connected CRFs with Gaussian edge potentials. In NIPS, 2011.
 Lim2014 [2014] Yongsub Lim, Kyomin Jung, and Pushmeet Kohli. Efficient energy minimization for enforcing label statistics. IEEE Trans. Pattern Anal. Mach. Intell., 36(9):1893–1899, 2014.
 scribblesup [2016] Di Lin, Jifeng Dai, Jiaya Jia, Kaiming He, and Jian Sun. Scribblesup: Scribblesupervised convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3159–3167, 2016.
 Litjens2017 [2017] Geert J. S. Litjens, Thijs Kooi, Babak Ehteshami Bejnordi, Arnaud Arindra Adiyoso Setio, Francesco Ciompi, Mohsen Ghafoorian, Jeroen A. W. M. van der Laak, Bram van Ginneken, and Clara I. Sánchez. A survey on deep learning in medical image analysis. Medical Image Analysis, 42:60–88, 2017.
 FCN [2015] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3431–3440, 2015.
 MarquezNeila2017 [2017] Pablo MárquezNeila, Mathieu Salzmann, and Pascal Fua. Imposing hard constraints on deep networks: Promises and limitations. CoRR, abs/1706.02025, 2017.
 Niethammer2013 [2013] Marc Niethammer and Christopher Zach. Segmentation with area constraints. Medical Image Analysis, 17(1):101–112, 2013.
 papandreou2015weakly [2015] George Papandreou, LiangChieh Chen, Kevin Murphy, and Alan L Yuille. Weaklyand semisupervised learning of a DCNN for semantic image segmentation. arXiv preprint arXiv:1502.02734, 2015.
 paszke2016enet [2016] Adam Paszke, Abhishek Chaurasia, Sangpil Kim, and Eugenio Culurciello. Enet: A deep neural network architecture for realtime semantic segmentation. arXiv preprint arXiv:1606.02147, 2016.
 paszke2017automatic [2017] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017.
 pathak2015constrained [2015] Deepak Pathak, Philipp Krahenbuhl, and Trevor Darrell. Constrained convolutional neural networks for weakly supervised segmentation. In Proceedings of the IEEE international conference on computer vision, pages 1796–1804, 2015.
 Platt1988 [1988] J. C. Platt and A. H. Barr. Constrained differential optimization. Technical report, California Institute of Technology, 1988.
 deepcut [2017] Martin Rajchl, Matthew CH Lee, Ozan Oktay, Konstantinos Kamnitsas, Jonathan PasseratPalmbach, Wenjia Bai, Mellisa Damodaram, Mary A Rutherford, Joseph V Hajnal, Bernhard Kainz, et al. Deepcut: Object segmentation from bounding box annotations using convolutional neural networks. IEEE transactions on medical imaging, 36(2):674–683, 2017.
 Ravi2018 [2018] Vishnu Sai Rao Lokhande Vikas Singh Sathya N. Ravi, Tuan Dinh. Constrained deep learning using conditional gradient and applications in computer vision. arXiv:1803.0645, 2018.
 ncloss:cvpr18 [2018] Meng Tang, Abdelaziz Djelouah, Federico Perazzi, Yuri Boykov, and Christopher Schroers. Normalized Cut Loss for Weaklysupervised CNN Segmentation. In IEEE conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, June 2018.
 tang2018regularized [2018] Meng Tang, Federico Perazzi, Abdelaziz Djelouah, Ismail Ben Ayed, Christopher Schroers, and Yuri Boykov. On regularized losses for weaklysupervised cnn segmentation. arXiv preprint arXiv:1803.09569, 2018.
 weston2012deep [2012] Jason Weston, Frédéric Ratle, Hossein Mobahi, and Ronan Collobert. Deep learning via semisupervised embedding. In Neural Networks: Tricks of the Trade, pages 639–655. Springer, 2012.
 Zhang1992 [1992] S. Zhang and A. Constantinides. Lagrange programming neural networks. IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing, 39(7):441–452, 1992.
 Zhang2017 [2017] Yang Zhang, Philip David, and Boqing Gong. Curriculum domain adaptation for semantic segmentation of urban scenes. In IEEE International Conference on Computer Vision (ICCV), pages 2039–2049, 2017.