# End-to-End Deep Convolutional Active Contours for Image Segmentation

###### Abstract

The Active Contour Model (ACM) is a standard image analysis technique whose numerous variants have attracted an enormous amount of research attention across multiple fields. Incorrectly, however, the ACM’s differential-equation-based formulation and prototypical dependence on user initialization have been regarded as being largely incompatible with the recently popular deep learning approaches to image segmentation. This paper introduces the first tight unification of these two paradigms. In particular, we devise Deep Convolutional Active Contours (DCAC), a truly end-to-end trainable image segmentation framework comprising a Convolutional Neural Network (CNN) and an ACM with learnable parameters. The ACM’s Eulerian energy functional includes per-pixel parameter maps predicted by the backbone CNN, which also initializes the ACM. Importantly, both the CNN and ACM components are fully implemented in TensorFlow, and the entire DCAC architecture is end-to-end automatically differentiable and backpropagation trainable without user intervention. As a challenging test case, we tackle the problem of building instance segmentation in aerial images and evaluate DCAC on two publicly available datasets, Vaihingen and Bing Huts. Our reseults demonstrate that, for building segmentation, the DCAC establishes a new state-of-the-art performance by a wide margin.

## 1 Introduction

The ACM [12] is one of the most influential computer vision techniques. It has been successfully employed in various image analysis tasks, including object segmentation and tracking. In most ACM variants the deformable curve(s) of interest dynamically evolves through an iterative procedure that minimizes a corresponding energy functional. Since the ACM is a model-based formulation founded on geometric and physical principles, the segmentation process relies mainly on the content of the image itself, not on large annotated image datasets, extensive computational resources, and hours or days of training. However, the classic ACM relies on some degree of user interaction to specify the initial contour and tune the parameters of the energy functional, which undermines its applicability to the automated analysis of large quantities of images.

In recent years, Deep Neural Networks (DNNs) have become popular in many areas. In computer vision and medical image analysis, CNNs have been succesfully exploited for different segmentation tasks [11, 8, 17]. Despite their tremendous success, the performance of CNNs is still very dependent on their training datasets. In essence, CNNs rely on a filter-based learning scheme in which the weights of the network are usually tuned using a back-propagation error gradient decent approach. Since CNN architectures often include millions of trainable parameters, the training process relies on the sheer size of the dataset. In addition, CNNs usually generalize poorly to images that differ from those in the training datasets and they are vulnerable to adversarial examples [23]. For image segmentation, capturing the details of object boundaries and delineating them remains a challenging task even for the most promising of CNN architectures that have achieved state-of-the-art performance on relevant bench-marked datasets [4, 9, 24]. The recently proposed Deeplabv3+ [5] has mitigated this problem to some extent by leveraging the power of dilated convolutions, but such improvements were made possible by extensive pre-training and vast computational resources—50 GPUs were reportedly used to train this model.

In this paper, we aim to bridge the gap between CNNs and ACMs by introducing a truly end-to-end framework. Our framework leverages an automatically differentiable ACM with trainable parameters that allows for back-propagation of gradients. This ACM can be trained along with a backbone CNN from scratch and without any pre-training. Moreover, our ACM utilizes a locally-penalized energy functional that is directly predicted by its backbone CNN, in the form of 2D feature maps, and it is initialized directly by the CNN. Thus, our work alleviates one of the biggest obstacles to exploiting the power ACMs—eliminating the need for any type of user supervision or intervention.

As a challenging test case for our DCAC framework, we tackle the problem of building instance segmentation in aerial images. Our DCAC sets new state-of-the-art benchmarks on the Vaihingen and Bing Huts datasets for building instance segmentation, outperforming its closest competitor by a wide margin.

## 2 Related Work

#### Eulerian active contours:

Eulerian active contours evolve the segmentation curve by dynamically propagating an implicit function so as to minimizing its associated energy functional [18]. The most notable approaches that utilize this formulation are the active contours without edges by Chan and Vese [3] and the geodesic active contours by Caselles et al. [2]. The Caselles-Kimmel-Sapiro model is mainly dependent on the location of the level-set, whereas the Chan-Vese model mainly relies on the content difference between the interior and exterior of the level-set. In addition, the work by [14] proposes a reformulation of the Chan-Vese model in which the energy functional incorporates image properties in local regions around the level-set, and it was shown to more accurately segment objects with heterogeneous features.

#### “End-to-End” CNNs with ACMs:

Several efforts have attempted to integrate CNNs with ACMs
in an end-to-end manner as opposed to utilizing the ACM merely as a
post-processor of the CNN output. Le et al.
[15] implemented level-set ACMs as Recurrent
Neural Networks (RNNs) for the task of semantic segmentation of
natural images. There exists 3 key differences between our proposed
DCAC and this effort: (1) DCAC does not reformulate ACMs as RNNs and
as a result is more computationally efficient. (2) DCAC benefits from
a novel locally-penalized energy functional, whereas
[15] has constant weighted parameters. (3) DCAC
has an entirely different pipeline—we employ a single CNN that is
trained from scratch along with the ACM, whereas
[15] requires two *pre-trained* CNN
backbones (one for object localization, the other for classification).
The dependence of [15] on pre-trained CNNs has
limited its applicability. The other attempt, the DSAC model by Marcos
et al. [16], is an integration of ACMs
with CNNs in a structured prediction framework for building instance
segmentation in aerial images. There are 3 key differences between
DCAC and this work: (1) [16] heavily depends on
the *manual initialization* of contours, whereas our DCAC is
fully automated and runs without any external supervision. (2) The ACM
used in [16] has a parametric formulation that
can handle only a single building at a time, whereas our DCAC
leverages the Eulerian ACM which can naturally handle multiple
building instances simultaneously. (3) [16]
requires the user to *explicitly* calculate the gradients,
whereas our approach fully automates the direct back-propagation of
gradients through the entire DCAC framework due to its automatically
differetiable ACM.

#### Building instance segmentation:

Modern CNN-based methods have been used with different approaches to the problem of building segmentation. Some efforts have treated this problem as a semantic segmentation problem [20, 22] and utilized post-processing steps to extract the building boundaries. Other efforts have utilized instance segmentation networks [10] to directly predict the location of buildings.

## 3 Level Set Active Contours

First proposed by Osher and Sethian [19] to evolve wavefronts in CFD simulations, a level-set is an implicit representation of a hypersurface that is dynamically evolved according to the nonlinear Hamilton-Jacobi equation. In 2D, let be a closed time-varying contour represented in by the zero level set of the signed distance map . Function evolves according to

(1) |

where represents the initial level set.

We introduce a generalization of the level-set ACM proposed by Chan and Vese [3]. Their model assumes that the image of interest consists of two areas of distinct intensities. The interior of is represented by the smoothed Heaviside function

(2) |

and represents its exterior. The derivative of (2) is the smoothed Dirac delta function

(3) |

The energy functional associated with is written as

(4) |

where penalizes the length of and penalizes its enclosed area (we set and ), and where and are the mean image intensities inside and outside . We follow Lankton et al. [14] and define and as the mean image intensities inside and outside within a local window around .

Note that to afford greater control over , we have generalized the
constants and used in [3] to
parameter *functions* and in
(4). The contour expands or shrinks at a certain location
if or
, respectively [6]. In DCAC, these
parameter functions are trainable and learned directly by the backbone
CNN. Fig.2 illustrates an example of these
learned maps by the CNN.

## 4 CNN Backbone

As our CNN backbone, we follow [7] and utilize a fully convolutional encoder-decoder architecture with dilated residual blocks (Fig. 3). Each convolutional layer is followed by a Rectified Linear Unit (ReLU) as the activation layer and a batch normalization. The dilated residual block consists of 2 consecutive dilated convolutional layers whose outputs are fused with its input and fed into the ReLU activation layer. In the encoder, each path consist of 2 consecutive convolutional layers, followed by a dilated residual unit with a dilation rate of 2. Before being fed into the dilated residual unit, the output of these convolutional layers are added with the output feature maps of another 2 consecutive convolutional layers that learn additional multi-scale information from the resized input image in that resolution. To recover the content lost in the learned feature maps during the encoding process, we utilize a series of consecutive dilated residual blocks with dilation rates of 1, 2, and 4 and feed the output to a dilated spatial pyramid pooling layer with 4 different dilation rates of 1, 6, 12 and 18. The decoder is connected to the dilated residual units at each resolution via skip connections, and in each path we up-sample the image and employ 2 consecutive convolutional layers before proceeding to the next resolution. The output of the decoder is fed into another series of 2 consecutive convolutional layer and then passed into 3 separate convolutional layers for predicting the output maps of and as well as the distance transform.

## 5 DCAC Architecture and Implementation

In our DCAC framework (Fig. 1), the CNN backbone serves to directly initialize the zero level-set contour as well as the weighted local parameters. We initialize the zero level-set by a learned distance transform that is directly predicted by the CNN along with additional convolutional layers that learn the parameter maps. Figure 2 illustrates an example of what the backbone CNN learns in the DCAC on one input image from the Vaihingen data set. These learned parameters are then passed to the ACM that unfolds for a certain number of timesteps in a differentiable manner. The final zero level-set is then converted to logits and compared with the label and the resulting error is back-propagated through the entire framework in order to tune the weights of the CNN backbone. Algorithm 1 presents the details of DCAC training algorithm.

### 5.1 Implementation Details

All components of DCAC, including the ACM, have been implemented entirely in Tensorflow [1] and are compatible with both Tensorflow 1.x and 2.0 versions. The ACM implementation benefits from the automatic differentiation utility of Tensorflow and has been designed to enable the back-propagation of the error gradient through the layers of the ACM.

In each ACM layer, each point along the the zero level-set contour is probed by a local window and the mean intensity of the inside and outside regions; i.e., and in (4), are extracted. In our implementation, and are extracted by using a differentiable global average pooling layer with appropriate padding not to lose any information on the edges.

All the training was performed on an Nvidia Titan XP GPU, and an Intel® Core™ i7-7700K CPU @ 4.20GHz. The size of the minibatches for training on the Vaihingen and Bing Huts datasets were 3 and 20 respectively. All the training sessions employ the Adam optimization algorithm [13] with a learning rate of 0.001 that that decays by a factor of 10 every 10 epochs.

Dataset: | Vaihingen | Bing Huts | ||||||
---|---|---|---|---|---|---|---|---|

Model | Dice | mIoU | WCov | BoundF | Dice | mIoU | WCov | BoundF |

DSAC | – | 0.840 | – | – | – | 0.650 | – | – |

UNet | 0.810 | 0.797 | 0.843 | 0.622 | 0.710 | 0.740 | 0.852 | 0.421 |

ResNet | 0.801 | 0.791 | 0.841 | 0.770 | 0.81 | 0.797 | 0.864 | 0.434 |

Backbone CNN | 0.837 | 0.825 | 0.865 | 0.680 | 0.737 | 0.764 | 0.809 | 0.431 |

DCAC: Single Inst | 0.928 | 0.929 | 0.943 | 0.819 | 0.855 | 0.860 | 0.894 | 0.534 |

DCAC: Multi Inst | 0.908 | 0.893 | 0.910 | 0.797 | 0.797 | 0.809 | 0.839 | 0.491 |

DCAC: Single Inst, Const | 0.877 | 0.888 | 0.936 | 0.801 | 0.792 | 0.813 | 0.889 | 0.513 |

DCAC: Multi Inst, Const | 0.857 | 0.842 | 0.876 | 0.707 | 0.757 | 0.777 | 0.891 | 0.486 |

## 6 Experiments

### 6.1 Datasets

#### Vaihingen:

The Vaihingen buildings
dataset ^{1}^{1}1http://www2.isprs.org/commissions/comm3/wg4/2d-sem-label-vaihingen.html
consists of 168 building images of size pixels. The
labels for each image are generated by using a semi-automated
approach. We used 100 images for training and 68 for testing,
following the same data split as in [16]. In this
dataset, almost all images consist of multiple instances of buildings,
some of which are located at the edges of the image.

#### Bing Huts:

The Bing Huts
dataset ^{2}^{2}2https://www.openstreetmap.org/#map=4/38.00/-95.80
consists of 605 images of size . We followed the same data
split that is used in [16] and used 335 images
for training and 270 images for testing. This dataset is especially
challenging due the low spatial resolution and contrast that are
exhibited in the images.

### 6.2 Evaluation Metrics and Loss Function

To evaluate our model’s performance, we utilized five different metrics—Dice, mean Intersection over Union (mIoU), Weighted Coverage (WCov), Boundary F (BoundF), and Root Mean Square Error (RMSE). The original DSAC paper only reported on mIoU for both Vaihingen and Bing Huts and only RMSE for the Bing Huts dataset. However, since the delineation of boundaries is one of the important goals of our framework, we employ the BoundF metric [21] to precisely measure the similarity between the specific boundary pixels in our predictions and the corresponding image labels. Furthermore, we used a soft Dice loss function in training our model.

## 7 Results and Discussion

### 7.1 Local and Fixed Weighted Parameters

To validate the contribution of the local weighted parameters in the level-set ACM, we also trained our DCAC on both the Vaihingen and Bing Huts datasets by only allowing one trainable scalar parameter for each of and , which is constant over the entire image. As presented in Table 1, in both the Vaihingen and Bing Huts datasets, this constant- formulation still outperforms the baseline CNN in all evaluation metrics for both single-instance and multi-instance buildings, thus showing the effectiveness of the end-to-end training of the DCAC. However, the DCAC with the full and maps outperforms this constant formulation by a wide margin in all experiments and metrics.

A key metric of interest in this comparison is the BoundF score, which demonstrates how our local formulation captures the details of the boundaries more effectively by adjusting the inward and outward forces on the contour locally. As illustrated in Figure 4, DCAC has perfectly delineated the boundaries of the building instances. However, DCAC with constant formulation has over-segmented these instances.

### 7.2 Buildings on the Edges of the Image

Our DCAC is capable of properly segmenting the instances of buildings located on the edges of some of the images present in the Vaihingen dataset. This is mainly due to the proper padding scheme that we have utilized in our global average pooling layer used to extract the local intensities of pixels while avoiding the loss of information on the boundaries.

### 7.3 Initialization and Number of ACM Iterations

In all cases, we performed our experiments with the goal of leveraging the CNN to fully automate the ACM and eliminate the need for any human supervision. Our scheme for learning a generalized distance transform directly helped us to localize all the building instances simultaneously and initialize the zero level-sets appropriately while avoiding a computationally expensive and non-differentiable distance transform operation. In addition, initializing the zero level-sets in this manner, instead of the common practice of initializing from a circle, helped the contour to converge significantly faster and avoid undesirable local minima.

### 7.4 Comparison Against the DSAC Model

Although most of the images in the Vaihingen dataset consist of multiple instances of buildings, the DSAC model [16] can deal with only a single building at a time. For a fair comparison between the two approaches, we report separate metrics for a single building, as reported by in [16] for the DSAC, as well as for all the instances of buildings (which the DSAC cannot handle). As presented in Table 1, our DCAC outperforms DSAC by and percent in mIoU respectively on both the Vaihingen and Bing Huts datasets. Furthermore, the multiple-instance metrics of our DCAC outperform the single-instance DSAC results. As demonstrated in Fig. 5, in the Vaihingen dataset, DSAC struggles in coping with the topological changes of the buildings and fails to appropriately capture sharp edges, while our framework in most cases handles these challenges. In the Bing Hut dataset, the DSAC is able to localize the buildings, but it mainly over-segments the buildings in many cases. This may be due to DSAC’s inability to distinguish the building from the surrounding soil because of the low contrast and small size of the image. By comparison, our DCAC is able to low contrast dataset well, with more accurate boundaries, when comparing the segmentation output of DSAC (b) and our DCAC (c), as seen in Fig. 5.

## 8 Conclusions and Future Work

We have introduced a novel image segmentation framework, called DCAC, which is a truly end-to-end integration of ACMs and CNNs. We proposed a novel locally-penalized Eulerian energy model that allows for pixel-wise learnable parameters that can adjust the contour to precisely capture and delineate the boundaries of objects of interest in the image. We have tackled the problem of building instance segmentation on two very challenging datasets of Vaihingen and Bing Huts as test case and our model outperforms the current state-of-the-art method, DSAC. Unlike DSAC, which relies on the manual initialization of its ACM contour, our model requires minimal human supervision and is initialized and guided by its CNN backbone. Moreover, DSAC can only segment a single building at a time whereas our DCAC can segment multiple buildings simultaneously. We also showed that, unlike DSAC, our DCAC is effective in handling various topological changes in the image. Given the level of success that DCAC has achieved in this application and the fact that it features a general Eulerian formulation, it is readily applicable to other segmentation tasks in various domains where purely CNN filter-based approaches can benefit from the versatility and precision of ACMs in delineating object boundaries in images.

## References

- [1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. Tensorflow: A system for large-scale machine learning. In OSDI, volume 16, pages 265–283, 2016.
- [2] V. Caselles, R. Kimmel, and G. Sapiro. Geodesic active contours. International Journal of Computer Vision, 22(1):61–79, 1997.
- [3] T. F. Chan and L. A. Vese. Active contours without edges. IEEE Transactions on Image Processing, 10(2):266–277, 2001.
- [4] L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587, 2017.
- [5] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. arXiv preprint arXiv:1802.02611, 2018.
- [6] A. Hatamizadeh, A. Hoogi, D. Sengupta, W. Lu, B. Wilcox, D. Rubin, and D. Terzopoulos. Deep active lesion segmentation. arXiv preprint arXiv:1908.06933, 2019.
- [7] A. Hatamizadeh, H. Hosseini, Z. Liu, S. D. Schwartz, and D. Terzopoulos. Deep dilated convolutional nets for the automatic segmentation of retinal vessels. arXiv preprint arXiv:1905.12120, 2019.
- [8] A. Hatamizadeh, D. Terzopoulos, and A. Myronenko. End-to-end boundary aware networks for medical image segmentation. arXiv preprint arXiv:1908.08071, 2019.
- [9] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask r-cnn. In Computer Vision (ICCV), 2017 IEEE International Conference on, pages 2980–2988. IEEE, 2017.
- [10] V. Iglovikov, S. Seferbekov, A. Buslaev, and A. Shvets. Ternausnetv2: Fully convolutional network for instance segmentation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2018.
- [11] A.-A.-Z. Imran, A. Hatamizadeh, S. P. Ananth, X. Ding, D. Terzopoulos, and N. Tajbakhsh. Automatic segmentation of pulmonary lobes using a progressive dense V-network. In Deep Learning in Medical Image Analysis, volume 11045 of Lecture Notes in Computer Science, pages 282–290. Springer, 2018.
- [12] M. Kass, A. Witkin, and D. Terzopoulos. Snakes: Active contour models. International Journal of Computer Vision, 1(4):321–331, 1988.
- [13] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- [14] S. Lankton and A. Tannenbaum. Localizing region-based active contours. IEEE Transactions on Image Processing, 17(11):2029–2039, 2008.
- [15] T. H. N. Le, K. G. Quach, K. Luu, C. N. Duong, and M. Savvides. Reformulating level sets as deep recurrent neural network approach to semantic segmentation. IEEE Transactions on Image Processing, 27(5):2393–2407, 2018.
- [16] D. Marcos, D. Tuia, B. Kellenberger, L. Zhang, M. Bai, R. Liao, and R. Urtasun. Learning deep structured active contours end-to-end. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 8877–8885, 2018.
- [17] A. Myronenko and A. Hatamizadeh. 3d kidneys and kidney tumor semantic segmentation using boundary-aware networks. arXiv preprint arXiv:1909.06684, 2019.
- [18] S. Osher and R. P. Fedkiw. Level set methods: An overview and some recent results. Journal of Computational Physics, 169(2):463–502, 2001.
- [19] S. Osher and J. A. Sethian. Fronts propagating with curvature-dependent speed: algorithms based on hamilton-jacobi formulations. Journal of computational physics, 79(1):12–49, 1988.
- [20] S. Paisitkriangkrai, J. Sherrah, P. Janney, V.-D. Hengel, et al. Effective semantic pixel labelling with convolutional networks and conditional random fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 36–43, 2015.
- [21] F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M. Gross, and A. Sorkine-Hornung. A benchmark dataset and evaluation methodology for video object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 724–732, 2016.
- [22] S. Wang, M. Bai, G. Mattyus, H. Chu, W. Luo, B. Yang, J. Liang, J. Cheverie, S. Fidler, and R. Urtasun. Torontocity: Seeing the world with a million eyes. arXiv preprint arXiv:1612.00423, 2016.
- [23] C. Xie, J. Wang, Z. Zhang, Z. Ren, and A. Yuille. Mitigating adversarial effects through randomization. arXiv preprint arXiv:1711.01991, 2017.
- [24] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid scene parsing network. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 2881–2890, 2017.