End-to-End Deep Convolutional Active Contours for Image Segmentation

End-to-End Deep Convolutional Active Contours for Image Segmentation

Ali Hatamizadeh Computer Science Department
University of California, Los Angeles, CA, USA
Debleena Sengupta Computer Science Department
University of California, Los Angeles, CA, USA
Demetri Terzopoulos Computer Science Department
University of California, Los Angeles, CA, USA

The Active Contour Model (ACM) is a standard image analysis technique whose numerous variants have attracted an enormous amount of research attention across multiple fields. Incorrectly, however, the ACM’s differential-equation-based formulation and prototypical dependence on user initialization have been regarded as being largely incompatible with the recently popular deep learning approaches to image segmentation. This paper introduces the first tight unification of these two paradigms. In particular, we devise Deep Convolutional Active Contours (DCAC), a truly end-to-end trainable image segmentation framework comprising a Convolutional Neural Network (CNN) and an ACM with learnable parameters. The ACM’s Eulerian energy functional includes per-pixel parameter maps predicted by the backbone CNN, which also initializes the ACM. Importantly, both the CNN and ACM components are fully implemented in TensorFlow, and the entire DCAC architecture is end-to-end automatically differentiable and backpropagation trainable without user intervention. As a challenging test case, we tackle the problem of building instance segmentation in aerial images and evaluate DCAC on two publicly available datasets, Vaihingen and Bing Huts. Our reseults demonstrate that, for building segmentation, the DCAC establishes a new state-of-the-art performance by a wide margin.

1 Introduction

Figure 1: DCAC is a framework the end-to-end training of an automatically differentiable ACM and backbone CNN without user intervention, implemented entirely in TensorFlow. The CNN learns to properly initialize the ACM, via a generalized distance transform, as well as the per-pixel parameter maps in the ACM’s energy functional.

The ACM [12] is one of the most influential computer vision techniques. It has been successfully employed in various image analysis tasks, including object segmentation and tracking. In most ACM variants the deformable curve(s) of interest dynamically evolves through an iterative procedure that minimizes a corresponding energy functional. Since the ACM is a model-based formulation founded on geometric and physical principles, the segmentation process relies mainly on the content of the image itself, not on large annotated image datasets, extensive computational resources, and hours or days of training. However, the classic ACM relies on some degree of user interaction to specify the initial contour and tune the parameters of the energy functional, which undermines its applicability to the automated analysis of large quantities of images.

In recent years, Deep Neural Networks (DNNs) have become popular in many areas. In computer vision and medical image analysis, CNNs have been succesfully exploited for different segmentation tasks [11, 8, 17]. Despite their tremendous success, the performance of CNNs is still very dependent on their training datasets. In essence, CNNs rely on a filter-based learning scheme in which the weights of the network are usually tuned using a back-propagation error gradient decent approach. Since CNN architectures often include millions of trainable parameters, the training process relies on the sheer size of the dataset. In addition, CNNs usually generalize poorly to images that differ from those in the training datasets and they are vulnerable to adversarial examples [23]. For image segmentation, capturing the details of object boundaries and delineating them remains a challenging task even for the most promising of CNN architectures that have achieved state-of-the-art performance on relevant bench-marked datasets [4, 9, 24]. The recently proposed Deeplabv3+ [5] has mitigated this problem to some extent by leveraging the power of dilated convolutions, but such improvements were made possible by extensive pre-training and vast computational resources—50 GPUs were reportedly used to train this model.

In this paper, we aim to bridge the gap between CNNs and ACMs by introducing a truly end-to-end framework. Our framework leverages an automatically differentiable ACM with trainable parameters that allows for back-propagation of gradients. This ACM can be trained along with a backbone CNN from scratch and without any pre-training. Moreover, our ACM utilizes a locally-penalized energy functional that is directly predicted by its backbone CNN, in the form of 2D feature maps, and it is initialized directly by the CNN. Thus, our work alleviates one of the biggest obstacles to exploiting the power ACMs—eliminating the need for any type of user supervision or intervention.

As a challenging test case for our DCAC framework, we tackle the problem of building instance segmentation in aerial images. Our DCAC sets new state-of-the-art benchmarks on the Vaihingen and Bing Huts datasets for building instance segmentation, outperforming its closest competitor by a wide margin.

2 Related Work

Eulerian active contours:

Eulerian active contours evolve the segmentation curve by dynamically propagating an implicit function so as to minimizing its associated energy functional [18]. The most notable approaches that utilize this formulation are the active contours without edges by Chan and Vese [3] and the geodesic active contours by Caselles et al. [2]. The Caselles-Kimmel-Sapiro model is mainly dependent on the location of the level-set, whereas the Chan-Vese model mainly relies on the content difference between the interior and exterior of the level-set. In addition, the work by [14] proposes a reformulation of the Chan-Vese model in which the energy functional incorporates image properties in local regions around the level-set, and it was shown to more accurately segment objects with heterogeneous features.

“End-to-End” CNNs with ACMs:

Several efforts have attempted to integrate CNNs with ACMs in an end-to-end manner as opposed to utilizing the ACM merely as a post-processor of the CNN output. Le et al. [15] implemented level-set ACMs as Recurrent Neural Networks (RNNs) for the task of semantic segmentation of natural images. There exists 3 key differences between our proposed DCAC and this effort: (1) DCAC does not reformulate ACMs as RNNs and as a result is more computationally efficient. (2) DCAC benefits from a novel locally-penalized energy functional, whereas [15] has constant weighted parameters. (3) DCAC has an entirely different pipeline—we employ a single CNN that is trained from scratch along with the ACM, whereas [15] requires two pre-trained CNN backbones (one for object localization, the other for classification). The dependence of [15] on pre-trained CNNs has limited its applicability. The other attempt, the DSAC model by Marcos et al. [16], is an integration of ACMs with CNNs in a structured prediction framework for building instance segmentation in aerial images. There are 3 key differences between DCAC and this work: (1) [16] heavily depends on the manual initialization of contours, whereas our DCAC is fully automated and runs without any external supervision. (2) The ACM used in [16] has a parametric formulation that can handle only a single building at a time, whereas our DCAC leverages the Eulerian ACM which can naturally handle multiple building instances simultaneously. (3) [16] requires the user to explicitly calculate the gradients, whereas our approach fully automates the direct back-propagation of gradients through the entire DCAC framework due to its automatically differetiable ACM.

Building instance segmentation:

Modern CNN-based methods have been used with different approaches to the problem of building segmentation. Some efforts have treated this problem as a semantic segmentation problem [20, 22] and utilized post-processing steps to extract the building boundaries. Other efforts have utilized instance segmentation networks [10] to directly predict the location of buildings.

(a) Input image
(b) Learned distance transform
Figure 2: Examples of learned distance transform, and maps for a given input image.

3 Level Set Active Contours

First proposed by Osher and Sethian [19] to evolve wavefronts in CFD simulations, a level-set is an implicit representation of a hypersurface that is dynamically evolved according to the nonlinear Hamilton-Jacobi equation. In 2D, let be a closed time-varying contour represented in by the zero level set of the signed distance map . Function evolves according to


where represents the initial level set.

We introduce a generalization of the level-set ACM proposed by Chan and Vese [3]. Their model assumes that the image of interest consists of two areas of distinct intensities. The interior of is represented by the smoothed Heaviside function


and represents its exterior. The derivative of (2) is the smoothed Dirac delta function


The energy functional associated with is written as


where penalizes the length of and penalizes its enclosed area (we set and ), and where and are the mean image intensities inside and outside . We follow Lankton et al. [14] and define and as the mean image intensities inside and outside within a local window around .

Note that to afford greater control over , we have generalized the constants and used in [3] to parameter functions and in (4). The contour expands or shrinks at a certain location if or , respectively [6]. In DCAC, these parameter functions are trainable and learned directly by the backbone CNN. Fig.2 illustrates an example of these learned maps by the CNN.

Given an initial distance map and parameter maps and , the ACM is evolved by numerically time-integrating, within a narrow band around for computational efficiency, the finite difference discretized Euler-Lagrange PDE for ; refer to [3] and [14] for the details.

4 CNN Backbone

Figure 3: Architecture of the CNN backbone.

As our CNN backbone, we follow [7] and utilize a fully convolutional encoder-decoder architecture with dilated residual blocks (Fig. 3). Each convolutional layer is followed by a Rectified Linear Unit (ReLU) as the activation layer and a batch normalization. The dilated residual block consists of 2 consecutive dilated convolutional layers whose outputs are fused with its input and fed into the ReLU activation layer. In the encoder, each path consist of 2 consecutive convolutional layers, followed by a dilated residual unit with a dilation rate of 2. Before being fed into the dilated residual unit, the output of these convolutional layers are added with the output feature maps of another 2 consecutive convolutional layers that learn additional multi-scale information from the resized input image in that resolution. To recover the content lost in the learned feature maps during the encoding process, we utilize a series of consecutive dilated residual blocks with dilation rates of 1, 2, and 4 and feed the output to a dilated spatial pyramid pooling layer with 4 different dilation rates of 1, 6, 12 and 18. The decoder is connected to the dilated residual units at each resolution via skip connections, and in each path we up-sample the image and employ 2 consecutive convolutional layers before proceeding to the next resolution. The output of the decoder is fed into another series of 2 consecutive convolutional layer and then passed into 3 separate convolutional layers for predicting the output maps of and as well as the distance transform.

5 DCAC Architecture and Implementation

In our DCAC framework (Fig. 1), the CNN backbone serves to directly initialize the zero level-set contour as well as the weighted local parameters. We initialize the zero level-set by a learned distance transform that is directly predicted by the CNN along with additional convolutional layers that learn the parameter maps. Figure 2 illustrates an example of what the backbone CNN learns in the DCAC on one input image from the Vaihingen data set. These learned parameters are then passed to the ACM that unfolds for a certain number of timesteps in a differentiable manner. The final zero level-set is then converted to logits and compared with the label and the resulting error is back-propagated through the entire framework in order to tune the weights of the CNN backbone. Algorithm 1 presents the details of DCAC training algorithm.

Data: ,: Paired image and label; : CNN with parameters ; : ACM with parameters ; : Loss function; : Number of ACM iterations; : learning rate
Result: : Final segmentation
while not converged do
       for  to  do
       end for
       compute and Back-propagate the error
       Update the Weights of :
end while
Algorithm 1 DCAC Training Algorithm
(a) Labeled image
(b) DCAC, constant s
(c) DCAC
Figure 4: (a) Labeled image (b) DCAC output with constant weighted parameters (c) DCAC output (d),(e) learned parameter maps and

5.1 Implementation Details

All components of DCAC, including the ACM, have been implemented entirely in Tensorflow [1] and are compatible with both Tensorflow 1.x and 2.0 versions. The ACM implementation benefits from the automatic differentiation utility of Tensorflow and has been designed to enable the back-propagation of the error gradient through the layers of the ACM.

In each ACM layer, each point along the the zero level-set contour is probed by a local window and the mean intensity of the inside and outside regions; i.e., and in (4), are extracted. In our implementation, and are extracted by using a differentiable global average pooling layer with appropriate padding not to lose any information on the edges.

All the training was performed on an Nvidia Titan XP GPU, and an Intel® Core™ i7-7700K CPU @ 4.20GHz. The size of the minibatches for training on the Vaihingen and Bing Huts datasets were 3 and 20 respectively. All the training sessions employ the Adam optimization algorithm [13] with a learning rate of 0.001 that that decays by a factor of 10 every 10 epochs.

Dataset: Vaihingen Bing Huts
Model Dice mIoU WCov BoundF Dice mIoU WCov BoundF
DSAC 0.840 0.650
UNet 0.810 0.797 0.843 0.622 0.710 0.740 0.852 0.421
ResNet 0.801 0.791 0.841 0.770 0.81 0.797 0.864 0.434
Backbone CNN 0.837 0.825 0.865 0.680 0.737 0.764 0.809 0.431
DCAC: Single Inst 0.928 0.929 0.943 0.819 0.855 0.860 0.894 0.534
DCAC: Multi Inst 0.908 0.893 0.910 0.797 0.797 0.809 0.839 0.491
DCAC: Single Inst, Const 0.877 0.888 0.936 0.801 0.792 0.813 0.889 0.513
DCAC: Multi Inst, Const 0.857 0.842 0.876 0.707 0.757 0.777 0.891 0.486
Table 1: Model Evaluations.

6 Experiments

6.1 Datasets


The Vaihingen buildings dataset 111http://www2.isprs.org/commissions/comm3/wg4/2d-sem-label-vaihingen.html consists of 168 building images of size pixels. The labels for each image are generated by using a semi-automated approach. We used 100 images for training and 68 for testing, following the same data split as in [16]. In this dataset, almost all images consist of multiple instances of buildings, some of which are located at the edges of the image.

Bing Huts:

The Bing Huts dataset 222https://www.openstreetmap.org/#map=4/38.00/-95.80 consists of 605 images of size . We followed the same data split that is used in [16] and used 335 images for training and 270 images for testing. This dataset is especially challenging due the low spatial resolution and contrast that are exhibited in the images.

6.2 Evaluation Metrics and Loss Function

To evaluate our model’s performance, we utilized five different metrics—Dice, mean Intersection over Union (mIoU), Weighted Coverage (WCov), Boundary F (BoundF), and Root Mean Square Error (RMSE). The original DSAC paper only reported on mIoU for both Vaihingen and Bing Huts and only RMSE for the Bing Huts dataset. However, since the delineation of boundaries is one of the important goals of our framework, we employ the BoundF metric [21] to precisely measure the similarity between the specific boundary pixels in our predictions and the corresponding image labels. Furthermore, we used a soft Dice loss function in training our model.

7 Results and Discussion

Figure 5: Comparative visualization of the labeled image, the output of DSAC, and the output of our DCAC, for the Vaihingen (top) and Bing Huts (bottom) datasets: (a) Image with label (green), (b) DSAC output, (c) our DCAC output, (d) DCAC learned distance transform, (e) and (f) for the DCAC.

7.1 Local and Fixed Weighted Parameters

To validate the contribution of the local weighted parameters in the level-set ACM, we also trained our DCAC on both the Vaihingen and Bing Huts datasets by only allowing one trainable scalar parameter for each of and , which is constant over the entire image. As presented in Table 1, in both the Vaihingen and Bing Huts datasets, this constant- formulation still outperforms the baseline CNN in all evaluation metrics for both single-instance and multi-instance buildings, thus showing the effectiveness of the end-to-end training of the DCAC. However, the DCAC with the full and maps outperforms this constant formulation by a wide margin in all experiments and metrics.

A key metric of interest in this comparison is the BoundF score, which demonstrates how our local formulation captures the details of the boundaries more effectively by adjusting the inward and outward forces on the contour locally. As illustrated in Figure 4, DCAC has perfectly delineated the boundaries of the building instances. However, DCAC with constant formulation has over-segmented these instances.

7.2 Buildings on the Edges of the Image

Our DCAC is capable of properly segmenting the instances of buildings located on the edges of some of the images present in the Vaihingen dataset. This is mainly due to the proper padding scheme that we have utilized in our global average pooling layer used to extract the local intensities of pixels while avoiding the loss of information on the boundaries.

7.3 Initialization and Number of ACM Iterations

In all cases, we performed our experiments with the goal of leveraging the CNN to fully automate the ACM and eliminate the need for any human supervision. Our scheme for learning a generalized distance transform directly helped us to localize all the building instances simultaneously and initialize the zero level-sets appropriately while avoiding a computationally expensive and non-differentiable distance transform operation. In addition, initializing the zero level-sets in this manner, instead of the common practice of initializing from a circle, helped the contour to converge significantly faster and avoid undesirable local minima.

7.4 Comparison Against the DSAC Model

Although most of the images in the Vaihingen dataset consist of multiple instances of buildings, the DSAC model [16] can deal with only a single building at a time. For a fair comparison between the two approaches, we report separate metrics for a single building, as reported by in [16] for the DSAC, as well as for all the instances of buildings (which the DSAC cannot handle). As presented in Table 1, our DCAC outperforms DSAC by and percent in mIoU respectively on both the Vaihingen and Bing Huts datasets. Furthermore, the multiple-instance metrics of our DCAC outperform the single-instance DSAC results. As demonstrated in Fig. 5, in the Vaihingen dataset, DSAC struggles in coping with the topological changes of the buildings and fails to appropriately capture sharp edges, while our framework in most cases handles these challenges. In the Bing Hut dataset, the DSAC is able to localize the buildings, but it mainly over-segments the buildings in many cases. This may be due to DSAC’s inability to distinguish the building from the surrounding soil because of the low contrast and small size of the image. By comparison, our DCAC is able to low contrast dataset well, with more accurate boundaries, when comparing the segmentation output of DSAC (b) and our DCAC (c), as seen in Fig. 5.

8 Conclusions and Future Work

We have introduced a novel image segmentation framework, called DCAC, which is a truly end-to-end integration of ACMs and CNNs. We proposed a novel locally-penalized Eulerian energy model that allows for pixel-wise learnable parameters that can adjust the contour to precisely capture and delineate the boundaries of objects of interest in the image. We have tackled the problem of building instance segmentation on two very challenging datasets of Vaihingen and Bing Huts as test case and our model outperforms the current state-of-the-art method, DSAC. Unlike DSAC, which relies on the manual initialization of its ACM contour, our model requires minimal human supervision and is initialized and guided by its CNN backbone. Moreover, DSAC can only segment a single building at a time whereas our DCAC can segment multiple buildings simultaneously. We also showed that, unlike DSAC, our DCAC is effective in handling various topological changes in the image. Given the level of success that DCAC has achieved in this application and the fact that it features a general Eulerian formulation, it is readily applicable to other segmentation tasks in various domains where purely CNN filter-based approaches can benefit from the versatility and precision of ACMs in delineating object boundaries in images.


Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description