Semantic Segmentation with Boundary Neural Fields
The state-of-the-art in semantic segmentation is currently represented by fully convolutional networks (FCNs). However, FCNs use large receptive fields and many pooling layers, both of which cause blurring and low spatial resolution in the deep layers. As a result FCNs tend to produce segmentations that are poorly localized around object boundaries. Prior work has attempted to address this issue in post-processing steps, for example using a color-based CRF on top of the FCN predictions. However, these approaches require additional parameters and low-level features that are difficult to tune and integrate into the original network architecture. Additionally, most CRFs use color-based pixel affinities, which are not well suited for semantic segmentation and lead to spatially disjoint predictions.
To overcome these problems, we introduce a Boundary Neural Field (BNF), which is a global energy model integrating FCN predictions with boundary cues. The boundary information is used to enhance semantic segment coherence and to improve object localization. Specifically, we first show that the convolutional filters of semantic FCNs provide good features for boundary detection. We then employ the predicted boundaries to define pairwise potentials in our energy. Finally, we show that our energy decomposes semantic segmentation into multiple binary problems, which can be relaxed for efficient global optimization. We report extensive experiments demonstrating that minimization of our global boundary-based energy yields results superior to prior globalization methods, both quantitatively as well as qualitatively.
The recent introduction of fully convolutional networks (FCNs) [?] has led to significant quantitative improvements on the task of semantic segmentation. However, despite their empirical success, FCNs suffer from some limitations. Large receptive fields in the convolutional layers and the presence of pooling layers lead to blurring and segmentation predictions at a significantly lower resolution than the original image. As a result, their predicted segments tend to be blobby and lack fine object boundary details. We report in Figure 6 some examples illustrating typical poor localization of objects in the outputs of FCNs.
Recently, Chen at al. [?] addressed this issue by applying a Dense-CRF post-processing step [?] on top of coarse FCN segmentations. However, such an approach introduces several problems of its own. First, the Dense-CRF adds new parameters that are difficult to tune and integrate into the original network architecture. Additionally, most methods based on CRFs or MRFs use low-level pixel affinity functions, such as those based on color. These low-level affinities often fail to capture semantic relationships between objects and lead to poor segmentation results (see last column in Figure 6).
We propose to address these shortcomings by means of a Boundary Neural Field (BNF), an architecture that employs a single semantic segmentation FCN to predict semantic boundaries and then use them to produce semantic segmentation maps via a global optimization. We demonstrate that even though the semantic segmentation FCN has not been optimized to detect boundaries, it provides good features for boundary detection. Specifically, the contributions of our work are as follows:
We show that semantic boundaries can be expressed as a linear combination of interpolated convolutional feature maps inside an FCN. We introduce a boundary detection method that exploits this intuition to predict object boundaries with accuracy superior to the state-the-of-art.
We demonstrate that boundary-based pixel affinities are better suited for semantic segmentation than the commonly used color affinity functions.
Finally, we introduce a new global energy that decomposes semantic segmentation into multiple binary problems and relaxes the integrality constraint. We show that minimizing our proposed energy yields better qualitative and quantitative results relative to traditional globalization models such as MRFs or CRFs.
Spectral methods comprise one of the most prominent categories for boundary detection. In a typical spectral framework, one formulates a generalized eigenvalue system to solve a low-level pixel grouping problem. The resulting eigenvectors are then used to predict the boundaries. Some of the most notable approaches in this genre are MCG [?], gPb [?], PMI [?], and Normalized Cuts [?]. A weakness of spectral approaches is that they tend to be slow as they perform a global inference over the entire image.
To address this issue, recent approaches cast boundary detection as a classification problem and predict the boundaries in a local manner with high efficiency. The most notable examples in this genre include sketch tokens (ST) [?] and structured edges (SE) [?], which employ fast random forests. However, many of these methods are based on hand-constructed features, which are difficult to tune.
The issue of hand-constructed features have been recently addressed by several approaches based on deep learning, such as fields [?], DeepNet [?], DeepContour [?], DeepEdge [?], HfL [?] and HED [?]. All of these methods use CNNs in some way to predict the boundaries. Whereas DeepNet and DeepContour optimize ordinary CNNs to a boundary based optimization criterion from scratch, DeepEdge and HfL employ pretrained models to compute boundaries. The most recent of these methods is HED [?], which shows the benefit of deeply supervised learning for boundary detection.
In comparison to prior deep learning approaches, our method offers several contributions. First, we exploit the inherent relationship between boundary detection and semantic segmentation to predict semantic boundaries. Specifically, we show that even though the semantic FCN has not been explicitly trained to predict boundaries, the convolutional filters inside the FCN provide good features for boundary detection. Additionally, unlike DeepEdge [?] and HfL [?], our method does not require a pre-processing step to select candidate contour points, as we predict boundaries on all pixels in the image. We demonstrate that our approach allows us to achieve state-of-the-art boundary detection results according to both F-score and Average Precision metrics. Additionally, due to the semantic nature of our boundaries, we can successfully use them as pairwise potentials for semantic segmentation in order to improve object localization and recover fine structural details, typically lost by pure FCN-based approaches.
We can group most semantic segmentation methods into three broad categories. The first category can be described as “two-stage” approaches, where an image is first segmented and then each segment is classified as belonging to a certain object class. Some of the most notable methods that belong to this genre include [?].
The primary weakness of the above methods is that they are unable to recover from errors made by the segmentation algorithm. Several recent papers [?] address this issue by proposing to use deep per-pixel CNN features and then classify each pixel as belonging to a certain class. While these approaches partially address the incorrect segmentation problem, they perform predictions independently on each pixel. This leads to extremely local predictions, where the relationships between pixels are not exploited in any way, and thus the resulting segmentations may be spatially disjoint.
The third and final group of semantic segmentation methods can be viewed as front-to-end schemes where segmentation maps are predicted directly from raw pixels without any intermediate steps. One of the earliest examples of such methods is the FCN introduced in [?]. This approach gave rise to a number of subsequent related approaches which have improved various aspects of the original semantic segmentation [?]. There have also been attempts at integrating the CRF mechanism into the network architecture [?]. Finally, it has been shown that semantic segmentation can also be improved using additional training data in the form of bounding boxes [?].
Our BNF offers several contributions over prior work. To the best of our knowledge, we are the first to present a model that exploits the relationship between boundary detection and semantic segmentation within a FCN framework. We introduce pairwise pixel affinities computed from semantic boundaries inside an FCN, and use these boundaries to predict the segmentations in a global fashion. Unlike [?], which requires a large number of additional parameters to learn for the pairwise potentials, our global model only needs extra parameters, which is about orders of magnitudes less than the number of parameters in a typical deep convolutional network (e.g. VGG [?]). We empirically show that our proposed boundary-based affinities are better suited for semantic segmentation than color-based affinities. Additionally, unlike in [?], the solution to our proposed global energy can be obtained in closed-form, which makes global inference easier. Finally we demonstrate that our method produces better results than traditional globalization models such as CRFs or MRFs.
3Boundary Neural Fields
In this section, we describe Boundary Neural Fields. Similarly to traditional globalization methods, Boundary Neural Fields are defined by an energy including unary and pairwise potentials. Minimization of the global energy yields the semantic segmentation. BNFs build both unary and pairwise potentials from the input RGB image and then combine them in a global manner. More precisely, the coarse segmentations predicted by a semantic FCN are used to define the unary potentials of our BNF. Next, we show that the convolutional feature maps of the FCN can be used to accurately predict semantic boundaries. These boundaries are then used to build pairwise pixel affinities, which are used as pairwise potentials by the BNF. Finally, we introduce a global energy function, which minimizes the energy corresponding to the unary and pairwise terms and improves the initial FCN segmentation. The detailed illustration of our architecture is presented in Figure 7. We now explain each of these steps in more detail.
3.1FCN Unary Potentials
To predict semantic unary potentials we employ the DeepLab model [?], which is a fully convolutional adaptation of the VGG network [?]. The FCN consists of convolutional layers and fully convolutional layers. There are more recent FCN-based methods that have demonstrated even better semantic segmentation results [?]. Although these more advanced architectures could be integrated into our framework to improve our unary potentials, in this work we focus on two aspects orthogonal to this prior work: 1) demonstrating that our boundary-based affinity function is better suited for semantic segmentation than the common color-based affinities and 2) showing that our proposed global energy achieves better qualitative and quantitative semantic segmentation results in comparison to prior globalization models.
3.2Boundary Pairwise Potentials
In this section, we describe our approach for building pairwise pixel affinities using semantic boundaries. The basic idea behind our boundary detection approach is to express semantic boundaries as a function of convolutional feature maps inside the FCN. Due to the close relationship between the tasks of semantic segmentation and boundary detection, we hypothesize that convolutional feature maps from the semantic segmentation FCN can be employed as features for boundary detection.
Learning to Predict Semantic Boundaries.
We propose to express semantic boundaries as a linear combination of interpolated FCN feature maps with a non-linear function applied on top of this sum. We note that interpolation of feature maps has been successfully used in prior work (see e.g. [?]) in order to obtain dense pixel-level features from the low-resolution outputs of deep convolutional layers. Here we adopt interpolation to produce pixel-level boundary predictions. There are several advantages to our proposed formulation. First, because we express boundaries as a linear combination of feature maps, we only need to learn a small number of parameters, corresponding to the individual weight values of each feature map in the FCN. This amounts to learning parameters, which is much smaller than the number of parameters in the entire network (). In comparison, DeepEdge [?] and HFL [?] need 17M and 6M additional parameters to predict boundaries.
Furthermore, expressing semantic boundaries as a linear combination of FCN feature maps allows us to efficiently predict boundary probabilities for all pixels in the image (we resize the FCN feature maps to the original image dimensions). This eliminates the need to select candidate boundary points in a pre-processing stage, which was instead required in prior boundary detection work [?].
Our boundary prediction pipeline can be described as follows. First we use use SBD segmentations [?] to optimize our FCN for semantic segmentation task. We then treat FCN convolutional maps as features for the boundary detection task and use the boundary annotations from BSDS 500 dataset [?] to learn the weights for each feature map. BSDS 500 dataset contains training, validation, testing images, and ground truth annotations by human labelers for each of these images.
To learn the weights corresponding to each convolutional feature map we first sample points from the dataset. We define the target labels for each point as the fraction of human annotators agreeing on that point being a boundary. To fix the issue of label imbalance (there are many more non-boundaries than boundaries), we divide the label space into four quartiles, and select an equal number of samples for each quartile to balance the training dataset. Given these sampled points, we then define our features as the values in the interpolated convolutional feature maps corresponding to these points. To predict semantic boundaries we weigh each convolutional feature map by its weight, sum them up and apply a sigmoid function on top of it. We obtain the weights corresponding to each convolutional feature map by minimizing the cross-entropy loss using a stochastic batch gradient descent for epochs. To obtain crisper boundaries at test-time we post-process the boundary probabilities using non-maximum suppression.
To give some intuition on how FCN feature maps contribute to boundary detection, in Figure 13 we visualize the feature maps corresponding to the highest weight magnitudes. It is clear that many of these maps contain highly localized boundary information.
Boundary Detection Results
Before discussing how boundary information is integrated in our energy for semantic segmentation, here we present experimental results assessing the accuracy of our boundary detection scheme. We tested our boundary detector on the BSDS500 dataset [?], which is the standard benchmark for boundary detection. The quality of the predicted boundaries is evaluated using three standard measures: fixed contour threshold (ODS), per-image best threshold (OIS), and average precision (AP).
In Table 1 we show that our algorithm outperforms all prior methods according to both F-score measures and the Average Precision metric. In Figure 22, we also visualize our predicted boundaries. The second column shows the pixel-level softmax output computed from the linear combination of feature maps, while the third column depicts our final boundaries after applying a non-maximum suppression post-processing step.
We note that our predicted boundaries achieve high-confidence predictions around objects. This is important as we employ these boundaries to improve semantic segmentation results, as discussed in the next subsection.
Constructing Pairwise Pixel Affinities.
We can use the predicted boundaries to build pairwise pixel affinities. Intuitively, we declare two pixels as similar (i.e., likely to belong to the same segment) if there is no boundary crossing the straight path between these two pixels. Conversely, two pixels are dissimilar if there is a boundary crossing their connecting path. The larger the boundary magnitude of the crossed path, the more dissimilar the two pixels should be, since a strong boundary is likely to mark the separation of two distinct segments. Similarly to [?], we encode this intuition with a following formulation:
where denotes the maximum boundary value that crosses the straight line path between pixels and , depicts the smoothing parameter and denotes the semantic boundary-based affinity between pixels and .
Similarly, we want to exploit high-level object information in the network to define another type of pixel similarity. Specifically, we use object class probabilities from the softmax (SM) layer to achieve this goal. Intuitively, if pixels and have different hard segmentation labels from the softmax layer, we set their similarity ( ) to . Otherwise, we compute their similarity using the following equation:
where denotes the difference in softmax output values corresponding to the most likely object class for pixels and , and is a smoothing parameter. Then we can write the final affinity measure as:
We exponentiate the term corresponding to the object-level affinity because our boundary-based affinity may be too aggressive in declaring two pixels as dissimilar. To address this issue, we increase the importance of the object-level affinity in using the exponential function. However, in the experimental results section, we demonstrate that most of the benefit from modeling pairwise potentials comes from rather than .
We then use this pairwise pixel affinity measure to build a global affinity matrix that encodes relationships between pixels in the entire image. For a given pixel, we sample of points in the neighborhood of radius around that pixel, and store the resulting affinities into .
The last step in our proposed method is to combine semantic boundary information with the coarse segmentation from the FCN softmax layer to produce an improved segmentation. We do this by introducing a global energy function that utilizes the affinity matrix constructed in the previous section along with the segmentation from the FCN softmax layer. Using this energy, we perform a global inference to get segmentations that are well localized around the object boundaries and that are also spatially smooth.
Typical globalization models such as MRFs [?], CRFs [?] or Graph Cuts [?] produce a discrete label assignment for the segmentation problem by jointly modeling a multi-label distribution and solving a non-convex optimization. The common problem in doing so is that the optimization procedure may get stuck in local optima.
We introduce a new global energy function, which overcomes this issue and achieves better segmentation in comparison to prior globalization models. Similarly to prior globalization approaches, our goal is to minimize the energy corresponding to the sum of unary and pairwise potentials. However, the key difference in our approach comes from the relaxation of some of the constraints. Specifically, instead of modeling our problem as a joint multi-label distribution, we propose to decompose it into multiple binary problems, which can be solved concurrently. This decomposition can be viewed as assigning pixels to foreground and background labels for each of the different object classes. Additionally, we relax the integrality constraint. Both of these relaxations make our problem more manageable and allow us to formulate a global energy function that is differentiable, and has a closed form solution.
In [?], the authors introduce the idea of learning with global and local consistency in the context of semi-supervised problems. Inspired by this work, we incorporate some of these ideas in the context of semantic segmentation. Before defining our proposed global energy function, we introduce some relevant notation.
For the purpose of illustration, suppose that we only have two classes: foreground and background. Then we can denote an optimal continuous solution to such a segmentation problem with variable . To denote similarity between pixels and we use . Then, indicates the degree of a pixel . In graph theory, the degree of a node denotes the number of edges incident to that node. Thus, we set the degree of a pixel to for all except . Finally, with we denote an initial segmentation probability, which in our case is obtained from the FCN softmax layer.
Using this notation, we can then formulate our global inference objective as:
This energy consists of two different terms. Similar to the general globalization framework, our first term encodes the unary energy while the second term includes the pairwise energy. We now explain the intuition behind each of these terms. The unary term attempts to find a segmentation assignment () that deviates little from the initial candidate segmentation computed from the softmax layer (denoted by ). The in the unary term is weighted by the degree of the pixel in order to produce larger unary costs for pixels that have many similar pixels within the neighborhood. Instead, the pairwise term ensures that pixels that are similar should be assigned similar values. To balance the energies of the two terms we introduce a parameter and set it to throughout all our experiments.
We can also express the same global energy function in matrix notation:
where is a vector containing an optimal continuous assignment for all pixels, is a diagonal degree matrix, and is the pixel affinity matrix. Finally, denotes a vector containing the probabilities from the softmax layer corresponding to a particular object class.
An advantage of our energy is that it is differentiable. If we denote the above energy as then the derivative of this energy can be written as follows:
With simple algebraic manipulations we can then obtain a closed form solution to this optimization:
where and . In the general case where we have object classes we can write the solution as:
where now depicts a matrix containing assignments for all object classes, while denotes matrix with object class probabilities from softmax layer. Due to the large size of it is impractical to invert it. However, if we consider an image as a graph where each pixel denotes a vertex in the graph, we can observe that the term in our optimization is equivalent to a Laplacian matrix of such graph. Since we know that a Laplacian matrix is positive semi-definite, we can use the preconditioned conjugate gradient method [?] to solve the system in Eq. . Alternatively, because our defined global energy in Eq. is differentiable, we can efficiently solve this optimization problem using stochastic gradient descent. We choose the former option and solve the following system:
To obtain the final discrete segmentation, for each pixel we assign the object class that corresponds to the largest column value in the row of (note that each row in represents a single pixel in the image, and each column in represents one of the object classes). In the experimental section, we show that this solution produces better quantitative and qualitative results in comparison to commonly used globalization techniques.
|Metric||Inference Method||RGB Affinity|| BNF Affinity
In this section we present quantitative and qualitative results for semantic segmentation on the SBD [?] dataset, which contains objects and their per-pixel annotations for Pascal VOC classes. We evaluate semantic segmentation results using two evaluation metrics. The first metric measures accuracy based on pixel intersection-over-union averaged per pixels (PP-IOU) across the 20 classes. According to this metric, the accuracy is computed on a per-pixel basis. As a result, the images that contain large object regions are given more importance. However, for certain applications we may need to accurately segment small objects. Therefore, similar to [?] we also consider the PI-IOU metric (pixel intersection-over-union averaged per image across the 20 classes), which gives equal weight to each of the images.
We compare Boundary Neural Fields with other commonly used global inference methods. These methods include Belief Propagation [?], Iterated Conditional Mode (ICM), Graph Cuts [?], and Dense-CRF [?]. Note that in all of our evaluations we use the same FCN unary potentials for every model.
Our evaluations provide evidence for three conclusions:
In Subsection Section 4.1, we show that our boundary-based pixel affinities are better suited for semantic segmentation than the traditional color-based affinities.
In Subsection Section 4.2, we demonstrate that our global minimization leads to better results than those achieved by other inference schemes.
In Figure 32, we qualitatively compare the outputs of FCN and Dense-CRF to our predicted segmentations. This comparison shows that the BNF segments are better localized around the object boundaries and that they are also spatially smooth.
4.1Comparing Affinity Functions for Semantic Segmentation
In Table 2, we consider two global models. Both models use the same unary potentials obtained from the FCN softmax layer. However, the first model uses the popular color-based pairwise affinities, while the second employs our boundary-based affinities. Each of these two models is optimized using several inference strategies. The table shows that using our boundary based-affinity function improves the results of all global inference methods according to both evaluation metrics. Note that we cannot include Dense-CRF [?] in this comparison because it employs an efficient message-passing technique and integrating our affinities into this technique is a non-trivial task. However, we compare our method with Dense-CRF in Subsection Section 4.2.
The results in Table 2 suggest that our semantic boundary based pixel affinity function yields better semantic segmentation results compared to the commonly-used color based affinities. We note that we also compared the results of our inference technique using other edge detectors, notably UCM [?] and HfL [?]. In comparison to UCM edges, we observed that our boundaries provide and according to both evaluation metrics respectively. When comparing our boundaries with HfL method, we observed similar segmentation performance, which suggests that our method works best with the high quality semantic boundaries.
4.2Comparing Inference Methods for Semantic Segmentation
Additionally, we also present semantic segmentation results for both of the metrics (PP-IOU and PI-IOU) in Table 3. In this comparison, all the techniques use the same FCN unary potentials. Additionally, all inference methods except Dense-CRF use our affinity measure (since the previous analysis suggested that our affinities yield better performance). We use BNF-SB to denote the variant of our method that uses only semantic boundary based affinities. Additionally, we use BNF-SB-SM to indicate the version of our method that uses both boundary and softmax-based affinities (see Eq. ).
Based on these results, we observe that our proposed technique outperforms all the other globalization methods according to both metrics, by and respectively. Additionally, these results indicate that most benefit comes from the semantic boundary affinity term rather than the softmax affinity term.
In Figure 32, we also present qualitative semantic segmentation results. Note that, compared to the segmentation output from the softmax layer, our segmentation is much better localized around the object boundaries. Additionally, in comparison to Dense-CRF predictions, our method produces segmentations that are much spatially smoother.
4.3Semantic Boundary Classification
We can also label our boundaries with a specific object class, using the same classification strategy as in the HfL system [?]. Since the SBD dataset provides annotations for semantic boundary classification, we can test our results against the state-of-the-art HfL [?] method for this task. Due to the space limitation, we do not include full results for each category. However, we observe that our produced results achieve mean Max F-Score of (averaged across all classes) whereas HfL method obtains .
In this work we introduced a Boundary Neural Field (BNF), an architecture that employs a semantic segmentation FCN to predict semantic boundaries and then uses the predicted boundaries and the FCN output to produce an improved semantic segmentation maps a global optimization. We showed that our predicted boundaries are better suited for semantic segmentation than the commonly used low-level color based affinities. Additionally, we introduced a global energy function that decomposes semantic segmentation into multiple binary problems and relaxes an integrality constraint. We demonstrated that the minimization of this global energy allows us to predict segmentations that are better localized around the object boundaries and that are spatially smoother compared to the segmentations achieved by prior methods. We made the code of our globalization technique available at http://www.seas.upenn.edu/~gberta/publications.html.
The main goal of this work was to show the effectiveness of boundary-based affinities for semantic segmentation. However, due to differentiability of our global energy, it may be possible to add more parameters inside the BNFs and learn them in a front-to-end fashion. We believe that optimizing the entire architecture jointly could capture the inherent relationship between semantic segmentation and boundary detection even better and further improve the performance of BNFs. We will investigate this possibility in our future work.
This research was funded in part by NSF award CNS-1205521.