Semantic Edge Detection with Diverse Deep Supervision
Semantic edge detection (SED), which aims at jointly extracting edges as well as their category information, has far-reaching applications in domains such as semantic segmentation, object proposal generation, and object recognition. SED naturally requires achieving two distinct supervision targets: locating fine detailed edges and identifying high-level semantics. We shed light on how such distracted supervision targets prevent state-of-the-art SED methods from effectively using deep supervision to improve results. In this paper, we propose a novel fully convolutional neural network architecture using diverse deep supervision (DDS) within a multi-task framework where lower layers aim at generating category-agnostic edges, while higher layers are responsible for the detection of category-aware semantic edges. To overcome the distracted supervision challenge, a novel information converter unit is introduced, whose effectiveness has been extensively evaluated in several popular benchmark datasets, including SBD, Cityscapes, and PASCAL VOC2012. Source code will be released upon paper acceptance.
Classical edge detection aims to detect edges and objects’ boundaries. It is category-agnostic in the sense that recognizing object categories is not necessary. It can be viewed as a pixel-wise binary classification problem whose objective is to classify each pixel as belonging to one class, indicating the edge, or to the other class, indicating non-edge. In this paper we consider more practical scenarios of semantic edge detection, in which the detection of edges and the recognition of edges’ categories within an image is jointly achieved. Semantic edge detection (SED) (Hariharan et al., 2011; Yu et al., 2017; Maninis et al., 2017; Bertasius et al., 2015b) is an active research topic in computer vision due to its wide-ranging applications in problems such as object proposal generation (Bertasius et al., 2015b), occlusion and depth reasoning (Hoiem et al., 2007; Amer et al., 2015), 3D reconstruction (Shan et al., 2014), object detection (Ferrari et al., 2008, 2010), image-based localization (Ramalingam et al., 2010) and so on.
Recently, deep convolutional neural networks (DCNNs) reign undisputed as the new de-facto method for category-agnostic edge detection (Xie and Tu, 2015; Liu et al., 2017) where near human-level performances have been achieved. Deep learning for category-aware SED, which jointly detects visually salient edges as well as recognizes their categories, however, is not yet to witness such vast popularity. Hariharan et al. (Hariharan et al., 2011) first combined generic object detectors with bottom-up edges to recognize semantic edges. A fully convolutional encoder-decoder network is proposed in (Yang et al., 2016) to detect object contours but without recognizing specific categories. Recently, CASENet (Yu et al., 2017) introduces a skip-layer structure to enrich category-wise edge activations with bottom layer features, improving previous state-of-the-art methods with a significant margin.
Distracted supervision paradox in SED. SED naturally requires achieving two distinct supervision targets: i) locating fine detailed edges by capturing discontinuity among image regions, mainly using low-level features; and ii) identifying abstracted high-level semantics by summarizing different appearance variations of the target categories. Such distracted supervision paradox prevents the state-of-the-art SED method, i.e. CASENet (Yu et al., 2017), from successfully applying deep supervision, whose effectiveness has been demonstrated in a wide number of other computer vision tasks, e.g. image categorization (Szegedy et al., 2015), object detection (Lin et al., 2017), visual tracking (Wang et al., 2015), and category-agnostic edge detection (Xie and Tu, 2015; Liu et al., 2017).
In this paper, we propose a diverse deep supervision (DDS) method, which employs deep supervision with different loss functions for high-level and low-level feature learning as shown in Fig. 11(b). While mainly using high-level convolution (i.e. conv ) features for semantic classification and low-level conv ones for non-semantic edge details is intuitive and straightforward, directly doing this as in CASENet (Yu et al., 2017) results in even worse performance than directly learning semantic edges without deep supervision or category-agnostic edge guidance. In (Yu et al., 2017), Yu et al. claimed that deep supervision for lower layers of the network is not necessary, after unsuccessfully trying various ways of adding deep supervision. As illustrated in Fig. 11(b), we propose an information converter unit for changing the backbone DCNN features into different representations, for training category-agnostic or semantic edges respectively. Without such information converters, the low-level (conv layer 1-4) and high-level (conv layer 5) DCNN features would be optimized towards category-agnostic and semantic edges respectively, which can not be easily transformed with simple convolutions between conv 4 and conv 5 features. By introducing the information converter units, a single backbone representation could be effectively learned end-to-end, towards different targets. An example of our DDS is shown in Fig. 10. The bottom sides of the neural network can help side-5 to find fine details, and thus the final fused semantic edges (Fig. 10 (i)) are smoother than those coming from Side-5 (Fig. 10 (h)).
In summary, our main contributions are:
Proposing a state-of-the-art SED method, namely diverse deep supervision (DDS), which use information converters to avoid the difficulties of learning powerful backbone features with distracted supervision (Sec. 4).
Providing detailed ablation studies to further understand the proposed method (Sec. 5.1).
We extensively evaluate our method on SBD (Hariharan et al., 2011), Cityscapes (Cordts et al., 2016), and PASCAL VOC2012 (Everingham et al., [n. d.]) datasets. Our method achieves the new state-of-the-art performance. On the SBD dataset, the average maximum F-measure of our proposed DDS algorithm at optimal dataset scale (ODS) is 73.3%, comparing with previous state-of-the-art performance of 71.4%.
2. Related Work
The wealth of research in this area is such that we cannot give an exhaustive review. Instead, we first describe the most important threads of research to solve the problem of classical category-agnostic edge detection, followed by deep learning based approaches, semantic edge detection (SED), and the technique of deep supervision.
Classical category-agnostic edge detection. Edge detection has been conventionally solved by designing various filters such as Sobel (Sobel, 1970) and Canny (Canny, 1986) to detect pixels with highest gradients in their local neighborhoods. To the best of our knowledge, Konishi et al. (Konishi et al., 2003) proposed the first data-driven edge detector in which, unlike previous model based approaches, edge detection was posed as statistical inferences. Pb features which consist of brightness, color and texture are used in (Martin et al., 2004) to obtain the posterior probability of each boundary point. Pb is further extended to gPb (Arbeláez et al., 2011) by computing local cues from multi-scale and globalizing them through spectral clustering. Sketch Tokens are learned from hand drawn sketches for contour detection (Lim et al., 2013). Random decision forests are employed in (Dollár and Zitnick, 2015) to learn the local structure of edge patches and show competitive results among non-deep-learning approaches.
Deep category-agnostic edge detection. The number of success stories of machine learning has seen an all-time rise across many computer vision tasks recently. The unifying idea is deep learning which utilizes neural networks aimed at learning complex feature representations from raw data. Motivated by this, deep learning based methods have made vast inroads into edge detection as well. Ganin et al. (Ganin and Lempitsky, 2014) applied the deep neural network for edge detection using dictionary learning and nearest neighbor algorithm. DeepEdge (Bertasius et al., 2015a) extracts candidate contour points first and then classifies these candidates. HFL (Bertasius et al., 2015b) uses SE (Dollár and Zitnick, 2015) to generate candidate edge points instead of Canny (Canny, 1986) used in DeepEdge. Compared with DeepEdge which needs to process input patches for every candidate point, HFL turns out to be more computationally feasible as the input image is fed into the network only once. DeepContour (Shen et al., 2015) partitions edge data into subclasses and fits each subclass by different model parameters. Xie et al. (Xie and Tu, 2015) leveraged deeply-supervised nets to build a fully convolutional network that performs image-to-image prediction. Their deep model, known as HED, fuses the information from bottom and top conv layers.Liu et al. (Liu et al., 2017) introduced the first real-time edge detector that achieves higher F-measure scores than average human annotators on the popular BSDS500 dataset (Arbeláez et al., 2011).
Semantic edge detection. Owing to its strong ability for semantic representation learning, deep learning based edge detectors tend to generate high responses at object boundary locations, e.g. Fig. 10 (d)-(g). This inspires research on simultaneously detecting edge pixels and classifying them based on association to one or more of the object categories. The so-called category-aware edge detection is highly beneficial to various vision tasks including object recognition, stereo, semantic segmentation, and object proposal generation.
Hariharan et al. (Hariharan et al., 2011) first provided a principled way of combining generic object detectors with bottom-up contours to detect semantic edges. Yang et al. (Yang et al., 2016) proposed a fully convolutional encoder-decoder network for object contour detection. HFL produces category-agnostic binary edges and assigns class labels to all boundary points using deep semantic segmentation networks. Maninis et al. (Maninis et al., 2017) coupled their convolutional oriented boundaries (COB) with semantic segmentation results by dilated convolutions (Yu and Koltun, 2016). A weakly supervised learning strategy is introduced in (Khoreva et al., 2016) in which bounding box annotations alone are sufficient to reach high-quality object boundaries without any object-specific annotations. Yu et al. (Yu et al., 2017) proposed a novel network, CASENet, which has pushed SED performance to state-of-the-art . In their architecture, low-level features are only used to augment top classifications. After several failed experiments, they claimed that deep supervision on bottom sides of lower layers is not necessary for SED.
Deep supervision. Deep supervision has been demonstrated to be effective in many vision or learning tasks such as image classification (Lee et al., 2015; Szegedy et al., 2015), object detection (Lin et al., 2017; Lin et al., 2016; Liu et al., 2016), visual tracking (Wang et al., 2015), category-agnostic edge detection (Xie and Tu, 2015; Liu et al., 2017) and so on. Theoretically speaking, lower layers in deep networks learn discriminative features so that classification/regression at higher layers is easier. In practice, one can explicitly influence the hidden layer weight/filter update process to favor highly discriminative feature maps using deep supervision. However, it may not be optimal to directly add deep supervision of category-agnostic edges on the bottom sides because of the distracted supervision discussed above. We will introduce a new semantic edge detector with successful diverse deep supervision in the following sections.
3. Discussion of CASENet
3.1. CASENet Model
CASENet (Yu et al., 2017) is built on the well-known backbone network of ResNet (He et al., 2016). It connects a conv layer after each of Side-1-3 to produce a single channel feature map . The top Side-5 is connected to a conv layer to output -channel class activation map , where is the number of categories. The shared concatenation replicates bottom features to separately concatenate with each channel of the class activation map:
Then, -grouped conv is performed on to generate semantic edge map with channels. The -th channel represents the edge map for the -th category.
CASENet (Yu et al., 2017) imposes supervision only on Side-5 and the final fused activation. In (Yu et al., 2017), the authors have tried several deeply supervised architectures. They first separately used all of Side-1-5 for SED and each side is connected a classification loss. The evaluation results are even worse than the basic architecture that directly applies convolution on Side-5 to obtain semantic edges. It is widely accepted that the lower layers of neural networks contain low-level less-semantic features such as local edges that are not suitable for semantic classification, because the recognition of semantic categories needs abstracted high-level features that appears in the top layers of neural networks. Thus they would obtain poor classification results at bottom sides. Fusing bottom poor results with top results, of course there would not be any improvement for SED accuracy.
They also attempted to impose deep supervision of binary edges on Side-1-3 in CASENet but observed a divergence to the semantic classification at Side-5. With the top supervision of semantic edges, the top layers of the network will be supervised to learn abstracted high-level semantics that can summarize different appearance variations of the target categories. The bottom layers will be forced to serve the top layers for this goal though back propagation, because the bottom layers are the bases of top layers for the representation power of DCNNs. On the other hand, with the bottom supervision of category-agnostic edges, the bottom layers are taught: all semantic categories are same here, and there only exists the differences between the edge and non-edge. This will cause conflicts at bottom layers and therefore fail to provide discriminative gradient signals for parameter updating.
Note that Side-4 is not used in CASENet. We believe it is a naive way to alleviate the information conflicts by regarding the whole res4 block as a buffer unit between the bottom and top sides. When adding Side-4 to CASENet (see in Sec. 5.1), the new model (CASENet Side4) achieves 70.9% mean F-measure, compared with the 71.4% of original CASENet. This demonstrates our hypothesis about the buffer function of res4 block. Moreover, the classical conv layer after each side (Xie and Tu, 2015; Yu et al., 2017) is too weak to buffer the conflicts. In this paper, we propose an information converter unit to solve the conflicts of the distracted supervision.
4. Our Approach
Intuitively, employing different but “appropriate” ground-truth for bottom and top sides, the learned intermediate representations of different levels may contain complementary information. But directly imposing deep supervision seems not beneficial here. In this section, we propose a new network architecture for the complementary learning of the bottom and top sides for SED.
4.1. The Proposed DDS Algorithm
Based on above discussion, we hypothesize that the bottom sides of neural networks may be not beneficial to SED directly. However, we believe that bottom sides still encode complementary fine details compared with the top side (Side-5). With appropriate architecture re-design, we believe they can be used for category-agnostic edge detection to improve the localization accuracy of semantic edges generated by the top side. To this end, we design a novel information converter to assist low-level feature learning and generate consistent gradient signals coming from higher layers. This is essential as they enable directly influencing the hidden layer weight/filter update process to favor highly discriminative feature maps towards correct SED.
We show our proposed network architecture in Fig. 11(b). We follow CASENet to use ResNet (He et al., 2016) as our backbone network. After each of the information converters (Sec. 4.2) of Side-1-4, we connect a conv layer with a single output channel to produce an edge response map. These predicted maps are then upsampled to the original image size using bilinear interpolation. These side outputs are supervised by binary category-agnostic edges. We perform -channel convolution on Side-5 to obtain semantic edges, where each channel represents the binary edge map of one category. We adopt the same upsampling operation as done in Side-1-4. Semantic edges are used to supervise the training of Side-5.
Here, we denote the produced binary edge maps from Side-1-4 as . The semantic edge map from Side-5 is still represented by . A shared concatenation is then performed to obtain the stacked edge activation map:
Note that is a stacked edge activation map, while in CASENet is a stacked feature map. Finally, we apply -grouped convolution on to generate the fused semantic edges. The fused edges are supervised by the ground truth of semantic edges. As demonstrated in HED (Xie and Tu, 2015), the convolution can well fuse the edges from bottom and top sides.
4.2. Information Converter
Recently, residual nets have been proved to be easier to optimize than plain nets (He et al., 2016). The residual learning operation is usually embodied by a shortcut connection and element-wise addition. We describe a residual conv block in Fig. 12. This block consists of four alternatively connected ReLU and conv layers, and the output of the first ReLU layer is added to the output of the last conv layer. Our proposed information converter combines two residual modules and is connected to each side of DDS network to transform the learned representation into proper form. This operation is expected to avoid the conflicts due to the discrepancy in different loss functions.
The top supervision of semantic edges will produce gradient signals for the learning of semantic features, while the bottom supervision of category-agnostic edges will produce category-agnostic gradients. These conflicting gradient signals will confuse the backbone network through back propagation if the distracted supervision is directly imposed. Our information converters can play a buffer role by converting these conflicting signals into proper representation. In this way, the backbone network will receive consistent update signals and be optimized towards the same target, besides, the different tasks at bottom and top sides are carried out by the information converters.
Our proposed network can successfully combine the fine details from the bottom sides and the semantic classification from the top side. The experimental results demonstrate that our algorithm solves the problem of conflicts caused by diverse deep supervision. Unlike CASENet, our semantic classification at Side-5 can be well optimized without any divergence. The produced binary edges from the bottom sides help Side-5 to make up the fine details. Thus the final fused semantic edges can achieve better localization quality.
We use binary edges with single pixel width for the supervision of Side-1-4 and thick semantic boundaries for the supervision of Side-5 and final fused edges. One pixel is viewed as a binary edge if it belongs to the semantic boundaries of any category. We obtain thick semantic boundaries by seeking the difference between a pixel and its neighbors, like CASENet (Yu et al., 2017). A pixel with the label is regarded as a boundary of class if at least one neighbor with a label () exists.
4.3. Multi-task Loss
Two different loss functions, which stand for category-agnostic and category-aware edge detection loss respectively, are employed in our multi-task learning framework. We denote all the layer parameters in the network as . Suppose an image has a corresponding binary edge map . The sigmoid cross-entropy loss function for Side-1-4 can be formulated as
where and . and represent edge and non-edge ground truth label sets, respectively. is the produced activation value at pixel . is the standard sigmoid function.
For an image , suppose the semantic ground truth label is , in which is the binary edge map for the -th category. Note that each pixel may belong to the boundaries of multiple categories. We use the multi-label loss as in CASENet:
in which is the activation value for -th category at pixel . The loss of the fused semantic activation map is denoted as . The computation of is similar to . The total loss is formulated as
With the total loss function, we can use stochastic gradient descent to optimize all parameters.
4.4. Implementation Details
We implement our algorithm using the well-known deep learning framework of Caffe (Jia et al., 2014). The proposed network is built on ResNet (He et al., 2016). We follow CASENet (Yu et al., 2017) to change the strides of the first and fifth convolution blocks from 2 to 1. The atrous algorithm is used to keep the receptive field sizes the same as the original ResNet. We also follow CASENet to pre-train the convolution blocks on the COCO dataset (Lin et al., 2014). The network is optimized with stochastic gradient descent (SGD). Each SGD iteration chooses 10 images uniformly at random and crops a patch from each of them. The weight decay and momentum are set to 0.0005 and 0.9, respectively. The base learning rate is set to 5e-7/2.5e-7 for SBD (Hariharan et al., 2011) and Cityscapes (Cordts et al., 2016) datasets, respectively. We use the learning rate policy of “poly”, in which the current learning rate equals the base one multiplying . The parameter of is set to 0.9. We run 25k/80k iterations of SGD for SBD (Hariharan et al., 2011) and Cityscapes (Cordts et al., 2016) datasets, respectively. We use the model trained on SBD to test PASCAL VOC2012 without retraining. The side upsampling operation is implemented with deconvolution layers by fixing the parameters to perform bilinear interpolation. All experiments are performed using a NVIDIA TITAN Xp GPU.
|CASENet (Yu et al., 2017)||83.3||76.0||80.7||63.4||69.2||81.3||74.9||83.2||54.3||74.8||46.4||80.3||80.2||76.6||80.8||53.3||77.2||50.1||75.9||66.8||71.4|
|InvDet (Hariharan et al., 2011)||41.5||46.7||15.6||17.1||36.5||42.6||40.3||22.7||18.9||26.9||12.5||18.2||35.4||29.4||48.2||13.9||26.9||11.1||21.9||31.4||27.9|
|HFL-FC8 (Bertasius et al., 2015b)||71.6||59.6||68.0||54.1||57.2||68.0||58.8||69.3||43.3||65.8||33.3||67.9||67.5||62.2||69.0||43.8||68.5||33.9||57.7||54.8||58.7|
|HFL-CRF (Bertasius et al., 2015b)||73.9||61.4||74.6||57.2||58.8||70.4||61.6||71.9||46.5||72.3||36.2||71.1||73.0||68.1||70.3||44.4||73.2||42.6||62.4||60.1||62.5|
|BNF (Bertasius et al., 2016)||76.7||60.5||75.9||60.7||63.1||68.4||62.0||74.3||54.1||76.0||42.9||71.9||76.1||68.3||70.5||53.7||79.6||51.9||60.7||60.9||65.4|
|WS (Khoreva et al., 2016)||65.9||54.1||63.6||47.9||47.0||60.4||50.9||56.5||40.4||56.0||30.0||57.5||58.0||57.4||59.5||39.0||64.2||35.4||51.0||42.4||51.9|
|DilConv (Yu and Koltun, 2016)||83.7||71.8||78.8||65.5||66.3||82.6||73.0||77.3||47.3||76.8||37.2||78.4||79.4||75.2||73.8||46.2||79.5||46.6||76.4||63.8||69.0|
|DSN (Yu et al., 2017)||81.6||75.6||78.4||61.3||67.6||82.3||74.6||82.6||52.4||71.9||45.9||79.2||78.3||76.2||80.1||51.9||74.9||48.0||76.5||66.8||70.3|
|COB (Maninis et al., 2017)||84.2||72.3||81.0||64.2||68.8||81.7||71.5||79.4||55.2||79.1||40.8||79.9||80.4||75.6||77.3||54.4||82.8||51.7||72.1||62.4||70.7|
|CASENet (Yu et al., 2017)||83.3||76.0||80.7||63.4||69.2||81.3||74.9||83.2||54.3||74.8||46.4||80.3||80.2||76.6||80.8||53.3||77.2||50.1||75.9||66.8||71.4|
We evaluate our method on several datasets, including SBD (Hariharan et al., 2011), Cityscapes (Cordts et al., 2016), and PASCAL VOC2012 (Everingham et al., [n. d.]). SBD (Hariharan et al., 2011) comprises 11355 images and corresponding labeled semantic edge maps for 20 Pascal VOC classes. It is divided into 8498 training and 2857 validation images. We use the training set to train our network and the validation set for evaluation. The Cityscapes dataset (Cordts et al., 2016) is a large-scale semantic segmentation dataset with stereo video sequences recorded in street scenarios from 50 different cities. It consists of 5000 images that are divided into 2975 training, 500 validation and 1525 test images. The ground truth of the test set has not been published because it is an online competition. Thus we use the training set for training and validation set for test as on SBD dataset. PASCAL VOC2012 (Everingham et al., [n. d.]) contains the same 20 classes as SBD. Due to the same reason, the semantic labeling of test set has not been published. We generate a new validation set which excludes training images in SBD. This results in 904 validation images. We test on this new set using the models trained on SBD training set to test the generalization ability of different methods.
For evaluation protocol, we use the standard benchmark published in (Hariharan et al., 2011). The maximum F-measure () at optimal dataset scale (ODS) for each class and mean maximum F-measure across all classes are reported. For Cityscapes and VOC2012 datasets, we follow (Hariharan et al., 2011) to generate semantic edges with single pixel width, so that the produced boundaries are just the boundaries of semantic objects or stuff in semantic segmentation. We also follow (Yu et al., 2017) to downsample the ground truth and predicted edge maps of Cityscapes into half sizes of original dimensions for the speed of evaluation.
5.1. Ablation Study
In this part, we perform some ablation experiments on SBD to first investigate our proposed DDS algorithm in various aspects before comparing with existing state-of-the-art methods. To this end, we propose six DDS variants:
Softmax that only adopts the top side (Side-5) with 21-class softmax loss function, so that the ground truth edges of each category are not overlapping and thus each pixel has one specific class label.
Basic that employs the top side (Side-5) for multi-label classification, which means we directly connect the loss function of on res5c to train the detector.
DSN that directly applies the deeply supervised network architecture, in which each side of the backbone network is connected a conv layer with channels for SED and the resulting activation maps from all sides are fused to generate the final semantic edges.
CASENet Side4, which is similar to CASENet but taking into account Side-4 that connects a conv layer to produce a single-channel feature map while CASENet only uses Side-1-3 and Side-5.
DDS Converter that removes the information converters in DDS, so that deep supervision is directly imposed after each side.
DDS DeepSup that removes the deep supervision from Side-1-4 of DDS but remains the information converters.
We evaluate these variants and the original DDS and CASENet (Yu et al., 2017) methods on the SBD validation dataset. The evaluation results are shown in Tab. 1. We can see that Softmax suffers from significant performance degradation. Because the predicted semantic edges of neural networks are usually thick and overlapping with other classes, it is improper to assign a single label to each pixel. Thus we apply multi-label loss in Equ. (4). The Basic network achieves an ODS F-measure of 70.6%, which is 0.3% higher than DSN. This further verifies our speculation in Sec. 3 indicating that features from the bottom layers are not discriminative enough for the task of semantic classification. Besides, CASENet Side4 performs better than DSN. It demonstrates that bottom convolution features are more suitable for binary edge detection. Moreover, the F-measure of CASENet Side4 is lower than original CASENet. In addition, the improvement from DDS DeepSup to DDS shows that DDS’ success does not come from more parameters (conv layers) but comes from the coordination of deep supervision and information converters. Adding more conv layers but without deep supervision may make the network more difficult to be convergent. Our conclusion is in line with (Yu et al., 2017), when compared DDS Converter with the results of CASENet, which indicates that it is useless to directly add binary edge supervision on bottom sides.
Intuitively, employing different but “appropriate” ground-truth for bottom and top sides may enhance the feature learning in different layers. Upon this, the learned intermediate representations of different levels tend to contain complementary information. However, in our case, it may not be necessary to directly add deep supervision of category-agnostic edges on bottom sides, because less discriminative gradient signals are likely to be caused by the discrepancy in the loss function of Equ. 5. Alternatively, we show with proper architecture re-design, we are able to employ deep supervision for a significant performance boosting. The information converters adopted in the proposed method play a central role in guiding lower layers for category-agnostic edge detection. In this way, low-level edges from bottom layers encode more details which assist top layers towards better localized semantic edges. They also work well towards generating consistent gradient signals coming from higher layers. This is essential as they enable directly influencing the hidden layer weight/filter update process to favor highly discriminative feature maps towards correct SED.
The significant performance improvement of our proposed DDS over CASENet Side4 and DDS - Converter demonstrates the importance of our design in which different sides use different supervision after the information format conversion. After exploring DDS with several variants and establishing the effectiveness of the approach, we summarize the results obtained by our method and compare it with several state-of-the-art methods.
5.2. Evaluation on SBD
We compare DDS on SBD dataset with several state-of-the-art methods including InvDet (Hariharan et al., 2011), HFL-FC8 (Bertasius et al., 2015b), HFL-CRF (Bertasius et al., 2015b), BNF (Bertasius et al., 2016), WS (Khoreva et al., 2016), DilConv (Yu and Koltun, 2016), DSN, COB (Maninis et al., 2017), and CASENet (Yu et al., 2017). Evaluation results are summarized in Tab. 2.
InvDet is a non-deep learning based approach which shows competitive results among other conventional approaches. COB is a state-of-the-art category-agnostic edge detection method. Combining it with semantic segmentation results of DilConv brings a competitive semantic edge detector (Maninis et al., 2017). The improvement of COB over DilConv reflects the effectiveness of the fusion algorithm in (Maninis et al., 2017). The fact that both CASENet and DDS outperform COB illustrates that directly learning semantic edges is crucial, because the combination of binary edges and semantic segmentation is not enough for SED. DDS achieves the state-of-the-art performance across all competitors. DDS also outperforms other methods on 15 classes across total 20 classes. The ODS F-measure of our DDS algorithm is 1.9% higher than CASENet and 2.6% higher than COB. Thus DDS pushes the SED to the new state of the art. The average runtime of DSN, CASENet, and DDS is displayed in Tab. 5. Hence DDS can generate state-of-the-art semantic edges only with slightly slower speed.
To better view edge prediction results, we show an example in Fig. 13. We also show the normalized images of side activations in Fig. 14. All activations are obtained before sigmoid non-linearization. For the arrangement of the pictures, we do not display Side-4 activation of our DDS. From Side-1 to Side-3, one can see that the feature maps of DDS are significantly clearer than DSN and CASENet. We can find clear category-agnostic edges for DDS, while DSN and CASENet suffer from noisy activations. For example in CASENet, without imposing deep supervision on Side-1-3, we can hardly find its edge activations. For category classification activations, DDS can separate horse and person clearly, while DSN and CASENet can not. Thus the information converters also help Side-5 to be better optimized for category-specific classification. This further verifies the feasibility of DDS.
5.3. Evaluation on Cityscapes
Cityscapes (Cordts et al., 2016) dataset is more challenging than SBD (Hariharan et al., 2011). The images in Cityscapes are captured in more complicated scenes, usually in the urban street scenes of different cities. There are more objects, especially more overlapping objects in each image. Thus Cityscapes may be better to test semantic edge detectors. We report the evaluation results in Tab. 3. DDS achieves better results than CASENet on all categories. The mean ODS F-measure of DDS is 6.7% higher than CASENet. Some qualitative comparisons are shown in Fig. 15. We can see that DDS can produce smoother and clearer edges.
5.4. Evaluation on PASCAL VOC2012
VOC2012 (Everingham et al., [n. d.]) contains the same object categories as SBD (Hariharan et al., 2011). For the validation set, we exclude the images that appear in SBD training set. This results in a new validation set with 904 images. We use the new validation set to test some competitors with the model trained on SBD. By this way, we can test the generalization ability of various methods. However, the original annotations of VOC2012 leave a thin unlabeled area near the boundary of each object. This will lead to an uncertain evaluation. Instead, we follow the methodology in (Yang et al., 2016) to employ a dense CRF model (Krähenbühl and Koltun, 2011) to fill the uncertain area with the neighboring object labels. We further follow (Hariharan et al., 2011) to generate semantic edges with single pixel width. The evaluation results are summarized in Tab. 4. DDS achieves the best performance, as expected, indicating that DDS has good generalization ability.
In this paper, we study the problem of SED. Previous methods claim that deep supervision in this area is not necessary. We show this is not true and with proper architecture re-design, we are able to deeply supervise the network for better detection results. The core of our approach is the introduction of novel information converters, which play a central role in guiding lower layers for category-agnostic edge detection. They also work well towards generating consistent gradient signals with the ones coming from category-aware edge detection in the higher layers. DDS is essential as it enables influencing the hidden layer weight/filter update process directly to favor highly discriminative feature maps towards correct SED. DDS achieves the state-of-the-art performance on several datasets including SBD (Hariharan et al., 2011), Cityscape (Cordts et al., 2016), and PASCAL VOC2012 (Everingham et al., [n. d.]). Our idea to leverage deep supervision for training a deep network opens a path towards putting more emphasis utilizing rich feature hierarchies from deep networks for SED as well as other high-level tasks such as semantic segmentation (Maninis et al., 2017; Chen et al., 2016), object detection (Ferrari et al., 2008; Maninis et al., 2017), and instance segmentation (Kirillov et al., 2017; Hayder et al., 2017).
- Amer et al. (2015) Mohamed R Amer, Siavash Yousefi, Raviv Raich, and Sinisa Todorovic. 2015. Monocular extraction of 2.1 D sketch using constrained convex optimization. IJCV 112, 1 (2015), 23–42.
- Arbeláez et al. (2011) Pablo Arbeláez, Michael Maire, Charless Fowlkes, and Jitendra Malik. 2011. Contour detection and hierarchical image segmentation. IEEE TPAMI 33, 5 (2011), 898–916.
- Bertasius et al. (2015a) Gedas Bertasius, Jianbo Shi, and Lorenzo Torresani. 2015a. DeepEdge: A multi-scale bifurcated deep network for top-down contour detection. In IEEE CVPR. 4380–4389.
- Bertasius et al. (2015b) Gedas Bertasius, Jianbo Shi, and Lorenzo Torresani. 2015b. High-for-low and low-for-high: Efficient boundary detection from deep object features and its applications to high-level vision. In IEEE ICCV. 504–512.
- Bertasius et al. (2016) Gedas Bertasius, Jianbo Shi, and Lorenzo Torresani. 2016. Semantic segmentation with boundary neural fields. In IEEE CVPR. 3602–3610.
- Canny (1986) John Canny. 1986. A computational approach to edge detection. IEEE TPAMI 6 (1986), 679–698.
- Chen et al. (2016) Liang-Chieh Chen, Jonathan T Barron, George Papandreou, Kevin Murphy, and Alan L Yuille. 2016. Semantic image segmentation with task-specific edge detection using cnns and a discriminatively trained domain transform. In IEEE CVPR. 4545–4554.
- Cordts et al. (2016) Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. 2016. The cityscapes dataset for semantic urban scene understanding. In IEEE CVPR. 3213–3223.
- Dollár and Zitnick (2015) Piotr Dollár and C Lawrence Zitnick. 2015. Fast edge detection using structured forests. IEEE TPAMI 37, 8 (2015), 1558–1570.
- Everingham et al. ([n. d.]) M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. [n. d.]. The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html. ([n. d.]).
- Ferrari et al. (2008) Vittorio Ferrari, Loic Fevrier, Frederic Jurie, and Cordelia Schmid. 2008. Groups of adjacent contour segments for object detection. IEEE TPAMI 30, 1 (2008), 36–51.
- Ferrari et al. (2010) Vittorio Ferrari, Frederic Jurie, and Cordelia Schmid. 2010. From images to shape models for object detection. IJCV 87, 3 (2010), 284–303.
- Ganin and Lempitsky (2014) Yaroslav Ganin and Victor Lempitsky. 2014. N-Fields: Neural network nearest neighbor fields for image transforms. In ACCV. Springer, 536–551.
- Hariharan et al. (2011) Bharath Hariharan, Pablo Arbeláez, Lubomir Bourdev, Subhransu Maji, and Jitendra Malik. 2011. Semantic contours from inverse detectors. In IEEE ICCV. IEEE, 991–998.
- Hayder et al. (2017) Zeeshan Hayder, Xuming He, and Mathieu Salzmann. 2017. Boundary-aware Instance Segmentation. In IEEE CVPR. 5696–5704.
- He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In IEEE CVPR. 770–778.
- Hoiem et al. (2007) Derek Hoiem, Andrew N Stein, Alexei A Efros, and Martial Hebert. 2007. Recovering occlusion boundaries from a single image. In IEEE ICCV. IEEE, 1–8.
- Jia et al. (2014) Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. In ACM MM. ACM, 675–678.
- Khoreva et al. (2016) Anna Khoreva, Rodrigo Benenson, Mohamed Omran, Matthias Hein, and Bernt Schiele. 2016. Weakly supervised object boundaries. In IEEE CVPR. 183–192.
- Kirillov et al. (2017) Alexander Kirillov, Evgeny Levinkov, Bjoern Andres, Bogdan Savchynskyy, and Carsten Rother. 2017. Instancecut: from edges to instances with multicut. In IEEE CVPR. 5008–5017.
- Konishi et al. (2003) Scott Konishi, Alan L. Yuille, James M. Coughlan, and Song Chun Zhu. 2003. Statistical edge detection: Learning and evaluating edge cues. IEEE TPAMI 25, 1 (2003), 57–74.
- Krähenbühl and Koltun (2011) Philipp Krähenbühl and Vladlen Koltun. 2011. Efficient inference in fully connected crfs with gaussian edge potentials. In NIPS. 109–117.
- Lee et al. (2015) Chen-Yu Lee, Saining Xie, Patrick Gallagher, Zhengyou Zhang, and Zhuowen Tu. 2015. Deeply-supervised nets. In Artificial Intelligence and Statistics. 562–570.
- Lim et al. (2013) Joseph J Lim, C Lawrence Zitnick, and Piotr Dollár. 2013. Sketch tokens: A learned mid-level representation for contour and object detection. In IEEE CVPR. 3158–3165.
- Lin et al. (2016) Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. 2016. Feature pyramid networks for object detection. In IEEE CVPR.
- Lin et al. (2017) Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. 2017. Focal loss for dense object detection. arXiv preprint arXiv:1708.02002 (2017).
- Lin et al. (2014) Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In ECCV. Springer, 740–755.
- Liu et al. (2016) Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. 2016. Ssd: Single shot multibox detector. In European conference on computer vision. Springer, 21–37.
- Liu et al. (2017) Yun Liu, Ming-Ming Cheng, Xiaowei Hu, Kai Wang, and Xiang Bai. 2017. Richer Convolutional Features for Edge Detection. In IEEE CVPR. 3000–3009.
- Maninis et al. (2017) Kevis-Kokitsi Maninis, Jordi Pont-Tuset, Pablo Arbelaez, and Luc Van Gool. 2017. Convolutional Oriented Boundaries: From Image Segmentation to High-Level Tasks. IEEE TPAMI (2017).
- Martin et al. (2004) David R Martin, Charless C Fowlkes, and Jitendra Malik. 2004. Learning to detect natural image boundaries using local brightness, color, and texture cues. IEEE TPAMI 26, 5 (2004), 530–549.
- Ramalingam et al. (2010) Srikumar Ramalingam, Sofien Bouaziz, Peter Sturm, and Matthew Brand. 2010. Skyline2gps: Localization in urban canyons using omni-skylines. In IEEERSJ IROS. IEEE, 3816–3823.
- Shan et al. (2014) Qi Shan, Brian Curless, Yasutaka Furukawa, Carlos Hernandez, and Steven M Seitz. 2014. Occluding contours for multi-view stereo. In IEEE CVPR. 4002–4009.
- Shen et al. (2015) Wei Shen, Xinggang Wang, Yan Wang, Xiang Bai, and Zhijiang Zhang. 2015. DeepContour: A deep convolutional feature learned by positive-sharing loss for contour detection. In IEEE CVPR. 3982–3991.
- Sobel (1970) Irwin Sobel. 1970. Camera models and machine perception. Technical Report. Stanford Univ Calif Dept of Computer Science.
- Szegedy et al. (2015) Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1–9.
- Wang et al. (2015) Lijun Wang, Wanli Ouyang, Xiaogang Wang, and Huchuan Lu. 2015. Visual tracking with fully convolutional networks. In IEEE ICCV. 3119–3127.
- Xie and Tu (2015) Saining Xie and Zhuowen Tu. 2015. Holistically-nested edge detection. In IEEE ICCV. 1395–1403.
- Yang et al. (2016) Jimei Yang, Brian Price, Scott Cohen, Honglak Lee, and Ming-Hsuan Yang. 2016. Object contour detection with a fully convolutional encoder-decoder network. In IEEE CVPR. 193–202.
- Yu and Koltun (2016) Fisher Yu and Vladlen Koltun. 2016. Multi-scale context aggregation by dilated convolutions. In ICLR.
- Yu et al. (2017) Zhiding Yu, Chen Feng, Ming-Yu Liu, and Srikumar Ramalingam. 2017. CASENet: Deep Category-Aware Semantic Edge Detection. In IEEE CVPR. 5964–5973.