Dense Extreme Inception Network: Towards
a Robust CNN Model for Edge Detection
This paper proposes a Deep Learning based edge detector, which is inspired on both HED (Holistically-Nested Edge Detection) and Xception networks. The proposed approach generates thin edge-maps that are plausible for human eyes; it can be used in any edge detection task without previous training or fine tuning process. As a second contribution, a large dataset with carefully annotated edges has been generated. This dataset has been used for training the proposed approach as well as the state-of-the-art algorithms for comparisons. Quantitative and qualitative evaluations have been performed on different benchmarks showing improvements with the proposed method when F-measure of ODS and OIS are considered.
Edge detection is a recurrent task required for several classical computer vision processes (e.g., segmentation [zhang2016segmentation], image recognition [yang2002detectFace, shotton2008objectRec]), or even in the modern tasks such as image-to-image translation [zhu2017cyclegan], photo sketching [lips2019photo-sketch] and so on. Moreover, in fields such as medical image analysis [pourreza2017medImg] or remote sensing [isikdogan2017remotesens] most of their heart activities require edge detectors. In spite of the large amount of work on edge detection, it still remains as an open problem with space for new contributions.
Since the Sobel operator published in [sobel1972sobelmethod], a large number of edge detectors have been proposed [oskoei2010surveyedge] and most of the techniques like Canny [canny1987cannymethod] are still being used nowadays. Recently, in the era of Deep Learning (DL), Convolutional Neural Netwoks (CNN) based models like DeepContour [shen2015deepcontour], DeepEdge [bertasius2015deepedge], HED [xie2017hed], RCF [liu2017rcf], BDCN [he2019edgeBDCN] among others, have been proposed. These models are capable of predict an edge map from a given image just like the low level based methods [ziou1998edgeOverview], with better performance. The success of these methods is mainly by the CCNs applied in different scales to a large set of images together with the training regularization techniques.
Most of the aforementioned DL based approaches are trained on already existing boundary detection or object segmentation datasets [martin2001bsds300, silberman2012NYUD, mottaghi2014PASCALcontext] to detect edges. Even though most of the images on those datasets are well annotated, there are a few of them that contain missing edges, which difficult the training, thus the predicted edge-maps lost some edges in the images (see Fig. 1). In the current work, those datasets are used just for qualitative comparisons due to the objective of the current work is edge detection (not objects’ boundary/contour detection). The boundary/contour detection tasks, although related and some times assumed as a synonym task, are different since just objects’ boundary/contour need to be detected, but not all edges present in the given image.
This manuscript aims to demonstrate the edge detection generalization from a DL model. In other words, the model is capable of being evaluated in other edge detection datasets without being trained on those sets. To the best of our knowledge, the unique dataset for edge detection shared to the community is Multicue for Boundary Detection Dataset (MBDD) [mely2016multicue], which although mainly generated for the boundary detection study, it contains a subset of images devoted for edge detection. Therefore, a new dataset has been collected to train the proposed edge detector. The main contributions in the paper are summarized as follow:
A dataset with carefully annotated edges has been generated and released to the community—BIPED: Barcelona Images for Perceptual Edge Detection.111Dataset and code will be available soon.
A robust CNN architecture for edge detection is proposed, referred to as DexiNed: Dense Extreme Inception Network for Edge Detection. The model has been trained from the scratch, without pretrained weights.
The rest of the paper is organized as follow. Section 2 summarizes the most relevant and recent work on edge detection. Then, the proposed approach is described in Section 3. The experimental setup is presented in Section 4. Experimental results are then summarized in Section 5; finally, conclusions and future work are given in Section 6.
2 Related Work
There is a large number of work on the edge detection literature, for a detailed review see [ziou1998edgeOverview, Gong2018contourOverview]. According to the technique the given image is processed, proposed approaches can be categorized as: Low level feature; Brain-biologically inspiration; Classical learning algorithms; Deep learning algorithms.
Low-level feature: Most of the algorithms in this category generally follow a smooth process, which could be performed convolving the image with a Gaussian filter or manually performed kernels. A sample of such methods are [canny1987cannymethod, schunck1987mMultiScaleF, perona1991mComEdges]. Since Canny [canny1987cannymethod], most of the nowadays methods use non-maximum suppression [canny1983non-maximum] as the last process of edge detection, even DL based models.
Brain-biologically inspiration: This kind of method started their research in the 60s of the last century analyzing the edge and contour formation in the vision systems of monkeys and cats [daugman1985gaborFilter]. inspired on such a work, in [grigorescu2003cid] the authors proposed a method based on simple cells and Gabor filters. Another study focused on boundary detection is presented in [mely2016multicue]. This work proposes to use Gabor and derivative of Gaussian filters, considering three different filter sizes and machine learning classifiers. More recently, in [yang2015SCO], an orientation selective neuron is presented, by using first derivative of a Gaussian function. This work has been recently extended in [Akbarinia2018SEDext] by modeling retina, simple cells even the cells from V2.
Classical learning algorithms: These techniques are usually based on sparse representation learning [mairal2008sparceModel], dictionary learning [xiaofeng2012Diclearn], gPb (gradient descent) [arbelaez2011bsds500] and structured forest [dollar2015forests] (decision trees). At the time these approaches have been proposed, they outperformed state-of-the-art low level based techniques reaching the best F-measure values in BSDS segmentation dataset [arbelaez2011bsds500]. Although obtained results were acceptable in most of the cases, these techniques still have limitations in challenging scenarios.
Deep learning algorithms: With the success of CNN, principally because of its result in [krizhevsky2012alexnet], many methods have been proposed [ganin2014firstDLedge, bertasius2015deepedge, xie2017hed, liu2017rcf, wang2017ced]. In HED [xie2017hed] for example, an architecture based on VGG16 [simonyan2014vgg] and pre-trained with ImageNet dataset is proposed. The network generate edges from each convolutional block constructing a multi-scale learning architecture. The training process uses a modified cross entropy loss function for each predicted edge-maps. Following the same process, [liu2017rcf] and [wang2017ced], have proposed improvements. While in [liu2017rcf] every output is feed from each convolution from every block, in [wang2017ced] a set of fusion backward process, with the data of each outputs, is performed. In general, most of the current DL based models use as their backbone the convolutional blocks of VGG16 architecture.
3 Dense Extreme Inception Network for Edge Detection
This section presents the architecture proposed for edge detection, termed DexiNed, which consists of a stack of learned filters that receive as input an image and predict an edge-map with the same resolution. DexiNed can be seen as two sub networks (see Figs. 3 and 4): Dense extreme inception network (Dexi) and the upsampling block (UB). While Dexi is fed with the RGB image, UB is fed with feature maps from each block of Dexi. The resulting network (DexiNed) generates thin edge-maps, avoiding missed edges in the deep layers. Note that even though without pre-trained data, the edges predicted from DexiNed are in most of the cases better than state-of-the-art results, see Fig. 1.
3.1 DexiNed Architecture
The architecture is depicted in Figure 3, it consists of an encoder with 6 main blocks inspired in the xception network [chollet2017xception]. The network outputs feature maps at each of the main blocks to produce intermediate edge-maps using an upsampling block defined in Section 3.2. All the edge-maps resulting from the upsampling blocks are concatenated to feed the stack of learned filters at the very end of the network and produce a fused edge-map. All six upsampling blocks do not share weights.
The blocks in blue consists of a stack of two convolutional layers with kernel size , followed by batch normalization and ReLU as the activation function (just the last convs in the last sub-blocks does not have such activation). The max-pool is set by kernel and stride . As the architecture follows the multi-scale learning, like in HED, an upsampling process (horizontal blocks in gray, Fig. 3) is followed (see details in Section 3.2).
Even though DexiNed is inspired in xception, the similarity is just in the structure of the main blocks and connections. Major differences are detailed below:
While in xception separable convolutions are used, DexiNed uses standard convolutions.
As the output is a 2D edge-map, there is ”not exit flow”, instead, another block at the end of block five has been added. This block has 256 filters and as in block 5 there is not maxpooling operator.
In block 4 and block 5, instead of 728 filters, 512 filters have been set. The separations of the main blocks are done with the blocks connections (rectangles in green) drawn on the top side of Fig. 3.
Concerning to skip connections, in xception there is one kind of connection, while in DexiNed there are two type of connections, see rectangles in green on the top and bottom of Fig. 3.
Since many convolutions are performed, every deep block losses important edge features and just one main-connection is not sufficient, as highlighted in DeepEdge [bertasius2015deepedge], from the forth convolutional layer the edge feature loss is more chaotic. Therefore, since block , the output of each sub-block is averaged with edge-connection (orange squares in Fig. 3). These processes are inspired in ResNet [he2016resnet] and RDN [zhang2018densenet] with the following notes: as shown in Fig. 3, after the max-pooling operation and before summation with the main-connection, the edge-connection is set to average each sub-blocks output (see rectangles in green, bottom side); from the max-pool, block , edge-connections feed sub-blocks in block , and , however, the sub-blocks in are feed just from block 5 output.
3.2 Upsampling Block
DexiNed has been designed to produce thin edges in order to enhance the visualization of predicted edge-maps. One of the key component of DexiNed for the edge thinning is the upsampling block, as appreciated in Fig. 3, each output from the Dexi blocks feeds the UB. The UB consists of the conditional stacked sub-blocks. Each sub-block has 2 layers, one convolutional and the other deconvolutional; there are two types of sub-blocks. The first sub-block (sub-block1) is feed from Dexi or sub-block2; it is only used when the scale difference between the feature map and the ground truth is equal to 2. The other sub-block (sub-block2), is considered when the difference is greater than 2. This sub-blocks is iterated till the feature map scale reaches 2 with respect to the GT. The sub-block1 is set as follow: kernel size of the conv layer ; followed by a ReLU activation function; kernel size of the deconv layer or transpose convolution , where is the input feature map scale level; both layers return one filter and the last one gives a feature map with the same size as the GT. The last conv layer does not have activation function. The sub-block2 is set similar to sub-block1 with just one difference in the number of filters, which is 16 instead of 1 in sub-block1. For example, the output feature maps from block 6 in Dexi has the scale of , there will be three iterations in the sub-block2 before fed the sub-block1. The upsampling process of the second layer from the sub-blocks can be performed by bi-linear interpolation, sub-pixel convolution and transpose convolution, see Sec. 5 for details.
3.3 Loss Functions
DexiNed could be summarized as a regression function , that is, = , where is an input image, is its respective ground truth, and is a set of predicted edge maps. , where has the same size as , and is the number of outputs from each upsampling block (horizontal rectangles in gray, Fig. 3); is the result from the last fusion layer ). Then, as the model is deep supervised, it uses the same loss as [xie2017hed] (weighted cross-entropy), which is tackled as follow:
where is the collection of all network parameters and is the corresponding parameter, is a weight for each scale level. = and = (, denote the edge and non-edge in the ground truth). See Section 4.4 for hyper-parameters and optimizer details for the regularization in the training process.
4 Experimental Setup
This section presents details on the datasets used for evaluating the proposed model, in particular the dataset and annotations (BIPED) generated for an accurate training of the proposed DexiNed. Additionally, details on the evaluation metrics and network’s parameters are provided.
4.1 Barcelona Images for Perceptual Edge Detection (BIPED)
The other contributions of the paper is a carefully annotated edge dataset. It contains 250 outdoor images of 1280720 pixels each. These images have been carefully annotated by experts on the computer vision field, hence no redundancy has been considered. In spite of that, all results have been cross-checked in order to correct possible mistakes or wrong edges. This dataset is publicly available as a benchmark for evaluating edge detection algorithms. The generation of a new dataset, is motivated by the lack of annotated edges in most of available datasets. These missed edges affect both the algorithm training as well as in the performance measurements. Some examples of these missed or wrong edges can be appreciated in the ground truths presented in Fig. 8; hence, edge detector algorithms that obtain these missed edges are penalized during the evaluation. The level of details of the dataset annotated in the current work can be appreciated looking at the GT, see Figs. 5 and 7. In order to do a fair comparison between the different state-of-the-art approaches proposed in the literature, BIPED dataset has been used for training those approaches, which have been later on evaluated in ODS, OIS, and AP. From the BIPED dataset, 50 images have been randomly selected for testing and the remainders 200 for training and validation. In order to increase the number of training images a data augmentation process has been performed as follow: i) as BIPED data are in high resolution they are split up in the half of image width size; ii) similarly to HED, each of the resulting images is rotated by 15 different angles and crop by the inner oriented rectangle; iii) the images are horizontally flip; and finally iv) two gamma corrections have been applied (0.3030, 0.6060). This augmentation process resulted in 288 images per each 200 images.
4.2 Test Datasets
The datasets used to evaluate the performance of DexiNed are summarized bellow. There is just one dataset intented for edged detection MBDD [mely2016multicue], while the remainders are intended for objects’ contour/boundary extraction/segmentation: CID [grigorescu2003cid], BSDS [martin2001bsds300, arbelaez2011bsds500], NYUD [silberman2012NYUD] and PASCAL [mottaghi2014PASCALcontext]. The last two datasets are generally included in edge detection publications.
MBDD: The Multicue Boundary Detection Dataset has been intended for the purpose of psychophysical studies on object boundary detection in natural scenes, from the early vision system. The dataset is composed of short binocular video sequences of natural scenes [mely2016multicue], containing 100 scenes in high definition (). Each scene has 5 boundary annotations and 6 edge annotations. From the given dataset 80 images are used for training and the remainders 20 for testing [mely2016multicue]. In the current work the DexiNed has been evaluated using the first 20 images from the dataset.
CID: This dataset has been presented in [grigorescu2003cid], a brain-biologically inspired edge detector technique. The main limitation of this data set is that it just contains a set of 40 images with their respective ground truth edges. This dataset highlight that in addition to the edges the ground truth map contains contours of object. In this case the DexiNed has been evaluated with the whole CID data.
BSDS: Berkeley Segmentation Dataset, consists of 200 new test images [arbelaez2011bsds500] additional to the 300 images contained in BSDS300 [martin2001bsds300]. In previous publications, the BSDS300 is split up into 200 images for training and 100 images for testing. Currently, the 300 images from BSDS300 are used for training and validation, while the remainders 200 images are used for testing. Every image in BSDS is annotated at least by annotators; this dataset is mainly intended for image segmentation and boundary detection. In the current work both datasets are evaluated BSDS500 (200 test images) and BSDS300 (100 test images).
NYUD: New York University Dataset is a set of 1449 RGBD images that contains 464 indoor scenarios, intended for segmentation purposes. This dataset is split up by [Gupta_2013NYUDsplit] into three subsets—i.e., training, validation and testing sets. The testing set contains 654 images, while the remainders images are used for training and validation purposes. In the current work, although the proposed model was not trained with this dataset, the testing set has been selected for evaluating the proposed DexiNed.
PASCAL: The Pascal-Context [mottaghi2014PASCALcontext] is a popular dataset in segmentation; currently most of major DL methods for edge detection use this dataset for training and testing, both for edge and boundary detection purposes. This dataset contains 11530 annotated images, about of them (505 images) have been considered for testing DexiNed.
4.3 Evaluation Metrics
The evaluation of an edge detector has been well defined since the pioneer work presented in [ziou1998edgeOverview]. Since BIPED has annotated edge-maps as GT, three evaluation metrics widely used in the community is be considered: fixed contour threshold (ODS), per-image best threshold (OIS), and average precision (AP). The F-measure (F) [arbelaez2011bsds500] of ODS and OIS, will be considered, where .
4.4 Implementation Notes
The implementation is performed in TensorFlow [abadi2016tensorflow]. The model converges after 150k iterations with a batch size of 8 using Adam optimizer and learning rate of . The training process takes around 2 days in a TITAN X GPU with color images of size 400x400 as input. 10% of the augmented BIPED were used for the validation. The weights for fusion layer are initialized as: (see Sec. 3.3 for ). After a hyperparameter search to reduce the number of parameters, best performance was obtained using kernel sizes of , and on the different convolutional layers of Dixe and UB.
5 Experimental Results
This section presents quantitative and qualitative evaluations conducted by the metrics presented in Sec. 4. Since the proposed DL architecture demands several experiments to be validated, DexiNed has been carefully tuned till reach its final version.
Edge detection dataset
|Contour/boundary detection/segmentation datasets|
5.1 Quantitative Results
Firstly, in order to select the upsampling process that achieves the best result, an empiric evaluation has been performed, see Fig. 6(a). The evaluation consists in conducting the same experiments by using the three upsampling methods; DexiNed-bdc refers to upsampling performed by a transpose convolution initialized with a bi-linear kernel; DexiNed-dc uses transpose convolution with trainable kernels; and DexiNed-sp uses subpixel convolution. According to F-measure, the three versions of DexiNed get the similar results, however, when analyzing the curves in Fig. 6(a), it can be appreciated that although all of them have the same behaviour (shape), a small difference in the performance of DexiNed-dc appears. As a conclusion, the DexiNed-dc upsampling strategy is selected; from now on, all the evaluations performed on this section are obtained using a DexiNed-dc upsampling; for simplicity of notation just the term DexiNed is used instead of DexiNed-dc.
Figure 6(b) and Table 1(a) present the quantitative results reached from each DexiNed edge-map prediction. The results from the eight predicted edge-maps are depicted, the best quantitative results, corresponding to the fused (DexiNed-f) and averaged (DexiNed-a) edge-maps are selected for the comparisons. Similarly to [xie2017hed] the averaged of all predictions (DexiNed-a) gets the best results in the three evaluation metrics, followed by the prediction generated in the fusion layer. Note that the edge-maps predicted from the block 2 till the 6 get similar results to DexiNed-f, this is due to the fact of the proposed skip-connections. For a qualitative illustration, Fig. 5 presents all edge-maps predicted from the proposed architecture. Qualitatively, the result from DexiNed-f is considerably better than the one from DexiNed-a (see illustration in Fig. 5). However, according to Table 1(a), DexiNed-a produces slightly better quantitative results than DexiNed-f. As a conclusion both approaches (fused and averaged) reach similar results; through this manuscript whenever the term DexiNed is used it corresponds to DexiNed-f.
Table 1(b) presents a comparison between the DexiNed and the state-of-the-art techniques on edge and boundary detection. In all the cases BIPED dataset has been considered, both for training and evaluating the DL based models (i.e., HED [shen2015deepcontour], RCF [liu2017rcf], CED [wang2017ced]) and BDCN [he2019edgeBDCN], the training process for each model took about two days. As can be appreciated from Table 1(b), DexiNed-a reaches the best results in all evaluation metrics. Actually both, DexiNed-a and DexiNed-f obtain the best results in almost all evaluation metrics. The F-measure obtained by comparing these approaches is presented in Fig. 6(c); it can be appreciated how for Recall above 75% DexiNed gets the best results. Illustrations of the edges obtained with DexiNed and the state-of-the-art techniques are depicted in Figure 7, just for four images from the BIPED dataset. As it can be appreciated, although RCF and BDCN obtain similar quantitative results than DexiNed, which were the second best ranked algorithms in Table 1(b), DexiNed predicts qualitative better results. Note that the proposed approach was trained from scratch without pre-trained weights.
The main objective of DexiNed is to get a precise edge-map from every dataset (RGB or Grayscale). Therefore, in order to evaluate this capability, all the datasets presented in Sec. 4.2 have been considered, split up into two categories for a fair analysis; one for edge detection and the others for contour/boundary detection/segmentation. Results of edge-maps obtained with state-of-the-art methods are presented in Table 2. It should be noted that for each dataset the methods compared with DexiNed have been trained using images from that dataset, while DexiNed is trained just once with BIPED. Values presented in Table 2 are provided for comparisons and they come from the corresponding publication. It can be appreciated that DexiNed obtains the best performance in the MBDD dataset. It should be noted that DexiNed is evaluated in CID and BSDS300, even though these datasets contain a few images, which are not enough for training other approaches (e.g., HED, RCF, CED). Regarding BSDS500, NYUD and PASCAL, DexiNed does not reach the best results since these datasets have not been intended for edge detection, hence the evaluation metrics penalize edges detected by DexiNed. To highlight this situation, Fig. 8 depicts results from Table 2. Two samples from each dataset are considered. They are selected according to the best and worst F measure. Therefore, as shown in Fig. 8, when the image is fully annotated the score reaches around 100%, otherwise it reaches less than 50%.
5.2 Qualitative Results
As highlighted in previous section, when the deep learning based edge detection approaches are evaluated in datasets intended for objects’ boundary detection or objects segmentation, the results will be penalized. To support this claim, we present in Fig. 8 two predictions (the best and the worst results according to F-measure) from all datasets used for evaluating the proposed approach (except BIPED that has been used for training). The F-measure obtained in the three most used datasets (i.e., BSDS500, BSDS300 and NYUD) reaches over 80 in those cases where images are fully annotated; otherwise, the F-measure reaches about 30. However, when the edge dataset (MBDD [mely2016multicue]) is considered the worst F-measure reaches over 75. As a conclusion, it should be stated that edge detection and contour/boundary detection are different problems that need to be tackled separately when a DL based model is considered.
A deep supervised and structured model (DexiNed) for image’s edge detection is proposed. Up to our knowledge it is the first DL based approach able to generate thin edge-maps. A large set of experimental results and comparisons with state-of-the-art approaches is provided showing the validity of DexiNed. It should be noted that although it is trained on the proposed dataset (BIPED) results when evaluated in other edge oriented dataset outperforms the state of the art. A second contribution of the current work is a carefully annotated dataset for edge detection. Future work will be focused on tackling the contour and boundary detection problems by using the proposed architecture.
This work has been partially supported by: the Spanish Government under Project TIN2017-89723-P; the “CERCA Programme / Generalitat de Catalunya” and the ESPOL project PRAIM (FIEC-09-2015). The authors gratefully acknowledge the support of the CYTED Network: “Ibero-American Thematic Network on ICT Applications for Smart Cities” (REF-518RT0559) and the NVIDIA Corporation with the donation of the Titan Xp GPU used for this research. Xavier Soria has been supported by Ecuador government institution SENESCYT under a scholarship contract 2015-AR3R7694.