Contour Detection Using Cost-Sensitive Convolutional Neural Networks
We address the problem of contour detection via per-pixel classifications of edge point. To facilitate the process, the proposed approach leverages with DenseNet, an efficient implementation of multiscale convolutional neural networks (CNNs), to extract an informative feature vector for each pixel and uses an SVM classifier to accomplish contour detection. The main challenge lies in adapting a pre-trained per-image CNN model for yielding per-pixel image features. We propose to base on the DenseNet architecture to achieve pixelwise fine-tuning and then consider a cost-sensitive strategy to further improve the learning with a small dataset of edge and non-edge image patches. In the experiment of contour detection, we look into the effectiveness of combining per-pixel features from different CNN layers and obtain comparable performances to the state-of-the-art on BSDS500.
|Jyh-Jing Hwang & Tyng-Luh Liu|
|Institute of Information Science|
Contour detection is fundamental to a wide range of computer vision applications, including image segmentation (Malik et al., 2001; Arbelaez et al., 2011), object detection (Zitnick & Dollár, 2014) and recognition (Shotton et al., 2008). The task is often carried out by exploring local image cues, such as intensity, color gradients, texture or local structures (Canny, 1986; Martin et al., 2004; Mairal et al., 2008; Arbelaez et al., 2011; Dollár & Zitnick, 2014). Take, for example, that Dollár & Zitnick (2014) use structured random forests to learn local edge patterns, and report current state-of-the-art results with impressive computation efficiency. More recently, object cues are also considered in Kivinen et al. (2014) and Ganin & Lempitsky (2014) to further boost the performance. Despite the constant evolvement of relevant techniques in better solving the problem, seeking an appropriate feature representation remains the cornerstone of such efforts. We are thus motivated to propose a new learning formulation that can generate suitable per-pixel features for more satisfactorily performing contour detection.
We consider deep neural networks to construct a desired per-pixel feature learner. In particular, since the underlying task is essentially a classification problem, we adopt deep convolutional neural networks (CNNs) to establish a discriminative approach. However, one subtle deviation from typical applications of CNNs should be emphasized. In our method, we intend to use the CNN architecture, e.g., AlexNet (Krizhevsky et al., 2012), to generate features for each image pixel, not just a single feature vector for the whole input image. Such a distinction would call for a different perspective of parameter fine-tuning so that a pre-trained per-image CNN on ImageNet (Deng et al., 2009) can be adapted into a new model for per-pixel edge classifications. To further investigate the property of the features from different convolutional layers and from various ensembles, we carry out a number of experiments to evaluate their effectiveness in performing contour detection on the benchmark BSDS Segmentation dataset (Martin et al., 2001).
The organization of the paper is as follows. Section 2 includes related work of contour detection and deep convolutional neural networks. In Section 3, we describe the overall model for learning per-pixel features and useful techniques for fine-tuning the parameters. Section 4 provides detailed experimental results and comparisons to demonstrate the advantages of our method. In Section 5 we discuss the key ideas of the proposed techniques and possible future research efforts.
2 Related Work
As stated, we focus on using a deep convolutional neural network to achieve feature learning for improving contour detection. The survey of relevant work is thus presented to give an insightful picture of the recent progress in each of the two areas of emphasis.
2.1 Contour Detection
Early techniques for contour detection (Fram & Deutsch, 1975; Canny, 1986; Perona & Malik, 1990) mainly concern local image cues, such as intensity and color gradients. Amongst them, the Canny detector (Canny, 1986) stands out for its simplicity and accuracy owing to exploring the peak gradient magnitude orthogonal to the contour direction. Detailed discussions about these approaches can be found in, e.g., Bowyer et al. (1999). Subsequent work along this line (Martin et al., 2004; Mairal et al., 2008; Arbelaez et al., 2011) also identifies that textures are useful local cues for increasing the detection accuracy.
Apart from detecting local cues, learning-based techniques form a notable group in addressing this intriguing task (Dollár et al., 2006; Mairal et al., 2008; Zheng et al., 2010; Xiaofeng & Bo, 2012; Lim et al., 2013; Dollár & Zitnick, 2014; Kivinen et al., 2014; Ganin & Lempitsky, 2014). Dollár et al. (2006) adopt a boosted classifier to independently label each pixel by learning its surrounding image patch. Zheng et al. (2010) analyze the combination of low-, mid-, and high-level information to detect object-specific contours. In addition, Xiaofeng & Bo (2012) propose to compute sparse code gradients and successfully improve Arbelaez et al. (2011).
Lim et al. (2013) classify edge patches into sketch tokens using random forest classifiers, which can capture local edge structures. Isola et al. (2014) consider pointwise mutual information to extract global object contours. Their results display crisp and clean contours. Like in Lim et al. (2013), Dollár & Zitnick (2014) use structured random forests to learn edge patches, and achieve current state-of-the-art results in both accuracy and efficiency.
More relevant to our approach, Kivinen et al. (2014) and Ganin & Lempitsky (2014) learn contour information with deep architectures. Kivinen et al. (2014) encode and decode contours using multi-layer mean-and-covariance restricted Boltzmann machines. Ganin & Lempitsky (2014) establish a deep architecture, which composes of convolutional neural networks and nearest neighbor search, and obtain convincing results. Different from Ganin & Lempitsky (2014), we strive for designing fine-tuning mechanisms with a small dataset for adapting an ImageNet pre-trained convolutional neural network for producing per-pixel image features. As we will see later, this effort leads to the state-of-the-art results of contour detection on the benchmark testing.
2.2 Convolutional Neural Networks
Noticeably, CNNs are popularized by LeCun and colleagues who first apply CNNs to digit recognition (LeCun et al., 1989), OCR (LeCun et al., 1998) and generic object recognition (Jarrett et al., 2009). In contrast to using hand-crafted features, CNNs learn discriminative features and exhibit hierarchical semantic information along their deep architecture.
The AlexNet by Krizhevsky et al. (2012) is perhaps the most popular implementation of CNNs for generic object classification. The model has been shown to outperform competing approaches based on traditional features in solving a number of mainstream computer vision problems. In Turaga et al. (2010) and Briggman et al. (2009), CNNs are used for image segmentation. To extend CNNs for object detection, Farabet et al. (2013) utilize CNNs for semantic segmentation. Sermanet et al. (2013a) use CNNs to predict object locations via sliding window, while learning multi-stage features of CNNs for pedestrian detection is proposed in Sermanet et al. (2013b). Girshick et al. (2013) also consider features from a deep CNN in a region proposal framework to achieve state-of-the-art object detection results on the PASCAL VOC dataset.
While CNNs thrives in generic object recognition and detection, less attention is paid to applications demanding per-pixel processing, such as contour detection and segmentation. Our method exploits the AlexNet model for contour detection and explores its per-pixel fine-tuning with a small dataset. Recently and independently from our work, generating per-pixel features based on CNNs can also be found in Hariharan et al. (2014) and Long et al. (2014).
3 Per-Pixel CNN Features
Learning features by employing a deep architecture of neural net has been shown to be effective, but most of the existing techniques focus on yielding a feature vector for an input image (or image patch). Such a design may not be appropriate for vision applications that require investigating image characteristics in pixel level. In the problem of contour detection, the central task is to decide whether an underlying pixel is an edge point or not. Thus, it would be convenient that the deep network could yield per-pixel features.
We propose to construct a multiscale CNN model for contour detection. To this end, we extract per-pixel CNN features in AlexNet (Krizhevsky et al., 2012) using DenseNet (Iandola et al., 2014), and pixelwise concatenate them to feed into a support vector machine (SVM) classifier. In particular, DenseNet provides fast multiscale feature pyramid extraction of any Caffe convolutional neural networks (Jia et al., 2014) and the convenience of working with images of arbitrary size. To extract per-pixel features, we upsample the feature maps from the first convolutional layer (Conv1) to the fifth convolutional layer (Conv5) to the original size of the input image. We then pixelwise stack the features from differen convolutional layers to constitute the per-pixel features. Depending on the selection of the convolutional layers, the resulting feature vector at each pixel would encode different level of information about an underlying pixel. Figure 1 illustrates the case of concatenating features from all five convolutional layers to form a 1376-D feature vector at each pixel.
To decide a pixel, say, is a contour point, one can now readily feed its corresponding feature vector to an SVM classifier. In practice, it is useful to include information from neighboring pixels so that local contour structures can be better distinguished. We consider the following eight neighboring pixels and append, starting from , their respective feature vector to that of in clockwise order. In our implementation, we have tested , which correspond to an image patch of size , and , respectively.
3.1 DenseNet Feature Pyramids
We use DenseNet for CNN feature extraction because of its efficiency, flexibility, and availability. DenseNet is an open source system that computes dense and multiscale features from the convolutional layers of a Caffe CNN based object classifier. The process of feature extraction proceeds as follows. Given an input image, DenseNet computes its multiscale versions and stitches them to a large plane. After processing the whole plane by CNNs, DenseNet would unstitch the descriptor planes and then obtain multiresolution CNN descriptors.
The dimensions of convolutional features are ratios of the image size, e.g., one-fourth for Conv1, and one-eighth for Conv2. We rescale feature maps of all the convolutional layers to the image size. That is, there is a feature vector in every pixel. As illustrated in Figure 1, the dimension of the resulting feature vector is , which is concatenated by Conv1 (), Conv2 (), Conv3 (), Conv4 (), and Conv5 ().
For classification, we first concatenate features from the surrounding eight pixels to incorporate information about the local contour structure, and then use the combined per-pixel feature vectors to train a binary linear SVM. Specifically, in our multiscale setting, we train the SVM based on only the original resolution. In test time, we classify test images using both the original and the double resolutions. We average the two resulting edge maps for the final output of contour detection.
3.2 Per-pixel Fine-tuning
To fine-tune parameters for per-pixel contour detection, we exclude the two fully-connected layers of the ImageNet pre-trained CNN model in that the two layers will cause to restrict the input image size and consequently the overall architecture. We keep only the five convolutional layers, and on top of Conv5, we add a new 2-way softmax layer for edge classification.
Specifically, the input image size of ImageNet pre-trained CNN model is , which is not suitable for our per-pixel design as each map in the Conv5 layer would still be . In addition, we need to remove padding in CNN to conform to that DenseNet does not use padding (except the input plane). To carry out per-pixel fine-tuning, we first generate a set of edge and non-edge patches. The image (patch) size is set to , and would reduce to in Conv5, at which the 2-way softmax layer can now properly compute the per-pixel probability of being a contour point. Note that the loss for back-propagation is computed by the label prediction and the ground truth of the center pixel of input patch.
3.3 Cost-sensitive Fine-tuning
Compared with the number of parameters in DenseNet, the size of the training set of edge and non-edge patches is relatively small. Using the aforementioned per-pixel fine-tuning alone is usually insufficient to achieve good performance. Still, when addressing edges, it is evident that there will be certain underlying features especially crucial for distinguishing edges from non-edges. To further learn these subtle features from a small database, we adopt the concept in cost-sensitive learning. The original 2-way softmax training cost is the negative log-likelihood cost:
where is the input image patch, is the parameters of CNN, and is the binary ( or ) edge label prediction. This cost is computed above the 2-way softmax layer, and will be back-propagated to train all convolutional layers. To apply cost-sensitive fine-tuning, we consider a biased negative log-likelihood cost:
where and are respectively the bias for positive (edge) or negative (non-edge) training data. If and , (2) is reduced to the original negative log-likelihood cost as in (1). In our approach, we set for positive cost-sensitive fine-tuning, and for negative cost-sensitive fine-tuning. Notice that, rather than directly back-propagating with (2), a convenient and alternative strategy is to create biased sampling for fine-tuning with (1). That is, for positive cost-sensitive fine-tuning, we sample twice more edge patches than non-edge ones, and vice versa, for negative cost-sensitive fine-tuning.
|(a) ODS OIS AP Pre-trained (baseline) .604 .620 .546 Traditional fine-tune .573 .585 .524 Per-pixel fine-tune .620 .632 .561 Positive fine-tune .612 .624 .542 Negative fine-tune .624 .639 .566 Pos + Neg fine-tune .638 .650 .579||
3.4 Final Fusion Model
The overall framework is an ensemble model. We combine an ImageNet pre-trained model, a per-pixel fine-tuned model, a positive cost-sensitive fine-tuned model, and a negative cost-sensitive fine-tuned model together. We use a heuristic branch-and-bound scheme to decide the fusion coefficients. The idea of fusing different training models is to capture different aspects of features. It is worthy mentioning that the improvements owing to the model fusion indicates that the various fine-tunings have their own merits on feature learning and are all useful in this respect.
4 Experiment Results
We test our method on the Berkeley Segmentation Dataset and Benchmark (BSDS500) (Martin et al., 2001; Arbelaez et al., 2011). To better assess the effects of the various fine-tuning techniques, we report their respective performance of contour detection. Comparisons with other competitive methods are also included to demonstrate the effectiveness of the proposed model.
The BSDS500 dataset is the current de facto standard image collection for contour detection. The dataset contains 200 training, 100 validation, and 200 testing images. Boundaries in each image are labeled by several workers and are averaged to form the ground truth. The accuracy of contour detection is evaluated by three measures: the best F-measure on the dataset for a fixed threshold (ODS), the aggregate F-measure on the dataset for the best threshold in each image (OIS), and the average precision (AP) on the full recall range (Arbelaez et al., 2011). Prior to evaluation, we apply a standard non-maximal suppression technique to edge maps to obtain thinned edges (Canny, 1986).
4.1 On Fine-tuning
The parameter fine-tuning is done on a a server with a GeForce GTX Titan Black GPU card. We set the overall learning rate as tenth of the original ImageNet pre-trained learning rate, and the softmax learning rate as ten times of the overall learning rate. The modification to the proposed per-pixel fine-tuning speeds up the parameter fine-tuning process. It takes days to finish iterations of per-pixel fine-tuning, while requiring more than days for traditional fine-tuning. For both traditional fine-tuning and per-pixel fine-tuning, we sample boundary (edge) and non-boundary (non-edge) patches per training image. For positive cost-sensitive fine-tuning, we sample boundary patches and non-boundary patches per training image, while boundary patches and non-boundary patches per training image for negative cost-sensitive fine-tuning.
We report the results of the various fine-tuning techniques in Table 1(a). The experiments use only Conv5 features, and are carried out with SVM classifications. Since this setting is most similar to a softmax fine-tuning architecture, we can directly observe the effectiveness of fine-tuning. The experiment results show that, compared with the baseline (the pre-trained model), traditional fine-tuning, which is the original fine-tuning architecture with padding in every layer, degrades overall performance by to . This implies that traditional per-image fine-tuning is not appropriate in learning per-pixel features for per-pixel applications. On the other hand, per-pixel fine-tuning improves the performance by about in all measurements. Pertaining to cost-sensitive fine-tuning, when compared with the per-pixel fine-tuning, positive fine-tuning slightly degrades and negative fine-tuning slightly improves. One possible explanation is that there are relatively more non-boundary regions than boundary points, so features learned for non-boundary regions improve the overall performance. However, if we combine features of positive and negative fine-tuning, the performance is significantly boosted again by . The performance gain signifies the complementary property of positive and negative fine-tunings as expected.
In conclusion, per-pixel fine-tuning raises the performances of per-pixel applications. Also, the combination of positive and negative cost-sensitive fine-tunings improves the classification performance the most. Therefore, it supports the advantage of using an ensemble fine-tuning model.
|Method ODS OIS AP gPb-owt-ucm (Arbelaez et al., 2011) .73 .76 .73 Sketch tokens (Lim et al., 2013) .73 .75 .78 SCG (Xiaofeng & Bo, 2012) .74 .76 .77 DeepNet (Kivinen et al., 2014) .74 .76 .76 PMI+sPb, MS (Isola et al., 2014) .74 .77 .78 -fields (Ganin & Lempitsky, 2014) .75 .77 .78 SE-MS-HS (Dollár & Zitnick, 2014) .75 .77 .80 5-stream pre-trained model .75 .77 .77 Final fusion model (CSCNN) .76 .78 .80|
4.2 On Features in Different Layers
We next conduct experiments to show how features from different convolutional layers contribute to the performance. In Table 1(b), we see that features in the second convolutional layer contribute the most, and then the third and the fourth layer. These suggest that low- to mid-level features are most useful for contour detection, while the lowest- and higher-level features provide additional boost. Although features in the first and the fifth convolutional layer are less effective when employed alone, we achieve the best results by combining all five streams. It indicates that the local edge information in low-level features and the object contour information in higher-level features are both necessary for achieving high performance in contour detection tasks.
4.3 Contour Detection Results and Comparisons
Finally, we show the experimental results of our pre-trained model and final fusion model. In Table 2, we report the contour detection performances on BSDS500 by our methods and seven competitive techniques, including gPb (Arbelaez et al., 2011), Sketch Tokens (Lim et al., 2013), Sparse Code Gradients (Xiaofeng & Bo, 2012), DeepNet (Kivinen et al., 2014), Pointwise Mutual Information (Isola et al., 2014), -fields (Ganin & Lempitsky, 2014) and Structured Edges (Dollár & Zitnick, 2014). While our 5-stream ImageNet pre-trained model (using features from all five convolutional layers) already achieves impressive results for contour detection on ODS and OIS measurements, the proposed fine-tuning techniques can further improve the performance. In particular, the final ensemble model improves from to on ODS measurement, and from to on OIS measurement. It also achieves state-of-the-art performance on the AP measurement. In Figure 2, we include a number of contour detection examples for qualitative visualization.
In this work, we describe how to use the DenseNet architecture to tailor for per-pixel computer vision problems, such as contour detection. We propose fine-tuning techniques to more effectively carry out parameter learning with a per-pixel based cost function and to overcome the limitation of using a small training set. The resulting cost-sensitive model appears to be promising for generating useful per-pixel feature vectors and should be useful for computer vision applications requiring analyzing local image property. An interesting future research direction is to establish a proper dimensionality reduction framework for the resulting high-dimensional per-pixel feature vectors and to examine its effects on the performance of contour detection.
- Arbelaez et al. (2011) Arbelaez, Pablo, Maire, Michael, Fowlkes, Charless, and Malik, Jitendra. Contour detection and hierarchical image segmentation. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 33(5):898–916, 2011.
- Bowyer et al. (1999) Bowyer, Kevin, Kranenburg, Christine, and Dougherty, Sean. Edge detector evaluation using empirical roc curves. In Computer Vision and Pattern Recognition, 1999. IEEE Computer Society Conference on., volume 1. IEEE, 1999.
- Briggman et al. (2009) Briggman, Kevin, Denk, Winfried, Seung, Sebastian, Helmstaedter, Moritz N, and Turaga, Srinivas C. Maximin affinity learning of image segmentation. In Advances in Neural Information Processing Systems, pp. 1865–1873, 2009.
- Canny (1986) Canny, John. A computational approach to edge detection. Pattern Analysis and Machine Intelligence, IEEE Transactions on, (6):679–698, 1986.
- Deng et al. (2009) Deng, Jia, Dong, Wei, Socher, Richard, Li, Li-Jia, Li, Kai, and Fei-Fei, Li. ImageNet: A large-scale hierarchical image database. In CVPR, pp. 248–255, 2009.
- Dollár & Zitnick (2014) Dollár, Piotr and Zitnick, C Lawrence. Fast edge detection using structured forests. arXiv preprint arXiv:1406.5549, 2014.
- Dollár et al. (2006) Dollár, Piotr, Tu, Zhuowen, and Belongie, Serge. Supervised learning of edges and object boundaries. In Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on, volume 2, pp. 1964–1971. IEEE, 2006.
- Farabet et al. (2013) Farabet, Clement, Couprie, Camille, Najman, Laurent, and LeCun, Yann. Learning hierarchical features for scene labeling. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 35(8):1915–1929, 2013.
- Fram & Deutsch (1975) Fram, Jerry R. and Deutsch, Edward S. On the quantitative evaluation of edge detection schemes and their comparison with human performance. Computers, IEEE Transactions on, 100(6):616–628, 1975.
- Ganin & Lempitsky (2014) Ganin, Yaroslav and Lempitsky, Victor. -fields: Neural network nearest neighbor fields for image transforms. arXiv preprint arXiv:1406.6558, 2014.
- Girshick et al. (2013) Girshick, Ross, Donahue, Jeff, Darrell, Trevor, and Malik, Jitendra. Rich feature hierarchies for accurate object detection and semantic segmentation. arXiv preprint arXiv:1311.2524, 2013.
- Hariharan et al. (2014) Hariharan, Bharath, Arbeláez, Pablo, Girshick, Ross, and Malik, Jitendra. Hypercolumns for object segmentation and fine-grained localization. arXiv preprint arXiv:1411.5752, 2014.
- Iandola et al. (2014) Iandola, Forrest, Moskewicz, Matt, Karayev, Sergey, Girshick, Ross, Darrell, Trevor, and Keutzer, Kurt. Densenet: Implementing efficient convnet descriptor pyramids. arXiv preprint arXiv:1404.1869, 2014.
- Isola et al. (2014) Isola, Phillip, Zoran, Daniel, Krishnan, Dilip, and Adelson, Edward H. Crisp boundary detection using pointwise mutual information. In Computer Vision–ECCV 2014, pp. 799–814. Springer, 2014.
- Jarrett et al. (2009) Jarrett, Kevin, Kavukcuoglu, Koray, Ranzato, M, and LeCun, Yann. What is the best multi-stage architecture for object recognition? In Computer Vision, 2009 IEEE 12th International Conference on, pp. 2146–2153. IEEE, 2009.
- Jia et al. (2014) Jia, Yangqing, Shelhamer, Evan, Donahue, Jeff, Karayev, Sergey, Long, Jonathan, Girshick, Ross, Guadarrama, Sergio, and Darrell, Trevor. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.
- Kivinen et al. (2014) Kivinen, Jyri J, Williams, Christopher KI, Heess, Nicolas, and Technologies, DeepMind. Visual boundary prediction: A deep neural prediction network and quality dissection. In Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics, pp. 512–521, 2014.
- Krizhevsky et al. (2012) Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105, 2012.
- LeCun et al. (1989) LeCun, Yann, Boser, Bernhard, Denker, John S, Henderson, Donnie, Howard, Richard E, Hubbard, Wayne, and Jackel, Lawrence D. Backpropagation applied to handwritten zip code recognition. Neural computation, 1(4):541–551, 1989.
- LeCun et al. (1998) LeCun, Yann, Bottou, Léon, Bengio, Yoshua, and Haffner, Patrick. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- Lim et al. (2013) Lim, Joseph J, Zitnick, C Lawrence, and Dollár, Piotr. Sketch tokens: A learned mid-level representation for contour and object detection. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pp. 3158–3165. IEEE, 2013.
- Long et al. (2014) Long, Jonathan, Shelhamer, Evan, and Darrell, Trevor. Fully convolutional networks for semantic segmentation. arXiv preprint arXiv:1411.4038, 2014.
- Mairal et al. (2008) Mairal, Julien, Leordeanu, Marius, Bach, Francis, Hebert, Martial, and Ponce, Jean. Discriminative sparse image models for class-specific edge detection and image interpretation. In Computer Vision–ECCV 2008, pp. 43–56. Springer, 2008.
- Malik et al. (2001) Malik, Jitendra, Belongie, Serge, Leung, Thomas, and Shi, Jianbo. Contour and texture analysis for image segmentation. International journal of computer vision, 43(1):7–27, 2001.
- Martin et al. (2001) Martin, David, Fowlkes, Charless, Tal, Doron, and Malik, Jitendra. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Computer Vision, 2001. ICCV 2001. Proceedings. Eighth IEEE International Conference on, volume 2, pp. 416–423. IEEE, 2001.
- Martin et al. (2004) Martin, David R, Fowlkes, Charless C, and Malik, Jitendra. Learning to detect natural image boundaries using local brightness, color, and texture cues. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 26(5):530–549, 2004.
- Perona & Malik (1990) Perona, Pietro and Malik, Jitendra. Scale-space and edge detection using anisotropic diffusion. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 12(7):629–639, 1990.
- Sermanet et al. (2013a) Sermanet, Pierre, Eigen, David, Zhang, Xiang, Mathieu, Michaël, Fergus, Rob, and LeCun, Yann. Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv preprint arXiv:1312.6229, 2013a.
- Sermanet et al. (2013b) Sermanet, Pierre, Kavukcuoglu, Koray, Chintala, Soumith, and LeCun, Yann. Pedestrian detection with unsupervised multi-stage feature learning. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pp. 3626–3633. IEEE, 2013b.
- Shotton et al. (2008) Shotton, Jamie, Blake, Andrew, and Cipolla, Roberto. Multiscale categorical object recognition using contour fragments. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 30(7):1270–1281, 2008.
- Turaga et al. (2010) Turaga, Srinivas C, Murray, Joseph F, Jain, Viren, Roth, Fabian, Helmstaedter, Moritz, Briggman, Kevin, Denk, Winfried, and Seung, H Sebastian. Convolutional networks can learn to generate affinity graphs for image segmentation. Neural Computation, 22(2):511–538, 2010.
- Xiaofeng & Bo (2012) Xiaofeng, Ren and Bo, Liefeng. Discriminatively trained sparse code gradients for contour detection. In Advances in neural information processing systems, pp. 584–592, 2012.
- Zheng et al. (2010) Zheng, Songfeng, Yuille, Alan, and Tu, Zhuowen. Detecting object boundaries using low-, mid-, and high-level information. Computer Vision and Image Understanding, 114(10):1055–1067, 2010.
- Zitnick & Dollár (2014) Zitnick, C Lawrence and Dollár, Piotr. Edge boxes: Locating object proposals from edges. In Computer Vision–ECCV 2014, pp. 391–405. Springer, 2014.