Reconstructive Sparse Code Transfer for Contour Detection and Semantic Labeling
We frame the task of predicting a semantic labeling as a sparse reconstruction procedure that applies a target-specific learned transfer function to a generic deep sparse code representation of an image. This strategy partitions training into two distinct stages. First, in an unsupervised manner, we learn a set of dictionaries optimized for sparse coding of image patches. These generic dictionaries minimize error with respect to representing image appearance and are independent of any particular target task. We train a multilayer representation via recursive sparse dictionary learning on pooled codes output by earlier layers. Second, we encode all training images with the generic dictionaries and learn a transfer function that optimizes reconstruction of patches extracted from annotated ground-truth given the sparse codes of their corresponding image patches. At test time, we encode a novel image using the generic dictionaries and then reconstruct using the transfer function. The output reconstruction is a semantic labeling of the test image.
Applying this strategy to the task of contour detection, we demonstrate performance competitive with state-of-the-art systems. Unlike almost all prior work, our approach obviates the need for any form of hand-designed features or filters. Our model is entirely learned from image and ground-truth patches, with only patch sizes, dictionary sizes and sparsity levels, and depth of the network as chosen parameters. To illustrate the general applicability of our approach, we also show initial results on the task of semantic part labeling of human faces.
The effectiveness of our data-driven approach opens new avenues for research on deep sparse representations. Our classifiers utilize this representation in a novel manner. Rather than acting on nodes in the deepest layer, they attach to nodes along a slice through multiple layers of the network in order to make predictions about local patches. Our flexible combination of a generatively learned sparse representation with discriminatively trained transfer classifiers extends the notion of sparse reconstruction to encompass arbitrary semantic labeling tasks.
A multitude of recent work establishes the power of learning hierarchical representations for visual recognition tasks. Noteworthy examples include deep autoencoders , deep convolutional networks [2, 3], deconvolutional networks , hierarchical sparse coding , and multipath sparse coding . Though modeling choices and learning techniques vary, these architectures share the overall strategy of concatenating coding (or convolution) operations followed by pooling operations in a repeating series of layers. Typically, the representation at the topmost layer (or pooled codes from multiple layers ) serves as an input feature vector for an auxiliary classifier, such as a support vector machine (SVM), tasked with assigning a category label to the image.
Our work is motivated by exploration of the information content of the representation constructed by the rest of the network. While the topmost or pooled features robustly encode object category, what semantics can be extracted from the spatially distributed activations in the earlier network layers? Previous work attacks this question through development of tools for visualizing and probing network behavior . We provide a direct result: a multilayer slice above a particular spatial location contains sufficient information for semantic labeling of a local patch. Combining predicted labels across overlapping patches yields a semantic segmentation of the entire image.
In the case of contour detection (regarded as a binary labeling problem), we show that a single layer sparse representation (albeit over multiple image scales and patch sizes) suffices to recover most edges, while a second layer adds the ability to differentiate (and suppress) texture edges. This suggests that contour detection (and its dual problem, image segmentation ) emerge implicitly as byproducts of deep representations.
Moreover, our reconstruction algorithm is not specific to contours. It is a recipe for transforming a generic sparse representation into a task-specific semantic labeling. We are able to reuse the same multilayer network structure for contours in order to train a system for semantic segmentation of human faces.
We make these claims in the specific context of the multipath sparse coding architecture of Bo et al. . We learn sparse codes for different patch resolutions on image input, and, for deeper layers, on pooled and subsampled sparse representations of earlier layers. However, instead of a final step that pools codes into a single feature vector for the entire image, we use the distributed encoding in the setting of sparse reconstruction. This encoding associates a high-dimensional sparse feature vector with each pixel. For the traditional image denoising reconstruction task, convolving these vectors with the patch dictionary from the encoding stage and averaging overlapping areas yields a denoised version of the original image .
Our strategy is to instead swap in an entirely different dictionary for use in reconstruction. Here we generalize the notion of “dictionary” to include any function which takes a sparse feature vector as input and outputs predicted labels for a patch. Throughout the paper, these transfer dictionaries take the form of a set of logistic regression functions: one function for predicting the label of each pixel in the output patch. For a simplified toy example, Figure 1 illustrates the reconstruction obtained with such a dictionary learned for the contour detection task. Figure 2 diagrams the much larger multipath sparse coding network that our actual system uses to generate high-dimensional sparse representations. The structural similarity to the multipath network of Bo et al.  is by design. They tap part of such a network for object recognition; we tap a different part of the network for semantic segmentation. This suggests that it may be possible to use an underlying shared representation for both tasks.
In addition to being an implicit aspect of deep representations used for object recognition, our approach to contour detection is entirely free of reliance on hand-crafted features. As Section 2 reviews, this characteristic is unique amongst competing contour detection algorithms. Sections 3, 4, and 5 describe the technical details behind our two-stage approach of sparse coding and reconstructive transfer. Section 6 visualizes and benchmarks results for our primary application of contour detection on the Berkeley segmentation dataset (BSDS) . We also show results for a secondary application of semantic part labeling on the Labeled Faces in the Wild (LFW) dataset [13, 14]. Section 7 concludes.
2 Related Work
Contour detection has long been a major research focus in computer vision. Arbeláez et al.  catalogue a vast set of historical and modern algorithms. Three different approaches [8, 15, 16] appear competitive for state-of-the-art accuracy. Arbeláez et al.  derive pairwise pixel affinities from local color and texture gradients  and apply spectral clustering  followed by morphological operations to obtain a global boundary map.
Ren and Bo  adopt the same pipeline, but use gradients of sparse codes instead of the color and texture gradients developed by Martin et al. . Note that this is completely different from the manner in which we propose to use sparse coding for contour detection. In , sparse codes from a dictionary of small patches serve as replacement for the textons  used in previous work [17, 8]. Borrowing the hand-designed filtering scheme of , half-discs at multiple orientations act as regions over which codes are pooled into feature vectors and then classified using an SVM. In contrast, we use a range of patch resolutions, from to , without hand-designed gradient operations, in a reconstructive setting through application of a learned transfer dictionary. Our sparse codes assume a role different than that of serving as glorified textons.
Dollár and Zitnick  learn a random decision forest on feature channels consisting of image color, gradient magnitude at multiple orientations, and pairwise patch differences. They cluster ground-truth edge patches by similarity and train the random forest to predict structured output. The emphasis on describing local edge structure in both  and previous work [20, 21] matches our intuition. However, sparse coding offers a more flexible methodology for achieving this goal. Unlike , we learn directly from image data (not predefined features), in an unsupervised manner, a generic (not contour-specific) representation, which can then be ported to many tasks via a second stage of supervised transfer learning.
Mairal et al.  use sparse models as the foundation for developing an edge detector. However, they focus on discriminative dictionary training and per-pixel labeling using a linear classifier on feature vectors derived from error residuals during sparse coding of patches. This scheme does not benefit from the spatial averaging of overlapping predictions that occurs in structured output paradigms such as  and our proposed algorithm. It also does not incorporate deeper layers of coding, an aspect we find to be crucial for capturing texture characteristics in the sparse representation.
Yang et al.  study the problem of learning dictionaries for coupled feature spaces with image super-resolution as an application. We share their motivation of utilizing sparse coding in a transfer learning context. As the following sections detail, we differ in our choice of a modular training procedure split into distinct unsupervised (generic) and supervised (transfer) phases. We are unique in targeting contour detection and face part labeling as applications.
3 Sparse Representation
Given image consisting of channels ( for an RGB color image) defined over a 2-dimension grid, our sparse coding problem is to represent each patch as a sparse linear combination of elements from a dictionary . From a collection of patches randomly sampled from a set of training images, we learn the corresponding sparse representations as well as the dictionary using the MI-KSVD algorithm proposed by Bo et al. . MI-KSVD finds an approximate solution to the following optimization problem:
where denotes Frobenius norm and is the desired sparsity level. MI-KSVD adapts KSVD  by balancing reconstruction error with mutual incoherence of the dictionary. This unsupervised training stage is blind to any task-specific uses of the sparse representation.
Once the dictionary is fixed, the desired encoding of a novel patch is:
Obtaining the exact optimal is NP-hard, but the orthogonal matching pursuit (OMP) algorithm  is a greedy iterative routine that works well in practice. Over each of rounds, it selects the dictionary atom (codeword) best correlated with the residual after orthogonal projection onto the span of previously selected codewords. Batch orthogonal matching pursuit  precomputes correlations between codewords to significantly speed the process of coding many signals against the same dictionary. We extract the patch surrounding each pixel in an image and encode all patches using batch orthogonal matching pursuit.
4 Dictionary Transfer
Coding an image as described in the previous section produces a sparse matrix , where is the number of pixels in the image and each column of has at most nonzeros. Reshaping each of the rows of into a 2-dimensional grid matching the image size, convolving with the corresponding codeword from , and summing the results approximately reconstructs the original image. Figure 1 (middle) shows an example with the caveat that we drop patch means from the sparse representation and hence also from the reconstruction. Equivalently, one can view as defining a function that maps a sparse vector associated with a pixel to a predicted patch which is superimposed on the surrounding image grid and added to overlapping predictions.
We want to replace with a function such that applying this procedure with produces overlapping patch predictions that, when averaged, reconstruct signal which closely approximates some desired ground-truth labeling . lives on the same 2-dimensional grid as , but may differ in number of channels. For contour detection, is a single-channel binary image indicating presence or absence of an edge at each pixel. For semantic labeling, may have as many channels as categories with each channel serving as an indicator function for category presence at every location.
We regard choice of as a transfer learning problem given examples of sparse representations and corresponding ground-truth, . To further simplify the problem, we consider only patch-wise correspondence. Viewing and as living on the image grid, we sample a collection of patches from along with the length sparse coefficient vectors located at the center of each sampled patch, . We rectify each of these sparse vectors and append a constant term:
Our patch-level transfer learning problem is now to find such that:
where is a vector of sparse coefficients and is a target ground-truth patch. Here, denotes the number of channels in the ground-truth (and its predicted reconstruction).
While one could still choose any method for modeling , we make an extremely simple and efficient choice, with the expectation that the sparse representation will be rich enough that simple transfer functions will work well. Specifically, we split into a set of independently trained predictors , one for each of the elements of the output patch. Our transfer learning problem is now:
As all experiments in this paper deal with ground-truth in the form of binary indicator vectors, we set each to be a logistic classifier and train its coefficients using L2-regularized logistic regression.
Predicting patches means that each element of the output reconstruction is an average of outputs from different classifiers. Moreover, one would expect (and we observe in practice) the accuracy of the classifiers to be spatially varying. Predicted labels of pixels more distant from the patch center are less reliable than those nearby. To correct for this, we weight predicted patches with a Gaussian kernel when spatially averaging them during reconstruction.
Additionally, we would like the computation time for prediction to grow more slowly than as patch size increases. Because predictions originating from similar spatial locations are likely to be correlated and a Gaussian kernel gives distant neighbors small weight, we construct an adaptive kernel , which approximates the Gaussian, taking fewer samples with increasing distance, but upweighting them to compensate for decreased sample density. Specifically:
where is a 2D Gaussian, is a set of sample points, and measures the local density of sample points. Figure 3 provides an illustration of for fixed and sampling patterns which repeatedly halve density at various radii.
We report all experimental results using the adaptively sampled approximate Gaussian kernel during reconstruction. We found it to perform equivalently to the full Gaussian kernel and better than uniform patch weighting. The adaptive weight kernel not only reduces runtime, but also reduces training time as we neither run nor train the classifiers that the kernel assigns zero weight.
5 Multipath Network
Sections 3 and 4 describe our system for reconstructive sparse code transfer in the context of a single generatively learned patch dictionary and the resulting sparse representation. In practice, we must offer the system a richer view of the input than can be obtained from coding against a single dictionary. To accomplish this, we borrow the multipath sparse coding framework of Bo et al.  which combines two strategies for building richer representations.
First, the image is rescaled and all scales are coded against multiple dictionaries for patches of varying size. Second, the output sparse representation is pooled, subsampled, and then treated as a new input signal for another layer of sparse coding. Figure 2 describes the network architecture we have chosen in order to implement this strategy. We use rectification followed by hybrid average-max pooling (the average of nonzero coefficients) between layers 1 and 2. For patches, we pool over windows and subsample by a factor of , while for patches, we pool over windows and subsample by a factor of .
We concatenate all representations generated by the -atom dictionaries, rectify, and upsample them so that they live on the original image grid. This results in a -dimensional sparse vector representation for each image pixel. Despite the high dimensionality, there are only a few hundred nonzero entries per pixel, so total computational work is quite reasonable.
The dictionary transfer stage described in Section 4 now operates on these high-dimensional concatenated sparse vectors () instead of the output of a single dictionary. Training is more expensive, but classification and reconstruction is still cheap. The cost of evaluating the logistic classifiers scales with the number of nonzero coefficients in the sparse representations rather than the dimensionality. As a speedup for training, we drop a different random of the representation for each of the logistic classifiers.
We apply multipath reconstructive sparse code transfer to two pixel labeling tasks: contour detection on the Berkeley segmentation dataset (BSDS) , and semantic labeling of human faces (into skin, hair, and background) on the part subset  of the Labeled Faces in the Wild (LFW) dataset . We use the network structure in Figure 2 in both sets of experiments, with the only difference being that we apply a zero-mean transform to patch channels prior to encoding in the BSDS experiments. This choice was simply made to increase dictionary efficiency in the case of contour detection, where absolute color is likely less important. For experiments on the LFW dataset, we directly encode raw patches.
6.1 Contour Detection
Figure 4 shows contour detection results on example images from the test set of the 500 image version  of the BSDS . Figure 5 shows the precision-recall curve for our contour detector as benchmarked against human-drawn ground-truth. Performance is comparable to the heavily-engineered state-of-the-art global Pb (gPb) detector .
Note that both gPb and SCG  apply a spectral clustering procedure on top of their detector output in order to generate a cleaner globally consistent result. In both cases, this extra step provides a performance boost. Table 1 displays a more nuanced comparison of our contour detection performance with that of SCG before globalization. Our detector performs comparably to (local) SCG. We expect that inclusion of a sophisticated spectral integration step  will further boost our contour detection performance, but leave the proof to future work.
It is also worth emphasizing that our system is the only method in Table 1 that relies on neither hand-crafted filters (global Pb, SCG) nor hand-crafted features (global Pb, Structured Edges). Our system is learned entirely from data and even relies on a generatively trained representation as a critical component.
Additional analysis of our results yields the interesting observation that the second layer of our multipath network appears crucial to texture understanding. Figure 6 shows a comparison of contour detection results when our system is restricted to use only layer 1 versus results when the system uses the sparse representation from both layers 1 and 2. Inclusion of the second layer (deep sparse coding) essentially allows the classification stage to learn an off switch for texture edges.
|ODS F||OIS F||AP||Features?||Filters?||Globalization?|
|Structured Edges ||yes||no||no|
|local SCG (color) ||no||yes||no|
|Sparse Code Transfer Layers 1+2||no||no||no|
|Sparse Code Transfer Layer 1||no||no||no|
|local SCG (gray) ||no||yes||no|
|multiscale Pb ||yes||yes||no|
|Canny Edge Detector ||yes||yes||no|
|global SCG (color) ||yes||yes||yes|
|global Pb + UCM ||yes||yes||yes + UCM|
|global Pb ||yes||yes||yes|
6.2 Semantic Labeling of Faces
Figure 7 shows example results for semantic segmentation of skin, hair, and background classes on the LFW parts dataset using reconstructive sparse code transfer. All results are for our two-layer multipath network. As the default split of the LFW parts dataset allowed images of the same individual to appear in both training and test sets, we randomly re-split the dataset with the constraint that images of a particular individual were either all in the training set or all in the test set, with no overlap. All examples in Figure 7 are from our test set after this more stringent split.
Note that while faces are centered in the LFW part dataset images, we directly apply our algorithm and make no attempt to take advantage of this additional information. Hence, for several examples in Figure 7 our learned skin and hair detectors fire on both primary and secondary subjects appearing in the photograph.
We demonstrate that sparse coding, combined with a reconstructive transfer learning framework, produces results competitive with the state-of-the-art for contour detection. Varying the target of the transfer learning stage allows one to port a common sparse representation to multiple end tasks. We highlight semantic labeling of faces as an additional example. Our approach is entirely data-driven and relies on no hand-crafted features. Sparse representations similar to the one we consider also arise naturally in the context of deep networks for image recognition. We conjecture that multipath sparse networks  can produce shared representations useful for many vision tasks and view this as a promising direction for future research.
Acknowledgments. ARO/JPL-NASA Stennis NAS7.03001 supported Michael Maire’s work.
-  Le, Q.V., Ranzato, M., Monga, R., Devin, M., Chen, K., Corrado, G.S., Dean, J., Ng, A.Y.: Building high-level features using large scale unsupervised learning. ICML (2012)
-  LeCun, Y., Kavukcuoglu, K., Farabet, C.: Convolutional networks and applications in vision. ISCAS (2010)
-  Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. NIPS (2012)
-  Zeiler, M.D., Taylor, G.W., Fergus, R.: Adaptive deconvolutional networks for mid and high level feature learning. ICCV (2011)
-  Yu, K., Lin, Y., Lafferty, J.: Learning image representations from the pixel level via hierarchical sparse coding. CVPR (2011)
-  Bo, L., Ren, X., Fox, D.: Multipath sparse coding using hierarchical matching pursuit. CVPR (2013)
-  Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. ECCV (2014)
-  Arbeláez, P., Maire, M., Fowlkes, C., Malik, J.: Contour detection and hierarchical image segmentation. PAMI (2011)
-  Elad, M., Aharon, M.: Image denoising via sparse and redundant representations over learned dictionaries. IEEE Transactions on Image Processing (2006)
-  Pati, Y.C., Rezaiifar, R., Krishnaprasad, P.S.: Orthogonal matching pursuit: Recursive function approximation with applications to wavelet decomposition. Asilomar Conference on Signals, Systems and Computers (1993)
-  Rubinstein, R., Zibulevsky, M., Elad, M.: Efficient implementation of the K-SVD algorithm using batch orthogonal matching pursuit. (2008)
-  Martin, D., Fowlkes, C., Tal, D., Malik, J.: A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. ICCV (2001)
-  Huang, G.B., Ramesh, M., Berg, T., Learned-Miller, E.: Labeled faces in the wild: A database for studying face recognition in unconstrained environments. (2007)
-  Kae, A., Sohn, K., Lee, H., Learned-Miller, E.: Augmenting CRFs with Boltzmann machine shape priors for image labeling. CVPR (2013)
-  Ren, X., Bo, L.: Discriminatively trained sparse code gradients for contour detection. NIPS (2012)
-  Dollár, P., Zitnick, C.L.: Structured forests for fast edge detection. ICCV (2013)
-  Martin, D., Fowlkes, C., Malik, J.: Learning to detect natural image boundaries using local brightness, color and texture cues. PAMI (2004)
-  Shi, J., Malik, J.: Normalized cuts and image segmentation. PAMI (2000)
-  Malik, J., Belongie, S., Leung, T., Shi, J.: Contour and texture analysis for image segmentation. IJCV (2001)
-  Ren, X., Fowlkes, C., Malik, J.: Figure/ground assignment in natural images. ECCV (2006)
-  Lim, J., Zitnick, C.L., Dollár, P.: Sketch tokens: A learned mid-level representation for contour and object detection. CVPR (2013)
-  Mairal, J., Leordeanu, M., Bach, F., Hebert, M., Ponce, J.: Discriminative sparse image models for class-specific edge detection and image interpretation. ECCV (2008)
-  Yang, J., Wang, Z., Lin, Z., Shu, X., Huang, T.: Bilevel sparse coding for coupled feature spaces. CVPR (2012)
-  Aharon, M., Elad, M., Bruckstein, A.: K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation. IEEE Transactions on Signal Processing (2006)
-  Maire, M., Yu, S.X., Perona, P.: Progressive multigrid eigensolvers for multiscale spectral segmentation. ICCV (2013)
-  Arbeláez, P.: Boundary extraction in natural images using ultrametric contour maps. POCV (2006)
-  Canny, J.: A computational approach to edge detection. PAMI (1986)