Training Deep Networks to be Spatially Sensitive
In many computer vision tasks, for example saliency prediction or semantic segmentation, the desired output is a foreground map that predicts pixels where some criteria is satisfied. Despite the inherently spatial nature of this task commonly used learning objectives do not incorporate the spatial relationships between misclassified pixels and the underlying ground truth. The Weighted F-measure, a recently proposed evaluation metric, does reweight errors spatially, and has been shown to closely correlate with human evaluation of quality, and stably rank predictions with respect to noisy ground truths (such as a sloppy human annotator might generate). However it suffers from computational complexity which makes it intractable as an optimization objective for gradient descent, which must be evaluated thousands or millions of times while learning a model’s parameters. We propose a differentiable and efficient approximation of this metric. By incorporating spatial information into the objective we can use a simpler model than competing methods without sacrificing accuracy, resulting in faster inference speeds and alleviating the need for pre/post-processing. We match (or improve) performance on several tasks compared to prior state of the art by traditional metrics, and in many cases significantly improve performance by the weighted F-measure.
When optimizing a predictive model it is important that the objective function not only encode the ideal solution (zero mistakes), but also quantify the relative severity of mistakes. A common dimension of preference is the desired tradeoff between precision and recall. One can capture this tradeoff with a metric, where reflects the relative importance of recall compared to precision. While this metric can quantify the relative importance of false positives and false negatives, it cannot capture differing severity between two false positives, or two false negatives. One domain where differentiating between such errors becomes important is the prediction of foreground maps, where the output has many desired properties not captured by notions of precision or recall, such as smoothness, accuracy of boundaries, contiguity of the predicted mask, etc. As a result, two predictions with the same number of mistakes, or with the same score on a measure which treats false positives and false negatives equally (e.g. intersection over union, IoU), may differ substantially in their perceived spatial quality. Loss functions derived from per-pixel classification-based surrogates, such as log-loss are almost universally used in existing work, but fail to capture both the precision-recall tradeoff and the spatial sensibilities of this kind.
Margolin et al. [?] proposed a method to quantify these distinctions when predicting foreground maps. Their measure formalizes two notions. First, false detections are less severe when close to the object’s true boundary; Second, missing an entire section of an object is worse than missing the same number of pixels scattered across the entire object. These alterations closely match human intuition and perceptual judgements, and have the additional benefits of being robust to small annotation errors (such as minor differences between multiple human annotators). The measure is also able to reliably rank generic foreground maps, such as centered geometric shapes, lower than state of the art predictions (They show traditional metrics, such as AUC, lack this property). Despite these positive traits, their formulation has memory and computational requirements where the number of pixels in the image.
This computational cost poses a particular problem if the metric were used as the training objective for deep neural networks (DNNs). Normally trained with stochastic gradient descent over large training sets, DNNs require computing the gradient of the loss many, many, times. This means that the loss function must be differentiable, and efficient – two criteria which does not meet.
Our primary contribution is a differentiable and computationally efficient approximation of the metric, which can be used directly as the loss function of a convolutional neural network (CNN). As a secondary contribution, we propose a memory-efficient CNN architecture which is capable of producing high resolution pixel-wise predictions, taking full advantage of the spatial information provided by our proposed loss. By combining these two components we are able to produce high-fidelity, spatially cohesive predictions, without relying on complex, often expensive pre-processing (such as super-pixels) or post-processing (such as CRF inference), resulting in inference speeds an order of magnitude faster than state of the art in multiple domains. We do not sacrifice accuracy, achieving competitive or state of the art accuracy on benchmarks for salient object detection, portrait segmentation, and visual distractor masking.
In this section we discuss the prior work on incorporating spatial consideration into learning objectives. While multiple objectives have been proposed to capture spatial properties of prediction maps [?], these have been limited to structured prediction methods using random fields, and adds significant complexity when incorporated into a feed-forward prediction framework like that of CNNs. We focus on the metric, which is decoupled from the prediction framework and upon which we directly build our approach. We review it below, and also survey the related work on the segmentation tasks on which we evaluate our contributions: salient object detection, distractor detection, and portrait segmentation.
2.1The metric family
In a binary classification scenario, with labels , when the predicted label is a mistake , it is either a false positive (FP, ) or a false negative (FN, ). Performance of any classifier on an evaluation set can be characterized by its precision and its recall .
While precision and recall each only tell part of the story, one can summarize a classification algorithm’s performance in a single number, using the metric
captures the relative importance of precision compared to recall (e.g. if precision is twice as important as recall, we use ). The well known metric is a special case corresponding to equal importance between precision and recall. The metrics is a common benchmark in ’information extraction’ tasks, and in [?] Jansche outlines a procedure to directly optimize it. This formulation applies to any scenario when is meaningful, but it cannot encode differences within the categories of false positive, and false negative, which are quite meaningful in the highly structured domain of natural images.
The standard is extended in [?] in two ways. First, it is generalized to handle continuous predictions, (the ground truth remains binary). The adjusted definitions of the true positive, false positive, true negative, and false negative are as follows:
This holds in the case of predicting a set of values; is the vector of ground truth labels , is the vector of predictions, and denotes the dot product
The second modification proposed in [?] addresses the unequal nature of mistakes in binary segmentation ( implying foreground, background), as determined by the spatial configuration of predictions vs. ground truth. The authors of [?] suggest a number of criteria for evaluating foreground maps.
First consider false negatives, missed detections of foreground pixels. If random foreground pixels across an object are undetected, leaving small holes in the foreground, this is easily corrected via post-processing. However, concentrating the same number of errors in one part of the object is much more perceptually severe and difficult to correct. See the top row of Figure ?. This is captured by by re-weighting with a matrix :
This definition of means that FN error at any given pixel is calculated by summing over all FN errors in the image, weighted by a gaussian centered at the pixel of interest. Intuitively, if there are many spatially co-occurring FN predictions, they will all contribute to each other’s loss, heavily penalizing larger sections of missed foreground.
False positives, or erroneous foreground detections, are treated differently. A false positive near the true boundary of the object is more acceptable than a distant one. Even human annotators often do not precisely agree on the boundaries of an object. See the bottom row of Figure ?. Margolin et al. [?] quantify this as follows:
Where , and . Intuitively this gives false positives a weight , where false positives spatially distant from any true positive approach weight 2, and false positives next to true positives have weight approximately 1. This penalizes more heavily far spurious false detection.
and are then defined by substituting in place of in Equation 2. and use these terms to define weighted precision, weighted recall, and the metric.
2.3Salient Object Detection
Traditionally salient object detection models have been constructed by applying expert domain knowledge. Some methods rely on feature engineering combined with center-surround contrast concepts motivated by human perception, where the features are based on color, intensity and texture [?]. A more advanced perception model was used in [?] to generate object detections from attended points. Another approach is using high-level object detectors to determine local ’objectness’ [?]. Many methods combine both approaches [?]. Other techniques make predictions hierarchically [?], or based on graphical models [?]. Other expert knowledge includes re-weighting the model predictions based on the image center or boundaries [?].
Deep networks were used in [?] to learn local patch features to predict the saliency score at the center of the patch. However, lack of global information might lead to failure to detect the interior of large objects. In [?] Kokkinos combines the task of salient object detection with several other vision tasks, demonstrating a general multi-task CNN architecture.
CNNs were used to extract features around super-pixels [?], as well as combining them with hand-crafted ones [?]. Li et al. [?] propose a two stream method that fuses coarse pixel-level prediction, based on concatenated multi-layer features similar to [?], and then fusing these with super-pixel predictions (reminiscent to [?]). The results of [?] and [?] also rely on post-processing with a CRF.
Our method differs from [?] in two important ways. Instead of relying upon spatial supervision provided by super-pixel algorithms, our architecture directly produces a high resolution prediction. Our proposed spatially sensitive loss function encourages the learned network to make predictions that snap to object boundaries and avoid “holes” in the interior of objects, without any post-processing (e.g., CRF). Our model achieves competitive or state-of-the-art results on all benchmarks. Additionally, training our model is three times faster than competitive saliency methods, making it much easier to scale to larger training sets. Performing inference with our model is almost an order of magnitude faster than any competitive method, and can be used in a real-time application.
Another task where it is vital to predict accurate and high resolution foreground maps is distractor detection as proposed by [?]. Distractors are defined as visually salient parts of an image which are not the photographer’s intended focus. This task is somewhat similar to salient object detection, but successful algorithms must go beyond simply detecting all salient objects, and model the image at a global level to discriminate between the intended focus of the image and the distractors. In [?] Fried et al. propose an SVM based approach, trained on a relatively small dataset, which classifies super-pixels extracted by Multiscale Combinatorial Grouping (MCG) based on a set of hand-crafted features. To test the robustness of their approach we gathered a larger dataset with crowdsourced labels. While their method is able to detect large and well defined distractors, it struggles to detect non-object distractors such as shadows, lights and reflections as well as select small objects.
Portraits are highly popular art form in both photography and painting. In most instances, artists seek to make the subject stand out from its surrounding, for instance, by making it brighter or sharper or by applying photographic or painterly filters that adapt to the semantics of the image. Shen et al. [?] presented a new high quality automatic portrait segmentation algorithm by adapting the FCN-8s framework [?]. They also introduced a portrait image segmentation dataset and benchmark for training and testing.
3Our Approximate loss ()
There are three issues that prevent the metric as defined in Section ? from being directly optimized as a loss function. The first is that while the metric is differentiable almost everywhere, it is not differentiable when , because is undefined.
In practice, we observed difficulties optimizing the error using SGD due to the constant value of the gradient for (intuitively, because the gradient doesn’t decrease as the error decreases). We solve this by replacing the norm with : , which we find to be much easier to optimize.
The second problem is that constructing (not to mention computing ) has time and space complexity. However, we can overcome this problem by leveraging convolutions. When we unpack the definition of matrix multiplication in Equation (Equation 3), we can write at pixel as:
Note that if then . We then define and as the ground truth and error respectively at pixel , and can approximate as:
If we let , we can define a Gaussian convolutional kernel , yielding
Where is element-wise multiplication. Now we don’t need to store any entries of , only a kernel and we skip most of the original summation over pixels . So the time and space complexity is reduced to . In practice, we use .
The final problem is that constructing has complexity , because computing requires finding the minimum over all pixels , such that . However, if , then in Equation (Equation 4). Our intuition is that is modeling the region of uncertainty about an object’s boundary, for which we believe 25 pixels to be too generous. So we redefine as:
Squaring the distance so that the region of uncertainty is assumed to be approximately 5 pixels instead of 25. Now the time complexity is where . We can approximate using a convolution with the kernel , a tensor of size , defined at each index (zero indexed) as:
and we can rewrite as:
By reformulating the local search for the true object boundary as a convolution followed by an argmin, we can leverage the efficient implmentations of these operations already available in many packages. While our current architecture does not suffer from speed or memory issues, more complicated architectures might benefit from a more optimized implementation, namely a custom ’minimization convolution’ that would not store the intermediary result of , and takes advantage of the sparsity of .
These changes yield a spatially informed loss function that can easily be implemented in an existing DNN framework such as Tensorflow. It fully utilizes the GPU, does not increase training wall clock time noticeably, and yields better results than more commonly used loss functions for foreground maps. Compared to an unoptimized implementation of the original formulation in python, our approximation takes two orders of magnitude less time to compute on the CPU, and three orders of magnitude less time on the GPU, to compute our loss on a 224x224 pixel image, See section 5.5.
In order to produce accurate foreground maps each pixel must have a rich feature representation. To achieve this we utilize Zoomout features [?], which have been effectively utilized in the semantic segmentation community. Zoomout features are extracted from a CNN by upsampling and downsampling the features computed by each convolutional filter to be the same spatial resolution, then concatenating the features computed at all layers of the CNN. In this way, each spatial location is richly described by both the weakly localized semantic features computed at higher layers, the strongly localized edge and color detectors computed in the first layers, and everything in between.
Zoomout features are expressive but have a large memory footprint, limiting the spatial resolution of predictions that can be made using them. In tasks like distractor detection, where the end goal is to precisely localize distractors and remove them, a low resolution prediction leads to spatial ambiguity and lower precision. To remedy this problem we adapt the insights of [?], introducing what we call Squeeze Modules to our network. A Squeeze Module consists of convolutional filters, of which are convolutions and of which are convolutions. Applying Squeeze Modules to each convolutional layer acts as a dimensionality reduction with learned parameters, allowing us to make predictions at essentially arbitrary resolutions by setting to be sufficiently small. In practice we produce predictions, and set . We refer to our full architecture as a Squeezed Zoomout Network ().
We report on experiments with three tasks where we can expect spatial sensitivity to be important for quality of the output: salient object detection, portrait segmentation, and distractor detection. See Section 2 for background.
In all experiments we train our using a CNN (from which the squeezed zoomout features are derived) pre-trained on ImageNet. As the base CNN we use VGG-16 [?] for saliency, portraits, and distractors. We train the architecture using ADAM [?], and train in 3 stages. In the first stage we set the learning rate to 3e-4 for 8 epochs, In the second we set the learning rate to be 1e-4 for 4 epochs. The base CNN is kept fixed (not fine-tuned) in the first two stages. In the third stage we set the learning rate to 1e-5 for 14 epochs, fine-tuning the weights of the base CNN as well. We augment the training images by randomly permuting standard data transformations as described in [?]: image flips, random noise, changing contrast levels, and global color shifts.
5.2Salient Object Detection
We consider four standard data sets for this task:
- 5000 images with pixel level annotations provided by [?]. Widely used for salient object detection. Most images contain a single object on a high contrast background.
- 4447 images with pixel level annotations provided by [?]. All images with at least one of the following attributes: multiple salient objects, salient objects touching boundary, low color contrast, complex background.
- 1000 challenging images with pixel level annotations provided by [?].
850 images from the PASCAL VOC 2/media/arxiv_projects/1/media/arxiv_projects/19312/9312/010 segmentation challenge with pixel level annotations provided by [?]. Following the convention of [?] we threshold the soft labels at 0.5.
|MC [?]||MDF [?]||DCL [?]||(our)||(our)||GT||Input|
Following the convention of [?] [?] we report the measure (with oracle access to the optimal threshold for the soft predictions), area under the receiver-operator characteristic curve (AUROC), and mean absolute error (). While the first two metrics evaluate whether we rank pixels correctly, MAE captures absolute classification error. We report the mean of each metrics on the test set. We also report the metric almost exactly as formulated by [?], except that for tractibility we drop terms tied to spatially distant pixels, which are very expensive to all compute and have a negligable effect on the loss. While other measures give all errors equal weight, and a small percentage of pixels predicted differently barely affects their value, those mislabeled pixels can be perceptually vital. This is captured by the metric, we provide examples of this phenomenon in Figure ?
Results and Comparison Following the convention of [?] [?] we train on 2500 images from MSRA-B, validate on 500, and test on the remaining 2000. We then use the same model trained on MSRA-B to generate predictions for all other datasets. To evaluate our proposed loss function we compare the performance of our Squeezed Zoomout Network trained with the commonly used cross-entropy loss function (), against the same architecture trained with the exact same training procedure, but replacing the cross entropy with our loss function. The latter is our proposed method and we denote it from now on. We also compare both these models against other competitive techniques, MC [?], HDHF [?], and DCL [?], these results are summarized in Table Table 1, and we provide a qualitative comparison in Figure ? . While [?], [?],and [?] report competitive results on some of the same test sets, they train on 10,000 images, while we only train on 2500, making the results not directly comparable, and we omit those methods from Table 1.
We also use saliency to explore the effectiveness of the proposed objective function, compared to other reweighting schemes. These include: Dropping either the reweighting by the matrix or , using a weighted Cross-Entropy loss, with double weight given to correctly classifying the foreground or background, and standard cross entropy, but ignoring the labels of all pixels in a 3-pixel band around the borders of the foreground. Each of these reweighting schemes reduces AUROC by close to 1%, but effect on varies. Most interesting is the large drop in performance caused by ignoring a 3-pixel border during training, which seems to indicate that these border pixels contain extremely important information for learning a higher quality model.
All inference timing results were gathered using a Titan X GPU and a 3.5GHz Intel Processor. For training MC [?] used a Titan GPU and a 3.6GHz Intel Processor, HDHF [?] and DCL [?] both use a Titan Black GPU and a 3.4GHz Intel Processor.
|Proposed, no A||0.976||0.836|
|Proposed, no B||0.975||0.835|
|Cross-Entropy, 2x foreground weight||0.976||0.834|
|Cross-Entropy, 2x background weight||0.974||0.797|
|Cross-Entropy, 3pix ’DNC’ band||0.973||0.807|
Dataset We use the dataset from [?], consisting of 1800 human portrait images gathered from Flickr. A face detector is run on each image, producing a centered crop scaled to be an 800x600. The crop is manually segmented using Photoshop’s “quick select”. This dataset focuses on portraits captured using a front-facing mobile camera (through the choice of Flickr queries), but includes other portrait types as well. The dataset is split into /media/arxiv_projects/1/media/arxiv_projects/19312/9312/1500 training images and 300 test images. There is a wide variety in the subjects’ age, clothing, accessories, hair-style, and background.
Results Table ? shows that by MIoU both our models significantly outperform (PortraitFCN), which uses only RGB input; and , which requires substantial preprocessing (fiducial point detection, computing an average segmentation mask and aligning it to the input face location) and additional input channels. While our model achieves significantly higher scores than and , they are only slightly better than . We believe this is due to the spatial guidance used by .
- A dataset of 403 images, with accurate, pixel level annotations averaged over many (on average 27.8 [?]) humans through Mechanical Turk.
- A dataset of 4/media/arxiv_projects/1/media/arxiv_projects/19312/9312/019 images, gathered via a free app which removed regions highlighted by users. Because the ground truth was gathered based on thumb swipes it is often inexact, and has only weak correspondence with object boundaries. To rectify this we used ground truth with scores averaged over super pixels generated with MCG [?], where the boundary threshold is set to be 0.1. In this dataset each pixel is labeled with either one of 9 foreground classes corresponding to different types of distractors (light, object, person, clutter, pole, trash, sign, shadow, and reflection) or background.
Evaluation We evaluate our performance on the MTurk dataset through 10 fold cross validation, and compare against the performance of [?] using leave-one-out cross validation. Note that this disadvantages our method, because while each model they use for validation is trained on 402 images, each model we use is trained on 362 or 363 images. We also compare against [?] on the Dist9 dataset, training 10 separate models, one on the entire dataset, and one each of the 9 small datasets corresponding to one of the foreground classes. We split the dataset randomly, using 90% to train and 10% to test. Following the convention of [?], we measure AUROC on all datasets. The results are summarized in Table ?, and Figure ?. Note the final column in Figure ? averages across categories, while Table ? averages over the entire dataset.
To evaluate the relative speed of our approximation we compute wall clock time of computing the , and scores, averaged on fifteen random images from ECSSD. While the original takes 37 minutes, our approximation takes 8.7 seconds on a cpu, and 0.33 seconds on a GPU.
6Discussion and Future Work
We propose a differentiable and efficient objective function which directly encoding multiple widely desirable spatial properties of a foreground mask. We use this objective to learn the parameters of a novel “squeezed zoomout” architecture. resulting in high fidelity foreground maps, which match or surpass state of the art results for a range of binary segmentation tasks. Notably, we achieve these results without relying on any pre-processing (e.g., super-pixel segmentation) or post-processing (e.g., CRF). An interesting direction for fugure work is to generalize our loss function to a multi-class setting, for instance semantic segmentation.