Local Blur Mapping: Exploiting
High-Level Semantics by Deep Neural Networks
The human visual system excels at detecting local blur of visual images, but the underlying mechanism is mysterious. Traditional views of blur such as reduction in local or global high-frequency energy and loss of local phase coherence have fundamental limitations. For example, they cannot well discriminate flat regions from blurred ones. Here we argue that high-level semantic information is critical in successfully detecting local blur. Therefore, we resort to deep neural networks that are proficient in learning high-level features and propose the first end-to-end local blur mapping algorithm based on a fully convolutional network (FCN). We empirically show that high-level features of deeper layers indeed play a more important role than low-level features of shallower layers in resolving challenging ambiguities for this task. We test the proposed method on a standard blur detection benchmark and demonstrate that it significantly advances the state-of-the-art (ODS F-score of ). In addition, we explore the use of the generated blur map in three applications, including blur region segmentation, blur degree estimation, and blur magnification.
Blur is one of the most common image degradations that arises from a number of sources, including atmospheric scatter, camera shake, defocus, and object motion. It is also manipulated by photographers to create visually pleasing effect that draws attention to humans/objects of interest. Given a natural photographic image, the goal of “local blur mapping” is to label every pixel as either blurred or not, resulting in a blur map. Local blur mapping is an important component in many image processing and computer vision systems. For image quality assessment, blur is an indispensable factor that contributes to perceptual image quality [wang2006modern]. For example, the worst quality images that are scored by human subjects in the LIVE Challenge database [ghadiyaram2016massive] are mainly caused by motion and/or out-of-focus blur. For object detection, the identified blurred regions can be excluded for efficient region proposal and robust object localization [liu2015box, ren2015faster]. Other applications that may benefit from local blur mapping include image restoration [dai2009removing, efrat2013accurate], photo editing [bae2007defocus], depth recovery [mather1997use, shi2015just], and image segmentation [favaro2004variational, qi2015semantic].
The human visual system (HVS) is good at identifying which parts of an image appear blurred with amazing speed [webster2002neural], but the underlying mechanism is not well understood. A traditional view of blur is that it reduces the energy (either globally or locally) at high frequencies. Several low-level features have been hand-crafted to exploit this observation. Among those, power spectral slopes [liu2008image, shi2014discriminative] and image gradient distributions [levin2006blind, liu2008image, shi2014discriminative] are representative. Another view is that blur results from the disruption of the local phase coherence at precisely localized features (e.g., step edges) and a coarse-to-fine phase prediction may serve as an indication of blur [wang2003local]. Nearly all previous local blur mapping operators [chakrabarti2010analyzing, hassen2013image, levin2006blind, liu2008image, shi2014discriminative] rely on the two assumptions either explicitly or implicitly, but achieve limited success. In particular, they do not resolve the ambiguity in discriminating flat and blurred regions, and they often mix up structures with and without blurring. A visual example and the associated statistical analysis are shown in Fig. 1 and Fig. 2, respectively.
In this regard, we argue that the fundamental problem in existing approaches is their ignorance to high-level semantic information in natural images, which is crucial in successfully detecting local blur. Therefore, we resort to deep convolutional neural networks (CNN) that have advanced the state-of-the-art in many high-level vision tasks such as image classification [He2015], object detection [ren2015faster], and semantic segmentation [long2015fully]. Specifically, we develop the first fully convolutional network (FCN) [long2015fully] for end-to-end and image-to-image blur mapping [shi2014discriminative]. By fully convolutional, we mean all the learnable filters in the network are convolutional and no fully connected layers are involved. As a result, the proposed blur mapper allows input of arbitrary size, encodes spatial information thoroughly for better prediction, and maintains a relatively low computational cost. More specifically, we trim the 16-layer VGGNet [Simonyan2015Very] for image classification up to the last convolutional layer similar in [xie2015holistically] as our architecture. We then fine-tune the network with weights pre-trained on the semantic segmentation task [long2015fully] that contain rich high-level information about what an input image constitutes. Among the transferred hierarchical representations, we empirically show that high-level features are more important in resolving challenging ambiguities in local blur mapping, which conforms to our claim of blur perception. The proposed algorithm is tested on a standard blur detection benchmark [shi2014discriminative] and outperforms state-of-the-art methods by a large margin.
Our contribution is three-fold. First, we provide a new perspective on blur perception, where high-level semantic information plays a critical role. Second, we show that it is possible to learn an end-to-end and image-to-image local blur mapper based on FCN [long2015fully], which addresses challenging ambiguities such as differentiating flat and blurred regions, and structures with and without blurring. Third, we explore three potential applications of the generated blur map: (1) blur region segmentation, (2) blur degree estimation, and (3) blur magnification.
2 Related work
The computational blur analysis is a central and long-standing problem in vision research and early works can be dated back to as early as 1960s [slepian1967restoration]. Most researchers in this field focus on image deblurring problem that aims to restore a sharp image from a blurred one [cannon1976blind, xu2010two], but blur mapping itself is little investigated. Early works on blur mapping quantify the overall blur degree of an image and cannot perform dense prediction. For example, Marziliano et al. analyzed the spread of the edges [marziliano2002no]. A similar approach was proposed in [elder1998local] by estimating the thickness of object contours. Zhang and Bergholm [zhang1997multi] designed a Gaussian difference signature to model the diffuseness caused by out-of-focus blur. All these methods were designed to tackle Gaussian blurred images and could not be generalized to non-Gaussian and non-uniform blur cases in the real world.
Only recently has local blur mapping become an active research topic. Rugna and Konik [da2003automatic] identified blurry regions by exploiting the observation that they are more invariant to low-pass filtering. Blind deconvolution-based methods have also been investigated to segment motion-blurred [levin2006blind] and defocus-blurred [kovacs2007focus] regions. Su et al. [su2011blurred] examined the singular value information between blurry and non-blurry regions. Chakrabarti et al. [chakrabarti2010analyzing] adopted local Fourier transform to analyze directional blur. Liu et al. [liu2008image] manually designed three local features represented by spectrum, gradient, and color information for blurry regions extraction and an additional local autocorrelation congruency feature for type classification. Their features have been later improved by Shi et al. [shi2014discriminative] and combined with responses of learned local filters to jointly analyze blurry regions in a multi-scale fashion. All the above-mentioned methods are based on hand-crafted low-level features, which although successfully extract sharp regions, cannot tell which parts of an image are truly blurred or flat by nature.
A closely related area to blur mapping is image sharpness measurement [ferzli2009no, hassen2013image], which targets at extracting sharp regions from an image. The results may be combined to an overall sharpness score (global assessment) or refined to a sharpness map (local assessment). There are subtle differences between local blur mapping and sharpness assessment. For example, in sharpness assessment, flat and blurry regions can both be regarded as non-sharp, but in blur mapping, discriminating them is a must for a successful method.
At a high level, we feed images of arbitrary size into an FCN and the network successively outputs a blur map of the same size, where the size mismatch is solved by in-network upsampling. Through a standard stochastic gradient descent (SGD) training procedure, our network is able to learn a complex mapping from raw image pixels to blur perception.
3.1 Training and testing
Training phase. Given a training image set , where is the -th raw input image and is the corresponding ground truth binary blur map, our goal is to learn an FCN that produces a blur map with high accuracy. It is convenient to drop the subscript without ambiguity due to the image-wise operation. We denote all layer parameters in the network as . The loss function is a sum over per-pixel losses between the prediction and the ground truth , where indicates the spatial coordinate. We consider the cross entropy loss
is implemented by the sigmoid function on the -th activation. Eq. (1) can be easily extended to account for the class imbalance situation by weighting the loss according to the proportion of positive and negative labels. Although the labels in the blur detection database [shi2014discriminative] are mildly unbalanced (around pixels are blurred), we find using the class-balanced cross-entropy loss function unnecessary. Moreover, many probability distribution measures can be adopted as alternatives to the cross entropy loss, such as the fidelity loss from quantum physics [nielsen2010quantum]. We find in our experiments that the fidelity loss gives very similar performance, so we stick to the cross entropy loss throughout the paper.
Testing phase. After training, the layer parameters are learned. Given a test image , we freeze the weights and perform a standard forward pass to obtain the predicted blur map:
3.2 Network architecture and its alternatives
Trimmed VGGNet. Inspired by recent works [bertasius2015deepedge, xie2015holistically] that successfully fine-tune deep neural networks pre-trained on the general image classification task to the edge detection task, we adopt the 16-layer VGGNet architecture [Simonyan2015Very] but make two modifications [xie2015holistically]: (1) to make the network fully convolutional, we trim the network until the last convolutional layer (), throwing away both the -th pooling layer and all fully connected layers; (2) we connect an in-network upsampling layer (deconvolution) to interpolate the activations produced by the last convolutional layer so as to match the spatial size of the ground truth blur map. Fig. 3 shows the architecture. The reasons for such trimming are as follows. First, the pool5 layer produces too coarse spatial information that may pose difficulty for later interpolation. Second, instead of convolutionalizing the fully connected layers as in [long2015fully], cutting all of them significantly reduces the computational complexity with only mild loss of representation power. As a result, we speed up the computation and reduce the memory storage at both training and test stages.
We continue by discussing several more sophisticated architecture designs that better combine low-level and high-level features and that have been successfully applied to other areas of computer vision. As will be clear in Section 4.2, incorporating low-level features through these more involved architectures often impairs performance compared with the default trimmed VGGNet.
FCNs with skip layers. The original FCNs make use of classification nets for dense semantic segmentation [long2015fully] by transferring fully connected layers into convolutional ones. To combat the coarse spatial information in deeper layers that limits the scale of detail in the upsampled output, Long et al. [long2015fully] introduced skip layers that combine the responses of the final prediction layer with those of shallower layers with finer spatial information. It is straightforward to adapt this architecture to the blur mapping task by simply replacing the loss function with the cross entropy loss. We include FCN-8s, a top-performing architecture with reasonable complexity in our experiment.
Deeply supervised nets. To make the learning process of hidden layers direct and transparent, Lee et al. [lee2015deeply] proposed deeply supervised nets (DSN) that add side output layers to the convolutional layers in each stage. In the case of 16-layer VGGNet adopted in edge detection [xie2015holistically], five side outputs are produced right after , , , , and layers, respectively. All side outputs are fused to a final output, whose weights are learnable. The final output together with all side outputs contribute to the loss function, which can be minimized using a standard SGD method. We include two variants of DSN: training with weighted fusion only, and training with weighted fusion and deep supervision.
In this section, we first provide thorough implementation details on training and testing the proposed blur mapper. We then describe the experimental protocol and compare our mapper with four state-of-the-art methods. Finally, we analyze various aspects of our algorithm with an emphasis on the role of high-level features. All models are trained and tested with Caffe [jia2014caffe]. The codes will be made publicly available.
Data preparation. To the best of our knowledge, the blur detection benchmark built by Shi et al. [shi2014discriminative] is the only database that is publicly available for this task. It contains images with human labelled blur regions, among which are partially motion-blurred and are defocus-blurred. Since we are limited by the number of training samples available in the existing benchmark, we only divide it into training and test sets. Specifically, the training set contains images with odd indices, denoted by and the test set contains images with even indices, denoted by . Our FCN-based mapper allows for input of arbitrary size, so we try various input sizes and find that the proposed algorithm is insensitive to input image size. We take advantage of this and resize all images to in order to reduce GPU memory cost and speed up training and testing. We also try to augment the training samples by randomly mirroring, rotating, and scaling the images followed by a center cropping, but this yields no noticeable improvement. Therefore, the reported results in the paper are without data augmentation.
Optimization. We initialize the layer parameters with weights from a full 16-layer VGGNet pre-trained on the semantic segmentation task [long2015fully] and fine-tune them by SGD with momentum. The training is regularized by weight decay (the penalty multiplier set to ). The learning rate is initially set to be and follows a polynomial decay with a power of . The learning rates for biases are doubled. The batch size is set to images, and momentum to . The in-network upsampling layer is initialized and fixed to bilinear interpolation. Although those interpolation weights are learnable, the additional performance gain is marginal. The learning stops when the maximum iteration number is reached. The final weights are used for testing.
Comparison with the state-of-the-art. We compare our algorithm with four state-of-the-art methods: Liu08 [liu2008image], Chakrabarti10 [chakrabarti2010analyzing], Su11 [su2011blurred], and Shi14 [shi2014discriminative]. The quantitative performance is evaluated using the precision-recall curve.111We draw the precision-recall curve by concatenating all test images into one vector rather than averaging the curves of all test images. We also summarize the performance into an overall score using three standard criteria. They are (1) optimal dataset scale (ODS) F-score obtained by finding an optimal threshold for all images in the dataset; (2) optimal image scale (OIS) F-score by averaging the best F-scores for all images; and (3) average precision (AP) by averaging over all recall levels [arbelaez2011contour].
The precision-recall curves are shown in Fig. 4. The proposed algorithm achieves the highest precisions for all the recall levels, where the improvement can be as large as . It is interesting to note that previous methods experience precision drops at low recall levels. This is no surprise because traditional methods tend to give flat regions high blur confidence and misclassify them into blurry regions even with relatively large thresholds. By contrast, our mapper automatically learns rich discriminative features, especially high-level semantics, which help accurately discriminate flat regions from blurred ones, resulting in nearly perfect precisions at low recall levels. Moreover, our method exhibits a less steep decline at the middle recall range . This may result from the accurate classification of structures with and without blurring. Table 1 lists the ODS, OIS, and AP results, from which we observe that our algorithm significantly advances the state-of-the-art by a large margin with an ODS F-score of .
|Training from scratch||0.833||0.876||0.856|
|Fusion (w/o deep supervision)||0.844||0.877||0.865|
|Fusion (with deep supervision)||0.854||0.889||0.876|
To better investigate the effectiveness of the proposed algorithm at detecting local blur, we show some blur maps generated by the proposed algorithm and compare them with those by the most competitive methods Su11 [su2011blurred] and Shi14 [shi2014discriminative] in Fig. 5. Our algorithm is able to robustly detect local blur from complex foreground and background. First, it handles well blur regions across different scales from the small motorcycle (in the -th row) to the big girl (in the -th row). Second, it is capable of identifying flat regions such as the car body (the first row), clothes (in the -nd and -th rows), floors (in the -rd row), and the road sign (in the -th row) as non-blurry. Third, it is barely affected by strong structures after blurring and labels those regions correctly. All of these stand in stark contrast with previous methods, which mix flat and blurry regions with high probability, and are severely biased by strong structures after blurring. Moreover, our algorithm labels images with high confidence. Nearly of the pixels in the test images have predicted values either larger than (blurry) or smaller than (non-blurry).
The role of high-level features. We conduct a series of experiments to show that the learned high-level features play a crucial role in our blur mapper. We first train our FCN from scratch without using the initialization from a pre-trained net that contains rich high-level semantic information already. The results shown in the first row of Table 2 are unsatisfactory, which is expected since bad initialization can stall learning due to the instability of gradients in deep nets. By contrast, a more informative initialization with respect to the blur mapping task (in this case from semantic segmentation) is likely to guide SGD to find better local minima and results in a more meaningful blur map (Fig. 6).
We then investigate more advanced network architectures that make better use of low-level features at shallower layers, including FCN with skip layers (FCN-8s) [long2015fully], weighted fusion of side outputs [xie2015holistically], and weighted fusion of side outputs and deep supervision [xie2015holistically] (DSN). The results are shown in Table 2 and Fig. 6. We observe that although incorporating low-level features produces blur maps with somewhat finer spatial information, it voids the benefits of high-level features and results in erroneous and non-uniform blur assignment. This is expected because low-level features mainly contain edge information of an input image and do not help blur detection much. FCN-8s [long2015fully] that treats low-level and high-level features with equal importance impairs the performance most. The weighted fusion scheme without deep supervision learns to assign importance weights to the side outputs. It turns out that the side outputs generated by deeper convolutional layers are weighted heavier than those by shallower layers.222The learned fusion weights of the five side outputs from shallow to deep layers are , respectively. We observe slightly performance improvement over FCN-8s as expected. The weighted fusion scheme with deep supervision directly regularizes low-level features using the ground truth and delivers slightly better performance in terms of ODS and OIS than the proposed approach. In summary, the proposed default architecture that solely interpolates from high-level feature activations achieves comparable performance with its most complicated variant DSN and even ranks the best in terms of AP. This manifests the central role of high-level features in local blur mapping.
Independence of training sets. We show that the performance of our mapper is independent of specific training sets by changing the role of the training and test sets in our setting. In other words, we train the net on and test on this time. We observe in the Fig. 7 and Table 3 that similar superior performance has been achieved in terms of the precision-recall curve, ODS, OIS, and AP. This verifies that our mapper does not rely on any specific training set as long as it is diverse enough to cover the natural scenes and various causes of blur.
More training data. Deep learning algorithms have dominated many computer vision tasks, at least in part due to the availability of large amounts of labelled data for training. However, in local blur mapping, we are limited by the number of training images available in the existing benchmark. Here we want to explore whether more training data further benefit our algorithm. To do this, we randomly sample images from , incorporate them into , and evaluate the result on the remaining images. The result averaged over such trials is reported. We observe that by adding more training images, performance improves from to . This indicates that we may further boost the performance and enhance the robustness of the proposed algorithm by training it with a larger dataset.
In this subsection, we explore three potential applications that benefit from the blur map generated by our mapper: (1) blur region segmentation, (2) blur degree estimation, and (3) blur magnification.
Blur region segmentation. Many interactive image segmentation tools require users to manually create a mask to roughly indicate what parts belong to foreground and background. The blur map produced by our algorithm provides a useful mask to initialize segmentation without human intervention. Here we adopt GrabCut [rother2004grabcut] and set pixels with blur confidence , , , and as foreground, probable foreground, probable background, and background, respectively. We compare our results with Shi14 [shi2014discriminative] in Fig. 8 and observe that our method does a better job in segmenting images into blur and clear regions due to more accurate blur maps. By contrast, Shi14 [shi2014discriminative] is biased by flat regions in the foreground and structures with blurring in the background and segments images into non-connected parts.
Blur degree estimation. Our blur map can also serve as an estimation of the overall blur degree of an image. To do that, we design a simple blur degree measure of an image as the average value of its corresponding blur map. Fig. 9 shows a set of dog pictures ranked from left to right with the increasing of , from which we can see that our mapper robustly extracts blurred regions with high confidence and that the ranking results are in close agreement with human perception.
Blur magnification. With the extracted blurred regions, it is easy to increase defocus for blur magnification [bae2007defocus]. We compare the result using the blur map generated by the proposed algorithm with that of Shi14 [shi2014discriminative] in Fig. 10. It is clear that our method is barely affected by the structures with blurring and delivers a more perceptually consistent result with smooth transitions from clear to blur regions.
In this paper, we shed some light on visual blur perception of natural scenes, emphasizing on the importance of high-level semantic information. We opt for CNN as a proper tool to explore high-level features, and develop the first end-to-end and image-to-image blur mapping operator based on FCN. The proposed algorithm significantly outperforms previous methods and successfully resolves challenging ambiguities such as flat and blurred regions distinction. In the future, it remains to be seen how the low-level features and high-level semantics interplay with each other in the visual system and how they can be used to predict visual blur perception.
The authors would like to thank Dr. Wangmeng Zuo, Dongwei Ren for deeply insightful comments, Kai Zhang and Faqiang Wang for sharing their expertise on CNN.