Optimizing Image Classification Architectures for Raw
Real-world sensors suffer from noise, blur, and other imperfections that make high-level computer vision tasks like scene segmentation, tracking, and scene understanding difficult. Making high-level computer vision networks robust is imperative for real-world applications like autonomous driving, robotics, and surveillance. We propose a novel end-to-end differentiable architecture for joint denoising, deblurring, and classification that makes classification robust to realistic noise and blur. The proposed architecture dramatically improves the accuracy of a classification network in low light and other challenging conditions, outperforming alternative approaches such as retraining the network on noisy and blurry images and preprocessing raw sensor inputs with conventional denoising and deblurring algorithms. The architecture learns denoising and deblurring pipelines optimized for classification whose outputs differ markedly from those of state-of-the-art denoising and deblurring methods, preserving fine detail at the cost of more noise and artifacts. Our results suggest that the best low-level image processing for computer vision is different from existing algorithms designed to produce visually pleasing images. The principles used to design the proposed architecture easily extend to other high-level computer vision tasks and image formation models, providing a general framework for integrating low-level and high-level image processing.
Recent progress in deep learning has made it possible for computers to perform high-level tasks on images, such as classification, segmentation, and scene understanding. High-level computer vision is useful for many real-world applications, including autonomous driving, robotics, and surveillance. Applying deep networks trained for high-level computer vision tasks to the outputs of real-world imaging systems can be difficult, however, because raw sensor data is often corrupted by noise, blur, and other imperfections.
What is the correct way to apply high-level networks to raw sensor data? Do effects such as noise and blur degrade network performance? If so, can the lost performance be regained by cleaning up the raw data with traditional image processing algorithms or by retraining the high-level network on raw data? Or is an entirely new approach to combining low-level and high-level image processing necessary to make deep networks robust?
We examine these questions in the context of image classification under realistic camera noise and blur. We show that realistic noise and blur can substantially reduce the performance of a classification architecture, even after retraining on noisy and blurry images or preprocessing the images with standard denoising and deblurring algorithms. We introduce a new architecture for combined denoising, deblurring, and classification that improves classification performance in difficult scenarios. The proposed architecture is end-to-end differentiable and based on a principled and modular approach to combining low-level image processing with deep architectures. The architecture could be modified to handle a different image formation model or high-level computer vision task. We obtain superior performance by training the low-level image processing pipeline together with the classification network. The images output by the low-level image processing pipeline optimized for classification are qualitatively different from the images output by conventional denoising and deblurring algorithms, scoring worse on traditional reconstruction metrics such as peak signal-to-noise ratio (PSNR).
The proposed architecture for joint denoising, deblurring, and classification makes classification robust and effective in real-world applications. The principles used to design the proposed architecture can be applied to make other high-level computer vision tasks robust to noise and blur, as well as to handle raw sensor data with more complex image formation models, such as RGB-D cameras and general sensor fusion. More broadly, the idea of combining low-level and high-level image processing within a jointly trained architecture opens up new possibilities for all of computational imaging.
Our contributions in this paper are the following:
We introduce a dataset of realistic noise and blur models calibrated from real-world cameras.
We evaluate a classification architecture on images with realistic noise and blur and show substantial loss in performance.
We propose a new end-to-end differentiable architecture that combines denoising and deblurring with classification, based on a principled and modular design inspired by formal optimization that can be applied to other image formation models and high-level tasks.
We demonstrate that the proposed architecture, tuned on noisy and blurry images, substantially improves on the classification accuracy of the original network. The joint architecture outperforms alternative approaches such as fine-tuning the classification architecture alone and preprocessing images with a conventional denoiser or deblurrer.
We highlight substantial qualitative differences between the denoised and deblurred images output by the proposed architecture and those output by conventional denoisers and deblurrers, which suggest that the low-level image processing that is best for high-level computer vision tasks like classification is different than that which is best for producing visually pleasing images.
We evaluate the performance of the proposed architecture primarily in low-light conditions. We focus on classification in low-light both because it is important for real-world applications, such as autonomous driving and surveillance at night, and because out of the broad range of light levels for which we evaluated the classification network we found the largest drop in accuracy in low light (both with and without blur). If we can mitigate the effects of noise and blur under the most challenging conditions, then we can certainly do so for easier scenarios.
2 Related Work
Effects of noise and blur on high-level networks
A small body of work has explored the effects of noise and blur on deep networks trained for high-level computer vision tasks. Dodge and Karam evaluated a variety of state-of-the-art classification networks under noise and blur and found a substantial drop in performance \shortcitedodge2016understanding. Vasiljevic et al. similarly showed that blur decreased classification and segmentation performance for deep networks, though much of the lost performance was regained by fine-tuning on blurry images \shortcitevasiljevic2016examining. Several authors demonstrated that preprocessing noisy images with trained or classical denoisers improves the performance of trained classifiers [\citenameTang and Eliasmith 2010, \citenameTang et al. 2012, \citenameAgostinelli et al. 2013, \citenameJalalvand et al. 2016, \citenameda Costa et al. 2016]. Chen et al. showed that training a single model for both denoising and classification can improve performance on both tasks \shortcitechen2016joint. To the best of our knowledge we are the first to jointly train a denoiser or deblurrer combined with a high-level computer vision network in a pipeline architecture.
Unrolled optimization algorithms
The low-level image processing in the proposed joint architecture is based on unrolled optimization algorithms. Unrolled algorithms take classical iterative optimization methods, such as forward-backward splitting [\citenameBruck 1975], ISTA and FISTA [\citenameBeck and Teboulle 2009], Cremers-Chambolle-Pock [\citenamePock et al. 2009, \citenameChambolle and Pock 2011], the alternating direction method of multipliers [\citenameGlowinski and Marroco 1975, \citenameBoyd et al. 2001], and half-quadratic splitting [\citenameGeman and Yang 1995], and fix the number of iterations. If each iteration is differentiable in its output with respect to its parameters, the parameters of the unrolled algorithm can be optimized for a given loss through gradient based methods. Ochs et al. developed an unrolled primal-dual algorithm with Bregman distances and showed an application to segmentation \shortciteochs2015bilevel,ochs2016techniques. Schmidt and Roth trained image denoising and deblurring models using unrolled half-quadratic splitting \shortciteschmidt2014shrinkage. Similarly, Chen et al. trained models for denoising and other tasks using an unrolled forward-backward algorithm \shortcitechen2015learning,chen2015trainable. Both Schmidt and Roth and Chen et al. parameterized their models using a field-of-experts prior [\citenameRoth and Black 2005].
Structured neural networks
Unrolled optimization algorithms can be interpreted as structured neural networks, in which the network architecture encodes domain knowledge for a particular task [\citenameWang et al. 2016]. Structured neural networks have been proposed for deblurring [\citenameXu et al. 2014, \citenameSchuler et al. 2014, \citenameChakrabarti 2016, \citenameZhang et al. 2016a], denoising [\citenameZhang et al. 2016b], and demosaicking [\citenameGharbi et al. 2016]. Conventional fully-connected or convolutional neural networks have also been successfully applied to low-level image processing tasks (see, e.g., [\citenameJain and Seung 2009, \citenameXie et al. 2012, \citenameBurger et al. 2012, \citenameDong et al. 2014, \citenameKim et al. 2016]). Another approach to linking traditional optimization methods and neural networks is to train a network for image reconstruction on data preprocessed with an iterative reconstruction algorithm [\citenameSchuler et al. 2013, \citenameJin et al. 2016].
Camera image processing pipelines
Most digital cameras perform low-level image processing such as denoising and demosaicking in a hardware image signal processor (ISP) pipeline based on efficient heuristics [\citenameRamanath et al. 2005, \citenameZhang et al. 2011, \citenameShao et al. 2014]. Heide et al. showed in FlexISP that an approach based on formal optimization outperforms conventional ISPs on denoising, deblurring, and other tasks \shortciteheide2014flexisp. Heide et al. later organized the principles of algorithm design in FlexISP into ProxImaL, a domain specific language for optimization based image reconstruction \shortciteproximal.
3 Realistic image formation model
3.1 Image formation
We consider the image formation for each color channel as
where is the target scene, is the measured image, and are parameters in a Poisson and Gaussian distribution, respectively, represents the lens point spread function (PSF), denotes 2D convolution, and denotes projection onto the interval . The measured image thus follows the simple but physically accurate Poisson-Gaussian noise model with clipping described by Foi et al. \shortcitefoi08,foi2009clipped.
For simplicity we did not include subsampling of color channels, as in a Bayer pattern, in the image formation model. Subsampling amplifies the effects of noise and blur, so whatever negative impact noise and blur have on classification accuracy would only be greater if subsampling was taken into account. Nonetheless, we intend to expand the proposed joint denoising, deblurring, and classification architecture to include demosaicking in future work.
We calibrated the parameters , , and of the image formation model from Sec. 3.1 for a variety of real-world cameras. Specifically, the PSFs are estimated using a Bernoulli noise chart with checkerboard features, following Mosleh et al. \shortciteMosleh_2015_CVPR. The lens PSF varies spatially in the camera space, so we divided the field-of-view of the camera into non-overlapping blocks and carried out the PSF estimation for each individual block. Fig. 2(a) shows our PSF calibration setup. Fig. 2(c) shows PSFs for the entire field-of-view of a Nexus 5 rear camera.
To estimate the noise parameters and , we took calibration pictures of a chart containing patches of different shades of gray (e.g., [\citenameISO 2014]) at various gains and applied Foi’s estimation method \shortcitefoi2009clipped. Fig. 2(b) shows our noise calibration setup. Fig. 2(d) shows plots of versus and versus for different ISO levels on a Nexus 6P rear camera. The parameters and at a given light level are computed from the and plots.
The noise under our calibrated image formation model can be quite high, especially for low light levels. The noisy image in Fig. Dirty Pixels: Optimizing Image Classification Architectures for Raw Sensor Data is an example. Fig. 1 shows a typical capture of a Nexus 5 rear camera captured in low light. This image was acquired for ISO 3000 and a 30 ms exposure time. The only image processing performed on this image was demosaicking. The severe levels of noise present in the image demonstrate that low and medium light conditions represent a major challenge for imaging and computer vision systems. Note that particularly inexpensive low-end sensors will exhibit drastically worse performance compared to higher end smartphone camera modules.
An in-depth description of our calibration procedure is provided in the supplement. Upon acceptance, we will publically release our dataset of camera PSFs and noise curves.
4 Image Classification under Noise and Blur
We evaluated classification performance under the image formation model from Sec. 3.1, calibrated for a Nexus 5 rear camera. We used PSFs from the center, offaxis, and periphery regions of the camera space. The three PSFs are highlighted in Fig. 2(c). We used noise parameters for a variety of lux levels, ranging from moonlight to standard indoor lighting, derived from the ISO noise curves in Fig. 2(d).
We simulated the image formation model for the chosen PSFs and lux levels on the ImageNet validation set of images [\citenameDeng et al. 2009]. We then applied the Inception-v4 classification network, one of the state-of-the-art models, to each noised and blurred validation set [\citenameSzegedy et al. 2016]. Table 1 shows Inception-v4’s Top-1 and Top-5 classification accuracy for each combination of light level and PSF. The drop in performance for low light levels and for the periphery blur is dramatic. Relative to its performance on the original validation set, the network scores almost 60% worse in both Top-1 and Top-5 accuracy on the combination of the lowest light level and the periphery blur.
The results in Table. 1 clearly show that the Inception-v4 network is not robust to realistic noise and blur under low-light conditions. We consider three approaches to improving the classification network’s performance in difficult scenarios:
We fine-tune the network on training data passed through the image formation model.
We denoise and deblur images using standard algorithms before feeding them into the network.
We train a novel architecture that combines denoising, deblurring, and classification, which we describe in Sec. 5.
We evaluate all three approaches in Sec. 7.
5 Differentiable Denoising, Deblurring, and Classification Architecture
In this section, we describe the proposed architecture for joint denoising, deblurring, and classification, illustrated in Fig. 3. The architecture combines low-level and high-level image processing units in a pipeline that takes raw sensor data as input and outputs image labels. Our primary contribution is to make the architecture end-to-end differentiable through a principled approach based on formal optimization, allowing us to jointly train low-level and high-level image processing using efficient algorithms such as stochastic gradient descent (SGD). Existing pipeline approaches, such as processing the raw sensor data with a camera ISP before applying a classification network, are not differentiable in the free parameters of the low-level image processing unit with respect to the pipeline output.
We base the low-level image processing unit on the shrinkage fields model, a differentiable architecture for Gaussian denoising and deblurring that achieves near state-of-the-art reconstruction quality [\citenameSchmidt and Roth 2014]. We modify the shrinkage fields model using ideas from convolutional neural networks (CNNs) in order to increase the model capacity and make it better suited for training with SGD. We also show how the model can be adapted to handle Poisson-Gaussian noise while preserving differentiability using the generalized Anscombe transform [\citenameFoi and Makitalo 2013].
Any differentiable classification network can be used in the proposed pipeline architecture. We use the Inception-v4 convolutional neural network (CNN) evaluated in Sec. 4 [\citenameSzegedy et al. 2016]. The proposed architecture can be adapted to other high-level computer vision tasks such as segmentation, object detection, tracking, and scene understanding by replacing the classification network with a network for the given task.
The outline of the section is as follows. In Sec. 5.1, we motivate the shrinkage fields algorithm through a connection to Bayesian models and formal optimization. In Sec. 5.2, we discuss the previously proposed the shrinkage fields algorithm. In Sec. 5.3, we explain how we modify shrinkage fields to incorporate ideas from CNNs. In Sec. 5.4 and 5.5, we present the low-level image processing units for Poisson-Gaussian denoising and joint denoising and deconvolution. In Sec. 5.6, we explore the connections between the proposed low-level image processing units and structured neural networks.
5.1 Background and motivation
The proposed low-level image processing unit and the shrinkage fields model are inspired by the extensive literature on solving inverse problems in imaging via maximum-a-posteriori (MAP) estimation under a Bayesian model. In the Bayesian model, an unknown image is drawn from a prior distribution with parameters . The sensor applies a linear operator to the image , and then measures an image drawn from a noise distribution .
Let be the probability of sampling from and be the probability of sampling from . Then the probability of an unknown image yielding an observation is proportional to .
The MAP estimate of is given by
where the data term and prior are negative log-likelihoods. Computing thus involves solving an optimization problem [\citenameBoyd and Vandenberghe 2004, Chap. 7].
Many algorithms have been developed for solving problem (1) efficiently for different data terms and priors (e.g., FISTA [\citenameBeck and Teboulle 2009], Cremers-Chambolle-Pock [\citenameChambolle and Pock 2011], ADMM [\citenameBoyd et al. 2001]). The majority of these algorithms are iterative methods, in which a mapping is applied repeatedly to generate a series of iterates that converge to the solution , starting with an initial point .
Iterative methods are usually terminated based on a stopping condition that ensures theoretical convergence properties. An alternative approach is to execute a pre-determined number of iterations , also known as unrolled optimization. Fixing the number of iterations allows us to view the iterative method as an explicit function of the initial point . Parameters such as may be fixed across all iteration or vary by iteration. One can interpret varying parameters as adaptive step sizes or as applying a single iteration of different algorithms.
If each iteration of the unrolled optimization is differentiable, the gradient of and other parameters with respect to a loss function on can be computed efficiently through backpropagation. We can thereby optimize the algorithm for a reconstruction metric such as PSNR or even the loss of a high-level network that operates on (such as Inception-v4).
The choice of data term is based on the physical characteristics of the sensor, which determine the image formation and noise model. The choice of prior is far less clear and has been the subject of extensive research. Classical priors are based on sparsity in a particular (dual) basis, i.e., for some linear operator and some norm (or pseudo-norm) . For example, when is the discrete gradient operator and , is an anisotropic total-variation prior [\citenameRudin et al. 1992]. Other widely used hand-crafted bases include the discrete cosine transform (DCT) and wavelets [\citenameAhmed et al. 1974, \citenameDaubechies 1992].
Hand-crafted bases have few if any parameters. We need a richer parameterization in order to learn . The most flexible parameterization for images assumes that can be partitioned as
where are linear and translation invariant. It follows that each is given by convolution with some filter . Learning from data means learning the filters .
The norm can also be learned from data. Many iterative methods do not evaluate directly, but instead access via its (sub)gradient or proximal operator. The proximal operator is defined as
It can thus be simpler to learn the gradient or proximal operator of directly and define implicitly. For ease of exposition, we assume from now on that we learn .
A common assumption in the literature is that is fully separable, meaning given a multi-channel image , is a function only of [\citenameRoth and Black 2005]. Under this assumption, can be parameterized using radial basis functions (RBFs) or any other basis for univariate functions. It is also common to assume that is uniform across pixels. In other words, for a given channel , the function is the same for all . We then only need one parameterization per channel, and does not depend on the height and width of the image. The parameterization of and described above is known as the field-of-experts [\citenameRoth and Black 2005].
5.2 Shrinkage fields
The shrinkage fields model is an unrolled version of the half-quadratic splitting (HQS) algorithm with the field-of-experts parameterization of and described in Sec. 5.1 [\citenameSchmidt and Roth 2014, \citenameGeman and Yang 1995]. Fig. 4 illustrates the model. HQS is an iterative method to solve problem (2) when where , i.e., the noise model is Gaussian. HQS is ideal for unrolled optimization because it can converge in far fewer iterations than other iterative methods (less than 10). HQS lacks the robustness and broad asymptotic convergence guarantees of other iterative methods, but these deficiencies are irrelevent for unrolled optimization with learned parameters.
The HQS algorithm as applied to the optimization problem
with optimization variables and , and alternately minimize over and each iteration while increasing . Minimizing over is computing the proximal operator of . Minimizing over is a least-squares problems, whose solution is given by
When is a convolution, can be computed efficiently through inverse filtering.
5.3 CNN proximal operator
Schmidt and Roth achieve near state-of-the-art denoising and deblurring results by optimizing the shrinkage fields model for average reconstruction PSNR using the L-BFGS algorithm. Their RBF parameterization of , however, suffers from several deficiencies. The most significant problem is that the RBF parameterization has low representational capacity. Since the RBF parameterization of is fully separable, it cannot exploit cross-channel correlations. A small number of basis functions is sufficient to represent arbitrary univariate functions over a fixed range with reasonable precision. Therefore, increasing the number of basis functions or adding additional RBF layers has minimal benefit, so we cannot trade-off computational complexity and representational capacity. A more subtle issue is that storing the gradient of with respect to the RBF parameters is memory intensive, which makes optimization via backpropagation challenging on GPUs. We discuss the details of memory usage in the supplement.
In order to correct the deficiencies of the RBF parameterization, we propose a CNN parameterization of the proximal operator. In particular, we parameterize as a CNN with kernels with stride and ReLu nonlinearities. This is the same as iterating a fully connected network on the channels over the pixels. The proposed representation is separable and uniform across pixels, but not separable across channels. In other words, is a function of .
A CNN parameterization of the proximal operator has far greater representational capacity than an RBF representation. A CNN can exploit cross-channel correlations, and we can trade-off computational complexity and representational capacity by adding more layers or more units to the hidden layers. We could also increase the representational power of the CNN by breaking our assumption that the proximal operator is separable across pixels and expanding the CNN kernel size.
A further advantage of the CNN representation is that we benefit from the prolific research on optimizing CNNs. Questions such as how to initialize the kernel weights, how to regularize the weights, what optimization algorithm to use, and how to design the network so gradient information flows through it effectively have been explored in depth.
Merging color channels
Applying the shrinkage fields architecture with either a RBF or CNN proximal operator to grayscale images is straightforward. If the image has multiple color channels, then multiple approaches are possible. One approach is to apply the architecture to each color channel separately with the same parameters. One could also use different parameters for different channels.
A more sophisticated approach is to merge the color channels within the proximal operator by summing the channel’s filter responses before applying The output of the proximal operator is copied onto each color channel and the adjoint filter response is computed for that channel. Figure 5 shows the proximal operator with merged color channels. Merging the color channels is equivalent to representing as a single filter with multiple channels. We experiment in Sec. 7 with both keeping the color channels separate and merging the color channels.
For denoising, the image formation model is and the noise model is the Poisson-Gaussian noise discussed in Sec. 3.1. To specialize the shrinkage fields architecture to denoising, we take a three step approach. First we apply the generalized Anscombe transform to the measured image to convert the Poisson-Gaussian noise into IID Gaussian noise [\citenameFoi and Makitalo 2013]. Then we apply the shrinkage fields architecture with and a CNN representation of . The operations inside the low-level image processing unit in Fig. 3 illustrate an iteration of shrinkage fields for denoising. Lastly, we apply the inverse generalized Anscombe transform to convert the image back into its original domain. The full denoising unit is depicted in Fig. 6. The generalized Anscombe transform and its inverse are differentiable functions, so the Poisson-Gaussian denoising unit is differentiable. Note that the linear operator from Fig. 4 is computed through inverse filtering, since is a convolution.
5.5 Joint deblurring and denoising
For joint denoising and deblurring, the image formation model is for a known PSF and the noise model is the Poisson-Gaussian noise discussed in Sec. 3.1. We cannot apply the generalized Anscombe transform because it would make the image formation model nonlinear. Instead we approximate the Poisson-Gaussian noise as Gaussian noise and apply the shrinkage fields architecture with and a CNN representation of . The joint denoising and deblurring architecture is the low-level image processing unit depicted in Fig. 3. Inverse filtering is used to solve the HQS least-squares subproblem, as in the denoising unit.
5.6 Structured neural network interpretation
The proposed denoising and joint denoising and deblurring units can be viewed either as unrolled optimization algorithms or as structured neural networks, in which the network architecture encodes domain knowledge for a particular task [\citenameWang et al. 2016]. The advantage of structured neural networks is they can exploit domain knowledge, in our case the image formation model. However, there is no standard template for encoding domain knowledge as network structure.
The proposed low-level image processing unit offers just such a template for image reconstruction tasks. We designed our architecture as an unrolled HQS algorithm, which assumes a quadratic data term, but one could combine the CNN parameterization of the prior’s proximal operator (or gradient) with other unrolled optimization algorithms that make fewer assumptions. Potential applications exist across all of computational imaging, including depth estimation, sensor fusion, scientific imaging, and medical imaging, to name a few.
For many image reconstruction and image understanding tasks general purpose neural network architectures are sufficient. We believe there are some cases, however, where encoding domain knowledge in the network at a minimum accelerates training and reduces the risk of overfitting and in the best case improves overall performance.
We built all models in the TensorFlow framework [\citenameAbadi et al. 2015]. For the low-level image processing unit, we used only 1 layer of unrolled HQS due to computational constraints. We used filters with stride 1 for and 3 convolutional layers with 24 channels for the CNN proximal operator. We kept the color channels separate for the denoising unit but merged them in the CNN proximal operator for the joint deblurring and denoising unit.
We initialized the joint architecture with pretrained parameters for both the low-level image processing unit and Inception-v4. We pretrained the denoising units by optimizing for average PSNR on 2000 grayscale image patches from the BSDS300 training dataset [\citenameMartin et al. 2001]. Similarly, we pretrained the joint denoising and deblurring units by optimizing for average PSNR on 200 color image patches from BSDS300. We used 400 iterations of the L-BFGS optimization algorithm in all cases. We discuss the L-BFGS initialization for the low-level image processing units and further details of the units’ parameterization in the supplement.
The Inception-v4 and joint architectures were fine-tuned on the full ImageNet training set passed through the image formation model in Sec. 3.1. We used RMSProp [\citenameTieleman and Hinton 2012] with a decay of 0.9, , and a learning rate of , exponentially decayed by a factor of 0.94 every epoch. We trained all models for 2 epochs. Fine-tuning took 1 day for the Inception-v4 models, 5 days for the joint denoising and clasification models, and 3 days for the joint denoising, deblurring, and classification models. We used 4 NVIDIA Tesla K80 GPUs per model.
|3 lux||6 lux||6 lux + Offaxis PSF||6 lux + Periphery PSF|
|NLM + Pretrained||32.19%||52.12%||56.04%||77.85%||-||-||-||-|
|BM3D + Pretrained||47.43%||70.35%||67.00%||87.47%||-||-||-||-|
|HQS + Pretrained||-||-||-||-||48.64%||70.98%||36.21%||57.04%|
We selected four combinations of light level and camera region from Table 1 for which the classification accuracy of the pretrained Inception-v4 network on the ImageNet validation set dropped substantially relative to its performance without noise or blur. We evaluated three methods of improving the classification accuracy:
We applied conventional denoising and deblurring algorithms to the noised and blurred validation set images, then evaluated the pretrained network on the processed images.
We fine-tuned Inception-v4 on the million ImageNet training images passed through the image formation model for the given light level and PSF.
We fine-tuned the joint denoising, deblurring, and classification architecture described in Sec. 5 on the ImageNet training images passed through the image formation model.
Table 2 summarizes the results. For the two cases with noise but no blur, denoising the images with non-local means (NLM) [\citenameBuades et al. 2005] decreased Top-1 and Top-5 classification accuracy by over 10%, while denoising with BM3D [\citenameDanielyan et al. 2012] increased Top-1 and Top-5 accuracy by a few percent. For the combination of 6 lux illumination and offaxis blur, denoising and deblurring the images with HQS [\citenameGeman and Yang 1995] decreased Top-1 and Top-5 classification accuracy by over 10%. For the combination of 6 lux illumination and periphery blur, preprocessing the images with HQS decreased accuracy by a few percent. Overall, denoising and deblurring the images with conventional algorithms at best marginally increased classification accuracy and in many cases substantially decreased performance.
For all four cases, fine-tuning Inception-v4 on images passed through the image formation model improved classification accuracy substantially, by 10s of percent. The Top-1 and Top-5 classification accuracy were still worse than in the noise and blur free case, but the gap was reduced dramatically.
The highest classification accuracy, however, was obtained by the joint architecture. Top-1 accuracy was up to 5.1% higher than the fine-tuned classifier, and Top-5 accuracy was up to 3.7% higher. The benefit of tuning the denoiser and deblurrer with the classification network was far greater than that of combining a conventional denoiser or deblurrer with the pretrained network.
The results in Table 2 raise the question of why the jointly tuned denoising and deblurring algorithms were so much more helpful to the classifier than conventional algorithms. The images in Figure 7 suggest an answer. Figure 7 shows an image for each of the four combinations of noise and blur that was incorrectly classified by the pretrained Inception-v4 network but correctly classified by the joint architecture. The noised and blurred image is shown, as well as the noised and blurred image denoised and deblurred by a conventional algorithm and by the jointly tuned denoising and deblurring unit. The label assigned by the classifier is given in each instance, as well as the PSNR relative to the original image.
The images output by the conventional denoising and deblurring algorithms contain far less noise than the original noisy and blurry image. Fine details in many regions are blurred out, however. The image output by the jointly tuned denoising and deblurring unit, by contrast, contains more noise but preserves more fine detail.
By conventional metrics of restoration quality such as PSNR and visual sharpness, the jointly tuned denoising and deblurring unit is worse than conventional algorithms. We can see though that it preserves fine detail that is useful to the classification network, while still denoising and deblurring the image to an extent. The qualitative results suggest that the reconstruction algorithms and metrics used to make images visually pleasing to humans are not appropriate for high-level computer vision architectures.
In summary, we showed that the performance of a classification architecture decreases significantly under realistic noise and blur scenarios. We make classification robust to noise and blur by introducing a new fully-differentiable architecture that combines denoising, deblurring, and classification. The architecture is based on unrolled optimization algorithms and can easily be adapted to a different high-level task or image formation model.
We demonstrate that the proposed architecture dramatically improves classification accuracy under noise and blur, surpassing other approaches such as fine-tuning the classification network on blurry and noisy images and preprocessing images with a conventional denoising or deblurring algorithm. We highlight major qualitative differences between the denoised and deblurred images produced as intermediate representations in the proposed architecture and the output of conventional denoising and deblurring algorithms. Our results suggest that the image processing most helpful to a deep network for classification or some other high-level task is different from the traditional image processing designed to produce visually pleasing images.
As discussed in Sec. 3, our image formation model is an accurate representation of real world cameras. The only major simplification in our model is that we do not handle demosaicking. We intend to add demosaicking to our low-level image processing unit in future work. Another obstacle to real world deployment of the proposed architecture is that the noise level and PSF must be known at runtime, both because they are hard-coded into the low-level image processing unit and also because the architecture is trained on a specific noise level and PSF.
A simple solution to the dependence on the noise level and PSF is to train an ensemble of models for different noise levels and PSFs, run them all when classifying an image, and assign the label given the highest confidence. Another possible approach that we will explore in future work is to parameterize the model with the noise level so that retraining for different noise levels is not required.
In the future, we will apply the principles outlined in this paper for designing joint architectures that combine low-level and high-level image processing to many other problems in computational photography and computational imaging. We believe unrolled optimization algorithms with CNN proximal operators (or gradients) can achieve state-of-the-art results for many generative and discriminative tasks.
We will also expand the proposed architecture to model the camera lens and sensors. Just as we optimized the denoiser and deblurrer for classification, we aim to optimize the lens, color subsampling pattern, and other elements of the imaging system for the given high-level vision task. We plan on investigating a similar approach to optimizing the optical system for other imaging modalities as well.
In the future most images taken by cameras and other imaging systems will be consumed by high-level computer vision architectures, not by humans. We must reexamine the foundational assumptions of image processing in light of this momentous change. Image reconstruction algorithms designed to produce visually pleasing images for humans are not necessarily appropriate for computer vision pipelines. We have proposed one approach to redesigning low-level image processing to better serve high-level imaging tasks, in a way that incorporates and benefits from knowledge of the physical image formation model. But ours is only the first, not the final word in a promising new area of research.
We thank Emmanuel Onzon for calibrating the camera PSFs and noise curves and explaining the calibration procedure. We thank Lei Xiao for helpful discussions.
- [\citenameAbadi et al. 2015] Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Mané, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Viégas, F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., and Zheng, X., 2015. TensorFlow: Large-scale machine learning on heterogeneous systems. Software available from tensorflow.org.
- [\citenameAgostinelli et al. 2013] Agostinelli, F., Anderson, M. R., and Lee, H. 2013. Adaptive multi-column deep neural networks with application to robust image denoising. In Advances in Neural Information Processing Systems, C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Weinberger, Eds., 1493–1501.
- [\citenameAhmed et al. 1974] Ahmed, N., Natarajan, T., and Rao, K. 1974. Discrete cosine transform. IEEE Transactions on Computers C-23, 1, 90–93.
- [\citenameBeck and Teboulle 2009] Beck, A., and Teboulle, M. 2009. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences 2, 1, 183–202.
- [\citenameBoyd and Vandenberghe 2004] Boyd, S., and Vandenberghe, L. 2004. Convex Optimization. Cambridge University Press.
- [\citenameBoyd et al. 2001] Boyd, S., Parikh, N., Chu, E., Peleato, B., and Eckstein, J. 2001. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning 3, 1, 1–122.
- [\citenameBruck 1975] Bruck, R. 1975. An iterative solution of a variational inequality for certain monotone operators in Hilbert space. Bulletin of the American Mathematical Society 81, 5 (Sept.), 890–892.
- [\citenameBuades et al. 2005] Buades, A., Coll, B., and Morel, J.-M. 2005. A non-local algorithm for image denoising. In Proc. IEEE CVPR, vol. 2, 60–65.
- [\citenameBurger et al. 2012] Burger, H., Schuler, C., and Harmeling, S. 2012. Image denoising: Can plain neural networks compete with BM3D? In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2392–2399.
- [\citenameChakrabarti 2016] Chakrabarti, A. 2016. A neural approach to blind motion deblurring. In Proceedings of the European Conference on Computer Vision.
- [\citenameChambolle and Pock 2011] Chambolle, A., and Pock, T. 2011. A first-order primal-dual algorithm for convex problems with applications to imaging. Journal of Mathematical Imaging and Vision 40, 1, 120–145.
- [\citenameChen and Pock 2015] Chen, Y., and Pock, T. 2015. Trainable nonlinear reaction diffusion: A flexible framework for fast and effective image restoration. arXiv preprint arXiv:1508.02848.
- [\citenameChen et al. 2015] Chen, Y., Yu, W., and Pock, T. 2015. On learning optimized reaction diffusion processes for effective image restoration. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 5261–5269.
- [\citenameChen et al. 2016] Chen, G., Li, Y., and Srihari, S. 2016. Joint visual denoising and classification using deep learning. In Proceedings of the IEEE International Conference on Image Processing, 3673–3677.
- [\citenameda Costa et al. 2016] da Costa, G. B. P., Contato, W. A., Nazare, T. S., Neto, J. E., and Ponti, M. 2016. An empirical study on the effects of different types of noise in image classification tasks. arXiv preprint arXiv:1609.02781.
- [\citenameDanielyan et al. 2012] Danielyan, A., Katkovnik, V., and Egiazarian, K. 2012. BM3D frames and variational image deblurring. IEEE Trans. Image Processing 21, 4, 1715–1728.
- [\citenameDaubechies 1992] Daubechies, I. 1992. Ten lectures on wavelets, vol. 61. SIAM.
- [\citenameDeng et al. 2009] Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. 2009. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, IEEE, 248–255.
- [\citenameDodge and Karam 2016] Dodge, S., and Karam, L. 2016. Understanding how image quality affects deep neural networks. arXiv preprint arXiv:1604.04004.
- [\citenameDong et al. 2014] Dong, C., Loy, C. C., He, K., and Tang, X. 2014. Learning a deep convolutional network for image super-resolution. In Proceedings of the European Conference on Computer Vision, 184–199.
- [\citenameFoi and Makitalo 2013] Foi, A., and Makitalo, M. 2013. Optimal inversion of the generalized Anscombe transformation for Poisson-Gaussian noise. IEEE Trans. Image Process. 22, 1, 91–103.
- [\citenameFoi et al. 2008] Foi, A., Trimeche, M., Katkovnik, V., and Egiazarian, K. 2008. Practical Poissonian-Gaussian noise modeling and fitting for single-image raw-data. IEEE Trans. Image Process. 17, 10, 1737–1754.
- [\citenameFoi 2009] Foi, A. 2009. Clipped noisy images: Heteroskedastic modeling and practical denoising. Signal Processing 89, 12, 2609–2629.
- [\citenameGeman and Yang 1995] Geman, D., and Yang, C. 1995. Nonlinear image recovery with half-quadratic regularization. IEEE Trans. Image Processing 4, 7, 932–946.
- [\citenameGharbi et al. 2016] Gharbi, M., Chaurasia, G., Paris, S., and Durand, F. 2016. Deep joint demosaicking and denoising. ACM Transactions on Graphics (TOG) 35, 6, 191.
- [\citenameGlowinski and Marroco 1975] Glowinski, R., and Marroco, A. 1975. Sur l’approximation, par éléments finis d’ordre un, et la résolution, par pénalisation-dualité d’une classe de problèmes de dirichlet non linéaires. Revue française d’automatique, informatique, recherche opérationnelle. Analyse numérique 9, 2, 41–76.
- [\citenameHeide et al. 2014] Heide, F., Steinberger, M., Tsai, Y.-T., Rouf, M., Pajak, D., Reddy, D., Gallo, O., Liu, J., Heidrich, W., Egiazarian, K., Kautz, J., and Pulli, K. 2014. FlexISP: A flexible camera image processing framework. ACM Trans. Graph. (SIGGRAPH Asia) 33, 6.
- [\citenameHeide et al. 2016] Heide, F., Diamond, S., Niessner, M., Ragan-Kelley, J., Heidrich, W., and Wetzstein, G. 2016. ProxImaL: Efficient image optimization using proximal algorithms. ACM Trans. Graph. 35, 4.
- [\citenameISO 2014] 2014. ISO 12233:2014 Photography – Electronic still picture imaging – Resolution and spatial frequency responses.
- [\citenameJain and Seung 2009] Jain, V., and Seung, S. 2009. Natural image denoising with convolutional networks. In Advances in Neural Information Processing Systems, 769–776.
- [\citenameJalalvand et al. 2016] Jalalvand, A., Neve, W. D., de Walle, R. V., and Martens, J. 2016. Towards using reservoir computing networks for noise-robust image recognition. In Proceedings of the International Joint Conference on Neural Networks, 1666–1672.
- [\citenameJin et al. 2016] Jin, K. H., McCann, M. T., Froustey, E., and Unser, M. 2016. Deep convolutional neural network for inverse problems in imaging. arXiv preprint arXiv:1611.03679.
- [\citenameKim et al. 2016] Kim, J., Lee, J., and Lee, K. 2016. Accurate image super-resolution using very deep convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1646–1654.
- [\citenameMartin et al. 2001] Martin, D., Fowlkes, C., Tal, D., and Malik, J. 2001. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Proc. 8th Int’l Conf. Computer Vision, vol. 2, 416–423.
- [\citenameMosleh et al. 2015] Mosleh, A., Green, P., Onzon, E., Begin, I., and Pierre Langlois, J. 2015. Camera intrinsic blur kernel estimation: A reliable framework. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
- [\citenameOchs et al. 2015] Ochs, P., Ranftl, R., Brox, T., and Pock, T. 2015. Bilevel optimization with nonsmooth lower level problems. In International Conference on Scale Space and Variational Methods in Computer Vision, Springer, 654–665.
- [\citenameOchs et al. 2016] Ochs, P., Ranftl, R., Brox, T., and Pock, T. 2016. Techniques for gradient-based bilevel optimization with non-smooth lower level problems. Journal of Mathematical Imaging and Vision, 1–20.
- [\citenamePock et al. 2009] Pock, T., Cremers, D., Bischof, H., and A.Chambolle. 2009. An algorithm for minimizing the Mumford-Shah functional. In Proceedings of the IEEE International Conference on Computer Vision, 1133–1140.
- [\citenameRamanath et al. 2005] Ramanath, R., Snyder, W., Yoo, Y., and Drew, M. 2005. Color image processing pipeline in digital still cameras. IEEE Signal Processing Magazine 22, 1, 34–43.
- [\citenameRoth and Black 2005] Roth, S., and Black, M. J. 2005. Fields of experts: A framework for learning image priors. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), vol. 2, IEEE, 860–867.
- [\citenameRudin et al. 1992] Rudin, L., Osher, S., and Fatemi, E. 1992. Nonlinear total variation based noise removal algorithms. Physica D: Nonlinear Phenomena 60, 1â4, 259 – 268.
- [\citenameSchmidt and Roth 2014] Schmidt, U., and Roth, S. 2014. Shrinkage fields for effective image restoration. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2774–2781.
- [\citenameSchuler et al. 2013] Schuler, C. J., Christopher Burger, H., Harmeling, S., and Scholkopf, B. 2013. A machine learning approach for non-blind image deconvolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
- [\citenameSchuler et al. 2014] Schuler, C., Hirsch, M., Harmeling, S., and Schölkopf, B. 2014. Learning to deblur. In NIPS 2014 Deep Learning and Representation Learning Workshop.
- [\citenameShao et al. 2014] Shao, L., Yan, R., Li, X., and Liu, Y. 2014. From heuristic optimization to dictionary learning: A review and comprehensive comparison of image denoising algorithms. IEEE Transactions on Cybernetics 44, 7, 1001–1013.
- [\citenameSzegedy et al. 2016] Szegedy, C., Ioffe, S., and Vanhoucke, V. 2016. Inception-v4, inception-resnet and the impact of residual connections on learning. arXiv preprint arXiv:1602.07261.
- [\citenameTang and Eliasmith 2010] Tang, Y., and Eliasmith, C. 2010. Deep networks for robust visual recognition. In Proceedings of the International Conference on Machine Learning, 1055–1062.
- [\citenameTang et al. 2012] Tang, Y., Salakhutdinov, R., and Hinton, G. 2012. Robust boltzmann machines for recognition and denoising. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, IEEE, 2264–2271.
- [\citenameTieleman and Hinton 2012] Tieleman, T., and Hinton, G. 2012. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning 4, 2.
- [\citenameVasiljevic et al. 2016] Vasiljevic, I., Chakrabarti, A., and Shakhnarovich, G. 2016. Examining the impact of blur on recognition by convolutional networks. arXiv preprint arXiv:1611.05760.
- [\citenameWang et al. 2016] Wang, S., Fidler, S., and Urtasun, R. 2016. Proximal deep structured models. In Advances in Neural Information Processing Systems 29, D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, Eds. 865–873.
- [\citenameXie et al. 2012] Xie, J., Xu, L., and Chen, E. 2012. Image denoising and inpainting with deep neural networks. In Proceedings of the International Conference on Neural Information Processing Systems, 341–349.
- [\citenameXu et al. 2014] Xu, L., Ren, J. S., Liu, C., and Jia, J. 2014. Deep convolutional neural network for image deconvolution. In Advances in Neural Information Processing Systems, 1790–1798.
- [\citenameZhang et al. 2011] Zhang, L., Wu, X., Buades, A., and Li, X. 2011. Color demosaicking by local directional interpolation and nonlocal adaptive thresholding. Journal of Electronic Imaging 20, 2, 023016–023016.
- [\citenameZhang et al. 2016a] Zhang, J., Pan, J., Lai, W.-S., Lau, R., and Yang, M.-H. 2016. Learning fully convolutional networks for iterative non-blind deconvolution. arXiv preprint arXiv:1611.06495.
- [\citenameZhang et al. 2016b] Zhang, K., Zuo, W., Chen, Y., Meng, D., and Zhang, L. 2016. Beyond a Gaussian denoiser: Residual learning of deep CNN for image denoising. arXiv preprint arXiv:1608.03981.