Deep Burst Denoising
Noise is an inherent issue of low-light image capture, one which is exacerbated on mobile devices due to their narrow apertures and small sensors. One strategy for mitigating noise in a low-light situation is to increase the shutter time of the camera, thus allowing each photosite to integrate more light and decrease noise variance. However, there are two downsides of long exposures: (a) bright regions can exceed the sensor range, and (b) camera and scene motion will result in blurred images. Another way of gathering more light is to capture multiple short (thus noisy) frames in a âburstâ and intelligently integrate the content, thus avoiding the above downsides. In this paper, we use the burst-capture strategy and implement the intelligent integration via a recurrent fully convolutional deep neural net (CNN). We build our novel, multiframe architecture to be a simple addition to any single frame denoising model, and design to handle an arbitrary number of noisy input frames. We show that it achieves state of the art denoising results on our burst dataset, improving on the best published multi-frame techniques, such as VBM4D [maggioni2012video] and FlexISP [heide2014flexisp]. Finally, we explore other applications of image enhancement by integrating content from multiple frames and demonstrate that our DNN architecture generalizes well to image super-resolution.
Noise reduction is one of the most important problems to solve in the design of an imaging pipeline. The most straight-forward solution is to collect as much light as possible when taking a photograph. This can be addressed in camera hardware through the use of a large aperture lens, sensors with large photosites, and high quality A/D conversion. However, relative to larger standalone cameras, e.g. a DSLR, modern smartphone cameras have compromised on each of these hardware elements. This makes noise much more of a problem in smartphone capture.
Another way to collect more light is to use a longer shutter time, allowing each photosite on the sensor to integrate light over a longer period of time. This is commonly done by placing the camera on a tripod. The tripod is necessary as any motion of the camera will cause the collected light to blur across multiple photosites. This technique is limited though. First, any moving objects in the scene and residual camera motion will cause blur in the resulting photo. Second, the shutter time can only be set for as long as the brightest objects in the scene do not saturate the electron collecting capacity of a photosite. This means that for high dynamic range scenes, the darkest regions of the image may still exhibit significant noise while the brightest ones might staturate.
In our method we also collect light over a longer period of time, by capturing a burst of photos. Burst photography addresses many of the issues above (a) it is available on inexpensive hardware, (b) it can capture moving subjects, and (c) it is less likely to suffer from blown-out highlights. In using a burst we make the design choice of leveraging a computational process to integrate light instead of a hardware process, such as in [liu2014fast] and [hasinoff2016burst]. In other words, we turn to computational photography.
Our computational process runs in several steps. First, the burst is stabilized by finding a homography for each frame that geometrically registers it to a common reference. Second, we employ a fully convolutional deep neural network (CNN) to denoise each frame individually. Third, we extend the CNN with a parallel recurrent network that integrates the information of all frames in the burst.
The paper presents our work as follows. In section 2 we review previous single-frame and multi-frame denoising techniques. We also look at super-resolution, which can leverage multi-frame information. In section 3 we describe our recurrent network in detail and discuss training. In order to compare against previous work, the network is trained on simulated Gaussian noise. We also show that our solution works well when trained on Poisson distributed noise which is typical of a real-world imaging pipeline [conf/cvpr/HasinoffDF10]. In section 4, we show significant increase in reconstruction quality on burst sequences in comparison to state of the art single-frame denoising and performance on par or better than recent state of the art multi-frame denoising methods. In addition we demonstrate that burst capture coupled with our recurrent network architecture generalizes well to super-resolution.
In summary our main contributions are:
We introduce a recurrent ”feature accumulator” network as a simple yet effective extension to single-frame denoising models,
Demonstrate that bursts provide a large improvement over the best deep learning based single-frame denoising techniques,
Show that our model reaches performance on par with or better than recent state of the art multi-frame denoising methods, and
Demonstrate that our recurrent architecture generalizes well to the related task of super-resolution.
2 Related work
This work addresses a variety of inverse problems, all of which can be formulated as consisting of (1) a target “restored” image, (2) a temporally-ordered set or “burst” of images, each of which is a corrupted observation of the target image, and (3) a function mapping the burst of images to the restored target. Such tasks include denoising and super-resolution. Our goal is to craft this function, either through domain knowledge or through a data-driven approach, to solve these multi-image restoration problems.
Data-driven single-image denoising research dates back to work that leverages block-level statistics within a single image. One of the earliest works of this nature is Non-Local Means [buades2005non], a method for taking a weighted average of blocks within an image based on similarity to a reference block. Dabov, \etal [dabov2009bm3d] extend this concept of block-level filtering with a novel 3D filtering formulation. This algorithm, BM3D, is the de facto method by which all other single-image methods are compared to today.
Learning-based methods have proliferated in the last few years. These methods often make use of neural networks that are purely feed-forward [NIPS2012_4686, burger2012image, Zhang2017BeyondAG, NIPS2008_3506, gharbi2016deep, NIPS2013_5030, Zhang_2017_CVPR], recurrent [ChenSPIE2016], or a hybrid of the two [chen2017trainable]. Methods such as Field of Experts [roth2005fields] have been shown to be successful in modeling natural image statistics for tasks such as denoising and inpainting with contrastive divergence. Moreover, related tasks such as demosaicing and denoising have shown to benefit from joint formulations when posed in a learning framework [gharbi2016deep]. Finally, the recent work of [chaitanya2017interactive] applied a recurrent architecture in the context of denoising ray-traced sequenced.
Multi-image variants of denoising methods exist and often focus on the best ways to align and combine images. Tico [tico2008multi] returns to a block-based paradigm, but this time, blocks “within” and “across” images in a burst can be used to produce a denoised estimate. VBM3D [dabov2007image] and VBM4D [maggioni2011video, maggioni2012video] provide extensions on top of the existing BM3D framework. Liu, \etal [liu2014fast] showed how similar denoising performance in terms of PSNR could be obtained in one tenth the time of VBM3D and one one-hundredth the time of VBM4D using a novel “homography flow” alignment scheme along with a “consistent pixel” compositing operator. Systems such as FlexISP [heide2014flexisp] and ProxImaL [heide2016proximal] offer end-to-end formulations of the entire image processing pipeline, including demosaicing, alignment, deblurring, etc., which can be solved jointly through efficient optimization.
We in turn also make use of a deep model and base our CNN architecture on current state of the art single-frame methods [remez2017deep, Zhang2017BeyondAG, Ledig_2017_CVPR].
Super-resolution is the task of taking one or more images of a fixed resolution as input and producing a fused or hallucinated image of higher resolution as output.
Nasrollahi, \etal [nasrollahi2014super] offers a comprehensive survey of single-image super-resolution methods and Yang, \etal [yang2014single] offers a benchmark and evaluation of several methods. Glasner, \etal [Glasner2009] show that single images can be super-resolved without any need of an external database or prior by exploiting block-level statistics “within” the single image. Other methods make use of sparse image statistics [yang2010image]. Borman, \etaloffers a survey of multi-image methods [borman1998super]. Farsiu, \etal [farsiu2004fast] offers a fast and robust method for solving the multi-image super-resolution problem. More recently convolutional networks have shown very good results in single image super-resolution with the works of Dong \etal [dong2016image] and the state of the art Ledig \etal [Ledig_2017_CVPR].
Our single-frame architecture takes inspiration by recent deep super-resolution models such as [Ledig_2017_CVPR].
2.1 Neural Architectures
It is worthwhile taking note that while image restoration approaches have been more often learning-based methods in recent years, there’s also great diversity in how those learning problems are modeled. In particular, neural network-based approaches have experienced a gradual progression in architectural sophistication over time.
In the work of Dong, \etal , a single, feed-forward CNN is used to super-resolve an input image. This is a natural design as it leveraged what was then new advancements in discriminatively-trained neural networks designed for classification and applied them to a regression task. The next step in architecture evolution was to use Recurrent Neural Networks, or RNNs, in place of the convolutional layers of the previous design. The use of one or more RNNs in a network design can both be used to increase the effective depth and thus receptive field in a single-image network [ChenSPIE2016] or to integrate observations across many frames in a multi-image network. Our work makes use of this latter principle.
While the introduction of RNNs led to network architectures with more effective depth and thus a larger receptive field with more context, the success of skip connections in classification networks  and segmentation networks [7478072, Ronneberger2015] motivated their use in restoration networks. The work of Remez, \etal [remez2017deep] illustrates this principle by computing additive noise predictions from each level of the network, which then sum to form the final noise prediction.
We also make use of this concept, but rather than use skip connections directly, we extract activations from each level of our network which are then fed into corresponding RNNs for integration across all frames of a burst sequence.
In this section we first identify a number of interesting goals we would like a multi-frame architecture to meet and then describe our method and how it achieves such goals.
Our goal is to derive a method which, given a sequence of noisy images produces a denoised sequence. We identified desirable properties, that a multi-frame denoising technique should satisfy:
Generalize to any number of frames. A single model should produce competitive results for any number of frames that it is given.
Work for single-frame denoising. A corollary to the first criterion is that our method should be competitive for the single-frame case.
Be robust to motion. Most real-world burst capture scenarios will exhibit both camera and scene motion.
Denoise the entire sequence. Rather than simply denoise a single reference frame, as is the goal in most prior work, we aim to denoise the entire sequence, putting our goal closer to video denoising.
Be temporally coherent. Denoising the entire sequence requires that we do not introduce flickering in the result.
Generalize to a variety of image restoration tasks. As discussed in Section 2, tasks such as super-resolution can benefit from image denoising methods, albeit, trained on different data.
In the remainder of this section we will first describe a single-frame denoising model that produces competitive results with current state of the art models. Then we will discuss how we extend this model to accommodate an arbitrary number of frames for multi-frame denoising and how it meets each of our goals.
3.2 Single frame denoising
We treat image denoising as a structured prediction problem, where the network is tasked with regressing a pixel-aligned denoised image from noisy image with model parameters . Following [zhao2017loss] we train the network by minimizing the L1 distance between the predicted output and the ground-truth target image, .
To be competitive in the single-frame denoising scenario, and to meet our 2nd goal, we take inspiration from the state of the art to derive an initial network architecture. Several existing architectures [Zhang2017BeyondAG, remez2017deep, Ledig_2017_CVPR] consist of the same base design: a fully convolutional architecture consisting of layers with channels each.
We therefore follow suit by choosing this simple architecture as our single frame denoising (SFD) base, with , , convolutions and ReLU [maas2013rectifier] activation functions, except on the last layer, as can be seen in Figure 1.
3.3 Multi-frame denoising
Following goals 2 and 4, we want our model to be competitive in the single-frame case while being able to denoise the entire input sequence. Hence, given the set of all noisy images forming the sequence, , we task the network to regress a denoised version of each noisy frame, with model parameters . Our complete training objecting is thus:
A natural approach, which is already popular in the natural language and audio processing literature [yin2017comparative], is to process temporal data with recurrent neural networks (RNN) modules [hopfield1982neural]. RNNs operate on sequences and maintain an internal state which is combined with the input at each time step. In our model, we make use of recurrent connections to aggregate activations produced by our SFD network for each frame, as we show in Figure 1. This allows for an arbitrary input sequence length, our first goal. Unlike [chaitanya2017interactive] and [wieschollek2017learning] which utilize a single-track network design, we use a two track network architecture with the top track dedicated to SFD and the bottom track dedicated to fusing those results into a final prediction for MFD.
By decoupling per-frame feature extraction from multi-frame aggregation, we enable the possibility for pre-training a network rapidly using only single-frame data. In practice, we found that this pre-training not only accelerates the learning process, but also produces significantly better results in terms of PSNR than when we train the entire MFD from scratch. The core intuition is that by first learning good features for SFD, we put the network in a good state for learning how to aggregate those features across observations, but still grant it the freedom to update those features by not freezing the SFD weights during training.
It is also important to note that the RNNs are connected in such a way as to permit the aggregation of observation features in several different ways. Temporal connections within the RNNs help aggregate information “across” frames, but lateral connections “within” the MFD track permit the aggregation of information at different physical scales and at different levels of abstraction.
4 Implementation and Results
To show that our method fulfills the goals set in Section 3, we evaluate it in multiple scenarios: single-image denoising, multi-frame denoising, and single-image super-resolution
We trained all the networks in our evaluation using a dataset consisting of Apple Live Photos. Live Photos are burst sequences captured by Apple iPhone 6S and above
We implemented burst sequence stabilization using OpenCV
4.3 Training details
We implemented the neural network from Section 3 using the Caffe2 framework
We used Adam [Adam] with a learning rate of which decays to zero following a square root law. We trained on crops with random flips. Finally, we train the multi-frame model using back-propagation through time [werbos1988generalization].
4.4 Noise modelling
We first evaluate our architecture using additive white Gaussian noise with and , in order to make comparison possible with previous methods, such as VBM4D. To be able to denoise real burst sequences, we modeled sensor noise following [foi2009clipped] and trained separate models by adding Poisson noise, labelled a in [foi2009clipped], with intensity ranging from 0.001 to 0.01 in linear space before converting back to sRGB and clipping. We also simulate Bayer filtering and reconstruct an RGB image using bilinear interpolation. Unless otherwise mentioned, we add synthetic noise before stabilization.
4.5 Single frame denoising
Here we compare our single frame denoiser with current state of the art methods on additive white Gaussian noise. We compare our own SFD, which is composed of 8 layers, with the two 20 layer networks of DenoiseNet (2017) [remez2017deep] and DnCNN (2017) [Zhang2017BeyondAG]. For the sake of comparison, we also include a 20 layer version of our SFD as well as reimplementations of both DnCNN and DenoiseNet. All models were trained for 2000 epochs on 8000 images from the PASCAL VOC2010 [everingham2010pascal] using the training split from [remez2017deep]. We also include in the comparison BM3D (2009) [dabov2009bm3d] and TNRD (2015) [chen2017trainable].
All models were tested on BSD68 [roth2005fields], a set of 68 natural images from the Berkeley Segmentation Dataset [martin2001database]. In Figure 1, we can see diminishing returns in single frame denoising PSNR over the years, which confirms what Levin, \etal describe in [levin2011natural], despite the use of deep neural networks. We can see that our simpler SFD 20 layers model only slightly underperforms both DenoiseNet and DnCNN by . However, as we show in the following section, the PSNR gains brought by multi-frame processing vastly outshine fractional single frame PSNR improvements.
|DnCNN (reimpl w/o BN)||31.42||28.86||25.99||24.30|
4.6 Burst denoising
We evaluate our method on a held-out test set of Live Photos with synthetic additive white Gaussian noise added. In Table 3, we compare our architecture with single frame models as well as the multi-frame method VBM4D [maggioni2011video, maggioni2012video]. We show qualitative results with in Figure 5. In Figures LABEL:fig:teaser and 8 we demonstrate that our method is capable of denoising real sequences. This evaluation was performed on real noisy bursts from HDR+ [hasinoff2016burst]. Please see our supplementary material for more results.
|C2F||C4F||C8F||Ours 4L||Ours 8L||Ours 12L||Ours 16L||Ours 20L||Ours nostab|
We now evaluate our architecture choices, where we compare our full model, with 8 layers and trained on sequences of 8 frames with other variants.
Concat We first compare our method with a naive multi-frame denoising approach, dubbed Concat, where the input consists of concatenated frames to a single pass denoiser. We evaluated this architecture with as well as and . As we can see in Table 2 this model performs significantly worse than our model.
Number of layers We also evaluate the impact of the depth of the network by experimenting with and . As can be seen in Figure 2, the 16 and 20 layers network fail to surpass both the 8 and 12 layers after 125 epochs of training, likely due to the increased depth and parameter count. While the 12 layers network shows a marginal 0.18dB increase over the 8 layer model, we decided to go with the latter as we did not think that the modest increase in PSNR was worth the increase in both memory and computation time.
Length of training sequences Perhaps the most surprising result we encountered during training our recurrent model, was the importance of the number of frames in the training sequences. In Figure 3, we show that models trained on sequences of both 2 and 4 frames fail to generalize beyond their training length sequence. Only models trained with 8 frames were able to generalize to longer sequences at test time, and as we can see still denoise beyond 8 frames.
Pre-training One of the main advantages of using a two-track network is that we can train the SFD track independently first. As mentioned just before, a sequence length of 8 is required to ensure generalization to longer sequences, which makes the training of the full model much slower than training the single-frame pass. As we show in Figure 2, pre-training makes training the MFD significantly faster.
Frame ordering Due to its recurrent nature, our network exhibits a period of burn-in, where the first frames are being denoised to a lesser extent than the later ones. In order to denoise an entire sequence to a high quality level, we explored different options for frame ordering. As we show in Figure 4, by feeding the sequence twice to the network, we are able to obtain a higher average PSNR. We propose two variants, either repeat the sequence in the same order or reverse it the second time (named forward-backward). As we show in Figure 4, the forward-backward schedule does not suffer from burn-in nor flickering, thus meeting our 5th goal. We thus use forward-backward for all our experiments.
|Input||Average||VBM4D [maggioni2011video]||SFD (Ours)||MFD (Ours)||Ground truth|
|Bicubic||SFSR (Ours)||MFSR (Ours)||Ground truth|
We now compare our method with other denoising approaches on the FlexISP dataset and show our results in Figure 7. Each sequence was denoised using the first 8 frames only. The synthetic sequences flickr doll and kodak fence were generated by randomly warping an input image and adding respectively additive and multiplicative white Gaussian noise of , and additive with Gaussian noise of as well as simulating a Bayer filter. We thus trained two models by replicating these conditions on our Live Photos dataset. On flickr doll our method achieves a PSNR of 29.39dB, matching FlexISP (dB) but falling short of ProxImaL (30.23dB), not shown here. On kodak fence our recurrent model achieves a 0.5dB advantage over FlexISP (34.44dB) with a PSNR of 34.976dB. Despite reaching a higher PSNR than FlexISP, our method does not mitigate the demosiacing artifacts on the fence, likely due to the absence of high frequency demosaicing artifacts in our training data.
4.8 Super resolution
To illustrate that our approach generalizes to tasks beyond denoising, and our 6th goal, we trained our model to perform super-resolution, while keeping the rest of the training procedure identical to that of the denoising pipeline. Each input patch has been downsampled , using pixel area resampling and then resized to their original size using bilinear sampling. Figure 6 shows a couple of our results. Please refer to the supplemental material for more results.
Our single-frame architecture, based on [remez2017deep, Zhang2017BeyondAG, Ledig_2017_CVPR], makes use of stride-1 convolutions, enabling full-resolution processing across the entire network They are however both memory and computationally expensive, and have a small receptive field for a given network depth. Using multiscale architectures, such as a U-Nets [ronneberger2015u], could help alleviate both issues, by reducing the computational and memory load, while increasing the receptive field. Finally while we trained our network on pre-stabilized sequences, we observed a significant drop in accuracy on unstabilized sequences, as can be seen in Table 2, as well as unstability on longer sequences. It would be interesting to train the network to stabilize the sequence by warping inside the network such as in [jaderberg2015spatial, Godard_2017_CVPR].
We have presented a novel deep neural architecture to process burst of images. We improve on a simple single frame architecture by making use of recurrent connections and show that while single-frame models are reaching performance limits for denoising, our recurrent architecture vastly outperform such models for multi-frame data. We carefully designed our method to align with the goals we stated in Section 3.1. As a result, our approach achieves state-of-the-art performance in our Live Photos dataset, and matches existing multi-frame denoisers on challenging existing datasets with real camera noise.