Image Denoising using Attention-Residual Convolutional Neural Networks
During the image acquisition process, noise is usually added to the data mainly due to physical limitations of the acquisition sensor, and also regarding imprecisions during the data transmission and manipulation. In that sense, the resultant image needs to be processed to attenuate its noise without losing details. Non-learning-based strategies such as filter-based and noise prior modeling have been adopted to solve the image denoising problem. Nowadays, learning-based denoising techniques showed to be much more effective and flexible approaches, such as Residual Convolutional Neural Networks. Here, we propose a new learning-based non-blind denoising technique named Attention Residual Convolutional Neural Network (ARCNN), and its extension to blind denoising named Flexible Attention Residual Convolutional Neural Network (FARCNN). The proposed methods try to learn the underlying noise expectation using an Attention-Residual mechanism. Experiments on public datasets corrupted by different levels of Gaussian and Poisson noise support the effectiveness of the proposed approaches against some state-of-the-art image denoising methods. ARCNN achieved an overall average PSNR results of around dB and dB for Gaussian and Poisson denoising, respectively FARCNN presented very consistent results, even with slightly worsen performance compared to ARCNN.
Noise is usually defined as a random variation of brightness or color information, as shown by Figure 1, and it is often caused by the physical limitations of the image acquisition sensor or by unsuitable environmental conditions. These issues are often unavoidable in practical situations, which turn the noise in images a prevalent problem that needs to be solved by appropriate denoising techniques.
Denoising an image is a challenging task mainly because the noise is related to its high-frequency content, that is, the details . The goal, therefore, is to find a compromise between suppressing noise as much as possible and not loosing too much details. The most commonly used techniques for image denoising are the filter-based ones such as the Inverse, Median, Kuan, Richardson-Lucy , as well as the Wiener Filter . Besides filter-based techniques, there exist the non-learning-dependent noise modeling approaches, such as EPLL , Krishnan , KSVD , BM3D , Markov Random Fields , and Total Variation . Such techniques are based on noise prior modeling, and they figure some drawbacks, such as the computational burden and the need to fine-tune parameters. Their effectiveness is highly dependent on the prior knowledge about the type of noise (e.g., Gaussian, salt-and-pepper, speckle) and its statistical properties (e.g., mean and variance) .
In an opposite direction, deep learning-based techniques have become the most effective methods used in many real-world problems involving digital image processing, and likewise have been used as a natural replacement option for the non-learning dependent filter and prior knowledge-based denoising approaches. Such learning-based techniques tend to be less affected by the non-linear characteristics of the noise generator mechanisms.
Among such approaches, Multilayer Perceptrons (MLPs) were, for a long time, one of the most explored machine learning-based techniques for image denoising [3, 26, 22]. With the recent advances in computer graphics processing capacity, MLPs have been replaced by Convolutional Neural Networks (CNNs), especially concerning image processing tasks (e.g.,[23, 31, 29, 30, 32]).
State-of-the-art denoising CNNs have been used in a training strategy called residual learning, where the network is trained to assimilate the noise prior distribution. In that manner, it can almost replicate only the image noise, that can be removed from the image by a simple point-wise operation (e.g.,[30, 29, 23, 32]). One main problem with such an approach regards the noise-equally distribution assumption, even knowing that the noise tends to be more concentrated in certain specific parts of the corrupted image, which are usually related to high-frequency regions.
Another very interesting deep learning training strategy not yet very explored for image denoising is the attention learning. Such a mechanism is capable to make the deep neural network concentrates its learning effort in more informative components of the input data. The benefits of such a mechanism brings many advances in the areas of natural language processing , recommendation systems , health care analysis , speech recognition , and image classification , among others.
In this paper, we propose a robust deep learning denoising technique that consists of a CNN model that incorporates residual and attention learning strategies. Indeed, we demonstrate that attention mechanism is capable of support the residual learning strategy, thus enhancing the neural network denoising capacity without the need to increasing the number of parameters or the network architecture complexity. Experiments on public datasets corrupted by different levels of Gaussian and Poisson noise support the effectiveness of the proposed approach regarding some state-of-the-art image denoising methods.
The paper is structured into Sections II to V, presenting, respectively, a brief discussion about the image denoising problem using learning techniques, such as MLPs and CNNs, and non-learning-based ones, the proposed approaches, their training and evaluation methodology, quantitative and qualitative results, and the conclusions, also pointing out future directions of investigations.
Ii Proposed Approach
In this work, we propose a novel image denoising technique named âAttention-Residual Convolutional Neural Network” (ARCNN), as shown in Figure 2.
Influenced by the works of Remez et al. , concerning non-blind residual image denoising using CNNs, and Wang et al.  regarding the usage of attention mechanism for image classification, our proposal consists in developing a novel Attention-Residual mechanism for image denoising, represented by the dashed rectangle in Figure 2. Such mechanism is divided in two steps: (a) the Attention weights calculation, described in details by Subsection II-A, and (b) the Noise estimation process, described in details by Subsection II-B.
As shown by Figure 2, once the Attention-Residual mechanism was capable of estimating the noise
Ii-a Attention Weights
The Attention weights calculation is summarized by Figure 3, represented by the yellow module . The calculation procedure consists in: (a) grouping together each one of the 64th linearly activated feature maps into , (b) applying a sigmoid activation function to , which generates , and (c) normalizing content using a softmax activation function.
The softmax activation procedure that generates the Attention weights is given by:
where , and represents each element of in the th depth position.
Ii-B Noise Estimation
The noise estimation process consists in calculating the noise estimates expectation, summarized by:
where is the th residual map and stands for the point-wise multiplication computed between the Attention weight and the Residual map .
Ii-C Loss Function
The network training follows the standart backpropagation optimization procedure with the following loss function:
where stands for the number of training samples and denotes the Frobenius norm. Notice that we employed a patch-based metodology, where and denote the th patch extracted from clean and denoised images, respectively. Such a loss function was also used by Remez et al. .
Ii-D Denoising Process
After the network was properly trained, the denoising process can be described as follows:
where the expected noise value , learned from the proposed approach, is removed
Iii Experimental Design
In this section, we present the methodology used to train and evaluate the proposed ARCNN and FARCNN models. For the sake of clarification, we divided the section into two parts: Subsection III-A presents all the relevant information about the train and test datasets used in this work, and Subsection III-B discusses the train and evaluation procedures applied to the proposed approaches.
In this section, we provide details about the datasets used for training and evaluating the robustness of the proposed approach:
Berkeley Segmentation Dataset (BSD500): dataset created by  to provide an empirical base for research in image segmentation and boundary detection. The public dataset consists of natural color and grayscale images
4. From the dataset, we used patches of sizes extracted from its images for training purposes. The remaining images were used to evaluate the model.
Common Objects in Context (COCO2017): a large-scale object detection, segmentation, and captioning dataset, composed of color images
5and their correspondent foreground object annotations . From the COCO2017 dataset, we used patches of sizes extracted from all images.
DIVerse 2K high quality resolution images (DIV2K): created by , it is composed of images splitted into subsets of , and , respectively, for training, validation and test purposes. From the downscaled DIV2K version dataset, we used patches of sizes extracted from its images
Set12: composed of images,
7such as ”Airplane”, ”Barbara”, ”Boat”, ”Butterfly”, ”Cameraman”, ”Couple”, ”House”, ”Lena”, ”Man”, ”Parrot”, ”Peppers”, and ”Starfish”.
KODAK24: dataset consisting of natural images made publicly available by the Eastman Kodak Company .
Urban100: it is composed of real-world indoor and outdoor high resolution construction images,
8such as buildings and metro stations .
Iii-B Evaluation and training procedures
We train the non-blind proposed approaches considering two types of corruption process i.e, Gaussian and Poisson. The training was conducted over four different noise intensities for each individual corruption process. For the Gaussian one, we trained ARCNNs considering , and for Poisson corruption process we considered . Note that ARCNN was trained individually for each noise type and intensity. For the optimization process we used mini-batches
During the training process, we used a learning rate
To train the blind version of our proposal, named FARCNN, which stands for ”Flexible Attention-Residual Convolutional Neural Network”, we followed the same non-blind training protocol. The main difference regards the single train adopted strategy, where single Gaussian and Poisson denoisers were trained to learn jointly noise prior distributions, ranging from and from , respectively.
After training the ARCNN and FARCNN models, we evaluate quantitatively their effectiveness using the PSNR (Peak signal-to-noise ratio) in terms of average improvement
Iv Experimental Results
In this section, we present and discuss in detail the quantitative and qualitative results obtained by the proposed ARCNN and FARCNN modes following the methodology presented in Section III. For the sake of reading reading, we divided the discussion into Subsections IV-A and IV-B.
Iv-a Quantitative Results
According to Table I, one can note at first glance that our proposed non-blind gaussian denoising technique ARCNN was ranked in second place, considering the six compared techniques. Overall improvement in PSNR results obtained for BM3D, TNRD, DnCNN, IRCNN, and FFDNet techniques were, respectively, about dB, dB, dB, dB, and dB. Apart from that, Table I also shows that ARCNN performs worst than RDN+ by an overall average of dB. However, we highlight that our technique, in terms of quantity of parameters, is around more compact, since RDN+ and ARCNN have, respectively, about and parameters each.
Looking more carefully at Table I, it also can be noticed that, regarding the denoising train-based techniques TNRD, DnCNN, IRCNN, and FFDNet, ARCNN maintains a sustainable improvement of around dBs, when considering Set12, Kodak24, BSD68 datasets and intensities. Considering only Urban100 dataset, ARCNN performs even better, since under the same considerations and for the same set of intensities, the improvement was on average of dB. In this last case, the improvement can also be explained by a more similarity between the training dataset and Urban100 test dataset distributions.
Analyzing the FARCNN results presented by Table I, it can be noticed that, as expected, our blind denoising technique performs worst than the non-blind ARCNN model. Although, under the same overall improvement analysis, it almost tied with DnCNN and IRCNN, been only respectively dB and dB worst. In comparison against FFDNet it was clear that FARCNN was worst, with an overall decreasing of about dB. The same ARCNN statements apply to the comparison against RDN+.
According to Table II, considering the BSD68 dataset, our non-blind Poisson denoising technique ARCNN performs better than every other compared technique.
Overall noise improvement in PSNR, in the best case scenario, was about dB in comparison against the NLSPCA bin technique and, in the worst-case scenario, about dB in comparison against Class-Aware method.
Analyzing the FARCNN results, one can see that the blind denoising version of our proposal performs better than all other techniques mostly, since Table II shows that FARCNN had, in general same performance as Class-Aware technique in all the experiments considering .
Iv-B Qualitative Results
Beginning by Figure 4, one can observe that the Gaussian denoising results of ARCNN, except for RDN+, outperform all compared techniques. Different from others, ARCNN was capable to restore the severed corrupted straight lines without causing to much blurry effect in the surrounding content. Regarding FARCNN, some of the image is straight lines were not totally restored, but even so, the resultant denoised image quality resembles the DnCNN and IRCNN ones.
Figure 5 shows even better ARCNN performance results obtained in the Poisson denoising task. In comparison against the second-best Poisson denoising technique, according to Subsection IV-A analysis, ARCNN was capable to restore facial regions with more fidelity than Class-Aware technique. Such a statement can be verified especially in between eyebrow regions and of the right eye of the man’s face. In those regions, one can see that the Class-Aware denoising technique generates a denoised image with some kind of cartonization effect. The FARCNN technique presented decent results, especially because it also recovered face high-frequency regions. Like ARCNN, the blind version also did not generate cartonization effects, but even so, it generated some distortions in the face image, like the ones above the left eyebrow and on the right side of the chin.
To better analyze the behavior of the attention mechanism of the trained ARCNN and FARCNN models, blind we generated a heat map graphical representations for the attention weights taken from layers , , and , as shown by Figure 6. In this same figure, one can note that going deeper in the network indicates the increasing level of attention in the input image high-frequency regions, such as the butterfly wings and antennae contours. Such behavior is evidenced in both blind and non-blind cases, being, the contour regions more pronounced in the latter.
V Discussion and Conclusions
In this work, we demonstrated that the residual-attention mechanism enhances the Convolution Neural Network capacity of denoising, regarding Gaussian and Poisson noise corruption processes. The proposed ARCNN method achieves state-of-the-art results in comparison against six Gaussian and eight Poisson denoising techniques. The quantitative overall improvements of our Gaussian and Poisson non-blind learning-based denoisers, apart from the RDN+ technique, were respectively around dB and dB, on average.
The qualitative results also evidenced the ARCNN capacity to recover the image corrupted high-frequency regions. Besides that, also apart from RDN+ in the Gaussian denoising case, we also could show that the blind Gaussian and Poisson FARCNN denoisers presented results sufficiently closer to their non-blind denoiser versions. Matter of fact, the great advantage of the FARCNN denoiser regards its mechanism of assimilating knowledge about many different noise intensities at the same time, achieving almost non-blind denoiser’s effectiveness.
Regarding RDN+ comparisons, we verified that, even so RDN+ was capable to produce the best Gaussian denoising results, it fails in terms of compactness. Its quantity of parameters is at least larger than the proposed approaches, which could generate efficiency bhort comings or even make impossible the usage of the technique in small-sized memory devices, such as smartphones and tablets.
In future works, we intend to explore the proposed approach’s capacity to work with different types of noise, such as JPEG noise compression, speckle, and blur ones, using also color images. Besides that, we pretend to investigate the performance of the attention-residual mechanism in the context of classification problems.
The authors are grateful to CNPq grants 307066/2017-7 and 427968/2018-6, FAPESP grants 2013/07375-0 and 2014/12236-1, Petrobras grant 2017/00285-6, as well as NVIDIA for supporting this work on kindly providing Titan V GPU through the NVIDIA Data Science GPU Grant.
- In this work, we used Gaussian and Poisson noise distributions to corrupt the clean images.
- Our experiments demonstrated that, even so the noise was applied in a multiplicative manner, considering for example the Poisson corruption process, such additive noise removal strategy has worked very well.
- In Figure 2, such noise removal is treated likewise by a plus sign, which denotes the residual mechanism itself.
- Combination of train images subset with validation subset.
- Each mini-batch contains grayscale image patches of size.
- Depending on the training process convergence the maximum epoch value can be less than .
- The initial value is reduced by a factor of at every time the loss function hits a plateau.
- Distributed over the network with a ratio of layers of distance between them.
- Results obtained by subtracting the PSNRvalues, like presented in .
- (2010) Contour detection and hierarchical image segmentation. IEEE transactions on pattern analysis and machine intelligence 33 (5), pp. 898–916. Cited by: 1st item.
- (2016) Variance stabilization for noisy+ estimate combination in iterative poisson denoising. IEEE signal processing letters 23 (8), pp. 1086–1090. Cited by: §III-B, TABLE II.
- (2012) Image denoising: can plain neural networks compete with BM3D?. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pp. 2392–2399. Cited by: §I, footnote 13.
- (2017) Attentive collaborative filtering: multimedia recommendation with item-and component-level attention. In Proceedings of the 40th International ACM SIGIR conference on Research and Development in Information Retrieval, pp. 335–344. Cited by: §I.
- (2016) Trainable nonlinear reaction diffusion: a flexible framework for fast and effective image restoration. IEEE transactions on pattern analysis and machine intelligence 39 (6), pp. 1256–1272. Cited by: §III-B, TABLE I.
- (2016) Retain: an interpretable predictive model for healthcare using reverse time attention mechanism. In Advances in Neural Information Processing Systems, pp. 3504–3512. Cited by: §I.
- (2007) Image denoising by sparse 3-d transform-domain collaborative filtering. IEEE Transactions on image processing 16 (8), pp. 2080–2095. Cited by: §I, §III-B, TABLE I.
- (2015) Fast and accurate poisson denoising with optimized nonlinear diffusion. arXiv preprint arXiv:1510.02930. Cited by: §III-B, TABLE II.
- (1999) Kodak lossless true color image suite. source: http://r0k. us/graphics/kodak 4. Cited by: 5th item.
- (2006) Digital image processing. 3rd edition, Prentice-Hall, Inc., Upper Saddle River, NJ, USA. External Links: Cited by: §I.
- (2004) Digital image processing using matlab. Vol. 624, Pearson-Prentice-Hall Upper Saddle River. Cited by: §I.
- (2015) Single image super-resolution from transformed self-exemplars. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5197–5206. Cited by: 6th item.
- (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §III-B.
- (2017) Joint ctc-attention based end-to-end speech recognition using multi-task learning. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4835–4839. Cited by: §I.
- (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §III-B.
- (2009) Fast image deconvolution using hyper-laplacian priors. In Advances in Neural Information Processing Systems, pp. 1033–1041. Cited by: §I.
- (2016) Ask me anything: dynamic memory networks for natural language processing. In International conference on machine learning, pp. 1378–1387. Cited by: §I.
- (2006) Efficient belief propagation with learned higher-order markov random fields. In European conference on computer vision, pp. 269–282. Cited by: §I.
- (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: 2nd item.
- (2009) Non-local sparse models for image restoration. In Computer Vision, 2009 IEEE 12th International Conference on, pp. 2272–2279. Cited by: §I.
- (2010) Optimal inversion of the anscombe transformation in low-count poisson image denoising. IEEE transactions on Image Processing 20 (1), pp. 99–109. Cited by: §III-B, TABLE II.
- (2017) A robust restricted boltzmann machine for binary image denoising. In 2017 30th SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI), pp. 390–396. Cited by: §I.
- (2018) Class-aware fully convolutional gaussian and poisson denoising. IEEE Transactions on Image Processing 27 (11), pp. 5707–5722. Cited by: §I, §I, §II-C, §II-D, §II, §III-B, §III-B, TABLE II.
- (1992) Nonlinear total variation based noise removal algorithms. Physica D: nonlinear phenomena 60 (1-4), pp. 259–268. Cited by: §I.
- (2014) Poisson noise reduction with non-local pca. Journal of mathematical imaging and vision 48 (2), pp. 279–294. Cited by: §III-B, TABLE II.
- (2013) A machine learning approach for non-blind image deconvolution. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pp. 1067–1074. Cited by: §I.
- (2017) Ntire 2017 challenge on single image super-resolution: methods and results. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp. 114–125. Cited by: 3rd item.
- (2017) Residual attention network for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156–3164. Cited by: §I, §II.
- (2017) Beyond a gaussian denoiser: residual learning of deep cnn for image denoising. IEEE Transactions on Image Processing 26 (7), pp. 3142–3155. Cited by: §I, §I, §III-B, §III-B, TABLE I.
- (2017) Learning deep cnn denoiser prior for image restoration. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3929–3938. Cited by: §I, §I, §III-B, TABLE I, TABLE II.
- (2018) FFDNet: toward a fast and flexible solution for cnn-based image denoising. IEEE Transactions on Image Processing 27 (9), pp. 4608–4622. Cited by: §I, §III-B, TABLE I.
- (2020) Residual dense network for image restoration. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §I, §I, §III-B, TABLE I.
- (2011) From learning models of natural image patches to whole image restoration. In Computer Vision (ICCV), 2011 IEEE International Conference on, pp. 479–486. Cited by: §I.