# An approach to image denoising using manifold approximation without clean images

###### Abstract

Image restoration has been an extensively researched topic in numerous fields. With the advent of deep learning, a lot of the current algorithms were replaced by algorithms that are more flexible and robust. Deep networks have demonstrated impressive performance in a variety of tasks like blind denoising, image enhancement, deblurring, super-resolution, inpainting, among others. Most of these learning-based algorithms use a large amount of clean data during the training process. However, in certain applications in medical image processing, one may not have access to a large amount of clean data. In this paper, we propose a method for denoising that attempts to learn the denoising process by pushing the noisy data close to the clean data manifold, using only noisy images during training. Furthermore, we use perceptual loss terms and an iterative refinement step to further refine the clean images without losing important features.

## 1 Introduction

Image denoising is a problem that has been researched for over the last few decades and is still a very actively research topic due to its vast applications in medical image processing. Noise can creep into the image during the acquisition process, or during processing of the image post-acquisition. Different noise models capture the noisy image generation in different stages of the image acquisition and processing. Over these years, there have been a myriad of denoising algorithms which take into account these noise models, constraints on images, blind v/s non-blind denoising, i.e. knowing the noise model parameters or type. In the recent years, deep learning has proven to be very effective at many tasks, even surpassing humans in some of them [21]. A variety of techniques have also been applied in image denoising, with various constraints and assumptions. Some famous image denoising methods are discussed in the next section.

## 2 Related Work

Image restoration is a long standing problem with a lot of applications and use-cases. Some of the popular algorithms for denoising are [9], which deals with Gaussian data, [5] for sparse reconstruction of signals and [6] which is used for Poisson and binomial data. Previous work based on manifold denoising is also done by [13] where a graph-based diffusion process is used on the point sample. With the advent of deep learning, algorithms have been developed that deal with supervised, semi-supervised and unsupervised image restoration. [30] attempts to unroll the computational pipeline of BM3D algorithm into a convolutional neural network structure. [25] propose a strategy of using autoencoders to denoise corrupted versions of their inputs. [8] takes another step by using adversarial training to shape the latent shape in addition to learning to produce denoised outputs. Some work on denoising using manifolds is also done in [23], where they propose to denoise an image by finding the closest point in the manifold of the GAN which is learned on clean images. [28] uses deep networks pretrained with denoising autoencoder, for image inpainting and denoising. [29] use a more complex trilateral weighted sparse coding for real world image denoising. In medical settings, there has been a lot of work as well. [27] use a dictionary learning based algorithm for denoising, [26] proposes a deep learning approach to reconstruct MR data from zero-filled MR images. [7] use a residual convolutional network for low-dose CT imaging. [12] also tackle the medical image denoising problem using convolutional autoencoders.

However, all these algorithms use a lot of training data to work. However, getting large amounts of training data, which may not be feasible in medical settings due to large costs of acquiring high-quality data. Certain modalities like X-ray may also be harmful for the patients due to high radiation dose. Hence, this calls for algorithms in unsupervised learning. [24] showed that neural networks can be used as very strong priors, and image restoration tasks like blind denoising, inpainting, superresolution, and other tasks can be performed for single image without training any network, and by just using a single noisy instance. Recently, [18] proposed to learn the statistical properties of noise using a different training strategy - to use noisy pairs to train a network. The input is a noisy image, and the output is another noisy image. In their experiments, the noise parameter of both noisy images are the same. Although they show impressive results, acquiring noisy images with the same noise parameter may not complete feasible in certain applications. In contrast, my method uses a single noisy image and has a “directional” learning scheme, i.e. the network has to learn mapping from more noisy to less noisy examples and not the other way, since the training pairs in this scheme are not symmetrical unlike [18].

## 3 Method

Before we move into the manifold mapping technique, we define some terminology first. Let be the size of the training set, which consists of only noisy instances{}, . Each of the instance corresponds to a clean image , from which it is generated via some random process during acquisition, post-processing, etc. More formally,

(1) |

where is the degradation model parameterized by noise parameters . The training set doesn’t have the clean images and only contains one noisy instance of the corrupt image for each , contrary to the work of [18] which contained 2 realizations of the corrupted images. With this, we come to the description of our manifold.

A manifold is a lower dimensional basis for describing a high dimensional data. Images can have a complicated and highly non-linear manifold [14], [11], [31]. we attempt to approximately learn this manifold using the noisy instances of the images, since noisy instances will waver around this manifold, with the more noisy instances obviously straying farther from this manifold. we use this intuition to learn a function that tries to push the noisy instances in the direction of the manifold, so as to recover the clean images . Since the noise is random, we can assume it to be around the manifold, shown in Figure 1. One way to have a manifold learning is to have training pairs consisting of clean images with generated synthetic degraded images from the set of clean images, and use a supervised learning approach to learn a mapping from noisy to clean examples. Although that approach works very well in practice, in certain applications, especially medical imaging, one may not have access to a huge dataset of clean training images. In such a scenario, we try to learn the direction to move within the lower dimensional space without actually knowing the line but rather the normal direction towards the line. For each noisy example lying somewhere in the lower-dimensional space, we apply the degradation model to the image yielding a more noisy instance where the choice of will be made clear in later sections. The image is closer to the manifold than because the function cannot recover the original pixel information in the image. For example, for a Gaussian degradation model

(2) |

and , which yields and . Thus, has a higher variance and is more corrupted than . Similarly for a multiplicative Bernoulli noise, the degradation model is

(3) |

where is a random binary mask, is the pixel index and is the probability of the pixel being 1. In this case, and . Hence, can only contain as much pixel information as the pixels from , but cannot recover any of the pixels already lost, and hence is farther from the manifold. With this intuition, we attempt to learn a mapping from to , where are formed from a random degradation of the already corrupted image . we attempt to learn this mapping using a Fully Convolutional Network [19], which was used for semantic segmentation and later became popular in literature for image to image mapping tasks. Learning this mapping encourages the network to invert the degradation model by making use of the context around it. To enable using the context, we choose FCN architectures with sufficiently large receptive fields so that the network can utilize as much information as possible. This leads to a training scheme where no clean examples are ever seen by the network. During inference, we feed-forward the images into the network and obtain approximations of the clean image . There are 2 methods for performing this inference, as mentioned in later sections.

### 3.1 Loss functions

One of the factors that comes into play is a choice of the loss function. Both [18] and [24] already show the effectiveness of a loss function for a particular noise removal task.

#### 3.1.1 Additive Gaussian noise

For additive Gaussian noise, the most commonly chosen loss term is the Mean Squared Error (MSE) loss

(4) |

where are the learnable parameters of the neural network . we also start with MSE loss as a starting loss function for the Additive Gaussian noise. A perceptual loss term (details ahead) is also added to maintain more patch-level details and avoid blurriness in the outputs.

#### 3.1.2 Multiplicative Bernoulli Noise

For multiplicative Bernoulli noise, the loss function is modified so that the pixels which are missing do not count towards the loss function. The loss term is modified from the MSE loss function as:

(5) |

where is a binary mask indicating whether the pixel is missing or not. This can be obtained simply by checking which pixels are zero-valued.

#### 3.1.3 Poisson Noise

For Poisson noise, we use the MSE loss function as well, since the value of the pixel is same in expectation. Although does not maintain a Poisson distribution when is formed from , we hypothesize that the output of the neural network and would belong to the same distribution (since the network is encouraged to learn it). Hence, the network is expected to invert the degradation model. If there existed a deterministic function such that and and the network was a good approximation of , then for a particular the network would output . The degradation model isn’t deterministic, but over a long number of training iterations, the model should figure out to output an average output of a particular input. Therefore, a MSE loss term seems like a good baseline loss function to train on. MSE loss will ensure that will be of the same distribution as as the model learns to invert the Poisson degradation. To preserve higher level information, we also add a Perceptual Loss term.

### 3.2 Perceptual Loss

To compensate for the problems associated with bluriness due to the averaging effect of MSE loss and loss of perceptual features, [15], [17] propose an additional perceptual loss term which also maps higher order relationships between the 2 images. In my case, the perceptual loss only contains of a content loss given by:

(6) |

where is a mapping from the image to a feature map, usually the output of an intermediate layer of a network pretrained on thousands of images. In my case, is the output of the relu2_2 of a VGGNet [22] trained on the Imagenet [10] dataset. Since I’m training on medical datasets and we use a network pretrained on a very different dataset, we use the outputs of an early layer, which usually only gives a combination of low-level information like gradients, texture, etc. In my experiments, we see that adding a Perceptual loss term helps in removing the blurriness of the outputs.

### 3.3 Iterative refinement

Given the training scheme, the inference can be done simply by passing the noisy image through the network and getting the output. However, during training, the network sees a lot of variation between the input and outputs, as multiple instances of are created for a particular instance of for training. A particular value of may be close to one or more instances of in the manifold. Hence, the network may take sub-optimal steps in order to minimize the net error across all training examples. To tackle such a case, we take the output of the network, take a weighted average of the current output and previous output and pass it again as input for the next iteration. Although this makes the inference time more than the current baseline, we observe subtle improvements in the validation scores. The algorithm is outlined as follows:

The parameters and can be chosen to fit a validation dataset. we observed that the optimal values of these parameters depend on the noise levels as well as type of noise. The update term can also be seen as which corresponds to a residual update with learning rate . we describe some implementation details in the next section.

## 4 Implementation

For all the experiments, we use the commonly used U-Net architecture [20]. All networks are trained with the Adam optimizer [16], with a learning rate of and weight decay of . All the inputs are scaled to the range of . we use 3 medical datasets for different tasks to show the effectiveness of the method over a variety of datasets. In particular, we use 2D slices of T1-weighted images from a subset of the IXI Guys dataset [2] containing 322 3-dimensional T1-weighted images. we divide the dataset into 289 images for training, and 33 images for validation. For training, we modify the dataset as following: All the noisy images were generated by retaining every pixel of with a probability . The value of is chosen from the range of . For each , an instance is created by further training the pixels of with a probability . The value of is chosen randomly between . During validation, all the images are corrupted with Bernoulli noise of probability , and metrics were compared with the ground truth slices. Perceptual loss was NOT used in this case, since Perceptual loss would also capture the patch-level similarity, and incorporating the binary mask in this loss would be non-trivial. Finally, the iterative refinement is applied for 10 iterations with .

For the experiment with Gaussian noise, we use the Camelyon 16 dataset [1], which contains whole-slide images of hematoxylin and eosin (H&E) stained lymph node sections. Specifically, we use patches from whole-slide images for the training and validation sets. The training set consists of 6100 patches uniformly chosen from the training set, and 1200 patches chosen uniformly from the validation set. The clean images are corrupted with Gaussian noise with iid Gaussian noise of standard deviation . The value of is randomly chosen from 3 to 50 pixels on a scale of 255 pixels for the image. The image intensities are scaled to the range , and this forms a particular instance of . From this instance, a more noisy version is created by adding iid Gaussian noise to with standard deviation ranging from 0 to 25 pixels. During validation, the standard deviation of is always fixed to 20 pixels. The perceptual loss term is used along with the MSE term. A scaling factor of 0.5 suffices in this case. The iterative refinement is applied for 10 iterations with set to 0.01.

For the experiment with Poisson noise, we chose the JSRT dataset [3]. The dataset contains sized high-resolution chest X-ray images. we follow the 50-50 train-val split as prescribed by [3]. However, in the interest of faster computation, the images are resized to a size of pixels. For a given , is generated by applying Poisson noise to the clean image. Furthermore, is generated by applying Poisson noise to . Note that doesn’t follow the same distribution as (which was the case in the previous cases). However, was found to be more noisy than in my experiments, and we decided to choose MSE + Perceptual loss as the loss function as a starting baseline. The coefficient of perceptual loss was set to 0.2 for my experiments with this dataset. The training is done for 30 epochs. Iterative refinement was applied during testing for 100 iterations with set to 0.01.

Note that in all these cases, only a single is formed from a given , so that the model doesn’t see multiple instances corresponding to a single and learn to predict it so as to minimize the average loss, as opposed to the code released by [18], where different noisy instances are computed every iteration for a given clean image. Since are always different, overfitting was not a concern in my experiments.

## 5 Results and Discussion

We use the 2 commonly used metrics for denoising: (i) PSNR (Peak Signal-to-noise ratio) and (ii) SSIM (Structural Similarity index). PSNR is generally used as a quality measurement between a good image and a compressed image [4]. PSNR is also a commonly used metric for measuring quality between denoised image and a good quality image as well. PSNR for 2 images and is defined as

(7) |

where is the mean squared error between the predicted clean image and the ground truth image. is the maximum fluctuation in the input image data type. In this case, since we scale the image from 0 to 1, . The second metric, SSIM is a metric used to measure structural similarity between 2 images based on local means and variances. SSIM for 2 images and is defined as

(8) |

where are the averages of , are the standard deviations, is the covariance of and . and are constants chosen to be and respectively (which are the default values).

### 5.1 Bernoulli noise

Figure 2 shows some input images, their reconstructions, and the ground truth image. Unlike the experiments done in [18], my training pairs do not contain any extra information other than . This makes the problem harder for the neural network, since it has to fill up the points itself without mutual sharing of information between noisy pairs. However, we see that the network learns to effectively reconstruct the entire image with great accuracy.

SSIM | PSNR | |
---|---|---|

Noisy | 0.5029 | 14.2522 |

Ours | 0.7363 | 32.3628 |

Training was done for 15 epochs on a NVIDIA GTX 1080 GPU. Notable performance was recorded in terms of improvement in SSIM and PSNR.

### 5.2 Gaussian Noise

Figure 3 shows some input images, their reconstructions, and the ground truth image, similar to the previous experiment. Here, the network is encouraged to learn less noisy versions of the images, and the network performs well on the validation dataset, owing to the variation in the noise that the model gets to see throughout the training process.

Here, the model can make use of the Perceptual loss term as well, to exploit more patch-level similarities and correspondences. Training the network with and without the perceptual loss term shows a difference in the validation scores when trained for the same number of epochs. In particular, the perceptual loss reduces blurriness of the output and maintains crisper edges than the only-MSE loss based variant. Table 2 highlights the difference. Some qualitative results can also be seen in Figure 3.

SSIM | PSNR | |
---|---|---|

Noisy | 0.7102 | 26.1487 |

BM3D | 0.8518 | 29.2043 |

Ours (MSE) | 0.8522 | 29.2364 |

Ours (MSE + Perceptual) | 0.8931 | 30.1610 |

The network can remove most of the noise without distorting or blurring the features in the image. Random images are picked from the validation set to demonstrate the effectiveness of this training method.

### 5.3 Poisson Noise

Figure 4 shows some of the results of denoising on X-ray images corrupted by Poisson noise. This dataset was particularly hard because the images do not have a very high contrast. The ribs and other structures are barely visible. Also, the images and do not follow the same distribution unlike the other cases, so the task is expected to be harder than the previous tasks. However, MSE loss seems like a good baseline and can reduce sufficient amount of noise from the corrupted image. Table 3 shows the difference between noisy and denoised images.

SSIM | PSNR | |
---|---|---|

Noisy | 0.5253 | 24.2515 |

Ours | 0.6692 | 27.4641 |

There is an improvement from the baseline noisy images. Note that training is done by corrupting the noisy instances from the training set to generate training pairs , and no information about clean images is known. Both PSNR and SSIM show significant improvements. Although the approach isn’t as good as training with clean samples, there are non-trivial improvements and it is still comparable to fully supervised alternatives.

## 6 Conclusion

We proposed a method of using a dataset containing only noisy examples, and a training strategy to make the model infer the dynamics of the particular noise model. The denoising is blind in the sense that the parameter of the noise model is not known or can be different for different images. We show significant denoising achieved in 3 different datasets, and comparable performance with other supervised algorithms. Owing to the power of deep neural networks, we show that neural networks can be made into good approximators of stochastic degradation models by trying to push images into the manifold of clean images. This interpretation of trying to learn a manifold and hence perform learning is a neat technique when there is an unavailability of clean data, which is more common in medical settings.

## References

- [1] Camelyon 16 dataset. https://camelyon16.grand-challenge.org/.
- [2] Ixi guys dataset. http://brain-development.org/ixi-dataset/.
- [3] Jsrt dataset. http://db.jsrt.or.jp/eng.php.
- [4] Psnr. https://in.mathworks.com/help/vision/ref/psnr.html.
- [5] M. Aharon, M. Elad, and A. Bruckstein. -svd: An algorithm for designing overcomplete dictionaries for sparse representation. IEEE Trans. Signal Process., 54(11):4311–4322, Nov. 2006.
- [6] F. J. Anscombe. The transformation of poisson, binomial and negative-binomial data. Biometrika, 35(3/4):246–254, 1948.
- [7] H. Chen, Y. Zhang, M. K. Kalra, F. Lin, Y. Chen, P. Liao, J. Zhou, and G. Wang. Low-dose ct with a residual encoder-decoder convolutional neural network. IEEE transactions on medical imaging, 36(12):2524–2535, 2017.
- [8] A. Creswell and A. A. Bharath. Denoising adversarial autoencoders. IEEE Transactions on Neural Networks and Learning Systems, page 1â17, 2018.
- [9] K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian. Image denoising with block-matching and 3D filtering.
- [10] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, 2009.
- [11] A. Faaeq, H. Gürüler, and M. Peker. Image classification using manifold learning based non-linear dimensionality reduction. In 2018 26th Signal Processing and Communications Applications Conference (SIU), pages 1–4. IEEE, 2018.
- [12] L. Gondara. Medical image denoising using convolutional denoising autoencoders. In 2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW), pages 241–246. IEEE, 2016.
- [13] M. Hein and M. Maier. Manifold denoising. In B. Schölkopf, J. C. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems 19, pages 561–568. MIT Press, 2007.
- [14] G. E. Hinton, P. Dayan, and M. Revow. Modeling the manifolds of images of handwritten digits. IEEE TRANSACTIONS ON NEURAL NETWORKS, 8(1):65–74, 1997.
- [15] J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In European conference on computer vision, pages 694–711. Springer, 2016.
- [16] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- [17] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, et al. Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4681–4690, 2017.
- [18] J. Lehtinen, J. Munkberg, J. Hasselgren, S. Laine, T. Karras, M. Aittala, and T. Aila. Noise2Noise: Learning image restoration without clean data. Mar. 2018.
- [19] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431–3440, 2015.
- [20] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015.
- [21] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211–252, 2015.
- [22] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
- [23] S. Tripathi, Z. C. Lipton, and T. Q. Nguyen. Correction by projection: Denoising images with generative adversarial networks, 2018.
- [24] D. Ulyanov, A. Vedaldi, and V. Lempitsky. Deep image prior. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9446–9454, 2018.
- [25] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res., 11:3371–3408, Dec. 2010.
- [26] S. Wang, Z. Su, L. Ying, X. Peng, S. Zhu, F. Liang, D. Feng, and D. Liang. Accelerating magnetic resonance imaging via deep learning. In ISBI, 2016.
- [27] Y. Wang and H. Zhou. Total variation wavelet-based medical image denoising. International Journal of Biomedical Imaging, 2006, 2006.
- [28] J. Xie, L. Xu, and E. Chen. Image denoising and inpainting with deep neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 341–349. Curran Associates, Inc., 2012.
- [29] J. Xu, L. Zhang, and D. Zhang. A trilateral weighted sparse coding scheme for real-world image denoising. In Proceedings of the European Conference on Computer Vision (ECCV), pages 20–36, 2018.
- [30] D. Yang and J. Sun. Bm3d-net: A convolutional neural network for transform-domain collaborative filtering. IEEE Signal Processing Letters, 25:55–59, 2018.
- [31] J.-Y. Zhu, P. Krähenbühl, E. Shechtman, and A. A. Efros. Generative visual manipulation on the natural image manifold. In European Conference on Computer Vision, pages 597–613. Springer, 2016.