Deep Residual Autoencoder for quality independent JPEG restoration

Deep Residual Autoencoder for quality independent JPEG restoration

Simone Zini, Simone Bianco and Raimondo Schettini S. Zini, S. Bianco, and R. Schettini are with the Department of Informatics, Systems and Communication, University of Milano-Bicocca, 20126 Milan, Italy (email: s.zini1@campus.unimib.it; simone.bianco@unimib.it; schettini@disco.unimib.it).
Abstract

In this paper we propose a deep residual autoencoder exploiting Residual-in-Residual Dense Blocks (RRDB) to remove artifacts in JPEG compressed images that is independent from the Quality Factor (QF) used. The proposed approach leverages both the learning capacity of deep residual networks and prior knowledge of the JPEG compression pipeline. The proposed model operates in the YCbCr color space and performs JPEG artifact restoration in two phases using two different autoencoders: the first one restores the luma channel exploiting 2D convolutions; the second one, using the restored luma channel as a guide, restores the chroma channels explotining 3D convolutions.

Extensive experimental results on three widely used benchmark datasets (i.e. LIVE1, BDS500, and CLASSIC-5) show that our model is able to outperform the state of the art with respect to all the evaluation metrics considered (i.e. PSNR, PSNR-B, and SSIM). This results is remarkable since the approaches in the state of the art use a different set of weights for each compression quality, while the proposed model uses the same weights for all of them, making it applicable to images in the wild where the QF used for compression is unkwnown. Furthermore, the proposed model shows a greater robustness than state-of-the-art methods when applied to compression qualities not seen during training.

JPEG restoration, deep learning, residual network, autoencoder.

I Introduction

Fig. 1: PSNR-SSIM comparison of the state-of-the-art-models and our proposed method. For both metrics higher value means better visual results.

Image compression represents a very active research topic due to the high impact of the data in a big amount of fields, from image sharing on the web to the most specific applications involving the acquisition of images and transfer to elaboration nodes.

Specifically, image compression refers to the task of representing images using the smallest storage space possible.

Compression algorithms play a key role for saving space and bandwidth for the memorization and transfer of large amount of images. Two different compression paradigm exist: the former is lossless image compression, where the compression rate is limited by the requirement that the original image must be perfectly recovered; the latter, more diffused, is lossy image compression, where higher compression rates are possible at the cost of some distortion in the recovered image. Among the lossy compression algorithms, the most diffused and used is the JPEG compression algorithm.

The JPEG compression algorithm first converts the original RGB image into YCbCr color space and processes the luma and chroma channels separately. It divides the luma channel of an input image into non-overlapping blocks and performs the Discrete Cosine Transform (DCT) on each block separately, while downsampling the chroma components with a bilinear filter. The DCT coefficients obtained from the luma channel are then quantized based on quantization tables and adjusted using the user-selected quality factor. The image is then reconstructed from the quantized DCT coefficients by using the inverse DCT. The described JPEG encoding operation introduces three kinds of artifacts in the recovered images, related to the quality factor used for the compression: i) blocketization artifacts, which come from the recombination of the blocks, that are independently compressed without considering the adjacent blocks; ii) ringing artifacts, which are most visible along the edges and are related to the coarse quantization of the high-frequencies components; iii) blurred low-frequencies areas, which is also related to the compression of the high-frequencies in the DCT domain.

The presence of these kinds of artifacts represents a problem since the general quality of the images is degraded resulting unpleasing for normal users for generic applications (e.g. projection, print, etc.), or even useless for computer vision applications where the loss of information can be potentially critic for the task [1, 2].

With the purpose of reducing these artifacts, in the last years a lot of JPEG artifact reduction algorithms have been proposed. These methods include both traditional image processing pipelines [3, 4, 5, 6, 7, 8] and machine learning approaches [9, 10, 11, 12, 13, 14, 15, 16], both making great steps in the restoration of corrupted images. However, these methods suffer from two main limits: the first one is that they need to train a different model for each possible quality factor (QF), making them not generally applicable to general images downloaded from the web unless the QF used for compression is known; the second one, is that the great majority of methods in the state of the art restores just the luma channel or do not fully exploit the knowledge about the JPEG compression pipeline.

To address these problem we propose a new method for the restoration of JPEG compressed images in YCbCr color space, based on machine learning, specifically on convolutional autoencoders. The proposed approach consists in two deep autoencoders respectively used for luma and chroma restoration, that are able to restore images independently from the quality factor used for the compression. The main contribution are the following:

  • the design of a method for the restoration of JPEG compression artifact that is independent from the QF used;

  • the design of a model trainable end-to-end that fully exploits knowledge about JPEG compression pipeline;

  • a thorough comparison with the state of the art on three standard datasets at fixed QFs;

  • an analysis of robustness of restoration results at QFs not used for training.

Ii Related Works

The task of JPEG compression artifacts removal has been faced in different ways in the past years. The existing proposed methods can be broadly classified into two groups: traditional image processing methods and learning based methods.

To the first group belong methods based on traditional image processing techniques working both in the spatial and in the frequency domain. For spatial domain processing different kinds of filters have been proposed, with the intent of restoring specific areas of the images such as edges [3], textures [4], smooth regions[5], etc. Algorithms usually rely on information obtained by the application of the Discrete Cosine Transform (DCT) transform [6]. SA-DCT, proposed by Foi et al. [7], attempts to reconstruct an estimate of the signal using the DCT of the original image together with the spatial information contained in the image itself. However SA-DCT is not capable to reproduce details like sharp edges or complex textures. To overcome this limit different restoration oriented methods have been proposed, like the Regression Tree Fields based method (RTF) [8]. The RTF uses the results of SA-DCT to restore images, taking advantage of a regression tree field model.

Following the success of the application of Deep Convolutional Neural Networks (Deep-CNNs) in image processing tasks, such as image denoising [10] and Single-Image Super-Resolution [17], Deep-CNNs have been applied with success to JPEG compression artifact removal task. The basic idea behind Deep-CNNs is to learn a function to map a set of images from an input distribution, to the desired output one. In the artifact removal case the objective is to map degraded images into a distribution without the presence of the noise. The trained neural network obtained at the end of the training process represent an approximation of the desired function for the translation of the images from a distribution to another one.

The first attempt with this kind of models has been done by Dong et al.[9] who proposed the ARCNN, a model inspired by SRCNN [17], a neural network for Super-Resolution. This first attempt has been followed by DnCNN [10], a CNN for general denoising task that has also been used on JPEG compressed images, and CAS-CNN [11], a model proposed by Cavigelli et al., who presented a much deeper model capable to obtain higher quality images. Wang et al. proposed D3 [12], a deep neural network that adopts JPEG-related priors to improve reconstruction quality which obtained an improvement in speed and performances with respect with to the previous models. In 2017, Galtieri et al.[13] developed a generative adversarial network (GAN)[18] for artifact removal and texture reconstruction.

In 2018 a bunch of new models for JPEG artifact removal has been presented, showing interesting improvements in the results quality. Liu et al. [14] proposed a Multi-level Wavelet CNN (MWCNN), a model based on the U-Net architecture [19], trained and used for multiple tasks: compression artifact removal, denoising and super-resolution. Zhang et al. [15] developed DMCNN, a Dual-Domain Multi-Scale CNN, which gains higher results quality than the previous works, by using both pixel and frequency (i.e. DCT) domain information. Lastly S-Net, the most recent method by Zheng et al. [16] proposed a “greedy loss architecture” to train deeper models capable to outperform the previous state-of-the-art.

Iii Proposed Method

The method in the state of the art mainly suffer from two limits: the first one is that each machine learning model needs to know the JPEG compression Quality Factor (QF) of each input image to properly restore a compressed image; the second one is that the great majority of them are capable to restore only the luma channel without considering the chroma components, and the only one that recovers all three channels [16] does not fully exploit theoretical knowledge of the JPEG compression pipeline.

In this work we propose a method able to overcame both these problems. The first problem has to do with the way the models are trained: all of the previous existing methods make the implicit assumption that the compression quality factor QF used to compress the input images is known. In fact, most of the previous models present networks trained on datasets compressed on specific quality factors (the most common being and ). This way of training the models leads to two limits:

  • the models are capable to correctly restore only images at a specific QF, with the consequence that a specific training for each quality factor is needed;

  • the QF used for the compression of the images is needed in order to train a model and correctly restore the images: this is usually a not known information for images coming from unknown sources (e.g. downloaded from the web), thus largely limiting the usability of the model.

In order to overcome the necessity to know the compression quality factor, we train our model on a dataset containing images compressed at different QFs: this will make the model more generic and able to restore images taken in the wild, i.e. without knowing the actual QF used. This objective poses a challenge, since the training of such a quality independent model is much harder than training on a single quality factor.

The second problem concerns the way the previous models restore the images: all of the previous state-of-the-art methods are trained on the luma channel (Y channel of the YCbCr space) of the images. This approach is based on the fact that the JPEG compression algorithm applies the DCT to the Y channel, introducing ringing and blocketization artifacts on the luma channel, while the other Cb and Cr channels are just sub-sampled the bicubic interpolation. The design and training of a model for the specific restoration of the luma component and its subsequent application for the restoration of the chroma components (as done for example by ARCNN [9]), introduces chromatic aberrations and artifacts in the final result. S-Net[16] is the only method considering this problem and instead of training a model for the restoration of just the luma component, it takes as input a full RGB image and recovers a full RGB images as output.

To overcome this second limit and obtain better results we exploit the knowledge of how the JPEG compression pipeline works and propose the use of two models for the image restoration in YCbCr space: the first model restores the Y channel; the second model then uses the result as a Structure Map (i.e. a guide) for the restoration of the chroma components. A schematic representation of the proposed method is depicted in Figure 2.

Iii-a Luma and chroma Restoration Model

The vast majority of learning based methods for JPEG compression artifact removal in the state of the art [9, 10, 11, 12, 14, 15] focus exclusively on the luma component of the images. Generally these methods perform the compression artifact removal working on the Y channel of the images, after converting them in YCbCr color space. The learned model in some cases is then applied as is also on Cb and Cr channels (e.g. [9]). These approaches do not take in consideration the chroma aspects of the images, generating results with aberrations in RGB space and low perceptual quality.

Fig. 2: Schematic representation of the proposed method: the input image is first converted to YCbCr color space. The Y channel is restored with the Y-net and the result Y’ is concatenated with the original CbCr channels to restore Cb’Cr’ with the CbCr-net. Restored Y’Cb’Cr’ channels are then converted back to RGB color space.

Moreover the JPEG compression algorithm, when operating with very low compression quality factors, such as , tends to change the colors of the input images in two different ways: hue change and spatial location change. As can be seen in Figure 3, in the compressed version of the Cb and Cr channels, as expected the color resolution is reduced, and also, for some elements, the color position does not correspond to the one in the original uncompressed image.

Keeping the above considerations in mind we propose a method for restoring both luma and chroma components of the compressed images (see Figure 2). The method consists of two steps: the first step, after the conversion of the input image into YCbCr color space, involves the restoration of the Y channel alone, using a first model named LumiNet, and produces Y’ as output. The second step concatenates Y’CbCr along the channel dimension and uses a second model named ChromaNet, to restore the CbCr channels. This second step uses Y’ as a map of the structures present in the image (i.e. a sort of guide) to condition the second network to recover the color hue and contours, and produces Cb’Cr’ as output. The final output is obtained by concatenating Y’Cb’Cr’ and converting them back to RGB. Both LumiNet and ChromaNet are two different deep CNN Autoencoders both exploiting a new revisited version of the Residual Blocks [20].

(a) Original
(b) Compressed ()
(c) Original Cb
(d) Compressed Cb ()
(e) Original Cr
(f) Compressed Cr ()
Fig. 3: Visual example of how the JPEG compression algorithm, when operating with very low compression quality factors changes the colors of the input images in two different ways: hue change and spatial location change.

Iii-B Deep Residual Autoencoder Architecture

Fig. 4: Graphical representation of the architecture of the autoencoders used for both the luma and chroma restoration.

Autoencoder architectures have been widely used in image processing tasks like image-to-image translation [21], Super-Resolution [22], image inpainting[23] and rain removal [24]. Autoencoders generally present a structure made by three parts: the encoder, which extracts features from the -dimensional input (usually 1 or 3 channels); a central part, that performs feature processing; and the final decoder, which decodes the processed features into the output image having the desired dimensions. Figure 4 shows a schematic representation of the proposed model, while a more detailed description of its architecture is reported in Table I.

The encoder, which consists of two convolutions followed by Leaky ReLU activations, is followed by a central part for feature enhancement consisting in a sequence of Residual-in-Residual Dense Blocks (RRDB) [25], a modified version of the well known residual blocks originally introduced in the ResNet architecture[20], that have been shown to perform well in other image processing tasks, e.g. image super-resolution [26, 25]. The RRDBs blocks combine multi-level residual learning and dense connection architecture: the RRDBs are designed without the use of the Batch Normalization and the application of the residual learning on different levels. The RRDBs are shown in Figure 5: each RRDB is made of five Dense Blocks, which use only convolutions with Leaky ReLUs activation and dense skip connection structures, combined together with other skip connections. Finally, the decoder is designed in a symmetrical way with respect to the encoder part.

The same architecture has been used for both the networks for luma and chroma restoration, but with some differences:

  • different depth in terms of number of RRDBs used in the central part;

  • different feature extraction from the input in the encoder part.

For the restoration of the luma (Y channel) the number of central RRDBs is set to five, while for the CbCr restoration the number of RRDB is decreased to three. The second and more important difference is in the first layer of the CbCr version of the network, which is a 3-dimensional convolutional layer. Considering that the input of the CbCr-Net is the concatenation (along the channel dimension) of the restored Y’ channel with the Cb and Cr channels, we decided to use a 3D convolution to make the model capable to correlate information about color and structures with the use of the same kernels for all the information coming from the three input channels. The output of this second network are the two restored Cb and Cr channels, which are then concatenated with the restored Y’ channel, in order to obtain the complete restored image.

Layer Filter size, Stride, Padding output channels
Conv2D 3x3, 1, 1 64
Encoder Conv2D 5x5, 1, 2 128
LReLU - 128
Conv2D 3x3, 1, 1 64
LReLU - 64
RRDB x B
Decoder Conv2D 3x3, 1, 1 128
LReLU - 128
Conv2D 5x5, 1, 2 64
LReLU - 64
Conv2D 3x3, 1, 1 1
Tanh - 1
TABLE I: Detailed architecture of the autoencoders used for both the luma and chroma restoration. The number of RRDBs is for the Y-Net and for the CbCr-Net.
Fig. 5: Schematic representation of the architecture of the Residual-in-Residual Dense Block (RRDB) [25].

In order to improve the quality of the generated results, as well as to make the training process more stable, the proposed architecture include the following design choices:

  • removal of Batch Normalization (BN) layers from the Residual Blocks;

  • use of a residual scaling parameter in each Residual Block;

  • initialization of the model weights using a scaled version of the Kaiming initialization[27].

The removal of the batch normalization layers has been proved, in image Super-Resolution [26] and image deblurring [28] tasks, to increase the performances for the generation of images in terms of quality indexes (PSNR and SSIM [29]). The removal of the BN layers, which improve the stability of the training and the generated image appearance, makes on the other hand the training of deep networks more difficult. To solve that issues two solutions have been proved to work well: the so called residual scaling (in our model set to ), to scale each residual in order to not magnify the input image in a wrong way, and a small weight initialization, obtained by the application of the Kaiming initialization, presented by He et al.[27], scaled by a factor . As can be seen in Figure 5 the residual scaling is applied on the higher level of the residual learning architecture, i.e. on the output of each dense block and at the end of the RRDBs.

Iv Experimental Setup

The training of the proposed method leads to two different Deep-CNNs respectively for the restoration of the luminance and chroma components of JPEG compressed images at generic quality (i.e. QFs). In order to evaluate the results, our models have been compared with the state of the art in four different experimental setups:

  1. known QF luminance restoration: comparison with the state-of-the-art methods which work only on the Y channel of the input images;

  2. unknown QF luminance restoration: comparison to test the ability of the models to restore images at intermediate QFs never seen during training;

  3. high and low details density areas restoration: evaluation of the performances of the state-of-the-art methods and the proposed one over specific areas of the images, by dividing the images in patches classified on high-to-low frequency (DCT domain) and high-to-low detail density;

  4. color restoration: evaluation of the color restoration capability of the model on the images converted in RGB space after the elaboration.

Iv-a Dataset

The dataset used for training is the DIV2K dataset, a collection of high-quality images (2K resolution), presented during the NTIRE2017 challenge [30] for image restoration tasks. This dataset is made of a total amount of 900 images: 800 are used for training while the remaining 100 are used for validation. The complete dataset contains also 100 images for testing. The groundtruths of this last part have not been released after the challenge, and therefore are not used in this paper.

With the purpose of increase the amount of different texture and pattern to show to the model during training, we have combined the DIV2K dataset with the Flickr2K dataset [31], a collection of 2650 high-quality images (same resolution as the DIV2K) collected from Flickr website.

In order to train the models on different quality factors, for each image in the dataset we have applied 10 different compression levels, corresponding to the quality factors between to , with step . The images have been compressed in RGB space with the MATLAB standard library function, then the compressed images have been converted later in YCbCr space using the Python Scikit-Image library (), during the training phase. The compressed version of the training dataset contains 8000 images. The same operation has been applied to the Flickr2K dataset for a total amount of 34k training images.

The evaluation of our model has been done on the LIVE1[29], Classic-5 and BSD500 [32], three benchmark datasets widely used for JPEG artifact removal algorithm evaluation. For the evaluation of the behaviour of the models with the unknown compression quality factor we adopted the SDIVL [33], a dataset proposed for Image Quality Assessment task.

Iv-B Evaluation metrics

The globally adopted metrics for the evaluation of the quality of images in artifact removal tasks are PSNR, PSNR-B [34] (which focus the evaluation on the blocketization in the image) and SSIM [29] indexes. For all of these three measures an higher value means better results. The PSNR and PSNR-B indexes give information about the quality of the images in terms of noise and perceived quality, with PSNR-B taking in consideration also the blocketization artifacts; SSIM index is an indicator of the quality of edges and structures contained in the. For all the three indexes considered an higher value means that the content and the structures in the reconstructed image are more similar to the ones in the target image.

Iv-C Training Details

All the training phase has been done on a NVIDIA GTX 1070 GPU with GB of memory using PyTorch framework at version 0.4.1. The mini-batch size has been set to and each input image has been cropped to a patch size of pixels. During the experiments we tried to train the network with different crop sizes (, , and ), observing how training deeper networks with bigger patch size gives a boost on performances over both PSNR and SSIM indexes.

We also explored the use of different numbers of RRDBs in the model: we observed how with deeper models, using this specific kind of residual blocks, the results got better and better, increasing the PSNR and SSIM values on the validation set. The final structure uses five RRDBs for the Y channel restoration model and three RRDBs for the CbCr model, where each convolution has filters. We found this configuration to be the best one, with respect to the patch size, the amount of RRDBs, the number of filters and the limits due to the memory offered by our board.

We trained the model using Adam optimizer [35] with , with learning rate initialized at decreased after epochs of training by a factor of . The training has been performed using the L1 Loss, since allow us to achieve better PSNR results and to make the training more stable.

V Experimental Results

Fig. 6: PSNR-SSIM comparison of the state-of-the-art-models and our proposed method. For both metrics higher value means better visual results.

V-a Restoration with known compression Quality Factor

We compared our model with the state-of-the-art models ARCNN[9], CAS-CNN[11], D3[12], and the more recent DMCNN[15], MWCNN[14], ARGAN[13] and S-Net[16].

Since the state-of-the-art methods operate only on the Y channel of the images, in order to make a fair comparison, the metrics are evaluated on the Y channel recovered by the first network with the corresponding target images, using the MATLAB standard libraries, over five different compression qualities: . For each method, on all the datasets considered, we report the results taken from the corresponding publication, except for ARCNN and MWCNN which provide the source-code, that are then used for the evaluation. Since the training of the proposed methods leads to a single model that can be used for all the quality factors, we used the same model for the evaluation at all the qualities previously mentioned. All the state-of-the-art methods compared, instead, have a different trained model for each QF considered.

Table II, III and IV respectively report the comparison on the LIVE1, BSD500 and Classic-5 datasets for all the three metrics considered. As can be seen our model outperforms the state of the art on all the metrics. With the proposed model we obtained improvements with respect to the state-of-the-art methods on both general perceptual quality (PSNR/PSNR-B) and structure reconstruction (SSIM). Since each index focuses of different aspects of the restoration quality, each index alone is not capable to summarize all the aspect of a good reconstruction. Therefore, we also compare the methods in a graph style-view, reported in Figures 1 and 6 to correlate the two indexes. In order for a method to obtain a more pleasing perceived quality, it is necessary that both the metrics obtain high values. It is easy from this kind of view to see how the proposed method outperforms the current state-of-the-art models even if a single model is used for all the QFs.

Quality
ARCNN
[9]
DnCNN
[10]
CAS-CNN
[11]
D3
[12]
DMCNN
[15]
MWCNN
[14]
S-NET
[16]
ARGAN-MSE
[13]
ARGAN
[13]
OUR
10 29.13 29.19 29.44 29.96 29.73 29.37 29.87 29.45 27.29 29.97
20 31.4 31.59 31.70 32.21 32.09 31.58 32.26 31.77 28.35 32.34
PSNR 40 33.63 33.96 34.10 - - 34.17 34.61 34.09 28.99 34.78
60 - - 35.78 - - - - - - 36.47
80 - - 38.55 - - - - - - 39.31
10 28.74 - 29.19 29.45 29.55 28.85 - 29.10 26.69 29.60
20 30.69 - 30.88 31.35 31.32 30.83 - 31.26 28.10 31.76
PSNR-B 40 33.12 - 33.68 - - 33.33 - 33.40 28.84 33.96
60 - - 35.10 - - - - - - 35.51
80 - - 37.73 - - - - - - 38.26
10 0.823 0.812 0.833 0.823 0.842 0.832 0.847 0.834 0.773 0.850
20 0.886 0.880 0.895 0.890 0.905 0.891 0.907 0.896 0.817 0.908
SSIM 40 0.931 0.924 0.937 - - 0.936 0.942 0.922 0.837 0.944
60 - - 0.954 - - - - - - 0.960
80 - - 0.973 - - - - - - 0.976
TABLE II: Comparison on test set LIVE1: for the methods in the state of the art a five different models are trained for each QF considered. The proposed method uses the same model for all the QFs.
Quality
ARCNN
[9]
DnCNN
[10]
CAS-CNN
[11]
D3
[12]
DMCNN
[15]
MWCNN
[14]
S-NET
[16]
ARGAN-MSE
[13]
ARGAN
[13]
OUR
10 29.10 - - - 29.67 29.50 29.82 29.03 27.01 29.92
PSNR 20 31.25 - - - 31.98 31.34 32.15 31.20 28.07 32.23
40 33.55 - - - - 33.23 34.45 33.30 28.61 34.61
10 28.75 - - - - 28.60 - 28.61 26.30 29.41
PSNR-B 20 30.60 - - - - 29.84 - 30.48 27.76 31.39
40 32.80 - - - - 31.04 - 32.18 28.20 33.34
10 0.819 - - - 0.840 0.835 0.844 0.807 0.746 0.847
SSIM 20 0.885 - - - 0.904 0.889 0.905 0.876 0.794 0.906
40 0.929 - - - - 0.928 0.941 0.921 0.815 0.943
TABLE III: Comparison on test set BSD500: for the methods in the state of the art a five different models are trained for each QF considered. The proposed method uses the same model for all the QFs.
Quality
ARCNN
[9]
DnCNN
[10]
CAS-CNN
[11]
D3
[12]
DMCNN
[15]
MWCNN
[14]
S-NET
[16]
ARGAN-MSE
[13]
ARGAN
[13]
OUR
10 29.04 29.4 - - - 29.68 - - - 29.67
PSNR 20 31.16 31.63 - - - 31.78 - - - 31.89
40 33.34 33.77 - - - 34.05 - - - 34.04
10 28.75 - - - - 29.06 - - - 29.35
PSNR-B 20 30.6 - - - - 30.95 - - - 31.43
40 32.8 - - - - 33.20 - - - 33.33
10 0.811 0.803 - - - 0.828 - - - 0.829
SSIM 20 0.869 0.861 - - - 0.878 - - - 0.882
40 0.91 0.9 - - - 0.916 - - - 0.917
TABLE IV: Comparison on test set Classic-5: for the methods in the state of the art a five different models are trained for each QF considered. The proposed method uses the same model for all the QFs.

V-B Restoration with unknown compression Quality Factor

Another kind of evaluation has been done about the capability of the models to recover images at compression quality factors never seen during training. In most of the real use-cases, the JPEG compression quality factor previously applied on an image is not know: it is then important that a model is able to recover the images without this prior information. On the other hand, if we are able at least to estimate the compression quality factor of the input compressed image, following the previous approaches we should train new models for each specific quality factor needed, or use the model trained for the closest QF to the desired one.

We compare our model with the two state-of-the-art models for which the code i available (i.e. ARCNN and MWCNN) in a specific selection of cases. Since previous models have been trained on specific quality factors, and our model has been instead trained over quality factor from 10 to 100 in steps of 10, without the use of images with QFs in between, we decided to test the model robustness on “never seen” artifacts. In order to perform the evaluation in a coherent way, for the state-of-the-art algorithm we used the pretrained models for the nearest quality factor, for example if the input image has been compressed with we used the models trained for . For this evaluation we adopted the SDIVL dataset: for each image of the testset we applied all of the compression factors in the interval . The evaluation is done in the same way it has been done for in the previous secsion, by extracting Y channel and measuring PSNR, PSNR-B and SSIM indexes.

In Figure 7 are shown the results of the models on the SDIVL with all the quality factors compression. As can be seen in those graphs our model shows a more stable behaviour: the model is capable to restore images at different QFs with a more coherent and smooth behaviour in relation to the increase of the QF, in comparison with the other methods. Moreover, the previous state-of-the-art models have difficulties to restore images at quality factors distant from the trained one. It is particularly interesting to see how the other models have difficulties to restore images at higher qualities with respect to the QF used in training, in terms of structures in the images (Figure (c)c), due to the more complex textures never seen by the models during training phase.

(a) PSNR
(b) PSNR-B
(c) SSIM
Fig. 7: Comparison on QFs not seen during training. For ARCNN and MWCNN the models trained for QF=10 and QF=20 are tested on QF in the range [5, 25]. The proposed model is trained for QF in the range [10, 100] with steps of 10, and is tested on the same intermediate QFs not seen in training.

V-C High and low frequency areas restoration

In order to better understand if the proposed method performs better than approaches in the state of the art only on certain image types, we conduct a further experiment: we divide the images from LIVE1 testset, compressed at , into patches and classify each of them into five categories. The categories are obtained by equally diving the patches into five bins with respect to both frequency and detail density. Patch frequency is computed as the weighted average of the 2D Fourier Transform normalized magnitude. Patch detail density is computed as the 2D average of the result of the Canny edge detection. The results for the considered evaluation metrics over the five categories of the frequency and detail density are respectively reported in Table V and VI. From the results reported it is possible to notice that the proposed method consistently outperforms the state of the art on all the frequency and detail density categories.

ARCNN MWCNN OUR
Frequency PSNR PSNR-B SSIM PSNR PSNR-B SSIM PSNR PSNR-B SSIM
high 27.53 27.26 0.782 27.61 27.24 0.792 28.18 27.88 0.807
medium-high 25.00 24.66 0.685 25.24 24.67 0.700 25.64 25.18 0.729
medium 24.61 24.27 0.734 24.50 23.82 0.740 25.37 24.91 0.773
medium-low 25.92 25.49 0.794 25.91 25.24 0.803 26.73 26.21 0.827
low 27.08 25.93 0.840 26.72 25.27 0.849 27.81 26.52 0.864
TABLE V: Comparison by subdividing the image patches on the basis of the frequency content in five classes from high to low.
ARCNN MWCNN OUR
Edges frequency PSNR PSNR-B SSIM PSNR PSNR-B SSIM PSNR PSNR-B SSIM
high 23.20 22.94 0.667 23.42 22.82 0.683 23.87 23.45 0.716
medium-high 24.69 24.39 0.721 24.91 24.31 0.735 25.42 25.02 0.763
medium 25.68 25.22 0.758 25.94 25.26 0.772 26.41 25.86 0.794
medium-low 26.83 26.12 0.805 27.01 25.95 0.817 27.47 26.61 0.832
low 29.17 28.22 0.884 27.45 26.28 0.888 29.97 28.99 0.897
TABLE VI: Comparison by subdividing the image patches on the basis of the detail density in five classes from low to high.

V-D Color Restoration

The final evaluation is focused on the color restoration capability of the models. The comparison, in the same way as done in the previous evaluations, has been done among the ARCNN[9], MWCNN[14] and our proposed model.

We restored the images from the LIVE1 testset with the lowest quality factors . For this specific evaluation we restored both luma and chroma components. In the case of ARCNN and MWCNN methods, we adopted the same model for all of the three channels (Y, Cb and Cr channels), while our method uses the two different networks to first restore the luminance channel then the chrominance channels.

For this comparison we used the PSNR, PSNR-B and SSIM indexes over the restored images in RGB space, instead of only evaluating the luminance information, using the MATLAB standard library: numerical results can be seen in Table VII and visual results are summarized in some patches from the images of LIVE1 in Figure 8. As can be seen the proposed model obtains better results than the other methods in terms of PSNR, PSNR-B and SSIM index, and is also evident the difference on the final images. The blocketization and the color aberration coming from the compression are blurred and mainteined in the other models, while are cleaned by our model which reshapes the color information with respect to the structures in the images. The results are much more pleasing and realistic than the other methods ones.

(a) Input (b) ARCNN (c) MWCNN (d) OUR (e) Target
Fig. 8: Visual comparison of the full color JPEG restoration.
LIVE1
Qualities ARCNN MWCNN OUR
10 28.97 29.84 29.98
PSNR 20 31.31 32.00 32.35
40 33.64 34.58 34.80
10 28.69 29.40 29.61
PSNR-B 20 30.78 31.28 31.77
40 33.13 33.78 33.98
10 0.822 0.846 0.851
SSIM 20 0.888 0.900 0.909
40 0.931 0.941 0.944
TABLE VII: Comparison of the evaluation metrics computer on full-color restorated images

Vi Conclusion

In this paper we proposed a deep residual autoencoder exploiting Residual-in-Residual Dense Blocks (RRDB) to remove artifacts in JPEG compressed images, that is independent from the QF used. The proposed model operates in the YCbCr color space and performs a two-phase restoration of JPEG artifacts: in the former phase, a first autoencoder exploiting 2D convolutions is used to restore the luma channel; in the latter phase, a second autoencoder, by stacking along the channel dimension the results of the first autoencoder and the original chroma channels, employs 3D convolutions to exploit the restored luma channel as a guide, and restores the chroma channels.

The main contributions of this paper are: i) the design of a method for the restoration of JPEG compression artifact that is independent from the QF used; ii) the design of a model trainable end-to-end that fully exploits knowledge about JPEG compression pipeline; iii) a thorough comparison with the state of the art on three standard datasets at fixed QFs; iv) an analysis of robustness of restoration results at QFs not used for training.

Extensive experimental results on three widely used benchmark datasets (i.e. LIVE1, BDS500, and CLASSIC-5) show that our model is able to outperform the state of the art with respect to all the evaluation metrics considered (i.e. PSNR, PSNR-B, and SSIM). This results is remarkable since the approaches in the state of the art use a different set of weights for each compression quality, while the proposed model uses the same weights for all of them, making it applicable to images in the wild where the QF used for compression is unkwnown. Furthermore, the proposed model shows a greater robustness than state-of-the-art methods when applied to compression qualities not seen during training. Since preliminary experiments with the same architecture proposed showed good results for the restoration of other artifacts (i.e. noise removal, in the CVPRW NTIRE2019 challenge), as future work we plan to investigate its extension to other single and multiple distortions [36].

References

  • [1] S. Dodge and L. Karam, “Understanding how image quality affects deep neural networks,” in 2016 eighth international conference on quality of multimedia experience (QoMEX).   IEEE, 2016, pp. 1–6.
  • [2] S. Bianco, L. Celona, and R. Schettini, “Robust smile detection using convolutional neural networks,” Journal of Electronic Imaging, vol. 25, no. 6, p. 063002, 2016.
  • [3] P. List, A. Joch, J. Lainema, G. Bjontegaard, and M. Karczewicz, “Adaptive deblocking filter,” IEEE transactions on circuits and systems for video technology, vol. 13, no. 7, pp. 614–619, 2003.
  • [4] H. C. Reeve and J. S. Lim, “Reduction of blocking effects in image coding,” Optical Engineering, vol. 23, no. 1, p. 230134, 1984.
  • [5] C. Wang, J. Zhou, and S. Liu, “Adaptive non-local means filter for image deblocking,” Signal Processing: Image Communication, vol. 28, no. 5, pp. 522–530, 2013.
  • [6] N. Ahmed, T. Natarajan, and K. R. Rao, “Discrete cosine transform,” IEEE transactions on Computers, vol. 100, no. 1, pp. 90–93, 1974.
  • [7] A. Foi, V. Katkovnik, and K. Egiazarian, “Pointwise shape-adaptive dct for high-quality deblocking of compressed color images,” in Signal Processing Conference, 2006 14th European.   IEEE, 2006, pp. 1–5.
  • [8] J. Jancsary, S. Nowozin, and C. Rother, “Loss-specific training of non-parametric image restoration models: A new state of the art,” in European Conference on Computer Vision.   Springer, 2012, pp. 112–125.
  • [9] C. Dong, Y. Deng, C. Change Loy, and X. Tang, “Compression artifacts reduction by a deep convolutional network,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 576–584.
  • [10] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang, “Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising,” IEEE Transactions on Image Processing, vol. 26, no. 7, pp. 3142–3155, 2017.
  • [11] L. Cavigelli, P. Hager, and L. Benini, “Cas-cnn: A deep convolutional neural network for image compression artifact suppression,” in Neural Networks (IJCNN), 2017 International Joint Conference on.   IEEE, 2017, pp. 752–759.
  • [12] Z. Wang, D. Liu, S. Chang, Q. Ling, Y. Yang, and T. S. Huang, “D3: Deep dual-domain based fast restoration of jpeg-compressed images,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2764–2772.
  • [13] L. Galteri, L. Seidenari, M. Bertini, and A. Del Bimbo, “Deep generative adversarial compression artifact removal,” arXiv preprint arXiv:1704.02518, 2017.
  • [14] P. Liu, H. Zhang, K. Zhang, L. Lin, and W. Zuo, “Multi-level wavelet-cnn for image restoration,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2018.
  • [15] DMCNN: Dual-domain Multi-scale Convolutional Neural Network for Compression Artifacts Removal, 2018.
  • [16] B. Zheng, R. Sun, X. Tian, and Y. Chen, “S-net: a scalable convolutional neural network for jpeg compression artifact reduction,” Journal of Electronic Imaging, vol. 27, no. 4, p. 043037, 2018.
  • [17] C. Dong, C. C. Loy, K. He, and X. Tang, “Image super-resolution using deep convolutional networks,” IEEE transactions on pattern analysis and machine intelligence, vol. 38, no. 2, pp. 295–307, 2016.
  • [18] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, Eds.   Curran Associates, Inc., 2014, pp. 2672–2680. [Online]. Available: http://papers.nips.cc/paper/5423-generative-adversarial-nets.pdf
  • [19] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention.   Springer, 2015, pp. 234–241.
  • [20] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  • [21] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,” CVPR, 2017.
  • [22] K. Zeng, J. Yu, R. Wang, C. Li, and D. Tao, “Coupled deep autoencoder for single image super-resolution,” IEEE transactions on cybernetics, vol. 47, no. 1, pp. 27–37, 2017.
  • [23] J. Xie, L. Xu, and E. Chen, “Image denoising and inpainting with deep neural networks,” in Advances in neural information processing systems, 2012, pp. 341–349.
  • [24] R. Qian, R. T. Tan, W. Yang, J. Su, and J. Liu, “Attentive generative adversarial network for raindrop removal from a single image,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 2482–2491.
  • [25] X. Wang, K. Yu, S. Wu, J. Gu, Y. Liu, C. Dong, Y. Qiao, and C. C. Loy, “Esrgan: Enhanced super-resolution generative adversarial networks,” in The European Conference on Computer Vision Workshops (ECCVW), September 2018.
  • [26] B. Lim, S. Son, H. Kim, S. Nah, and K. M. Lee, “Enhanced deep residual networks for single image super-resolution,” in The IEEE conference on computer vision and pattern recognition (CVPR) workshops, vol. 1, no. 2, 2017, p. 4.
  • [27] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1026–1034.
  • [28] S. Nah, T. H. Kim, and K. M. Lee, “Deep multi-scale convolutional neural network for dynamic scene deblurring,” in CVPR, vol. 1, no. 2, 2017, p. 3.
  • [29] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE transactions on image processing, vol. 13, no. 4, pp. 600–612, 2004.
  • [30] E. Agustsson and R. Timofte, “Ntire 2017 challenge on single image super-resolution: Dataset and study,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, July 2017.
  • [31] R. Timofte, E. Agustsson, L. Van Gool, M.-H. Yang, L. Zhang, B. Lim, S. Son, H. Kim, S. Nah, K. M. Lee et al., “Ntire 2017 challenge on single image super-resolution: Methods and results,” in Computer Vision and Pattern Recognition Workshops (CVPRW), 2017 IEEE Conference on.   IEEE, 2017, pp. 1110–1121.
  • [32] P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik, “Contour detection and hierarchical image segmentation,” IEEE transactions on pattern analysis and machine intelligence, vol. 33, no. 5, pp. 898–916, 2011.
  • [33] S. Corchs, F. Gasparini, and R. Schettini, “No reference image quality classification for jpeg-distorted images,” Digital Signal Processing, vol. 30, pp. 86–100, 2014.
  • [34] C. Yim and A. C. Bovik, “Quality assessment of deblocked images,” IEEE Transactions on Image Processing, vol. 20, no. 1, pp. 88–98, 2011.
  • [35] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  • [36] S. Corchs and F. Gasparini, “A multidistortion database for image quality,” in International Workshop on Computational Color Imaging.   Springer, 2017, pp. 95–104.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minumum 40 characters
   
Add comment
Cancel
Loading ...
345688
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description