An End-to-End Compression Framework Based on Convolutional Neural Networks
Deep learning, e.g., convolutional neural networks (CNNs), has achieved great success in image processing and computer vision especially in high level vision applications such as recognition and understanding. However, it is rarely used to solve low-level vision problems such as image compression studied in this paper. Here, we move forward a step and propose a novel compression framework based on CNNs. To achieve high-quality image compression at low bit rates, two CNNs are seamlessly integrated into an end-to-end compression framework. The first CNN, named compact convolutional neural network (ComCNN), learns an optimal compact representation from an input image, which preserves the structural information and is then encoded using an image codec (e.g., JPEG, JPEG2000 or BPG). The second CNN, named reconstruction convolutional neural network (RecCNN), is used to reconstruct the decoded image with high-quality in the decoding end. To make two CNNs effectively collaborate, we develop a unified end-to-end learning algorithm to simultaneously learn ComCNN and RecCNN, which facilitates the accurate reconstruction of the decoded image using RecCNN. Such a design also makes the proposed compression framework compatible with existing image coding standards. Experimental results validate that the proposed compression framework greatly outperforms several compression frameworks that use existing image coding standards with state-of-the-art deblocking or denoising post-processing methods.
In recent years, image compression attracts increasing interest in image processing and computer vision due to its potential applications in many vision systems. The aim of image compression is to reduce irrelevance and redundancy of an image in order to store or transmit the image at low bit rates [wallace1992jpeg]. Traditional image coding standards[ghanbari2003standard] (such as JPEG and JPEG2000) attempt to distribute the available bits for every nonzero quantized transform coefficient in the whole image. While the compression ratio increases, the bits per pixel (BPP) decreases as a result of the use of bigger quantization steps, which will cause the decoded image to have blocking artifacts or noises. To overcome this problem, a lot of efforts have been devoted to improving the quality of the decoded image using a post-processing deblocking or denoising method. Zhai et al. [zhai2008efficient] propose an effective deblocking method for JPEG images through post-filtering in shifted windows of image blocks. Foi et al. [foi2007pointwise] develop an image deblocking filtering based on shape-adaptive DCT, in conjunction with the anisotropic local polynomial approximation-intersection of confidence intervals technique. Inspired by the success of nonlocal filters and bilateral filters for image debolcking, several nonlocal filters have been proposed for image deblocking [zhang2011image, francisco2012generic, wang2013adaptive]. Recently, Zhang et al. [zhang2016concolor] propose a constrained non-convex low-rank model for image deblocking. Although desired performance is achieved, these post-processing methods are very time-consuming because solving the optimal solutions involves computationally expensive iterative processes. Therefore, it is difficult to apply them to practical applications.
Motivated by the excellent performance of convolutional neural networks (CNNs) in low level computer vision [dong2015compression, guo2016building, wang2016d3] in recent years and the fact that existing image codecs are extensively used across the world, we propose an end-to-end compression framework, which consists of two CNNs and an image codec as shown in Fig. 2. The first CNN, named compact convolutional neural network (ComCNN), learns an optimal compact representation from an input image, which is then encoded using an image codec (e.g., JPEG, JPEG2000 or BPG). The second CNN, named reconstruction convolutional neural network (RecCNN), is used to reconstruct the decoded image with high quality in the decoding end. Existing image coding standards usually consists of transformation, quantization and entropy coding. Unfortunately, the rounding function in quantization is not differentiable, which brings great challenges to train deep neural networks when performing the backpropagation algorithm. To address this problem, we present a simple but effective learning algorithm to train the proposed end-to-end compression framework by simultaneously learning ComCNN and RecCNN to facilitate the accurately reconstruction of the decoded image using RecCNN. An example of image compression is shown in Fig. 1, from which we can see that the proposed framework achieves much better quality with more visual details. In addition, as shown in Fig. 2, the compact representation obtained by ComCNN preserves the structural information of the image, therefore, an image codec can be effectively utilzed to compress the compact representation.
The contributions of this work are summarized as follows:
We propose an end-to-end compression framework using two CNNs and an image codec. ComCNN produces a compact representation for encoding using an image codec. RecCNN reconstructs the decoded image, respectively. To our best knowledge, it is the first time to connect the existing image coding standards with CNNs using a compact intermediate representation.
We further propose an effective learning algorithm to simultaneously learn two CNNs, which addresses the problem that the gradient can not be passed in the backpropagation algorithm since the rounding function in quantization is not differentiable.
The proposed compression framework is compatible with existing image codecs (e.g. JPEG, JPEG2000 or BPG), which makes our method more applicable to other tasks.
The remainder of this paper is organized as follows. Section 2 presents a brief review of related work. Section 3 elaborates the proposed compression framework, including the architectures of ComCNN and RecCNN. Section 4 illustrates the training parameters setting and the solutions to train the ComCNN and RecCNN. Experimental results are also reported in Section 4. In Section 5, we conclude this paper.
Ii Related Work
Ii-a Image Deblocking and Artifacts Reduction
In the literature, there have been some methods proposed to improve the quality of decoded images using post-processing techniques, which can be roughly categorized into deblocking oriented and restoration oriented methods. The deblocking oriented methods focus on removing blocking and ringing artifacts of the decoded images. Yeh et al.[yeh2014self] propose a self-learning based post-processing method for image/video deblocking by formulating deblocking as a morphological component analysis based image decomposition problem. Yoo et al.[yoo2014post] propose a two-step framework for reducing blocking artifacts in different regions based on increment of inter-block correlation, which classifies the coded image into flat regions and edge regions. Liu et al.[liu2016data] learned sparse representations within the dual DCT-pixel domain, and achieved very promising results. Recently, Dong et al.[dong2015compression] propose a compact and efficient network (ARCNN) for seamless attenuation of different compression artifacts. Innovatively, D3[wang2016d3] and DDCN[guo2016building] integrate dual-domain sparse coding and the prior knowledge of JPEG compression, which achieve impressive results.
The restoration oriented methods regard the compression operation as a distortion process and reduce artifacts by restoring the clear images. Sun et al. [sun2007postprocessing] model the quantization distortion as Gaussian noises and use field of experts as image priors to restore the images. Zhang et al. [zhang2012reducing, zhang2013compression] propose to utilize similarity priors of image blocks to reduce compression artifacts by estimating the transform coefficients of overlapped blocks from non-local blocks. Recently, Zhang et al.[zhang2016concolor] develop a novel algorithm for image deblocking using a constrained non-convex low-rank model, which formulates image deblocking as an optimization problem within maximum a posteriori framework.
In the aforementioned methods, image prior models play important roles in both the deblocking oriented and restoration oriented methods. However, these methods involve computationally expensive iterative processes when solving the optimal solutions with complex formula derivations. Therefore, they may be not suitable for practical applications. In short, all the related methods reviewed above attempt to improve image quality only from the perspective of image post-processing. In other words, the connection between the encoder front-end processing and the decoder back-end processing is ignored. We attempt to jointly optimize the encoder and decoder joint optimization to improve the compression performance.
Ii-B Image Super-Resolution Based on Deep Learning
Recently, CNNs have been used successfully for image super-resolution (SR) especially when residual learning[he2015deep] and gradients-based optimization algorithms[duchi2011adaptive, zeiler2012adadelta, kingma2014adam] are proposed to train deeper network efficiently. Dong et al. propose a CNN based SR method[dong2016image] named SRCNN, which consists of three layers: patch extraction, non-linear mapping and reconstruction. Although Dong et al. conclude in their paper that deeper networks do not result in better performance in some cases, other researchers argue that increasing depth significantly boosts performance. For example, VDSR[kim2015accurate] shows a significant improvement in accuracy, which uses 20 weight layers. DRCN[kim2015deeply] has a very deep recursive layer (up to 16 recursions) and outperforms previous methods by a large margin.
Ii-C Image Compression Based on Deep Learning
Recently, deep learning has been used both for lossy and lossless image compression and achieved competitive performance. For the lossy image compression, Toderici et al.[toderici2015variable] propose a general framework for variable-rate image compression and a novel architecture based on convolutional and deconvolutional LSTM recurrent networks. Further, Toderici et al.[toderici2016full] proposed a neural network which is competitive across compression rates on images of arbitrary size. For a given compression rate, both methods learn the compression models by minimizing the distortion. Theis et al.[theis2017lossy] propose compressive autoencoders, which uses a smooth approximation of the discrete of the rounding function and upper-bound the discrete entropy rate loss for continuous relaxation. Ballé et al.[balle2016end] make use a generalized divisive normalization (GDN) for joint nonlinearity and replace rounding quantization with additive uniform noise for continuous relaxation. Li et al.[li2017learning] proposed a content-weighted compression method with the importance map of image. For the lossless image compression, the methods proposed by Theis et al.[theis2015generative] and van den Oord et al.[oord2016pixel] achieves state-of-the-art results.
Overall, although the image compression methods based on deep learning achieve competitive performance, they ignored the compatibility with existing image codecs, which limits their use in some existing systems. In contrast, the proposed compression framework takes into account both compression performance and compatibility with existing image codecs.
Iii The Proposed Compression Framework
In this section, we first introduce the architecture of the proposed compression framework and then present the detailed learning algorithm.
Iii-a Architecture of End-to-End Compression Framework
As shown in Fig. 2, the proposed compression framework consists of two CNNs and an image codec. The compact representation CNN (ComCNN) is used to generate a compact representation of the input image for the encoding, which preserves structural information of the image and therefore facilitates the accurate reconstruction of high-quality images. The reconstruction CNN (RecCNN) is used to enhance the quality of the decoded image. These two CNNs collaborate with each other and are optimized simultaneously to achieve high-quality image compression at low bit rates.
Iii-A1 Compact Representation Convolutional Neural Network (ComCNN)
As shown in Fig. 3, ComCNN has 3 weight layers, which maintain the spatial structure of the original image and therefore facilitate the accurate reconstruction of the decoded image using RecCNN111We have tried deeper networks to obtain better performance, but only negligible improvements at the expense of a lot of training time and costs. . The combination of convolution and ReLU [krizhevsky2012imagenet] is used in ComCNN. The first layer is used to perform patch extraction and representation which extracts overlapping patches from the input image. Let represents the number of image channels. A total of 64 filters of size are used to generate 64 feature maps and the ReLU nonlinearity is utilized as an activation function. The second layer has two significant intentions: downscaling and enhancing the features, which are implemented by convolutional operations with setting the stride to 2. The sizes of 64 filters are and ReLU is also applied. For the last layer, c filters of size are utilized to construct the compact representation.
Iii-A2 Reconstruction Convolutional Neural Network (RecCNN)
As shown in Fig.4, RecCNN is composed of 20 weight layers, which have three types of layer combinations: Convolution + ReLU, Convolution + Batch Normalization[ioffe2015batch] + ReLU and Convolution. For the first layer, 64 filters of size are used to generate 64 feature maps, followed by ReLU. For the 2nd to 19th layers, 64 filters of size are used, and batch normalization is added between convolution and ReLU. For the last layer, c filters of size are used to reconstruct the output. Residual learning and batch normalization are applied to speed up the training process and boost the performance. The compressed image is upsampled to the original image size using bicubic interpolation.
Iii-B Learning Algorithm
According to the proposed architecture, both ComCNN and RecCNN try to make the reconstructed image as similar as possible to the original image. Therefore, the end-to-end optimization goal can be formulated as
where represents the original image, and are the parameters of ComCNN and RecCNN, respectively. and represent the ComCNN and RecCNN, respectively. represents an image codec (e.g. JPEG, JPEG2000 or BPG). From this objective function, we can see that an original image passes through the compression pipeline, including ComCNN, image codec and RecCNN, and finally outputs the reconstructed image . Such a process is an end-to-end compression.
Unfortunately, the in Eq.(1) involves a rounding function, which is not differentiable when performing the back propagation algorithm. To solve this problem, we designed an iterative optimization learning algorithm. By fixing , we can get
and by fixing , we can obtain
Iii-B1 Updating the Parameters of RecCNN
According to the network topology, an auxiliary variables is introduced and defined as the decoded compact representation of , which can be formulated as
After combining Eq.(4) and Eq.(3), we can obtain
Iii-B2 Updating the Parameters of ComCNN
From Eq.(2), we can see that it is not a trivial task to obtain the optimal since the is an inherently non-differentiable operation when performing back propagation. To solve this problem, we define an auxiliary variable as the optimal input of RecCNN:
Here we make a reasonable and general assumption that is monotonic with respect to , which can be expressed as
, if and only if
Let be the solution of , i.e., for any possible , it satisfies that
Following assumption (7), we can obtain that
Combining with Eq.(2), we can get , which is
Since is an codec, a reasonable solution of Eq.(12) is
Combine Eq.(13) and the assumption (7) above, it arrives
Here, we get Eq.(13) with a reasonable assumption and rigorous derivations, which is the approximation of Eq.(2). In this paper, we use Eq.(13) to train ComCNN instead of Eq.(2).
We can obtain the optimal and by iteratively optimizing Eq.(5) and Eq.(13), respectively. In light of all derivations above, the complete description of the proposed algorithm is given in Algorithm 1.
Iii-C Loss Functions
Iii-C1 For ComCNN training
Given a set of original images and trained parameters , we use mean squared error (MSE) as the loss function
where and represents the batch size and the trainable parameter, respectively.
Iii-C2 For RecCNN training
Having obtained a set of compact representation from ComCNN and original images , we use MSE as the loss function:
where represents the trainable parameter. represents the residual learned by RecCNN. Clearly, it looks somewhat different from Eq.(5), but they are not contradictory. Actually, they are essentially identical, and Eq.(15) is just expresses Eq.(5) as the form of the residual.
To evaluate the performance of the proposed compression framework, we conduct experimental comparisons against standard compression methods (e.g., JPEG, JPEG 2000 and BPG) with a post-processing deblocking or denoising method. Five representative image deblocking methods, i.e. Sun’s[sun2007postprocessing], DicTV[chang2014reducing], Zhang’s[zhang2013compression], Ren’s[ren2013image], Zhang’s[zhang2016concolor] and two representative image denoising methods, i.e. BM3D[dabov2007image] and WNNM[gu2014weighted] are chosen due to their state-of-the-art performance. Moreover, ARCNN[dong2015compression] is also chosen since it is a landmark deblocking method based on deep learning and achieves the state-of-the-art performance. Meanwhile, in order to demonstrate the effectiveness of ComCNN, we remove ComCNN in the framework and just using RecCNN to reconstruct the decoded image. Similarly, we remove the RecCNN to examine the effect of ComCNN using bicubic interpolation to obtain the reconstructed image of the same size as the original image. The results of all the compared methods are obtained by running the source codes of the original authors with the optimal parameters. Through this section, we use the name of the post-processing method to denote a compared method.
We use the MatConvNet package[vedaldi2015matconvnet] to train the proposed networks. All experiments are carried out in the Matlab (R2015b) environment running on a computer with Inter(R) Xeon(R) CPU E5-2670 2.60GHz and an Nvidia Tesla K40c GPU. It takes about 28 hours and one day to train the ComCNN and RecCNN on GPU, respectively.
Iv-a Datasets for Training and Testing
Following[chen2015trainable], we use 400 images of size for training. A total of 204800 ()222 For each image, there are 8 augmentations and 64 patches of size extracted. patches are sampled with a stride of 20 on the training images as well as their augmentations (flip and rotate with different angles). Our experiments indicate that using a larger training set can only bring negligible performance improvements. For testing, we use 7 images as shown in Fig.5 , which are widely used in the literature. Please note that all the test images are not included in the training dataset.
Iv-B Model Initialization
We initialize the weights of ComCNN using the method in [he2015delving] and use Adam algorithm [kingma2014adam] with and . We train ComCNN for 50 epochs using a batch size of 128. The learning rate is decayed exponentially from 0.01 to 0.0001 for 50 epochs. The weights initialization and gradient updating of RecCNN is the same as ComCNN. RecCNN is also trained for 50 epochs using the same batch size with ComCNN. The learning rate is decayed exponentially from 0.1 to 0.0001 for 50 epochs.
Iv-C Experimental Results
In our experiments, we set different quality factors (QF) to achieve the same bits per pixel (bpp) for both the proposed and compared methods. For the proposed method, we first manually adjust the QF for the compression of compact representation by JPEG to achieve almost the same bpp with the compared image enhancement methods. Then we compare the PSNR and SSIM of the proposed method with the compared methods. The comparison results for all test images with QF = 5 and QF = 10 are provided in Table I and Table II, respectively, where the best results are highlighted in bold.
As seen from Table I and Table II, in the case of QF = 5, the proposed compression framework achieves 1.20dB gains in PSNR and 0.0227 gains in SSIM compared against Zhang’s [zhang2016concolor], which is state-of-the-art in the compared methods. It is worth mentioning that the proposed framework outperforms all the compared image enhancement methods including ARCNN[dong2015compression], which is a milestone based on CNN. Meanwhile, in the case of QF = 10, the proposed compression framework achieves 0.43dB and 0.0067 gains in PSNR and SSIM, respectively, compared against ARCNN[dong2015compression]. The visual quality comparisons in the case of QF = 5 for Lena is provided in Fig. 6. We can see that the blocking artifacts are obvious in the image decoded directly by the standard JPEG. DicTV[chang2014reducing], Sun’s[sun2007postprocessing], WNNM[gu2014weighted] and BM3D[dabov2007image] remove the artifacts partially, but there are still some artifacts visible in the reconstructed image. Zhang’s [zhang2013compression] and Ren’s [ren2013image] generate better results than Sun’s[sun2007postprocessing] and BM3D[dabov2007image]. However, the blur effects along the edges are generated at the same time. Zhang’s [zhang2016concolor] achieves a better PSNR and SSIM, but it makes the image over-smoothing and discards some details in image edges. ARCNN[dong2015compression] and ReCNN achieve better visual quality than other compared methods. The proposed compression framework not only removes most of the artifacts significantly, but also preserves more details on both edges and textures than all the compared methods. In order to verify the effect of ComCNN, we remove ComCNN and use RecCNN alone to reconstruct the decoded image. Similarly, we remove RecCNN and only use ComCNN and bicubic interpolation to examine the effect of RecCNN. As shown in Table I and Table II, worse performances are obtained only with ComCNN or RecCNN. In addition, we show examples of the compact representation produced by ComCNN in Fig.7. It can be seen that the compact representation maintains the structural information of the original image, but it is different from traditional down sampling methods. In a nutshell, both ComCNN and RecCNN play key roles in the proposed compression framework. Due to the collaboration of ComCNN and RecCNN, the compact representation preserves more useful information for the final image reconstruction. Our testing codes are available in GitHub 333https://github.com/compression-framework/compression_framwork_for_tesing.
We also evaluate our framework on JPEG 2000 and BPG (Better Portable Graphics)444F. Bellard, The BPG Image Format, http://bellard.org/bpg/ and achieve excellent performance. BPG compression is based on the High Efficiency Video Coding (HEVC), which is considered as a major advance in compression techniques. For JPEG2000, we test the proposed compression framework at different bit rates (from 0.1bpp to 0.4bpp) and compare it with JPEG 2000. Table III presents the comparison results with JPEG 2000. It can seen that our framework significantly outperforms JPEG2000 on all test bit-rates of all test images in terms of both PSNR and SSIM. For bpp from 0.1 to 0.4, the proposed framework achieves on average 3.06dB, 2.45dB, 1.34dB, 1.09dB and 0.1047, 0.0709, 0.0525, 0.0435 gains in PSNR and SSIM compared against JPEG2000. In Fig. 8, one can see that the proposed compression framework achieves much better subjective performance than JPEG2000, especially at very low bit rate. Our framework preserves more high-frequency information and recovers sharp edges and pure textures in the reconstructed image. For BPG, we test the BPG codec at QP (quality parameter) = 43 and 47. Further, we keep the bit-rates of the proposed compression framework almost the same for each image. The results are shown in Table IV. One can see that, if we treat RecCNN as a post-processing method, RecCNN achieves on average 0.81dB and 0.0168 gains in PSNR and SSIM. And the proposed compression framework achieves on average 0.99dB and 0.0218 gains in PSNR and SSIM while saving 5.22% bit-rates. It is worth noting that the performance of our proposed compression framework on BPG is not so obvious on JPEG and JPEG2000 when compared, because BPG is already a very good compression method, which might not be significantly improved further.
|Test Images||BPG||BPG + RecCNN||ComCNN + BPG + RecCNN|
In order to show the effectiveness of the proposed compression framework, we test our method on Set5[kim2015accurate], Set14[kim2015accurate], LIVE1[sheikh2005live] and General-100[dong2016accelerating] datasets. It is worth mentioning that the General-100 dataset contains 100 bmp-format images with no compression, which are very suitable for compression task. Results are shown in Table VI, from which we can see that the performance of our proposed compression exceeds JPEG and JPEG2000 by a larger margin for all four testing datasets.
Iv-D Running Time
The running time of all compared methods when dealing with a grayscale image in CPU or GPU are shown in Table V. It should be noted that it is not possible to test the running time in GPU for all other compared methods. As we can see from Table V, the proposed framework needs only 1.56s and 0.017s in CPU and GPU, respectively. Our compression framework is faster than other post-processing methods. In addition, we also calculate the running time of our sub-network RecCNN, which almost takes the entire running time of our method. Because RecCNN has 20 layers, which is much deeper than ComCNN with only 3 layers.
In this paper, we propose an effective end-to-end compression framework based on two CNNs, one of which is used to produce compact intermediate representation for encoding using an image encoder. The other CNN is used to reconstruct the decoded image with high quality. These two CNNs collaborate each other and are trained using a unified optimization method. Experimental results demonstrate that the proposed compression framework achieves state-of-the-art performance and is much faster than most post-processing algorithms. Our work indicates that the performance of the proposed compression framework can be significantly improved by applying the proposed framework, which will inspire other researchers to design better deep neural networks for image compression along this orientation.
This work is partially funded by the Major State Basic Research Development Program of China (973 Program 2015CB351804), the Science and Technology Commission of China No.17-H863-03-ZT-003-010-01 and the Natural Science Foundation of China under Grant No. 61572155 and 61672188.
Feng Jiang received the B.S., M.S., and Ph.D. degrees in computer science from Harbin Institute of Technology (HIT), Harbin, China, in 2001, 2003, and 2008, respectively. He is now an Associated Professor in the Department of Computer Science, HIT and a visiting scholar in the School of Electrical Engineering, Princeton University. His research interests include computer vision, image and video processing and pattern recognition.
Wen Tao received the B.S. degrees in computer science from Harbin Institute of Technology(HIT), Harbin, China, in 2016. He is now working towards the M.S. degree at School of Computer Science and Technology, HIT. His current research interests are in image processing and computer vision.
Shaohui Liu , Associate professor, has got a Bachelor of Science Degree in Computational Mathematics and Its application Software Development, a Master of Science Degree in Computational Mathematics of Science and a Doctor of Philosophy Degree in Computer Application Technology from Harbin Institute of Technology, Harbin, P. R. China respectively in 1999, 2001 and 2007. He is a senior member of CCF. Dr. Liu was a visiting professor and post-doctor at Sejong University in South Korea, and a Visiting Scholar at University of Missouri Columbia in America. His research mainly includes image and video processing and analysis, computer vision, multimedia security. In the related fields, he has coauthored more than 80 papers which are cited more than 1000 times totally.
Jie Ren received his B.S. degree in 2015 from University of Electronic Science and Technology of China; now, he is a master in Harbin Institute of Technology. His main research interest includes image processing and multimedia compression technology.
Xun Guo received his Ph.D. degree in computer science from Harbin Institute of Technology, China, in 2007. From 2007 to 2012, he was with MediaTek Inc. as group manager, where he led a research team and worked mostly on video compression, especially on technology development for HEVC standard. He joined Microsoft Research Asia in 2012, where he is now a lead researcher. His research interests include video coding and processing, multimedia system and computer vision.
Debin Zhao (M’11) received the B.S., M.S., and Ph.D. degrees in computer science from Harbin Institute of Technology (HIT), Harbin, China, in 1985, 1988, and 1998, respectively. He is now a Professor in the Department of Computer Science, HIT. He has published over 200 technical articles in refereed journals and conference proceedings in the areas of image and video coding, video processing, video streaming and transmission, and pattern recognition.