A Unified End-to-End Framework for Efficient Deep Image Compression

A Unified End-to-End Framework for Efficient Deep Image Compression

Abstract

Image compression is a widely used technique to reduce the spatial redundancy in images. Recently, learning based image compression has achieved significant progress by using the powerful representation ability from neural networks. However, the current state-of-the-art learning based image compression methods suffer from the huge computational complexity, which limits their capacity for practical applications. In this paper, we propose a unified framework called Efficient Deep Image Compression (EDIC) based on three new technologies, including a channel attention module, a Gaussian mixture model and a decoder-side enhancement module. Specifically, we design an auto-encoder style network for learning based image compression. To improve the coding efficiency, we exploit the channel relationship between latent representations by using the channel attention module. Besides, the Gaussian mixture model is introduced for the entropy model and improves the accuracy for bitrate estimation. Furthermore, we introduce the decoder-side enhancement module to further improve image compression performance. Our EDIC method can also be readily incorporated with the Deep Video Compression (DVC) framework [1] to further improve the video compression performance. Simultaneously, our EDIC method boosts the coding performance significantly while bringing slightly increased computational complexity. More importantly, experimental results demonstrate that the proposed approach outperforms the current state-of-the-art image compression methods and is up to more than 150 times faster in terms of decoding speed when compared with Minnen’s method [2]. The proposed framework also successfully improves the performance of the recent deep video compression system DVC [1].

Image compression, neural network, auto-encoder, attention mechanism, Gaussian mixture model.

I Introduction

Image compression aims to reduce the spatial redundancy in images and is widely used to save the bandwidth and storage sizes in lots of applications. Traditional image compression methods  [3, 4, 5, 6] rely on hand-crafted techniques to improve the compression efficiency. For example, JPEG [3] uses the discrete cosine transform (DCT) to convert the images from the pixel domain to the frequency domain for high compression efficiency. However, the traditional compression methods cannot be optimized by using large-scale training, which may limit their performance.

Recently, learning based image and video compression methods [7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 2, 18, 19, 1] attract more and more attention. Ballé et al. [9] propose an end-to-end optimized image compression approach by using the convolutional neural network (CNN) based auto-encoder. To further improve the compression efficiency, Minnen et al. [2] employs the auto-regressive prior information to obtain accurate entropy model and achieve comparable or even better performance than the traditional codec [6].

Although the current state-of-the-art learning based methods [18, 2] improve the compression performance, they also increase the computational complexity significantly. When compared with the previous learning approaches [9, 10], the current state-of-the-art methods [18, 2] exploit the spatial redundancy in the latent feature space by using auto-regressive prior information. Therefore, the decoding procedure in [18, 2] is performed sequentially for each pixel, while the previous approaches [9, 10] can reconstruct all the pixels through convolution layers in a parallel manner. As shown in Table I, the average GPU decoding time for images with the resolution of using Ballé’s method [10] is 0.013 seconds while the corresponding decoding time using Minnen’s method [2] is 2.426 seconds.

Methods Decoding Time BDBR
Ballé’s [10] 0.013s
Minnen’s [2] 2.426s
EDIC(Ours) 0.016s
TABLE I: Decoding Time and BDBR improvement over JPEG2000 [4] of different methods on the Kodak [20] image dataset. The Deoding time is evaluating using one RTX 2080Ti. The full name of BDBR is “Bjontegaard delta bitrate”, which refers to the bitrate relative percentage of reduction under the same PSNR.

In this paper, we ask the question: Is it possible to improve the compression efficiency without significantly increasing the computation complexity? To address this issue, we propose a unified framework named as Efficient Deep Image Compression (EDIC), which consists of three new components, including the channel attention module, the Gaussian mixture model and the decoder-side enhancement module. Specifically, we utilize an auto-encoder style network for building the image compression framework. To further improve the compression performance, we also exploit the channel relationship in latent features at the encoder side and use an effective channel attention module to enhance the corresponding representation power. More importantly, instead of using the single Gaussian model for entropy estimation like [9, 10, 2, 18], we propose to use Gaussian mixture model (GMM) for more accurate entropy estimation. Besides, we introduce the decoder-side enhancement module to reduce the compression artifacts. The channel attention technique, the Gaussian mixture model and the decoder-side enhancement module are seamlessly combined, which leads to much better image compression performance with only slightly increased computational cost when compared with auto-regressive prior technique in [2, 18]. Experimental results demonstrate that the proposed image compression approach achieves comparable compression performance when compared with the current state-of-the-art approach [2], while the decoding speed of our method is over 150 times faster than [2] for images with the resolution of . Our method can be readily used for video compression and also achieves promising results for video compression.

The contributions of this paper are summarized in the following aspects. First, to the best of our knowledge, we are the first to introduce the channel attention technique to improve image compression efficiency. Second, the Gaussian mixture model is introduced to model the distribution of the latent representation in a more accurate way. Third, we additionally apply the decoder-side enhancement module to further improve image compression performance. Fourth, our proposed EDIC framework achieves the state-of-the-art image compression performance while significantly reducing the decoding time when compared to Minnen’s method [2]. Fifth, the proposed framework is general and also improve the performance of the recent learning based video compression system [1].

Fig. 1: The framework of our proposed EDIC. Each convolution layer is denoted by the number of filters, kernel size, and stride. indicates downsampling, and indicates upsampling in each convolutional layer. and are the hyper-parameters to set the number of channels for a specific layer. “GDN” means generative divisive normalization proposed in [21], and “IGDN” means inverse GDN. “Q” denotes quantization. “AE” and “AD” represent arithmetic encoder and arithmetic decoder, respectively. “” refers to the estimated parameters of the Gaussian mixture model.

Ii Related Work

Ii-a Traditional Image and Video Compression

The image and video compression techniques are widely used to save the bandwidth and storage size in practical applications. In the past decades, a lot of image and video compression methods have been proposed and several standards are also successfully built. To improve the compression efficiency, the traditional image and video compression methods [3, 4, 5, 6] rely on manually designed techniques, such as liner transform and block based motion estimation and motion compensation schemes.

The image compression methods mainly focus on reducing the spatial redundancy in images. One straightforward method is to convert the images from the pixel domain to the frequency domain, which is easier for compression. For example, the JPEG [3] uses the discrete cosine transform while JPEG2000 [4] employs discrete wavelet transform. After the transform procedure, these coefficients are quantized, and then are sent to the decoder side. To further improve the compression efficiency, the quantized coefficients are losslessly encoded by using the entropy coding tools, such as arithmetic coding [22]. Recently, the intra prediction technique in video compression is also exploited for image compression. For example, the BPG [6] standard is based on HEVC/H.265 [23], which achieves the state-of-the-art image compression performance when compared with the previous image codecs, such as JPEG and JPEG2000. The BPG standard adopts the prediction-transform technique and employs 35 encoding modes to obtain the predicted image, which further reduces the spatial redundancy.

Video compression is used to reduce the temporal redundancy in video sequences. Most video compression algorithms follow the hybrid coding architecture for high compression efficiency. In particular, H.264 [24] is the most widely used video codec. In H.264, the block based motion estimation and motion compensation modules are utilized to obtain the predicted frame. Then we can calculate the residual information, which is compressed by using linear transform. Recently, HEVC/H.265 [23] and versatile video coding (VVC) are proposed as the next generation video codecs. These standards build upon the previous hybrid coding architecture and utilize more advanced techniques for high efficiency coding. For example, HEVC uses the so-called Coding Unit (CU) Tree technique with the CU size ranging from to , which provides flexible coding units for different video contents.

Ii-B Learning based Image and Video Compression

In the past few years, deep neural network (DNN) has demonstrated its effectiveness for a lot of computer vision tasks, including super-resolution, denoising, etc. Recently, researchers try to exploit the powerful representation ability from neural networks to enhance the image/video compression performance [7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 2, 18]. Toderici et al. proposed the first learning based image compression framework by using recurrent neural network (RNN). Their approach can generate multiple bitrates through a single model. In [17], more advanced RNN modules and effective reconstruction techniques are introduced to achieve comparable or even better performance when compared with BPG in terms of MS-SSIM [25] However, these methods [7, 8, 17] are designed to minimize the bitrates instead of considering the rate-distortion trade-off.

In [9], Ballé et al. proposed a CNN based image compression framework by optimizing rate-distortion criterion. To improve the accuracy of the entropy model, a hyper-prior model is proposed in [10], where the latent representations are modeled based on zero-mean Gaussian distribution. In [2], Minnen et al. employed the auto-regressive priors to further improve the compression and achieve better performance than BPG in terms of PSNR. However, these CNN based image compression systems have to train different models for different bitrates and increase the model sizes significantly. In [26], Choi et al. proposed a variable rate deep image compression framework by using a conditional autoencoder and generates different bitrates through a single model.

Considering that the quantization procedure itself is not differentiable, it is non-trivial to optimize the image compression system in an end-to-end manner. In [9], the quantization operation is approximated by adding uniform noise in the training stage. In [11], the gradients of quantization operation in the training stage are replaced for end-to-end optimization. To further improve the compression efficiency, Rippel et al. [14] used the multi-scale image decomposition technique to exploit the relationship between different scales. Agustsson et al. [16] proposed a generative adversarial network based image compression system, which provides a visually pleasing reconstructed image for very low bitrate compression. In addition, Li et al. [13] investigated the spatial relation in the latent representations and computed the importance map to guide the learning based image compression method. Inspired by the intra-prediction technique in traditional video coding, Baig et al. used the inpainting method [27] to obtain the predicted block in the reconstructed frame and encode the corresponding residual by using neural network. Although the current state-of-the-art learning based methods [2, 18] achieve better performance than the traditional methods such as BPG [6], the computational complexity increases nearly 100 times when the auto-regressive prior [2, 18] is employed. Therefore, it is critical to build a more efficient image compression framework for practical applications.

Recently, learning based video compression has attacted more and more attention. Wu et al. formulated video compression as frame interpolation and applied neural network to encode the residual information. Lu et al. [1] followed the traditional hybrid coding architecture and employed the neural networks to implement the video compression procedure, which can be optimized in an end-to-end manner. Cheng et al. [28] used the interpolation loop in the coding procedure and designed a spatial energy compaction-based penalty term into the loss function for better coding efficiency. In [29], a 3D autoencoder scheme is proposed for video compression without computing the motion information. In [30], the proposed framework can decode the latent representations into motion and blending coefficients. Besides, the residual information is compressed in the latent space instead of the pixel domain.

Iii Proposed Methods

Iii-a Overall Architecture for Image Compression

In this section, we introduce the proposed efficient deep image compression framework called EDIC. The architecture of the proposed scheme is illustrated in Fig. 1. Inspired by the recent progress in learning based image compression [9, 10], we also utilize the auto-encoder style network for learning based image compression. Specifically, there are four modules in the proposed scheme, i.e., encoder network, decoder network, hyper-encoder network, and hyper-decoder network. The encoder network takes the original image as the input and generates the corresponding latent representations by using several convolutional layers and non-linear functions. The latent representations will be quantized to . Following arithmetic coding, like arithmetic encoder and arithmetic decoder, the quantized latent representations were sent to the decoder network to reconstruct the final decoded image . We adopt the same quantization strategy as  [10, 2]. Considering that the image compression methods aim to achieve high quality reconstructed image at a given bitrate target, it is critical to build an accurate entropy model. In the proposed framework, we follow the pipeline in [10, 2] and apply the hyper-encoder and hyper-decoder modules to estimate the parameters for the entropy model. Specifically, based on the latent representations , the hyper-encoder module obtains the hyper-prior information and encodes it to latent representations . Similarly, the latent representations will be quantized as . Then, quantized will be sent by arithmetic coding. Finally, the hyper-decoder will reconstruct the hyper-prior information by using quantized hyper-latent representations as the input and estimate the corresponding parameters of the entropy model. The entropy model of hyper-latent representations is the same as  [10, 9]. The network architecture and entropy model in our proposed method will be discussed in the next three sections.

The whole learning based image compression framework is optimized by considering the rate-distortion trade-off in the following way:

(1)

where and represent the distortion and bitrate, respectively. is the trade-off parameter. is the distortion metric (mean square error or MS-SSIM [25]). represents the bitrate for encoding latent representations and . In the proposed method, the bitrate is approximated by using the entropy of the corresponding latent representations, i.e., and . and represent the distributions of and , respectively.

Iii-B Channel Attention Scheme

In [2, 18], the auto-regressive prior model which captures the spatial relationship in latent representations is used to improve the compression performance. Meanwhile, some works have applied spatial attention mechanisms implemented by non-local blocks [31] to image compression [32, 33], which aims to reduce the spatial redundancy. Based on the aforementioned two motivations and inspired by [34], we propose to use a light-weight channel attention technique to exploit channel attention in the latent representations and . The architecture of the proposed attention module is shown in Fig. 2. Let us denote the input feature map as , , where , , and denote height, width and channel dimension of the feature map, respectively. First, we apply global average pooling to obtain the channel-wise statistics , which is formulated bellow:

(2)

where means the -th element of , and represents the -th channel specific value of the input feature map . Then, we apply several non-linear transforms to capture the channel-wise relationship. Specifically, the non-linear transforms are described in the following formula:

(3)

where refers to the output channel-wise attention value, and and denotes the fully-connected layers, is the ReLU activation function [35] for non-linear transform, and represents sigmoid activation. For reducing the dimension, we set as 16. Finally, we re-scale the input feature map with . In addition, we add the residual operation in our implementation.

Fig. 2: The structure of our channel attention module.“GAP” represents global average pooling. “FC” means fully-connected layer.

As shown in Fig. 1, the proposed channel attention module is integrated into the encoder network and hyper encoder network and utilized to exploit the channel relationship for high quality compression. We apply the re-weighted feature map to the following quantization and entropy coding modules.

Fig. 3: The structure of our GMM Module. denotes the hyper-parameter to set the number of channels for a specific layer, and depends on the number of Gaussian models. (See Section III-C for more explanations)

Iii-C Gaussian Mixture Model for Entropy Estimation


Fig. 4: Bit allocation map of the latent representations . The left column is the original image from Kodak [20]. The middle left column is the bit allocation map of after using a single Gaussian model as the entropy model. The middle right column is the bit allocation map of after using the Gaussian mixture model as the entropy model. The right column is bit allocation difference between them. We take “kodim20.png” (the top) and “kodim23.png” (the bottom) for visualization.

In the learning based image compression methods, accurate bit rate estimation is critical. In [2, 18], the learning based systems adopt the hyper-prior compression scheme and the latent representations are modeled as Gaussian distribution as follows:

(4)

where is represented by using the factorized entropy model [9]. The goal of hyper-encoder and hyper-decoder is to estimate the parameters and of the Gaussian model.

Although the single Gaussian based entropy model has achieved significant improvements when compared with the previous work [9], the representation ability of a single Gaussian model is still limited, especially for the complex contents. Therefore, we utilize the Gaussian mixture model to further improve the efficiency of the image compression system. Specifically, the distribution of is formulated as follows:

(5)

where represents the weights for different Gaussian models. is the number of Gaussian models. As shown in Fig. 3, we design three convolutional layers with two LeakyReLU layers to estimate the parameters of the Gaussian mixture model. In our implementation, is set as 2. So the output channel number of the GMM module is set as , the first channels are used to estimate the mean and variance of two Gaussian models, respectively. In order to estimate the weights of each gaussian model. We add a sigmoid layer on the output of the last channels. If the weight of one Gaussian model is , the weight of another Gaussian model is . Specifically, if we design Gaussian models, we can change the number of output channels of the GMM module to . Similarly, the first channels estimate the mean and variance parameters of Gaussian models. In particular, we add the softmax layer after the last channels to calculate the weight of each Gaussian model.

We provide further analysis of the GMM module. As shown in Fig. 4, the left part is the original image from Kodak [20], the right part shows the estimated bit allocation map difference of the latent representations between the single Gaussian model and Gaussian mixture model. The brighter region indicates that the Gaussian mixture model saves more bits. From Fig. 4, it is clear that the Gaussian mixture model can save more bits, especially in the edge regions.

Iii-D Decoder-side Enhancement

Since the proposed compression scheme is a lossy procedure, the reconstructed image has compression artifacts inevitably. To further improve the reconstructed quality, we introduce an enhancement module at the decoder side after image reconstruction. We adopt several residual blocks to restore the original image based on the input reconstructed image. Inspired by the network design strategy for super resolution [36], we introduce the residual block to learn the high frequency information for image compression. As shown in Fig. 5, we first add a convolution layer to increase the channel dimension from 3 to 32. Then, we apply three enhancement blocks to the output of the convolution layer. Every enhancement block has three residual blocks. Finally, we add a convolution layer and residual operation to obtain the reconstructed image. Moreover, the decoder-side enhancement module can be readily integrated into the whole compression system and optimized in an end-to-end manner with high efficiency. As shown in Fig. 6, we provide analysis about the decoder-side enhancement module. The learned image is the output after the final convolution layer. We observe that the learned residual image mainly contains the high frequency information, which means that the decoder-side enhancement module helps to predict the high frequency components.

Fig. 5: The structure of our decoder-side enhancement module. “RB” refers to the residual block.
Fig. 6: The left image is the reconstructed image after the decoder-side enhancement module, and the right image the learned residual image. We take “kodim01.png” from Kodak [20] for illustration.

Iii-E Extension for Video Compression

In order to further demonstrate the effectiveness of our newly proposed method, we also apply our proposed method for the video compression task. In our work, we choose DVC [1] as our baseline algorithm.

The overall framework is shown in Fig. 7. {} denote the current video sequences. refers to the frame at time-step . represents the reconstructed frame. and are the motion and residual information, respectively. The procedure of our video compression framework is shown as follows.

Motion Estimation and Compression

We utilize the CNN model proposed by [37] to predict optical flow, which represents the motion information . Instead of encoding the motion information directly, we send to the encoder network of the motion compression module to obtain , Then, we will quantize and reconstruct the motion information by using the decoder network of the motion compression module.

Motion Compensation, Residual Compression and Frame Reconstruction

The motion compensation module takes the previous reconstructed frame and motion information as the input, and obtains the predicted frame , which is supposed to be as close to the current frame as possible. After that, we use the original frame and to obtain residual information , where . The encoder network of the residual information module encodes the resiudal information , and quantizes to obtain the latent representations . Similarly, the decoder network of the residual information module reconstructs the residual information . Then, the final reconstructed frame can be obtained, where .

Optimization of the framework

The overall framework is optimized by minimizing the following Rate-Distortion trade-off:

(6)

where is the loss at the current time step , is the distortion between the current frame and the reconstructed frame , and and are the bitrates of the latent representations of residual information and the latent representations of motion information, which are estimated by the bitrate estimation module.

DVC utilizes the method proposed by Ballé [10] to compress the residual information, and Ballé’s method [9] to compress the motion information. In our work, we propose to use our proposed EDIC image compression framework to compress both the residual and motion information. Specifically, in the encoder network of the residual compression module and the motion compression module, we utilize the proposed channel attention scheme described in Section III-B to reduce the redundancy of the latent representations of residual and motion information. In terms of bitrate estimation module, we introduce the newly proposed Gaussian mixture model as the entropy model described in Section III-C to estimate the bitrates of the latent representations more accurately, in which the hyper-encoder and hyper-decoder network are used to estimate the parameters of Gaussian mixture model. Furthermore, in the decoder network of the residual compression module and the motion compression module, we add the decoder-side enhancement module in Section III-D to improve the reconstructed qualities of the residual and motion information effectively.

Fig. 7: The framework of our video compression method. The network structures of the residual compression module and the motion compression module are the same as in the Fig. 1. The bitrate estimation module is our method for estimating the bitrate of the latent representations. “Q” denotes quantization.

Iv Experiments

In this section, we perform extensive experiments to demonstrate the effectiveness of our proposed EDIC framework, which consists of the attention module, the GMM module and the decoder-side enhancement module. With regard to image compression, we adopt high-quality images from Flick.com and randomly take cropped patches for training. For performance evaluation, we calculate the Rate-Distortion (RD) performance, which is averaged over all images in the Kodak PhotoCD image dataset [20]. For video compression, we use Vimeo-90k [38] dataset, which has 89,800 video clips with the resolution of , as our training dataset, and evaluate our model on the HEVC Standard Test Sequences (i.e., Class B, Class C, Class D, Class E) [23], which is widely used for evaluating video compression methods. Our EDIC framework is implemented on the PyTorch [39] platform. All the experiments are conducted on the GPU NVIDIA 2080Ti server with 11 GB memory.

Iv-a Performance and implementation details for Image Compression

For image compression with the quality metric as the MSE loss function, we train our model using different values (i.e., 256, 512, 1024, 2048, 4096, 8192). In the first stage, we train the high bitrate point in the Rate-Distortion (RD) curve with as 8192. The model is trained on 1 GPU with the batch size of 4. We apply Adam optimizer [40] with the learning rate of in the first 3,000,000 iterations and in the remaining 500,000 iterations. For other bitrates, we just adopt the model trained on high bitrate () as a pre-trained model and fine-tune our model. We use Adam optimizer with the learning rate of in the first 500,000 iterations, and in the remaining 500,000 iterations. Other training settings remain the same. When our model is optimized with other quality metrics, such as the MS-SSIM loss function, we adopt the model optimized by the MSE loss function with of 8192 as our pre-trained model. Then, we change the MSE loss function to the MS-SSIM loss function and fine-tune the pre-trained model with different values (i.e., 16, 32, 64, 128, 256). We train the model with the learning rate of for 500,000 iterations. Besides, we set to and to .

As shown in Fig. 8, we adopt peak signal-to-noise ratio (PSNR) as the quality metric. We compare our EDIC method with the well-kown image compression standards, like BPG [6], JPEG [3], JPEG2000 [4], and recent neural networks methods, like Ballé’s work [10], Minnen’s work [2] and Lee’s work [18]. The results of Lee’s work [18] are from their released source code1. The results of Ballé’s work [10] and Minnen’s work [10] are based on our implementation. When compared with the traditional methods, our EDIC has surpassed BPG [6], JPEG [3], JPEG2000 [4] by a large margin. When compared with the existing deep learning based methods, our EDIC achieves significant improvement over Ballé’s work [10]. As far as we know, the method proposed by Minnen et al. has achieved the state-of-the-art performance for image compression. Our method has comparable results with Minnen’s work [2] and Lee’s method [18] at low bitrates, and achieves apparent performance improvement over Minnen’s work [2] and Lee’s method [18] at high bitrates. In addition, Minnen’s work and Lee’s method are very slow, because their inference strategies are sequential. By contrast, our method can be readily parallelized. As a result, our method is very efficient, which is very important for practical application scenarios. Furthermore, the attention module, the GMM module, and the Decoder-side Enhancement module are all independent modules and can be easily incorporated with other methods. As shown in Fig. 8, when we incorporate our method into Minnen’s work [2], which has the context model for estimating more accurate entropy parameters, our EDIC method with context model also achieves over 0.2 dB improvement when compared with our EDIC method. which again demonstrated the effectiveness of our proposed schemes.

Fig. 8: Rate-distortion curves of our proposed EDIC method and the competitive methods for image compression when using the PSNR metric. The “Context Model” is from Minnen’s work [2], which must be executed sequentially in the inference stage.

As shown in Fig. 9, we also conduct the experiments in terms of the MS-SSIM quality metric. In order to describe the improvement more clearly, we report the MS-SSIM values using decibels MS-SSIM. It is clear that our EDIC is better than BPG [6], JPEG [3], JPEG2000 [4], and Ballé’s [10]. When compared with the state-of-the-art methods, our EDIC is comparable with Minnen’s method [2] and lower than Lee’s work [18] at low bitrates. However, our EDIC is apparently superior to their methods at high bitrates.

Fig. 9: Rate-distortion curves of our proposed EDIC method and the competitive methods for image compression when using the MS-SSIM metric.
Fig. 10: Rate-distortion curves of our proposed EDIC method and the competitive methods for video compression when using the PSNR metric.
Fig. 11: Effectiveness of each module in our newly proposed framework.
Fig. 12: Rate-distortion curves of our proposed EDIC method and the competitive methods for video compression when using the MS-SSIM metric.
Fig. 13: The results when using different numbers of Gaussian models in our method.

Iv-B Performance and implementation details for Video Compression

Each video clip in the Vimeo dataset consists of 7 frames. For each frame in the video clip, we use as a reference frame and as the original frame during the training process. The HEVC test dataset contains the videos with different resolutions and different contents. We set to 192 and to 288. In the training process, when using the quality metric as the MSE loss function, we select to obtain our pre-trained model for 2,000,000 iterations with the learning rate of . Then, we apply different values (i.e., 256, 512, 1024, 2048) to fine-tune this pre-trained model with the learning rate of . When optimized by using the MS-SSIM loss function, we fine-tune the model at high bitrates from the MSE loss function for 80,000 iterations with the learning rate of . The remaining training strategies are similar to the implementation details of image compression described in Section IV-A.

As shown in Fig. 12, we compare our method with the traditional video compression standards, like H.264 [24], H.265 [41], and the deep learning based method DVC [1]. As for the Rate-Distortion(RD) curves, in terms of the PSNR quality metric, our method is much better than DVC [1] and H.264 [24], and it achieves comparable performance with H.265 [41]. With regard to MS-SSIM, it is clear that our newly proposed method is superior to DVC [1], H.264 [24], and H.265 [41] for almost all the HEVC test classes.

Fig. 14: Visualization of reconstructed sample images of groud truth, JPEG [3], BPG [6], Ballé’s method [10], Minnen’s method [2] and our proposed EDIC method. We take “kodim23.png” from Kodak [20] for illustration.

Iv-C Ablation study

Effiectiveness of Each module

In order to verify the effectiveness of each proposed module, we perform ablation study for image compression in this section. For the baseline model, we utilize a single Gaussian model as our entropy model. When implementing the baseline model, we just remove the GMM module, the attention module and the decoder-side enhancement module (See Fig. 1). After that, the last convolution layer of the hyper decoder is , so the first channels are used to estimate the mean parameters and the last channels are used to estimate the variance parameters of a single Gaussian model. Then, we add each module to the baseline model, respectively. When we utilize the Gaussian mixture model as our entropy model, we add the GMM module described in Fig. 4 based on the baseline model to estimate the parameters of the Gaussian mixture model. As shown in Fig. 11, we compare the performance of the baseline model, the baseline model with additional attention module, the baseline model with additional decoder-side enhancement module, the baseline model with additional GMM module, our proposed EDIC method without decoder-side enhancement module, and our overall EDIC method consisting of the GMM module, the attention module and the decoder-side enhancement module. For all experiments, we use the same training strategy described in Section IV-A. As shown in Fig. 11, we observe that each module brings significant performance improvement when compared to our baseline model. For the attention module, the baseline model with the attention module is about 0.2 dB better than to the baseline model. The baseline model with the GMM module is also superior to the baseline model, which demonstrates the effectiveness of the Gaussian mixture model. Furthermore, when we add the decoder-side enhancement module to any models, we can achieve better performance.

Results when using different numbers of Gaussian Models

We also conduct the experiments to report the results when using different numbers of Gaussian models for image compression. Specifically, we adopt two Gaussian models in our implementation of the Gaussian mixture model. When we use three Gaussian models, we simply change the number of output channels in the last layer of GMM module (See Fig. 4) to . The first channels estimate the parameters of mean and variance, while the last channels estimate the weights of each Gaussian model. In order to make the sum of weights equal to 1, we add the softmax layer to the output of the last channels. As shown in Fig. 13, we observe that our method using three Gaussian models achieve similar performance with that using two Gaussian models, which demonstrates that the performance of our approach cannot be improved significantly when increasing the number of Gaussian models.


Fig. 15: Comparsion between the bit allocation map of latent representations (the second row) and original image (the first row). We take “kodim04.png”, and “kodim19.png” from Kodak [20] for illustration.

Iv-D Visualization

In order to demonstrate the effectiveness of our EDIC more clearly, we provide some visualization results. As shown in Fig. 15, we visualize the bit allocation map of latent representations . The brighter region means we allocate more bits. In the smooth region, our proposed EDIC method allocates a few bits. In contrast, we need more bits in the edge region, which means that our neural network can learn to allocate bits according to different types of regions automatically. Furthermore, we compare the reconstructed sample images of our proposed EDIC method and other competitive methods in Fig. 14. The reconstructed image of our method achieves higher quality in both PSNR metric and qualitative viewing when the compression ratios of all methods are close.

V Conclusion

In this paper, we have proposed a unified framework EDIC to boost image compression performance while keeping fast inference speed for practical scenarios. We first adopt a light-weight channel-wise attention mechanism to reduce channel-wise redundancy of the latent representations. Moreover, we propose to use the Gaussian mixture model to estimate the bitrate more accurately, which has been shown to be very useful for edge regions. Finally, we introduce a simple decoder-side enhancement module to further improve image compression performance. Our framework can be trained in an end-to-end fashion and readily used for video compression. Experimental results have demonstrated the superiority of our proposed EDIC method for image and video compression over the existing state-of-the-art methods.

Footnotes

  1. https://github.com/JooyoungLeeETRI/CA_Entropy_Model

References

  1. G. Lu, W. Ouyang, D. Xu, X. Zhang, C. Cai, and Z. Gao, “DVC: An end-to-end deep video compression framework,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,CVPR, 2019, pp. 11 006–11 015.
  2. D. Minnen, J. Ballé, and G. D. Toderici, “Joint autoregressive and hierarchical priors for learned image compression,” in Advances in Neural Information Processing Systems, 2018, pp. 10 771–10 780.
  3. G. K. Wallace, “The jpeg still picture compression standard,” IEEE Transactions on Consumer Electronics, vol. 38, no. 1, pp. xviii–xxxiv, 1992.
  4. A. Skodras, C. Christopoulos, and T. Ebrahimi, “The jpeg 2000 still image compression standard,” IEEE Signal Processing Magazine, vol. 18, no. 5, pp. 36–58, 2001.
  5. “Webp.” https://developers.google.com/speed/webp/, accessed: 2018-10-30.
  6. “F. bellard, bpg image format.” http://bellard.org/bpg/, accessed: 2018-10-30.
  7. G. Toderici, S. M. O’Malley, S. J. Hwang, D. Vincent, D. Minnen, S. Baluja, M. Covell, and R. Sukthankar, “Variable rate image compression with recurrent neural networks,” in 4th International Conference on Learning Representations, ICLR, 2016.
  8. G. Toderici, D. Vincent, N. Johnston, S. J. Hwang, D. Minnen, J. Shor, and M. Covell, “Full resolution image compression with recurrent neural networks.” in CVPR, 2017, pp. 5435–5443.
  9. J. Ballé, V. Laparra, and E. P. Simoncelli, “End-to-end optimized image compression,” in 5th International Conference on Learning Representations, ICLR, 2017.
  10. J. Ballé, D. Minnen, S. Singh, S. J. Hwang, and N. Johnston, “Variational image compression with a scale hyperprior,” in 6th International Conference on Learning Representations, ICLR, 2018.
  11. L. Theis, W. Shi, A. Cunningham, and F. Huszár, “Lossy image compression with compressive autoencoders,” in 5th International Conference on Learning Representations, ICLR, 2017.
  12. E. Agustsson, F. Mentzer, M. Tschannen, L. Cavigelli, R. Timofte, L. Benini, and L. V. Gool, “Soft-to-hard vector quantization for end-to-end learning compressible representations,” in NIPS, 2017, pp. 1141–1151.
  13. M. Li, W. Zuo, S. Gu, D. Zhao, and D. Zhang, “Learning convolutional networks for content-weighted image compression,” in CVPR, June 2018.
  14. O. Rippel and L. Bourdev, “Real-time adaptive image compression,” in ICML, 2017.
  15. F. Mentzer, E. Agustsson, M. Tschannen, R. Timofte, and L. Van Gool, “Conditional probability models for deep image compression,” in CVPR, no. 2, 2018, p. 3.
  16. E. Agustsson, M. Tschannen, F. Mentzer, R. Timofte, and L. Van Gool, “Generative adversarial networks for extreme learned image compression,” arXiv preprint arXiv:1804.02958, 2018.
  17. N. Johnston, D. Vincent, D. Minnen, M. Covell, S. Singh, T. Chinen, S. Jin Hwang, J. Shor, and G. Toderici, “Improved lossy image compression with priming and spatially adaptive bit rates for recurrent networks,” in CVPR, June 2018.
  18. J. Lee, S. Cho, and S.-K. Beack, “Context-adaptive entropy model for end-to-end optimized image compression,” arXiv preprint arXiv:1809.10452, 2018.
  19. C.-Y. Wu, N. Singhal, and P. Krahenbuhl, “Video compression through image interpolation,” in ECCV, September 2018.
  20. “E. kodak, kodak lossless true color image suite (photocd pcd0992). [online].” http://r0k.us/graphics/kodak/.
  21. J. Ballé, V. Laparra, and E. P. Simoncelli, “Density modeling of images using a generalized normalization transformation,” arXiv preprint arXiv:1511.06281, 2015.
  22. I. H. Witten, R. M. Neal, and J. G. Cleary, “Arithmetic coding for data compression,” Communications of the ACM, vol. 30, no. 6, pp. 520–540, 1987.
  23. G. J. Sullivan, J.-R. Ohm, W.-J. Han, T. Wiegand et al., “Overview of the high efficiency video coding(hevc) standard,” TCSVT, vol. 22, no. 12, pp. 1649–1668, 2012.
  24. “x264, the best h.264/avc encoder.” https://www.videolan.org/developers/x264.html, accessed: 2018-10-30.
  25. Z. Wang, E. Simoncelli, A. Bovik et al., “Multi-scale structural similarity for image quality assessment,” in ASILOMAR CONFERENCE ON SIGNALS SYSTEMS AND COMPUTERS, vol. 2.   IEEE; 1998, 2003, pp. 1398–1402.
  26. Y. Choi, M. El-Khamy, and J. Lee, “Variable rate deep image compression with a conditional autoencoder,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 3146–3154.
  27. M. H. Baig, V. Koltun, and L. Torresani, “Learning to inpaint for image compression,” in NIPS, 2017, pp. 1246–1255.
  28. Z. Cheng, H. Sun, M. Takeuchi, and J. Katto, “Learning image and video compression through spatial-temporal energy compaction,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR, 2019, pp. 10 071–10 080.
  29. A. Habibian, T. v. Rozendaal, J. M. Tomczak, and T. S. Cohen, “Video compression with rate-distortion autoencoders,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 7033–7042.
  30. A. Djelouah, J. Campos, S. Schaub-Meyer, and C. Schroers, “Neural inter-frame compression for video coding,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 6421–6429.
  31. X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7794–7803.
  32. T. Chen, H. Liu, Z. Ma, Q. Shen, X. Cao, and Y. Wang, “Neural image compression via non-local attention optimization and improved context modeling,” arXiv preprint arXiv:1910.06244, 2019.
  33. H. Liu, T. Chen, P. Guo, Q. Shen, X. Cao, Y. Wang, and Z. Ma, “Non-local attention optimized deep image compression,” arXiv preprint arXiv:1904.09757, 2019.
  34. J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7132–7141.
  35. V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,” in Proceedings of the 27th international conference on machine learning (ICML-10), 2010, pp. 807–814.
  36. B. Lim, S. Son, H. Kim, S. Nah, and K. Mu Lee, “Enhanced deep residual networks for single image super-resolution,” in Proceedings of the IEEE conference on computer vision and pattern recognition workshops, 2017, pp. 136–144.
  37. A. Ranjan and M. J. Black, “Optical flow estimation using a spatial pyramid network,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
  38. T. Xue, B. Chen, J. Wu, D. Wei, and W. T. Freeman, “Video enhancement with task-oriented flow,” International Journal of Computer Vision, IJCV, vol. 127, no. 8, pp. 1106–1125, 2019.
  39. A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, “Pytorch: An imperative style, high-performance deep learning library,” in Advances in Neural Information Processing Systems 32.   Curran Associates, Inc., 2019, pp. 8024–8035. [Online]. Available: http://papers.nips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf
  40. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  41. “x265 hevc encoder / h.265 video codec.” http://x265.org, accessed: 2018-10-30.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
407698
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description