Blockoptimized Variable Bit Rate Neural Image Compression
Abstract
In this work, we propose an endtoend blockbased autoencoder system for image compression. We introduce novel contributions to neuralnetwork based image compression, mainly in achieving binarization simulation, variable bit rates with multiple networks, entropyfriendly representations, inferencestage code optimization and performanceimproving normalization layers in the autoencoder. We evaluate and show the incremental performance increase of each of our contributions.
1 Introduction
Image compression has traditionally been addressed by transformbased methods such as JPEG [14] and BPG [11]. Recently, neural network based approaches have also been utilized such as hybrid approaches, where neural networks are used together with a traditional codec, or endtoend learned approaches, where the codec consists solely of neural networks.
Regarding the hybrid approach, several works involve using neural networks as postprocessing filters ([5], [6]), to enhance the decoded image. In [8] and [15] both preprocessing and postprocessing neural networks are used. In [8] and [15], due to the nondifferentiable traditional codec, an endtoend training cannot be achieved. [15] proposes to utilize alternate training to overcome this issue. In the first stage, the preprocessing network is trained via a differentiable virtual codec. In the second stage, the real codec is used and only the postprocessing network is trained.
Regarding the endtoend learned approach, a typical architecture consists of an autoencoder (see [9], [12]), where the encoder maps the input image to a lowdimensional tensor, and the decoder reconstructs the image.
The encoder’s output, which typically consists of floatingpoint values, needs to be quantized in order to achieve reasonable compression rates. The quantization operation would provide zero gradients almost everywhere. In order to approximate the quantization, [4] propose to add a random sample from a uniform distribution. [13] uses a random mapping of floatingpoint values to binary values with a probability derived from the floatingpoint value.
Training of autoencoders for data compression needs to account for both decoding quality and compression efficiency. One straightforward training loss for decoding quality is the mean squared error (MSE) between input and output of the autoencoder. Minimizing the MSE maximizes the peak signal to noise ratio (PSNR), which is a widely used evaluation metric in data compression. However, a model trained with MSE loss tends to result into blurred decoded images. Alternative losses are variational losses [7], adversarial losses [2] and structural similarity loss [12].
Regarding the compression efficiency, [9] proposes to use an adaptive codelength regularization term which encourages structure in the code, so that the arithmetic coder can exploit it for adapting the final codelength to the complexity of the input. In [4] and [12] the authors optimize for ratedistortion performance, where the rate is represented by the entropy.
Other neural network architectures used for image compression include recurrent models, such as in [13].
In this paper, we propose a system for blockbased image compression using autoencoders. In particular, our contributions are:

Using multiple networks for variable bit rate, with inferencestage code optimization.

Using normalization layer as the first layer of decoder, which improves the training and inference performance.

An entropyfriendly loss designed for blockbased neural autoencoders.

Finetuning each network on a separate subset of blocks, according to the blocks’ encoding difficulty.

Intervalpreserving binarization noise, which ensures that the noisy signal is in a certain interval to provide consistent input to the decoder during training.
2 Method
In this section, we describe the method used in our endtoend image compression. Our method is based on fullyconvolutional deep autoencoders and is applied on 32x32 blocks from the image.
2.1 Network Description
AutoEncoder Network: The encoder part contains five consecutive convolutional blocks. Each block consists of a convolutional layer with stride 2 followed by a parametric rectified linear unit (PReLU). These five blocks are followed by a 1x1 convolutional layer and a sigmoid activation. The output of this layer is the compressed signal and will be referred to as blockcodes from now on. The blockcodes are 1dimensional, as the input to the network is of size 32x32 and there exits 5 downsampling convolutions.
The first layer of the decoder is an normalization layer. It has been shown that mapping autoencoder representations to the hypersphere surface improves clustering [3]. We also find normalization beneficial for this work, and we will provide more details about the benefits later in the experimental results. The normalization layer is followed by five consecutive deconvolutional blocks, each upsampling to double size. Each deconvolutional block consists of a deconvolutional layer followed by PReLU. The five deconvolutional blocks are followed by a final 1x1 convolutional layer with sigmoid activation.
Multiple Networks: The blockcode length (number of vector’s entries) for the above network is fixed and equal to the number of convolutional kernels in the last layer of the encoder. Setting up a fixed codelength for all blocks can be suboptimal, as blocks may have different content complexity and thus different compression difficulty. To allow for variable bit rate encoding (other than entropy coding), we make use of three separate networks with different codelengths. We encode/decode each block with the network that provides the smallest bit rate for a target PSNR value.
Deblocking Network: Due to blockbased compression, the decoded image contains blocking artefacts. To suppress these artefacts, we employ a fullyconvolutional deblocking filter that operates over the entire image. The network’s structure is similar to UNet [10].
2.2 Inference
During encoding, first the image is divided into 32x32 blocks by rasterscan. Each block is encoded by the lowest bit rate neural network (out of three) which satisfies a target PSNR. The output of the encoder is binarized.
We optimize each blockcode by optimizing the encoder per block: we keep the weights of the decoder frozen and finetune the encoder for a single block. To this end, we set a target PSNR and start optimizing the encoder of lowest bit rate neural network. If this network cannot achieve the target PSNR, we move on to the higher bit rate neural network and optimize its encoder. This process is continued until the target PSNR is achieved and the corresponding blockcode is selected as the final one.
We use twobit indicator signal for each block indicating which neural network was used for encoding that block. All the indicator signals are concatenated and one long indicator vector is obtained for the entire image. The indicator vector is entropycoded for further bit rate reduction. Similarly, each blockcode is concatenated to obtain a long imagecode. This imagecode is first differencecoded and then entropycoded. In the end, each image is encoded into three vectors: 1) entropycoded imagecode 2) entropycoded indicator vector 3) shape of the original image.
During decoding, first the entropycoded vectors are decoded. Then, the next two bits from indicator vector is read and based on the indicator, the decoder knows which of the three decoder network needs to be used for the current block and therefore the encoding dimension. Then the next bit sequence of same length as this encoding dimension is read from the imagecode and is decoded by the selected neural decoder. We repeat the above procedure for all blocks. Next, we combine all blocks to reconstruct the entire image, by using the read shape information. Finally, the reconstructed image is passed through the deblocking network as a postprocessing step.
2.3 Training
Binarization Simulation: The blockcodes consist of floatingpoint numbers in the interval [0,1], which need to be binarized in order to achieve a reasonable compression rate. Yet, binarization operation is nondifferentiable and cannot be used as is for training the autoencoder endtoend. Therefore, during training, we simulate the binarization by adding noise to a value in the blockcode . The noise is random with a uniform distribution within the interval , where denotes the rounding to nearest integer operation and is the absolute value operator. The noise is selected such that the resulting value with additive noise remains in the interval . This is to be able to provide a consistent input to the decoder when we use this approximation and when we use the real binarization.
EntropyFriendly Loss: We concatenate the 1dimensional block codes from the image, and the resulting image code is entropycoded to achieve higher compression rates. To make the image code more suitable for entropy coding, we propose the loss in Eq. 1.
(1) 
In Eq. 1, is obtained by padding the code with 0 from both sides, where corresponds to a blockcode and is the number of elements in . This padding is beneficial since in the end we will concatenate all the block codes, thus enforcing both beginning and ends of each blockcode to be zero helps achieving a smooth imagecode after concatenation.
Training Process: During training, we use MSEbased reconstruction loss and the entropy loss with a regularization parameter as follows: .
Although we simulate the binarization via additive random noise as previously discussed, we found it further beneficial to utilize an alternate training. In each epoch, we first train the autoencoder endtoend with binarization simulation over the entire training dataset. Next, we freeze the encoder part, perform actual binarization on the codes and train only the decoder over the entire training dataset.
Since we are going to use each of three neural networks for a different encoding difficulty level, the above training can be suboptimal. In fact, training for example the lowest bit rate neural network with all blocks (including the hardest blocks), would not be consistent with the inference stage, when that network would never been used on hard to encode blocks. To make each network expert to their targeted blocks, we finetune each network with the blocks for which that network satisfies the target PSNR.
To make the decoder even more suitable to binarized codes, we keep the encoder frozen, use real binarization and finetune each expert decoder on its own training blocks.
The training of the deblocking network is performed separately where input images are the images reconstructed via the inference stage (except the deblocking part) and the ground truth are the original images.
3 Experimental Results
In this section, we quantitatively evaluate our method in CLIC image compression dataset [1]. As an evaluation metric we use the peak signaltonoise ratio (PSNR). We calculate a single meansquarederror (MSE) from the entire image dataset and calculate the corresponding PSNR according to Eq. 2. Note that the MSE is calculated on RGB images.
(2) 
3.1 Implementation Details
All the convolutional or deconvolutional layers have 3x3 kernel size. The number of filters in the first five encoder’s layers are 64, 128, 256, 512, 1024. The number of filters in the last layer of the encoder is different in each neural network: 64, 216 and 368 – these determine the number of values output by each encoder. The decoder simply follows the filter sizes for encoder in reverse order. The final layer has 3 filters to convert back to RGB space. The training is performed on 32x32 blocks extracted from images from training dataset with halfoverlapping blocks. We refer to the halfoverlapping variant as data augmentation training. The regularization parameter for entropy coding loss is selected as . All neural networks were trained using a batch size of 256 and Adam optimizer with learning rate . We have two variants: NTcodec where we apply deblocking filter on blocks of size 255x255 for memory efficiency, and NTcodecFull where we apply deblocking filter on the entire image.
No Noise Sim.  No  No Data Aug.  No Alt. Train.  No Entropy Loss  Batchnorm  Full Model  
PSNR  23.882  26.778  26.946  25.751  27.258  26.882  27.055 
Single NN  Multiple NN  Expert NN  Decoder FineTune  Deblocking  Code Optimize  
PSNR  27.055  27.691  27.779  27.792  28.088  28.929 
3.2 Effect of Each Contribution
First, we investigate the effect of each contribution by controlled experiments. In particular, we investigate the effect of noise simulation, normalization, data augmentation, alternate training and entropy loss. In each of these experiments, one of the above properties were removed and all others were kept fixed. We also compare our full model with a standard architecture where after each convolution and deconvolution layer there is a batchnormalization layer. In this standard model, we remove the introduced normalization layer, but keep all other components same. We have conducted the experiments only on the 216bit neural network. We report the obtained validation PSNR in Table 1.
Noise simulation and alternate training have significant effects, as they are crucial for approximating the binarization process. The network with normalization results into a compression rate of 0.151 bits per pixel (bpp), whereas the one without to 0.134 bpp. The network without normalization can achieve the same performance (both in bpp and PSNR) with the network with normalization, however at encoding dimension of 236. As the encoding dimension increases, the training of the network takes longer, moreover the network size increases. Therefore, normalization has a positive effect in training speed and final network size. Data augmentation has a very minor effect on the performance due to already large number of training blocks and correlated blocks in the data augmentation. The model with batchnormalization behaves similarly to nonormalization network in terms of bitrate and PSNR, i.e. normalization achieves similar performance with faster training and lower number of network parameters. Finally, since the entropyloss acts as a regularization loss, it reduces the final PSNR value. However, the validation set bit rate (after entropy coding) with entropyloss is 0.151 bits per pixel (bpp) whereas without the entropyloss it is 0.216 bpp. Therefore, the huge compression rate improvement dominates the slight PSNR decrease.
Next, we investigate the effect of multiple networks, expert neural network training, final decoder finetuning, codeoptimization and deblocking postprocessing. Each experiment is done incrementally to each other in the above order. We report the validation PSNRs and the bit rates for each incremental training in Table 2. As we observe, using multiple neural networks provide a decent performance increase whereas expert trainings and decoder finetuning has only a minor incremental effect. Deblocking postprocessing was aimed to help achieving visually better quality images, yet we also observe that it increases the performance too. Finally, blockwise code optimization greatly improves the performance and achieves a decent PSNR. We would like to note here that the average bit rate for the final model with code optimization is 0.149 bpp, which is below our baseline with single network with no additional processing (0.151 bpp).
Testset results: Table 3 reports PSNR and bit rates on the testset for our method and for two traditional codecs (JPEG and BPG).
JPEG  BPG  OURS  

PSNR  25.612  29.587  27.920 
bpp  0.149  0.148  0.148 
4 Conclusion
We have proposed an endtoend blockbased autoencoder system for learned image compression. We have evaluated each building block of our method and have shown that each building block contributes to the performance to a degree. Our novel contributions normalization, concatenationenabling entropyfriendly loss, expert neural network finetuning and code optimization greatly contribute to our final performance.
References
 [1] Workshop and challenge on learned image compression (clic). http://www.compression.cc/challenge/. Accessed: 20180426.
 [2] E. Agustsson, M. Tschannen, F. Mentzer, R. Timofte, and L. Van Gool. Generative adversarial networks for extreme learned image compression. arXiv preprint arXiv:1804.02958, 2018.
 [3] Ç. Aytekin, X. Ni, F. Cricri, and E. Aksu. Clustering and unsupervised anomaly detection with L2 normalized deep autoencoder representations. CoRR, abs/1802.00187, 2018.
 [4] J. Ballé, V. Laparra, and E. P. Simoncelli. Endtoend optimized image compression. In ICLR, 2017.
 [5] L. Cavigelli, P. Hager, and L. Benini. Cascnn: A deep convolutional neural network for image compression artifact suppression. In International Joint Conference on Neural Networks (IJCNN), 2017.
 [6] C. Dong, Y. Deng, C. C. Loy, and X. Tang. Compression artifacts reduction by a deep convolutional network. In International Conference on Computer Vision (ICCV), 2015.
 [7] K. Gregor, F. Besse, D. Jimenez Rezende, I. Danihelka, and D. Wierstra. Towards conceptual compression. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 3549–3557. Curran Associates, Inc., 2016.
 [8] F. Jiang, W. Tao, S. Liu, J. Ren, X. Guo, and D. Zhao. An endtoend compression framework based on convolutional neural networks. IEEE Transaction on Circuits and Systems for Video Technology, 2017.
 [9] O. Rippel and L. Bourdev. Realtime adaptive image compression. In International Conference on Machine Learning, 2017.
 [10] O. Ronneberger, P. Fischer, and T. Brox. Unet: Convolutional networks for biomedical image segmentation. In N. Navab, J. Hornegger, W. M. Wells, and A. F. Frangi, editors, Medical Image Computing and ComputerAssisted Intervention – MICCAI 2015, pages 234–241, Cham, 2015. Springer International Publishing.
 [11] G. J. Sullivan, J. R. Ohm, W. J. Han, and T. Wiegand. Overview of the high efficiency video coding (hevc) standard. IEEE Transactions on Circuits and Systems for Video Technology, 22(12):1649–1668, Dec 2012.
 [12] L. Theis, W. Shi, A. Cunningham, and F. HuszÃ¡r. Lossy image compression with compressive autoencoders. In International Conference on Learning Representations, 03 2017.
 [13] G. Toderici, D. Vincent, N. Johnston, S. Jin Hwang, D. Minnen, J. Shor, and M. Covell. Full resolution image compression with recurrent neural networks. In IEEE Conference on Computer Vision and Pattern Recognition, 2017.
 [14] G. K. Wallace. The jpeg still picture compression standard. Communications of the ACM, pages 30–44, 1991.
 [15] L. Zhao, H. Bai, A. Wang, and Y. Zhao. Learning a virtual codec based on deep convolutional neural network to compress image. CoRR, abs/1712.05969, 2017.