# Variable Rate Deep Image Compression with Modulated Autoencoder

## Abstract

Variable rate is a requirement for flexible and adaptable image and video compression. However, deep image compression methods are optimized for a single fixed rate-distortion tradeoff. While this can be addressed by training multiple models for different tradeoffs, the memory requirements increase proportionally to the number of models. Scaling the bottleneck representation of a shared autoencoder can provide variable rate compression with a single shared autoencoder. However, the R-D performance using this simple mechanism degrades in low bitrates, and also shrinks the effective range of bit rates. Addressing these limitations, we formulate the problem of variable rate-distortion optimization for deep image compression, and propose modulated autoencoders (MAEs), where the representations of a shared autoencoder are adapted to the specific rate-distortion tradeoff via a modulation network. Jointly training this modulated autoencoder and modulation network provides an effective way to navigate the R-D operational curve. Our experiments show that the proposed method can achieve almost the same R-D performance of independent models with significantly fewer parameters.

## I Introduction

Image compression is a fundamental and well-studied problem in image processing and computer vision [8, 20, 16]. The goal is to design binary representations (i.e. bitstream) with minimal entropy [15] that minimize the number of bits required to represent an image (i.e. rate) at a given level of fidelity (i.e. distortion) [6]. In many communication scenarios the network or storage device may impose a constraint on the maximum bitrate, which requires the image encoder to adapt to a given bitrate budget. In other scenarios that constrain may even change dynamically over time (e.g. video). In all these cases, a rate control mechanism is required, and it is available in most traditional image and video compression codecs. In general, reducing the rate causes an increase in the distortion (i.e. rate-distortion tradeoff). This mechanism is typically based on scaling the latent representation prior to quantization to obtain finer or coarser quantization, and then inverting the scaling at the decoder side (see Fig. 1a).

Recent studies show that deep image compression (DIC) achieves comparable or even better results than classical image compression techniques [18, 2, 5, 19, 17, 7, 10, 11, 12, 9]. In this paradigm, the parameters of the encoder and decoder are learned from certain image data by jointly minimizing rate and distortion at a particular rate-distortion tradeoff (instead of engineered by experts). However, variable bitrate requires an independent model for every R-D tradeoff. This is an obvious limitation, since it requires storing each model separately, resulting in large memory requirement.

Addressing this limitation, Theis et al. [17] use a single autoencoder whose bottleneck representation is scaled before quantization depending on the target rate (see Fig. 1b). However, this approach only considers the importance of different channels from the bottleneck representation of learned autoencoders under R-D tradeoff constraint. In addition, the autoencoder is optimized for a single specific R-D tradeoff (typically high rate). These two aspects lead to a drop in performance for low rates and a narrow effective range of bit rates.

Addressing the limitations of multiple independent models and bottleneck scaling, we formulate the problem of variable rate-distortion optimization for DIC, and propose the modulated autoencoder (MAE) framework, where the representations of a shared autoencoder at different layers are adapted to a specific rate-distortion tradeoff via a modulating network. The modulating network is conditioned on the target R-D tradeoff, and synchronized with the actual tradeoff optimized to learn the parameters of the autoencoder and the modulating network. MAEs can achieve almost the same operational R-D points of independent models with much fewer overall parameters (i.e. just the shared autoencoder plus the small overhead of the modulating network). Multi-layer modulation does not suffer from the main limitations of bottleneck scaling, namely, drop in performance for low rates, and shrinkage of the effective range of rates.

## Ii background

Almost all lossy image and video compression approaches follow the transform coding paradigm [4]. The basic structure is a transform that takes an input image and obtains a transformed representation , followed by a quantizer where is discrete-valued vector. The decoder reverses the quantization (i.e. dequantizer ) and the transform (i.e. inverse transform) as reconstructing the output image . Before the transmission (or storage), the discrete-valued vector is binarized and serialized into a bitstream . Entropy coding [21] is used to exploit the statistical redundancy in that bitstream and reduce its length.

In deep image compression [17, 18, 2], the handcrafted analysis and synthesis transforms are replaced by the encoder and decoder of a convolutional autoencoder, parametrized by and . The fundamental difference is that the transforms are not designed but learned from training data.

The model is typically trained end-to-end minimizing the following optimization problem

(1) |

where measures the rate of the bitstream and represents a distortion metric between and , and the Lagrange multiplier controls the tradeoff between rate and distortion, i.e. R-D tradeoff. Note that is a fixed constant in this case. The problem is solved using gradient descent and backpropagation [14].

To make the model differentiable, which is required to apply backpropagation, during training the quantizer is replaced by a differentiable proxy function [17, 18, 2]. Similarly, entropy coding is invertible, but it is necessary to compute the length of the bitstream . This is usually approximated by the entropy of the distribution of quantized vector, , which is a lower bound of the actual bitstream length.

In this paper, we will use scalar quantization by (element-wise) rounding to the nearest neighboor, i.e. , which will be replaced by additive uniform noise as proxy during training [2], i.e. , with . There is no de-quantization in the decoder, and the reconstructed representation is simply . To estimate the entropy we will use the entropy model described in [2] to approximate by . Finally, we will use mean squared error (MSE) as a distortion metric. With these particular choices, (1) becomes

(2) |

with

(3) | ||||

(4) |

## Iii Multi-rate deep image compression with modulated autoencoders

### Iii-a Problem definition

We are interested in deep image compression models that can operate satisfactorily on different R-D tradeoffs, and adapt to the specific R-D tradeoff when required. Note that (Eq. 2) optimizes rate and distortion for a fixed tradeoff . We extend that formulation to multiple R-D tradeoffs (i.e. ) as the multi-rate-distortion problem

(5) |

with

(6) | ||||

(7) |

Here we omitted the explicit dependency on of the features and (implicitely) the encoder and decoder), i.e. and . In the following we may also omit the explicit dependency for conciseness. Note also that this formulation can be easily extended to a continuous range of tradeoffs. Note also that these optimization problems assume that all R-D operational points are equally important. It could be possible to integrate an importance function to further give more importance to certain R-D operational points if the application requires that. We assume uniform importance (continuous or discrete) for simplicity.

### Iii-B Bottleneck scaling

A possible way to make the encoder and decoder aware of is by simply scaling the latent representation in the bottleneck before quantization (implicitly scaling the quantization bin), and then inverting the scaling in the decoder. In that case, and , where is the scaling factor specific for the tradeoff . Conventional codecs use predefined tables for (the descaling is often implicitly subsumed in the dequantization, e.g. JPEG). Instead [17] learns them while keeping the encoder and decoder fixed, optimized for a particular R-D tradeoff (see Fig. 1(b)).

We observe several limitations in this approach [17]: (1) scaling only the bottleneck feature is not flexible enough to adapt to a large range of R-D tradeoffs, (2) using the inverse of the scaling factor in the decoder may also limit the flexibility of the adaptation mechanism, (3) optimizing the parameters of the autoencoder only for a single R-D tradeoff leads to suboptimal parameters for other distant tradeoffs, (4) training the autoencoder and the scaling factors separately may also be limiting. In order to overcome this limitations we propose the modulated autoencoder (MAE) framework.

### Iii-C Modulated autoencoders

Variable rate is achieved in MAEs by modulating the internal representations in the encoder and the decoder (see Fig. 2). Given a set of internal representations in the encoder and in the decoder , they are replaced by the corresponding modulated and demodulated versions and , where and are the modulating and demodulating functions.

Our MAE architecture extends the deep image compression architecture proposed in [2] which combines convolutional layers and GDN/IGDN layers [1]. In our experiments we choose to modulate the outputs of the convolutional layers in the encoder and decoder, i.e. and , respectively.

The modulating function for the encoder is learned by a modulating network as and the demodulating function by the demodulating network as . As a result, the encoder has learnable parameters and the decoder .

Finally, the optimization problem for the MAE is

(8) |

which extends equation (5) with the modulating/demodulating networks and their corresponding parameters. All parameters are learned jointly using gradient descent and backpropagation.

This mechanism is more flexible than bottleneck scaling since it allows multi-level modulation, decouples encoder and decoder scaling and allows effective joint training of both autoencoder and modulating network, therefore optimizing jointly to all R-D tradeoffs of interest.

### Iii-D Modulating and demodulating networks

The modulating network is a perceptron with two fully connected layers and ReLU [13] and exponential nonlinearities (see Fig. 2). The exponential nonlinearity guarantees positive outputs which we found beneficial in training. The output is directly . A small first hidden layer allows learning a meaningful nonlinear function between tradeoffs and modulation vectors, which is more flexible than simple scaling factors and allows more expressive interpolation between tradeoffs. In practice, we use normalized tradeoffs as . The demodulating network follows a similar architecture.

## Iv Experiments

### Iv-a Experimental setup

We evaluated MAE on the CLIC dataset^{1}

Independent models. Each R-D operational point is obtained by training a new model with a different R-D tradeoff in (2), requiring each model to be stored separately. This provides the optimal R-D performance, but also requires more memory to store all the models for different R-D tradeoffs.

Bottleneck scaling [17]. The autoencoder is optimized for the highest R-D tradeoff in the range. Then the autoencoder is frozen and the scaling parameters are learned for the other tradeoffs.

### Iv-B Results

Fig. 3 shows the R-D operational curves for the proposed MAE and the two baselines, for both PSNR and MS-SSIM. We can see that the best R-D performance is obtained by using independent models. Hyperprior models also have superior R-D performance. Bottleneck scaling is satisfactory for high bitrates, closer to the optimal R-D operational point of the autoencoder, but degrades for lower bitrates. Interestingly, bottleneck scaling cannot achieve as low bitrates as independent models since the autoencoder is optimized for high bitrate. This can be observed in the R-D curve as a narrower range of bitrates. The proposed MAEs can achieve an R-D performance very close to the corresponding independent models, demonstrating that multi-layer modulation with joint training is a more powerful mechanism to achieve effective variable rate compression.

The main advantage of bottleneck scaling and MAEs is that the autoencoder is shared, which results in much fewer parameters than independent models, which depend on the number of R-D tradeoffs (see Table. I). Both methods have a small overhead due to the modulating networks or the scaling factors (which is smaller in bottleneck scaling).

In order to illustrate the differences between the bottleneck scaling and MAE rate adaptation mechanisms, we consider the image in Fig. 4b and the reconstructions for high and low bitrates (see Fig. 4a). We show two of the 192 channels in the bottleneck feature before quantization (see Fig. 4a), and observe that the maps for the two bitrates are similar but the range is higher for , so the quantization will be finer. This is also what we would expect in bottleneck scaling. However, a closer look highlights the difference between both methods. We also compute the element-wise ratio between the bottleneck features at and , and show the ratio image for the same channels of the example (see Fig. 4c). We can see that the MAE learns to perform a more complex adaptation of the features beyond simple channel-wise bottleneck scaling since different parts of the image have different ratios (the ratio image would be constant in bottleneck scaling), which allows MAE to allocate bits more freely when optimizing for different R-D tradeoffs, especially for low bit-rate.

## V Conclusion

In this work, we introduce the modulated autoencoder, a novel variable rate deep image compression framework, based on multi-layer feature modulation and joint learning of autoencoder parameters. MAEs can realize variable rate image compression with a single model, while keeping the performance close to the upper bound of independent models that require significantly more memory. We show that MAE outperforms bottleneck scaling [17], especially for low bit-rates.

### Footnotes

### References

- (2015) Density modeling of images using a generalized normalization transformation. arXiv preprint arXiv:1511.06281. Cited by: §III-C.
- (2016) End-to-end optimized image compression. arXiv preprint arXiv:1611.01704. Cited by: §I, §II, §II, §II, §III-C, §IV-A, TABLE I.
- (2018) Variational image compression with a scale hyperprior. arXiv preprint arXiv:1802.01436. Cited by: §IV-A, TABLE I.
- (2001) Theoretical foundations of transform coding. IEEE Signal Processing Magazine 18 (5), pp. 9–21. Cited by: §II.
- (2016) Towards conceptual compression. In Advances In Neural Information Processing Systems, pp. 3549–3557. Cited by: §I.
- (1989) Fundamentals of digital image processing. Englewood Cliffs, NJ: Prentice Hall,. Cited by: §I.
- (2018) Improved lossy image compression with priming and spatially adaptive bit rates for recurrent networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4385–4393. Cited by: §I.
- (1992) Image compression using the 2-d wavelet transform. IEEE Transactions on image Processing 1 (2), pp. 244–250. Cited by: §I.
- (2018) Learning convolutional networks for content-weighted image compression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3214–3223. Cited by: §I.
- (2018) CNN-based dct-like transform for image compression. In International Conference on Multimedia Modeling, pp. 61–72. Cited by: §I.
- (2018) Conditional probability models for deep image compression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4394–4402. Cited by: §I.
- (2018) Joint autoregressive and hierarchical priors for learned image compression. In Advances in Neural Information Processing Systems, pp. 10771–10780. Cited by: §I.
- (2010) Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10), pp. 807–814. Cited by: §III-D.
- (1985) Learning internal representations by error propagation. Technical report California Univ San Diego La Jolla Inst for Cognitive Science. Cited by: §II.
- (1948) A mathematical theory of communication. Bell system technical journal 27 (3), pp. 379–423. Cited by: §I.
- (2012) JPEG2000 image compression fundamentals, standards and practice: image compression fundamentals, standards and practice. Vol. 642, Springer Science & Business Media. Cited by: §I.
- (2017) Lossy image compression with compressive autoencoders. arXiv preprint arXiv:1703.00395. Cited by: Fig. 1, §I, §I, §II, §II, §III-B, §III-B, §IV-A, TABLE I, §V.
- (2015) Variable rate image compression with recurrent neural networks. arXiv preprint arXiv:1511.06085. Cited by: §I, §II, §II.
- (2017) Full resolution image compression with recurrent neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5306–5314. Cited by: §I.
- (1995) Wavelet filter evaluation for image compression. IEEE Transactions on image processing 4 (8), pp. 1053–1060. Cited by: §I.
- (1972) Transform picture coding. Proceedings of the IEEE 60 (7), pp. 809–820. Cited by: §II.