Overfitting for Fun and Profit: InstanceAdaptive Data Compression
Abstract
Neural data compression has been shown to outperform classical methods in terms of \glsRD performance, with results still improving rapidly. At a high level, neural compression is based on an autoencoder that tries to reconstruct the input instance from a (quantized) latent representation, coupled with a prior that is used to losslessly compress these latents. Due to limitations on model capacity and imperfect optimization and generalization, such models will suboptimally compress test data in general. However, one of the great strengths of learned compression is that if the testtime data distribution is known and relatively lowentropy (e.g. a camera watching a static scene, a dash cam in an autonomous car, etc.), the model can easily be finetuned or adapted to this distribution, leading to improved \glsRD performance. In this paper we take this concept to the extreme, adapting the full model to a single video, and sending model updates (quantized and compressed using a parameterspace prior) along with the latent representation. Unlike previous work, we finetune not only the encoder/latents but the entire model, and  during finetuning  take into account both the effect of model quantization and the additional costs incurred by sending the model updates. We evaluate an image compression model on Iframes (sampled at 2 fps) from videos of the Xiph dataset, and demonstrate that fullmodel adaptation improves \glsRD performance by dB, with respect to encoderonly finetuning.
table,xcdraw,dvipsnamesxcolor \newacronymMLMLmachine learning \newacronymRDratedistortion \newacronymPDFPDFprobability density function \newacronymPMFPMFprobability mass function \newacronymCDFCDFcumulative density function \newacronymiidi.i.dindependently and identically distributed \newacronymSTESTEStraightThrough estimator \newacronymVAEVAEvariational autoencoder \iclrfinalcopy
1 Introduction
The most common approach to neural lossy compression is to train a \glsVAElike model on a training dataset to minimize the expected \glsRD cost (Theis et al., 2017; Kingma and Welling, 2013). Although this approach has proven to be very successful (Ballé et al., 2018), a model trained to minimize expected \glsRD cost over a full dataset is unlikely to be optimal for every test instance because the model has limited capacity, and both optimization and generalization will be imperfect. The problem of generalization will be especially significant when the testing distribution is different from the training distribution, as is likely to be the case in practice.
Suboptimality of the encoder has been studied extensively under the term inference suboptimality (Cremer et al., 2018), and it has been shown that finetuning the encoder or latents for a particular instance can lead to improved compression performance (Lu et al., 2020; Campos et al., 2019; Yang et al., 2020b; Guo et al., 2020). This approach is appealing as no additional information needs to be added to the bitstream, and nothing changes on the receiver side. Performance gains however are limited, because the prior and decoder can not be adapted.
In this paper we present a method for fullmodel instanceadaptive compression, i.e. adapting the entire model to a single data instance. Unlike previous work, our method takes into account the costs for sending not only the latent prior, but also the decoder model updates, as well as quantization of these updates. This is achieved by extending the typical \glsRD loss with an additional model rate term that measures the number of bits required to send the model updates under a newly introduced model prior, resulting in a combined loss.
As an initial proof of concept, we show that this approach can lead to very substantial gains in \glsRD performance ( dB PSNR gain at the same bitrate) on the problem of Iframe video coding, where a set of key frames, sampled from a video at 2 fps, are independently coded using an Iframe (image compression) model. Additionally, we show how the model rate bits are distributed across the model, and (by means of an ablation study) quantify the individual gains achieved by including a modelrate loss and using quantizationaware finetuning.
The rest of this paper is structured as follows. Section 2 discusses the basics of neural compression and related work on adaptive compression. Section 3 presents our method, including details on the loss, the choice of the model prior, its quantization, and the (de)coding procedure. In Sections 4 and 5 we present our experiments and results, followed by a discussion in Section 6.
2 Preliminaries and Related Work
2.1 Neural data compression
The standard approach to neural compression can be understood as a particular kind of VAE (Kingma and Welling, 2013). In the compression literature the encoder is typically defined by a neural network parameterized by , with either deterministic output (so is onehot) (Habibian et al., 2019) or with fixed uniform noise on the outputs (Ballé et al., 2018). In both cases, sampling is used during training while quantization is used at test time.
The latent is encoded to the bitstream using entropy coding in conjunction with a latent prior , so that coding takes about bits (up to discretization). On the receiving side, the entropy decoder is used with the same prior to decode and then reconstruct using the decoder network (note that we use the same symbol to denote the parameters of the prior and decoder jointly, as in our method both will have to be coded and added to the bitstream).
From these considerations it is clear that the rate and distortion can be measured by the two terms in the following loss:
(1) 
This loss is equal (up to the tradeoff parameter and an additive constant) to the standard negative evidence lower bound (ELBO) used in \glsVAE training. The rate term of ELBO is written as a KL divergence between encoder and prior, but since , and the encoder entropy is constant in our case, minimizing the KL loss is equivalent to minimizing the rate loss.
Neural video compression is typically decomposed into the problem of independently compressing a set of key frames (i.e. Iframes) and conditionally compressing the remaining frames (Lu et al., 2019; Liu et al., 2020; Wu et al., 2018; Djelouah et al., 2019; Yang et al., 2020a). In this work, we specifically focus on improving Iframe compression.
2.2 Adaptive Compression
A compression model is trained on a dataset with the aim of achieving optimal \glsRD performance on test data. However, because of limited model capacity, optimization difficulties, or insufficient data (resulting in poor generalization), the model will in general not achieve this goal. When the test data distribution differs from that of the training data, generalization will not be guaranteed even in the limit of infinite data and model capacity, and perfect optimization.
A convenient feature of neural compression however is that a model can easily be finetuned on new data or data from a specific domain. A model can for instance (further) be trained after deployment, and Habibian et al. (2019) showed improved \glsRD gains after finetuning a video compression model on footage from a dash cam, an approach dubbed adaptive compression (Habibian et al., 2019).
In adaptive compression, decoding requires access to the adapted prior and decoder models. These models (or their delta relative to a pretrained shared model) thus need to be signaled. When the amount of data coded with the adapted model is large, the cost of signaling the model update will be negligible as it is amortized. However, a tradeoff exists, the more restricted the domain of adaptation, the more we can expect to gain from adaptation (e.g. an image compared to a video or collection of videos). In this paper we consider the case where the domain of adaptation is a set of Iframes from a single video, resulting in costs for sending model updates which become very relevant.
2.3 Closing the amortization gap
Coding model updates can easily become prohibitively expensive when the model is adapted for every instance. However, if we only adapt the encoder or latents, no model update needs to be added to the bitstream, since the encoder is not needed for decoding as the latents are sent anyway. We can thus close, or at least reduce, the amortization gap (the difference between and the optimal encoder; Cremer et al. (2018)) without paying any bits for model updates. Various authors have investigated this approach: Aytekin et al. (2018); Lu et al. (2020) adapt the encoder, while Campos et al. (2019); Yang et al. (2020b); Guo et al. (2020) adapt the latents directly. This simple approach was shown to provide a modest boost in \glsRD performance.
2.4 Encoding model updates
As mentioned, when adapting (parts of) the decoder or prior to an instance, model updates have to be added to the bitstream in order to enable decoding. Recent works have proposed ways to finetune parts of the model, while keeping the resulting bitrate overhead small. For instance Klopp et al. (2020) train a reconstruction error predicting network at encoding time, quantize its parameters, and add them to the bitstream. Similarly (Lam et al., 2019, 2020) propose to finetune all parameters or only the convolutional biases, respectively, of an artifact removal filter that operates after decoding. A sparsityenforcing and magnitudesuppressing penalty is leveraged, and additional thresholding is applied to even more strictly enforce sparsity. The update vector is thereafter quantized using kmeans clustering. Finally, Zou et al. (2020) finetune the latents in a metalearning framework, in addition to updating the decoder convolutional biases, which are quantized by kmeans and thereafter transmitted. All these methods perform quantization posttraining, leading to a potentially unbounded reduction in performance. Also, albeit the use of regularizing loss terms, no valid proxy for the actual cost of sending model updates is adopted. Finally, none of these methods performs adaptation of the full model.
The field of model compression is related to our work as the main question to be answered is how to most efficiently compress a neural network without compromising on downstream task performance (Han et al., 2016; Kuzmin et al., 2019). Bayesian compression is closely related, where the model weights are sent under a model prior (Louizos et al., 2017; Havasi et al., 2018) as is the case in our method. Instead of modeling uncertainty in parameters, we however assume a deterministic posterior (i.e. point estimate). Another key difference with these works is that we send the model parameter updates relative to an existing baseline model, which enables extreme compression rates (0.020.2 bits/param). This concept of compressing updates has been used before in the context of federated learning (McMahan et al., 2017; Alistarh et al., 2017) as well. We distinguish ourselves from that context, as there the model has to be compressed during every iteration, allowing for error corrections in later iterations. We only transmit the model updates once for every data instance that we finetune on.
3 FullModel InstanceAdaptive Compression
In this section we present fullmodel finetuning on one instance, while (during finetuning) taking into account both model quantization and the costs for sending model updates. The main idea is described in Section 3.1, after which Section 3.2 and 3.3 respectively provide details regarding the model prior and its quantization. The algorithm is described in Section 3.4.
3.1 Finetuning at inference time
Fullmodel instanceadaptive compression entails finetuning of a set of global model parameters {} (obtained by training on dataset ) on a single instance . This results in updated parameters , of which only has to be signaled in the bitstream. In practice we only learn the changes with respect to the global model, and encode the model updates of the decoding model. In order to encode , we introduce a continuous model prior to regularize these updates, and use the quantized counterpart for entropy (de)coding them (more on quantization in Section 3.3).
The overhead for sending quantized model update is given by model rate , and can be approximated by its continuous counterpart (see Appendix A.1 for justification). Adding this term to the standard \glsRD loss using the same tradeoff parameter , we obtain the instanceadaptive compression objective:
(2) 
At inference time, this objective can be minimized directly to find the optimal model parameters for transmitting datapoint . It takes into account the additional costs for encoding model updates ( term), and incorporates model quantization during finetuning ( evaluated at ).
3.2 Model prior design
A plethora of options exist for designing model prior as any \glsPDF could be chosen, but a natural choice for modeling parameter update is to leverage a Gaussian distribution, centered around the zeroupdate. Specifically, we can define the model prior on the updates as a multivariate zerocentered Gaussian with zero covariance, and a single shared (hyperparameter) , denoting the standard deviation: .
Note that this is equivalent to modeling by .
When entropy (de)coding the quantized updates under , we must realize that even the zeroupdate, i.e. , is not for free. We define these initial static costs as . Because the mode of the defined model prior is zero, these initial costs equal the minimum costs. Minimization of eq. 2 thus ensures that – after overcoming these static costs – any extra bit spent on model updates will be accompanied by a commensurate improvement in \glsRD performance.
Since our method works best when signaling the zeroupdate is cheap, we want to increase the probability mass . We propose to generalize our earlier proposed model prior to a socalled spikeandslab prior (Ročková and George, 2018), which drastically reduces the costs for this zeroupdate. More specifically, we redefine the \glsPDF as a (weighted) sum of two Gaussians – a wide (slab) and a more narrow (spike) distribution:
(3) 
where is a hyperparameter determining the height of the spiky Gaussian with respect to the wider slab, is the the bin width used for quantization (more details in Section 3.3), and . By choosing the standard deviation of the spike to be , the mass within six standard deviations (i.e. of the total mass) is included in the central quantization bin after quantization. Note that the (slabonly) Gaussian prior is a special case of the spikeandslab Gaussian prior in section 3.2, where . As such, refers to the spikeandslab prior in the rest of this work. Appendix A.2 compares the continuous and discrete spikeandslab prior and its gradients.
Adding the spike distribution, not only decreases , it also more heavily enforces sparsity on the updates via regularizing term in eq. 2. In fact, a high spike (i.e. large ) can make the bits for signaling a zeroupdate so small (i.e. almost negligible), that the model effectively learns to make a binary choice; a parameter is either worth updating at the cost of some additional rate, or its not updated and the ‘spent’ bits are negligible.
3.3 Quantization
In order to quantize a scalar (denoted by in this section to avoid clutter), we use equispaced bins of width , and we define the following quantization function:
(4) 
As both rounding and clipping are nondifferentiable, the gradient of is approximated using the \glsSTE, proposed by Bengio et al. (2013). That is, we assume .
The bins are intervals . We view as a hyperparameter, and define to be the smallest integer such that the region covered by bins (i.e. the interval ), covers at least of the probability mass of . Indeed the number of bins is proportional to the ratio of the width of and the width of the bins: . The number of bins presents a tradeoff between finetuning flexibility and model rate costs , so the ratio is an important hyperparameter. The higher , the higher these costs due to finer quantization, but simultaneously the lower the quantization gap, enabling more flexible finetuning.
Since , the discrete model prior is the pushforward of through :
(5) 
That is, equals the mass of in the bin of , which can be computed as the difference of the \glsCDF of evaluated at the edges of that bin.
3.4 Entropy coding and decoding
After finetuning of the compression model on instance by minimizing the loss in eq. 2, both the latents and the model updates are entropy coded (under their respective priors and ) into a bitstream . Decoding starts by decoding using , followed by decoding using (where ), and finally reconstructing using . The whole process is shown in Figure 1 and defined formally in Algorithms 1 and 2.
4 Experimental setup
4.1 Datasets
The experiments in this paper use images and videos from the following two datasets:
CLIC19
Xiph5N 2fps
4.2 Global model architecture and training
Model rate can be restricted by (among others) choosing a lowcomplexity neural compression model, and amortizing the additional model update costs over a large number of pixels.
The most natural case for fullmodel adaptation is therefore to finetune a lowcomplexity model on a video instance. Typical video compression setups combine an image compression (Iframe) model to compress key frames, and a predict or betweenframe model to reconstruct the remaining frames. Without loss of generality for videoadaptive finetuning, we showcase our fullmodel finetuning framework for the Iframe compression problem, being a subproblem of video compression.
Specifically, we use the (relatively lowcomplexity) hyperpriormodel proposed by Ballé et al. (2018), including the meanscale prior (without context) from Minnen et al. (2018). Before finetuning, this model is trained on the training fold of the CLIC19 dataset, using the \glsRD objective given in eq. 1. Appendix C provides more details on both its architecture and the adopted training procedure.
4.3 Instanceadaptive finetuning
Each instance in the Xiph5N 2fps dataset (i.e. a set of Iframes belonging to one video) is (separately) used for fullmodel adaptation. The resulting model rate costs are amortized over the set of Iframes, rather than the full video. As a benchmark, we only finetune the encoding procedure, as it does not induce additional bits in the bitstream. Encoderonly finetuning can be implemented by either finetuning the encoding model parameters , or directly optimizing the latents as in Campos et al. (2019). We implement both benchmarks, as the former is an ablated version of our proposed fullmodel finetuning, whereas the latter is expected to perform better due to the amortization gap Cremer et al. (2018).
Global models trained with \glsRD tradeoff parameter are used to initialize the models that are then being finetuned with the corresonding value for . Both encoderonly and directlatent finetuning minimize the \glsRD loss as given in eq. 1. For encoderonly tuning we use a constant learning rate of , whereas for latent optimization a learning rate of is used for the low bitrate region (i.e. two highest values), and for the high rate region. In case of direct latent optimization, the prequantized latents are finetuned, which are initialized using a forward pass of the encoder.
Our instanceadaptive fullmodel finetuning framework extends encoderonly finetuning by jointly updating and using the loss from eq. 2. In this case we finetune the global model that is trained with , independent of the value of during finetuning. Empirically, this resulted in negligible difference in performance compared to using the global model of the corresponding finetuning , while it alleviated memory constraints thanks to the smaller size of this low bitrate model architecture (see Appendix C). The training objective in eq. 2 is expressed in bits per pixel and optimized using a fixed learning rate of . The parameters for the model prior were chosen as follows: quantization bin width , standard deviation , and the multiplicative factor of the spike . We empirically found that sensitivity to changing in the range 505000 was low. Realize that, instead of empirically setting , its value could also be solved for a target initial cost , given the number of decoding parameters and pixels in the finetuning instance.
All finetuning experiments (both encodingonly and fullmodel) ran for k steps, each containing one minibatch of a single, full resolution Iframe
5 Results
5.1 Ratedistortion gains
Figure 2a shows the compression performance for encoderonly finetuning, direct latent optimization, and fullmodel finetuning, averaged over all videos in the Xiph5N 2fps dataset for different ratedistortion tradeoffs. Finetuning of the entire parameterspace results in much higher \glsRD gains (on average approximately 1 dB for the same bitrate) compared to only finetuning the encoding parameters or the latents directly. Figure 8 in Appendix E shows this plot for each video separately. Note that encoderonly finetuning performance is on par with direct latent optimization, implying that the amortization gap (Cremer et al., 2018) is close to zero when finetuning the encoder model on a moderate number of Iframes.
Table 1 provides insight in the distribution of bits over latent rate and model rate , which both increase for lower values of . However, the relative contribution of the model rate increases for the higher bitrate regime, which could be explained by the fact that in this regime the latent rate can more heavily be reduced by finetuning. Figure 2a indeed confirms that the the total rate reduction is higher for the high bitrate regime, which thus fully originates from the reduction in latent rate after finetuning. Table 1 also shows that the static intial costs only marginally contribute to the total model rate. Figure 2b shows for one video how finetuning progressed over training steps, confirming that the compression gains already at the beginning of finetuning cover these initial costs. This effect was visible for all tested videos (see Appendix E).
5.2 Ablations
Figure 2c shows several ablation results for one video. Case I is our proposed fullmodel finetuning, optimizing the loss, while simultaneously quantizing the updates to compute distortion (denoted with ). One can see that not doing quantizationaware finetuning (case II) deteriorates the distortion gains during evaluation. Removing the (continuous) model rate penalty from the finetuning procedure (case III) imposes an extreme increase in rate during evaluation, caused by unbounded growing of the model rate during finetuning. As such, the models result in a rate even much higher than the baseline model’s rate, showing that finetuning without model rate awareness provides extremely poor results. Case IV shows performance deterioration in the situation of both quantization and model rate unaware finetuning. Analyzing these runs while (naively) not taking into account the additional model update costs (cases V and VI) provides upper bounds on compression improvement. Case VI shows the most naive bound; finetuning without quantization, whereas case V is a tighter bound that does include quantization. The gap between V and VI is small, suggesting that the used quantization strategy does only mildly harm performance.
An ablation study on the effect of the number of finetuning frames is provided in Appendix D. It reveals that, under the spikeandslab model prior, fullmodel finetuning works well for a large range of instance lengths. Only in the worst case scenario, when finetuning only one frame in the low bitrate regime, fullmodel finetuning was found to be too costly.
5.3 Distribution of bits
Figure 3 shows for different (mutually exclusive) parameter groups the distribution of the model updates (top) and their corresponding bits (bottom). Interestingly one can see how for (almost) all groups, the updates are being clipped by our earlier defined quantizer . This suggests the need for large, expensive updates in these parameter groups, for which the additional \glsRD gain thus appears to outweigh the extra costs. At the same time, all groups show an elevated center bin, thanks to training with the spikeandslab prior (see Appendix F). By design of this prior, the bits paid for this zeroupdate are extremely low, which can best be seen in the bits histogram (Fig. 3bottom) of the Codec Decoder Weight and Biases. The parameter updates of the Codec Decoder IGND group are the only ones that are nonsymmetrically distributed across the zeroupdated, which can be explained by the fact that IGDN (Ballé et al., 2016) is an (inverse) normalization layer. The Codec Decoder Weights were found to contribute most to the total model rate .
PSNR  

(dB)  (bits/pixel)  (bits/parameter)  (kB/frame)  
3.0e03  34.0  0.175  0.174  0.001  0.00033  0.025  0.38 
1.0e03  36.2  0.304  0.302  0.002  0.00033  0.041  0.63 
2.5e04  39.0  0.551  0.545  0.006  0.00033  0.106  1.65 
1.0e04  40.8  0.866  0.852  0.014  0.00033  0.229  3.57 
6 Discussion
This work presented instanceadaptive neural compression, the first method that enables finetuning of a full compression model on a set of Iframes from a single video, while restricting the additional bits for encoding the (quantized) model updates. To this end, the typical ratedistortion loss was extended by incorporating both model quantization and the additional model rate costs during finetuning. This loss guarantees pure \glsRD performance gains after overcoming a small initial cost for encoding the zeroupdate. We showed improved \glsRD performance on all five tested videos from the Xiph dataset, with an average distortion improvement of approximately 1 dB for the same bitrate.
Among videos, we found a difference in achieved finetuning \glsRD gain (see Appendix E). Possible causes can be threefold. First, the performance of the global model differs per video, therewith influencing the maximum gains to be achieved by finetuning. Second, video characteristics such as (non)stationarity greatly influence the diversity of the set of Iframes, thereby affecting the ease of modeladaption. Third, the number of Iframes differs per video and thus trades off model update costs (which are amortized over the set of Iframes), with ease of finetuning.
The results of the ablation in Fig. 2c show that the quantization gap (V vs VI) is considerably smaller than the performance deterioration due to (additionally) regularizing the finetuning using the model prior (I vs V). Most improvement in future work is thus expected to be gained by leveraging more flexible model priors, e.g. by learning its parameters and/or modeling dependencies among updates.
We showed how instanceadaptive fullmodel finetuning greatly improves \glsRD performance for Iframe compression, a subproblem in video compression. Equivalently, one can exploit fullmodel finetuning to enhance compression of the remaining (nonkey) frames of a video, compressing the total video even further. Also, neural video compression models that exploit temporal redundancy, could be finetuned, as long as the model’s complexity is low enough to restrict model rate. Leveraging such a lowcomplexity video model moves computational complexity of data compression from the receiver to the sender by heavily finetuning this small model on each video. We foresee convenience of this computational shift in applications where bandwidth and receiver compute power is scarce, but encoding compute time is less important. This use case in practice happens e.g. for (nonlive) video streaming to lowpower edge devices. Finding such lowcomplexity video compression models is however nontrivial as it’s capacity must still be enough to compress an entire video. We foresee great opportunities here, and will therefore investigate this in future work.
Appendix A Model rate loss
a.1 Gradient definition
The proof below shows that is an unbiased firstorder approximation of , validating its use during finetuning.
The gradient of the continuous model rate loss towards is defined as:
(6) 
The gradient of the nondifferentiable discrete model rate loss towards can be defined by exploiting the StraightThrough gradient estimator (Bengio et al., 2013), i.e. .
As such
(7) 
By firstorder approximation we can write:
(8) 
and
(9) 
a.2 Continuous vs discrete model rate penalty
The proof provided in the previous section is not restricted to specific designs for , and thus holds both for the spikeandslab prior including a spike (), or the special case where no spike is used ().
Figure 4 shows quantization of the model updates (topleft) and its corresponding gradient (leftbottom), exploiting the StraightThrough gradient estimator Bengio et al. (2013). Also the discrete (true) model rate and its continuous analogy (middletop) with their corresponding gradients (middlebottom) are shown for the special case of not using a spike, i.e. . One can see that the continuous model rate proxy is a shifted version of the discrete costs. Since such a translation does not influence the gradient (and neither gradientbased optimization), the continuous model rate loss can be used during finetuning, preventing instable training thanks to its smooth gradient.
Figure 4right shows the same figure but for a model prior that includes a spike. Comparing true (discrete) model rate costs of the slabonly prior (middletop), to these costs using the spikeandslab prior (righttop), shows how the introduced spike reduces the number of bits to encode the zeroupdate (i.e. the center bin), at the cost of making larger updates more expensive in number of bits. Another interesting phenomena is visible when comparing the gradients of the discrete and continuous model rates for slabonly (middlebottom) versus spikeandslab prior (rightbottom). The effect of the spike almost fully disappears in the gradient of the discrete model rate. This is caused by the fact that most of the spike’s mass is (by design) positioned inside the center quantization bin after quantization. Mathematically, this can be seen from filling in section A.1 for the spikeandslab prior:
(11) 
We can distinguish different behavior of eq. 11 for two distinct ranges of :

Symmetry around zeroupdate: and .

Since :
, , , .
.
Note that this approximation becomes less tight for large , as the small probabilities and cumulative densities of the spike distribution are multiplied by .
From the case, one can see that inclusion of the spike leaves the gradient unbiased. The case shows that the spike does not influence the gradient in the limit when the spike’s mass is entirely positioned within the center quantization bin. As the standard deviation of the spiky Gaussian was chosen to be , a total of of it’s mass is in practice being quantized in an offcenter quantization bin. This explains the slight increase of the offcenter bins in the gradient of the discrete model rate costs in Fig. 4 (rightbottom). Comparing the gradient of the discrete versus continuous model rate costs for the spikeandslab prior in Fig. 4 (rightbottom), we can see that the first order approximation between the two introduces a larger error than in the slabonly case (Fig. 4(middlebottom)). This can be explained by the fact that the introduced spiky Gaussian has a higher tangent due to its support being more narrow than that of the Gaussian slab. Nevertheless, the use of the continuous model rate loss is preferred for finetuning as it (much more) strictly enforces zeroupdates (thanks to the present gradient peaks around the center bin) than its discrete counterpart.
Appendix B Sample selection from Xiph dataset
b.1 Xiph dataset
The Xiph test videos can be found at https://media.xiph.org/video/derf/. Like Rippel et al. (2019) we select all 1080p videos, and exclude computergenerated videos and videos with inappropriate licences, which leaves us with the following videos:
aspen_1080p  ducks_take_off_1080p50  red_kayak_1080p  speed_bag_1080p 
blue_sky_1080p25  in_to_tree_1080p50  riverbed_1080p25  station2_1080p25 
controlled_burn_1080p  old_town_cross_1080p50  rush_field_cuts_1080p  sunflower_1080p25 
crowd_run_1080p50  park_joy_1080p50  rush_hour_1080p25  tractor_1080p25 
dinner_1080p30  pedestrian_area_1080p25  snow_mnt_1080p  west_wind_easy_1080p 
b.2 Xiph5N 2 fps dataset
Due to computational limits, we draw a sample of five videos, which we refer to as the Xiph5N dataset. When drawing such a small sample randomly, a high probability arises of drawing an unrepresentative sample, including for example too many videos with either low or high finetuning potential. To alleviate this problem, we use the following heuristic to select videos:

Evaluate the global model’s \glsRD performance on all videos for all values of .

Per value, rank all videos based on their respective loss.

For each video, average the rank over all values of .

Order all videos according to their average rank, and select videos based on evenly spaced percentiles.
The global model’s \glsRD performance for all videos from the Xiph dataset is shown in Figure 5. The five videos part of Xiph5N are indicated with colors, and Table 2 provides more details about these five videos. The column titled RDtank percentile shows the actual percentile at which the selected videos are ranked. The computed (target) percentiles for are 1/6, 2/6, , 5/6. For each percentile we selected the video closest to these target percentiles. The last column denotes the number of Iframes after subsampling to 2 fps. Videos of which the original sampling frequency was not integer divisible by a factor 2, were subsampled with a factor resulting in the Iframe sampling frequency closest to 2 fps.
Video  RDrank  Target  width height  Original  Duration  Nr. of Iframes 

percentile  percentile  frames  fps  (s)  at 2 fps  
in_to_tree  0.22  0.167  1920 1080 500  50  10.0  20 
aspen  0.37  0.333  1920 1080 570  30  19.0  38 
controlled  0.50  0.500  1920 1080 570  30  19.0  38 
sunflower  0.69  0.667  1920 1080 500  25  20.0  42 
pedestrian_area  0.82  0.833  1920 1080 375  25  15.0  32 
AVERAGE  0.52  0.500  1920 1080 503  32  16.6  34 
Appendix C Global model architecture and training
For our neural compression model, we adopt the architecture proposed by Ballé et al. (2018), including the meanscale prior from Minnen et al. (2018). We use a shared hyperdecoder to predict the mean and scale parameters. Like (Ballé et al., 2018) we use a model architecture with fewer parameters for the low bitrate regime (). Table 3 indicates both the model architecture and the number of parameters, grouped per submodel. The upper row in this table links the terminology proposed by Ballé et al. (2018) to conventional \glsVAE terminology which we follow in this work.
Figure 6 provides a visual overview of the model architecture, where we use and to indicate the latent and hyperlatent space respectively (referred to as and in the original paper of Ballé et al. (2018)). Note that even though we adopt a hierarchical latent variable model, we simplify the notation by defining a single latent space throughout this work.
During training we adopt a mixed quantization strategy, where the quantized latent are used to calculate the distortion loss (with their gradients estimated using the StraightTrough estimator from Bengio et al. (2013)) , while we use noisy samples for when computing the rate loss.
Optimization of eq. 1 on the training fold of the CLIC19 dataset, was done using the Adam optimizer with default settings (Kingma and Ba, 2014), and took 1.5M steps for the low bitrate models, and 2.0M steps for the high bitrate models. Each step contained random crops of pixels, and the initial learning rate was set to and lowered to at of training.
Transmitter  Receiver  

Encoder  Hyper Encoder  Hyperprior  Hyper Decoder  Decoder  Nr. of receiver  
parameters  
Low bitrate model  
Layers x Output channels  4x192  3x128  3x3  2x128 + 1x256  3x192 + 1x3    

2.89M  1.04M  2.89M  5.50k  1.26M  4.16M  
High bitrate model  
Layers x Output channels  4x320  3x192  3x3  2x192 + 1x384  3x320 + 1x3    

8.01M  2.40M  8.01M  8.26k  2.95M  10.97M 
Appendix D Temporal Ablation
In this experiment we investigate the tradeoff between the number of frames that the model is finetuned on, and the final performance. The higher this number of frames, the higher (potentially) the diversity (making finetuning more difficult), but the lower the bitrate overhead (in bits/pixels) due to model updates. This ablation repeats our main experiment on the sunflower video (Fig. 8) for a varying number of Iframes.
We sample number of frames (equispaced) from the full video, starting at the zero’th index. The experiment is run for , and the two outmost ratedistortion tradeoffs: . Note that the original experiment was done with frames sampled at 2 fps, resulting in for the sunflower video.
Figure 7 shows (for the low and high bitrate region) the total loss, and its subdivision in distortion and the different rate terms, as a function of numbers of finetuning frames. Fullmodel finetuning outperforms encoderonly finetuning in all cases, except for in the low bitrate regime. In this case, the model rate is causing the total rate to become too high to be competitive with the baselines. This in turn is mainly caused by the initial cost , which can only be amortized over a single frame. In general, for other values of , these initial costs were found to contribute only little to the total rate (with even a negligible contribution in the low and high bitrate regions for respectively and ).
Note that the global model’s performance varies noticeably as a function of . Apparently, the first frame of this video is easy to compress, therewith lowering the total loss for small sets of Iframes. To make fair comparisons, one should thus only consider relative performance of encoderonly and fullmodel finetuning with respect to the \glsRD trained baseline. In line with our main findings in Fig. 2c, fullmodel finetuning shows the biggest improvements for the high bitrate setting.
Interestingly, when comparing the low and high bitrate regimes, the total relative gain of fullmodel finetuning follows a similar pattern for varying values of (higher gain for higher ). However, the subdivision of this gain in rate and distortion gain differs due to leveraging another tradeoff setting . For the high bitrate, mainly distortion is diminished (row 2), whereas for the low bitrate, rate is predominantly reduced (row 3). These rate and distortion reduction plots clearly show how the flexibility of fullmodel (compared to encoderonly) finetuning can improve results in various conditions.
This experiment has shown that the potential for fullmodel finetuning (under the current model architecture and prior) seems highest for video compression purposes, as gains are negative (due to relative high static initial costs in the low bitrate regime) or only marginal (in the high bitrate regime) when overfitting on a single frame. Yet, we hypothesize that fullmodel finetuning could still be useful for (single) image compression as well, given other choices for the model architecture and/or model prior. Also, the provided ablation is run on one video only, so further research is needed to investigate fullmodel finetuning in an image compression setup.
Appendix E Ratedistortion finetuning performance per video
Figure 8 shows the \glsRD plots for the different videos after finetuning for 100k steps. Fullmodel finetuning outperforms the global model, encoderonly and direct latent optimization for all videos. The blue lines, indicating global model performance, differ per video, which might influence the finetuning gains, which also differ per video, e.g. controlled_burn versus sunflower. True entropycoded results are used to create these graphs, rather than the computed \glsRD values. Deviations between entropycoded and computed rates were found to be negligible (mean deviation was 1.94e04 bits/pixel for and 1.06e03 bits/pixel for ). Throughout this paper, all training graphs and ablations are therefore provided using the computed values, rather than the entropy coded results.
Figure 9 shows for each of these videos the finetuning progression over training steps. Also here, differences in performance are visible among videos. The videos that result in highest finetuning gains, e.g. sunflower, show quicker performance improvement after the start of finetuning, and also continue to improve more over time.
Appendix F Model Updates Distributions
Figure 10 shows how the (quantized) model updates become much sparser (top row) when finetuning includes the spikeandslab model rate loss , compared to unregularized finetuning (bottom row).
Footnotes
 footnotemark:
 footnotemark:
 footnotemark:
 https://www.compression.cc/2019/challenge/
 https://media.xiph.org/video/derf/
 All frames in the videos in Xiph5N 2fps are of spatial resolution . In order to make this shape compatible with the strided convolutions in our model, we pad each frame to before encoding. After reconstructing , it is cropped back to its original resolution for evaluation.
 Due to cutting (a maximum) of mass from the tails of to enable quantization, the difference in cumulative masses in section A.1 should be renormalized by . As this is a constant division inside a , it results in a subtraction of . Since this normalization does not influence the gradient, we omitted it for the sake of clarity.
References
 QSGD: communicationefficient sgd via gradient quantization and encoding. In Advances in Neural Information Processing Systems, pp. 1709–1720. Cited by: §2.4.
 Blockoptimized variable bit rate neural image compression.. In CVPR Workshops, pp. 2551–2554. Cited by: §2.3.
 Density modeling of images using a generalized normalization transformation. In 4th International Conference on Learning Representations, ICLR 2016, Cited by: §5.3.
 Variational image compression with a scale hyperprior. In International Conference on Learning Representations, Cited by: Table 3, Appendix C, Appendix C, §1, §2.1, §4.2.
 Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432. Cited by: §A.1, §A.2, Appendix C, §3.3.
 Content adaptive optimization for neural image compression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 0–0. Cited by: §1, §2.3, §4.3.
 Inference suboptimality in variational autoencoders. In International Conference on Machine Learning, pp. 1078–1086. Cited by: §1, §2.3, §4.3, §5.1.
 Neural interframe compression for video coding. In Proceedings of the IEEE International Conference on Computer Vision, pp. 6421–6429. Cited by: §2.1.
 Variable rate image compression with content adaptive optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 122–123. Cited by: §1, §2.3.
 Video compression with ratedistortion autoencoders. In Proceedings of the IEEE International Conference on Computer Vision, pp. 7033–7042. Cited by: §2.1, §2.2.
 Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149. Cited by: §2.4.
 Minimal random code learning: getting bits back from compressed model parameters. In International Conference on Learning Representations, Cited by: §2.4.
 Adam: a method for stochastic optimization. ICLR. Cited by: Appendix C, §4.3.
 AutoEncoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §1, §2.1.
 Utilising low complexity cnns to lift nonlocal redundancies in video coding. IEEE Transactions on Image Processing. Cited by: §2.4.
 Taxonomy and evaluation of structured compression of convolutional neural networks. arXiv preprint arXiv:1912.09802. Cited by: §2.4.
 Compressing weightupdates for image artifacts removal neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 0–0. Cited by: §2.4.
 Efficient adaptation of neural network filter for video compression. In Proceedings of the 28th ACM International Conference on Multimedia, pp. 358–366. Cited by: §2.4.
 Learned video compression via joint spatialtemporal correlation exploration. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, pp. 11580–11587. Cited by: §2.1.
 Bayesian compression for deep learning. In Advances in neural information processing systems, pp. 3288–3298. Cited by: §2.4.
 Content adaptive and error propagation aware deep video compression. arXiv preprint arXiv:2003.11282. Cited by: §1, §2.3.
 DVC: an endtoend deep video compression framework. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.1.
 Communicationefficient learning of deep networks from decentralized data. In Artificial Intelligence and Statistics, pp. 1273–1282. Cited by: §2.4.
 Joint autoregressive and hierarchical priors for learned image compression. In Advances in Neural Information Processing Systems, pp. 10771–10780. Cited by: Appendix C, §4.2.
 Learned video compression. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3454–3463. Cited by: §B.1.
 The spikeandslab lasso. Journal of the American Statistical Association 113 (521), pp. 431–444. Cited by: §3.2.
 Lossy image compression with compressive autoencoders. In International Conference on Learning Representations, External Links: Link Cited by: §1.
 Video compression through image interpolation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 416–431. Cited by: §2.1.
 Feedback recurrent autoencoder. In ICASSP 20202020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3347–3351. Cited by: §2.1.
 Improving inference for neural image compression. Advances in Neural Information Processing Systems 33. Cited by: §1, §2.3.
 LC–learning to learn to compress. In 2020 IEEE 22nd International Workshop on Multimedia Signal Processing (MMSP), pp. 1–6. Cited by: §2.4.