Learning Better Lossless Compression Using Lossy Compression
Abstract
We leverage the powerful lossy image compression algorithm BPG to build a lossless image compression system. Specifically, the original image is first decomposed into the lossy reconstruction obtained after compressing it with BPG and the corresponding residual. We then model the distribution of the residual with a convolutional neural networkbased probabilistic model that is conditioned on the BPG reconstruction, and combine it with entropy coding to losslessly encode the residual. Finally, the image is stored using the concatenation of the bitstreams produced by BPG and the learned residual coder. The resulting compression system achieves stateoftheart performance in learned lossless fullresolution image compression, outperforming previous learned approaches as well as PNG, WebP, and JPEG2000.
1 Introduction
The need to efficiently store the ever growing amounts of data generated continuously on mobile devices has spurred a lot of research on compression algorithms. Algorithms like JPEG [50] for images and H.264 [52] for videos are used by billions of people daily.
After the breakthrough results achieved with deep neural networks in image classification [26], and the subsequent rise of deeplearning based methods, learned lossy image compression has emerged as an active area of research (e.g. [5, 44, 45, 36, 1, 3, 29, 27, 47]). In lossy compression, the goal is to achieve small bitrates given a certain allowed distortion in the reconstruction, i.e., the ratedistortion tradeoff is optimized. In contrast, in lossless compression, no distortion is allowed, and we aim to reconstruct the input perfectly by transmitting as few bits as possible. To this end, a probabilistic model of the data can be used together with entropy coding techniques to encode and transmit data via a bitstream. The theoretical foundation for this idea is given in Shannon’s landmark paper [39], which proves a lower bound for the bitrate achievable by such a probabilistic model, and the overhead incurred by using an imprecise model of the data distribution. One beautiful result is that maximizing the likelihood of a parametric probabilistic model is equivalent to minimizing the bitrate obtained when using that model for lossless compression with an entropy coder (see, e.g., [28]). Learning parametric probabilistic models by likelihood maximization has been studied to a great extent in the generative modeling literature (e.g. [49, 48, 38, 33, 24]). Recent works have linked these results to learned lossless compression [28, 17, 46, 23].
Even though recent learned lossy image compression methods achieve stateoftheart results on various data sets, the results obtained by the nonlearned H.265based BPG [42, 6] are still highly competitive, without requiring sophisticated hardware accelerators such as GPUs to run. While BPG was outperformed by learningbased approaches across the bitrate spectrum in terms of PSNR [29] and visual quality [3], it still excels particularly at highPSNR lossy reconstructions.
In this paper, we propose a learned lossless compression system by leveraging the power of the lossy BPG, as illustrated in Fig. 1.
Specifically, we decompose the input image into the lossy reconstruction produced by BPG and the corresponding residual . We then learn a probabilistic model of the residual, conditionally on the lossy reconstruction . This probabilistic model is fully convolutional and can be evaluated using a single forward pass, both for encoding and decoding. We combine it with an arithmetic coder to losslessly compress the residual and store or transmit the image as the concatenation of the bitstrings produced by BPG and the residual compressor. Further, we use a computationally inexpensive technique from the generative modeling literature, tuning the “certainty” (temperature) of , as well as an auxiliary shallow classifier to predict the quantization parameter of BPG in order to optimize our compressor on a perimage basis. These components together lead to a stateoftheart fullresolution learned lossless compression system.
All of our code and data sets are available on github.
In contrast to recent work in lossless compression, we do not need to compute and store any side information (as opposed to L3C [28]), and our CNN is lightweight enough to train and evaluate on highresolution natural images (as opposed to [17, 23], which have not been scaled to fullresolution images to our knowledge).
In summary, our main contributions are:

We leverage the power of the classical stateoftheart lossy compression algorithm BPG in a novel way to build a conceptually simple learned lossless image compression system.

Our system is optimized on a perimage basis with a lightweight posttraining step, where we obtain a lowerbitrate probability distribution by adjusting the confidence of the predictions of our probabilistic model.

Our system outperform the stateoftheart in learned lossless fullresolution image compression, L3C [28], as well as the classical engineered algorithms WebP, JPEG200, PNG. Further, in contrast to L3C, we are also outperforming FLIF on Open Images, the domain where our approach (as well as L3C) is trained.
2 Related Work
Learned Lossless Compression
Arguably most closely related to this paper, Mentzer \etal [28] build a computationally cheap hierarchical generative model (termed L3C) to enable practical compression on fullresolution images.
Townsend \etal [46] and Kingma \etal [23] leverage the “bitsback scheme” [16] for lossless compression of an image stream, where the overall bitrate of the stream is reduced by leveraging previously transmitted information. Motivated by recent progress in generative modeling using (continuous) flowbased models (e.g. [34, 22]), Hoogeboom \etal [17] propose Integer Discrete Flows (IDFs), defining an invertible transformation for discrete data. In contrast to L3C, the latter works focus on smaller data sets such as MNIST, CIFAR10, ImageNet32, and ImageNet64, where they achieve stateoftheart results.
LikelihoodBased Generative Modeling
As mentioned in Section 1, virtually every generative model can be used for lossless compression, when used with an entropy coding algorithm. Therefore, while the following generative approaches do not take a compression perspective, they are still related. The stateoftheart PixelCNN [49]based models rely on autoregression in RGB space to efficiently model a conditional distribution. The original PixelCNN [49] and PixelRNN [48] model the probability distribution of a pixel given all previous pixels (in rasterscan order). To use these models for lossless compression, forward passes are required, where and are the image height and width, respectively. Various speed optimizations and a probability model amendable to faster training were proposed in [38]. Different other parallelization techniques were developed, including those from [33], modeling the image distribution conditionally on subsampled versions of the image, as well as those from [24], conditioning on a RGB pyramid and grayscale images. Similar techniques were also used by [8, 30].
Engineered Lossless Compression Algorithms
The widespread PNG [32] applies simple autoregressive filters to remove redundancies from the RGB representation (e.g. replacing pixels with the difference to their left neighbor), and then uses the DEFLATE [10] algorithm for compression. In contrast, WebP [51] uses larger windows to transform the image (enabling patchwise conditional compression), and relies on a custom entropy coder for compression. Mainly in use for lossy compression, JPEG2000 [40] also has a lossless mode, where an invertible mapping from RGB to compression space is used. At the heart of FLIF [41] is an entropy coding method called “metaadaptive nearzero integer arithmetic coding” (MANIAC), which is based on the CABAC method used in, e.g., H.264 [52]. In CABAC, the context model used to compress a symbol is selected from a finite set based on local context [35]. The “metaadaptive” part in MANIAC refers to the context model which is a decision tree learned per image.
Artifact Removal
Artifact removal methods in the context of lossy compression are related to our approach in that they aim to make predictions about the information lost during the lossy compression process. In this context, the goal is to produce sharper and/or more visually pleasing images given a lossy reconstruction from, e.g., JPEG. Dong \etal [11] proposed the first CNNbased approach using a network inspired by superresolution networks. [43] extends this using a residual structure, and [7] relies on hierarchical skip connections and a multiscale loss. Generative models in the context of artifact removal are explored by [12], which proposes to use GANs [13] to obtain more visually pleasing results.
3 Background
3.1 Lossless Compression
We give a very brief overview of lossless compression basics here and refer to the information theory literature for details [39, 9]. In lossless compression, we consider a stream of symbols , where each is an element from the same finite set . The stream is obtained by drawing each symbol independently from the same distribution , i.e., the are i.i.d. according to . We are interested in encoding the symbol stream into a bitstream, such that we can recover the exact symbols by decoding. In this setup, the entropy of is equal to the expected number of bits needed to encode each :
In general, however, the exact is unknown, and we instead consider the setup where we have an approximate model . Then, the expected bitrate will be equal to the crossentropy between and , given by:
(1) 
Intuitively, the higher the discrepancy between the model used for coding is from the real , the more bits we need to encode data that is actually distributed according to .
Entropy Coding
Given a symbol stream as above and a probability distribution (not necessarily ), we can encode the stream using entropy coding. Intuitively, we would like to build a table that maps every element in to a bit sequence, such that gets a short sequence if is high. The optimum is to output bits for symbol , which is what entropy coding algorithms achieve. Examples include Huffman coding [18] and arithmetic coding [53].
In general, we can use a different distribution for every symbol in the stream, as long as the are also available for decoding. Adaptive entropy coding algorithms work by allowing such varying distributions as a function of previously encoded symbols. In this paper, we use adaptive arithmetic coding [53].
3.2 Lossless Image Compression with CNNs
As explained in the previous section, all we need for lossless compression is a model ,
since we can use entropy coding to encode and decode any input losslessly given .
In particular, we can use a CNN to parametrize . To this end, one general approach is
to introduce (structured) side information available both at encoding and decoding time, and
model
the probability distribution of natural images conditionally on , using the CNN to parametrize .
One key difference among the approaches in the literature is the factorization of . In the original PixelCNN paper [48] the image is modeled as a sequence of pixels, and corresponds to all previous pixels. Encoding as well as decoding are done autoregressively. In IDF [17], is mapped to a using an invertible function, and is then encoded using a fixed prior , i.e., here is a deterministic function of . In approaches based on the bitsback paradigm [46, 23], while encoding, is obtained by decoding from additional available information (e.g. previously encoded images). In L3C [28], corresponds to features extracted with a hierarchical model that are also saved to the bitstream using hierarchically predicted distributions.
3.3 Bpg
BPG is a lossy image compression method based on the HEVC video coding standard [42], essentially applying HEVC on a single image. To motivate our usage of BPG, we show the histogram of the marginal pixel distribution of the residuals obtained by BPG on Open Images (one of our testing sets, see Section 5.1) in Fig. 2. Note that while the possible range of a residual is , we observe that for most images, nearly every point in the residual is in the restricted set , which is indicative of the highPSNR nature of BPG. Additionally, Fig. A1 (in the suppl.) presents a comparison of BPG to the stateoftheart learned image compression methods, showing that BPG is still very competitive in terms of PSNR.
BPG follows JPEG in having a chroma format parameter to enable color space subsampling, which we disable by setting it to . The only remaining parameter to set is the quantization parameter , where . Smaller results in less quantization and thus better quality (i.e., different to the quality factor of JPEG, where larger means better reconstruction quality). We learn a classifier to predict , described in Section 4.4.
4 Proposed Method
We give an overview of our method in Fig. 1. To encode an image , we first obtain the quantization parameter from the QClassifier (QC) network (Section 4.4). Then, we compress with BPG, to obtain the lossy reconstruction , which we save to a bitstream. Given , the Residual Compressor (RC) network (Section 4.1) predicts the probability mass function of the residual , i.e.,
We model as a discrete mixture of logistic distributions (Section 4.2). Given and , we compress to the bitstream using adaptive arithmetic coding algorithm (see Section 3.1). Thus, the bitstream consists of the concatenation of the codes corresponding to and . To decode from , we first obtain using the BPG decoder, then we obtain once again , and subsequently decode from the bitstream using . Finally, we can reconstruct . In the formalism of Section 3.2, we have .
Note that no matter how bad RC is at predicting the real distribution of , we can always do lossless compression. Even if RC were to predict, e.g., a uniform distribution—in that case, we would just need many bits to store .
4.1 Residual Compressor
We use a CNN inspired by ResNet [14] and UNet [37], shown in detail in Fig. 3. We first extract an initial feature map with channels, which we then downscale using a stride2 convolution, and feed through 16 residual blocks. Instead of BatchNorm [19] layers as in ResNet, our residual blocks contain GDN layers proposed by [4]. Subsequently, we upscale back to the resolution of the input image using a transposed convolution. The resulting features are concatenated with , and convolved to contract the channels back to , like in UNet. Finally, the network splits into four tails, predicting the different parameters of the mixture model, , described next.
4.2 Logistic Mixture Model
We use a discrete mixture of logistics to model the probability mass function of the residual, , similar to [28, 38]. We closely follow the formulation of [28] here: Let denote the RGB channel and the spatial location. We define
(2) 
We use a (weak) autoregression over the three RGB channels to define the joint distribution over channels via logistic mixtures :
(3) 
where we removed the indices to simplify the notation. For the mixture we use a mixture of logistic distributions . Our distributions are defined by the outputs of the RC network, which yields mixture weights , means , variances , as well as mixture coefficients . The autoregression over RGB channels is only used to update the means using a linear combination of and the target of previous channels, scaled by the coefficients . We thereby obtain :
(4) 
4.3 Loss
As motivated in Section 3.1, we are interested in minimizing the crossentropy between the real distribution of the residual and our model : the smaller the crossentropy, the closer is to , and the fewer bits an entropy coder will use to encode . We consider the setting where we have training images . For every image, we compute the lossy reconstruction as well as the corresponding residual . While the true distribution is unknown, we can consider the empirical distribution obtained from the samples and minimize:
(7) 
This loss decomposes over samples, allowing us to minimize it over minibatches. Note that minimizing Eq. 7 is the same as maximizing the likelihood of , which is the perspective taken in the likelihoodbased generative modeling literature.
4.4 QClassifier
A random set of natural images is expected to contain images of varying “complexity”, where complex can mean a lot of high frequency structure and/or noise. While virtually all lossy compression methods have a parameter like BPG’s , to navigate the tradeoff between bitrate and quality, it is important to note that compressing a random set of natural images with the same fixed will usually lead to the bitrates of these images being spread around some dependent mean. Thus, in our approach, it is suboptimal to fix for all images.
Indeed, in our pipeline we have a tradeoff between the bits allocated to BPG and the bits allocated to encoding the residual. This tradeoff can be controlled with : For example, if an image contains components that are easier for the RC network to model, it is beneficial to use a higher , such that BPG does not waste bits encoding these components. We observe that for a fixed image, and a trained RC, there is a single optimal .
To efficiently obtain a good , we train a simple classifier network, the QClassifier (QC), and then use to compress with BPG. For the architecture, we use a lightweight ResNetinspired network with 8 residual blocks for QC, and train it to predict a class in , given an image ( was selected using the Open Images validation set). In contrast to ResNet, we employ no normalization layers (to ensure that the prediction is independent of the input size). Further, the final features are obtained by average pooling each of the final channels of the dimensional feature map. The resulting dimensional vector is fed to a fully connected layer, to obtain the logits for the classes, which are then normalized with a softmax. Details are shown in Section A.1 in the supplementary material.
While the input to QC is the fullresolution image, the network is shallow and downsamples multiple times, making this a computationally lightweight component.
4.5 Optimization
Inspired by the temperature scaling employed in the generative modeling literature (e.g. [21]) , we further optimize the predicted distribution with a simple trick: Intuitively, if RC predicts a that is close to the target , we can make the crossentropy in Eq. 7 (and thus the bitrate) smaller by making the predicted logistic “more certain” by choosing a smaller . This shifts probability mass towards . However, there is a breaking point, where we make it “too certain” (i.e., the probability mass concentrates too tightly around ) and the crossentropy increases again.
While RC is already trained to learn a good , the prediction is only based on . We can improve the final bitrate during encoding, when we additionally have access to the target , by rescaling the predicted with a factor , chosen for every mixture and every channel . This yields a more optimal Obviously, also needs to be known for decoding, and we thus have to transmit it via the bitstream. However, since we only learn a for every channel and every mixture (and not for every spatial location), this causes a completely negligible overhead of floats bytes.
We find for a given image by minimizing the likelihood in Eq. 7 on that image, i.e., we optimize
(8) 
where is equal to predicted from RC but using . To optimize Eq. 8, we use stochastic gradient descent with a very high learning rate of and momentum , which converges in 1020 iterations, depending on the image.
We note that this is also computationally cheap. Firstly, we only need to do the forward pass through RC once, to get , and then in every step of the optimization, we only need to evaluate and subsequently Eq. 8. Secondly, the optimization is only over 15 parameters. Finally, since for practical dimensional images, , we can do the sum in Eq. 8 over a spatially subsampled version of .
[bpsp]  Open Images  CLIC.mobile  CLIC.pro  DIV2K  

RC (Ours)  2.790  2.538  2.933  3.079  
L3C  2.991  2.639  2.944  3.094  
PNG  4.005  3.896  3.997  4.235  
JPEG2000  3.055  2.721  3.000  3.127  
WebP  3.047  2.774  3.006  3.176  
FLIF  2.867  2.492  2.784  2.911 
5 Experiments
5.1 Data sets
Training
Like L3C [28], we train on images from the Open Images data set [25]. These images are made available as JPEGs, which is not ideal for the lossless compression task we are considering, but we are not aware of a similarly large scale lossless training data set. To prevent overfitting on JPEG artifacts, we downscale each training image using a factor randomly selected from by means of the Lanczos filter provided by the Pillow library [31]. For a fair comparison, the L3C baseline results were also obtained by training on the exact same data set.
Evaluation
We evaluate our model on four data sets: Open Images is a subset of 500 images from Open Images validation set, preprocessed like the training data. CLIC.mobile and CLIC.pro are two new data sets commonly used in recent image compression papers, released as part of the “Workshop and Challenge on Learned Image Compression” (CLIC) [54]. CLIC.mobile contains 61 images taken using cell phones, while CLIC.pro contains 41 images from DSLRs, retouched by professionals. Finally, we evaluate on the 100 images from DIV2K [2], a superresolution data set with highquality images. We show examples from these data sets in Section A.3.
For a small fraction of exceptionally highresolution images (note that the considered testing sets contain images of widely varying resolution), we follow L3C in extracting 4 nonoverlapping crops from the image such that combining yields . We then compress the crops individually. However, we evaluate the nonlearned baselines on the full images to avoid a bias in favor of our method.
5.2 Training Procedures
Residual Compressor
We train for epochs on batches of 16 random crops extracted from the training set, using the RMSProp optimizer [15]. We start with an initial learning rate (LR) of , which we decay every iterations by a factor of . Since our QClassifier is trained on the output of a trained RC network, it is not available while training the RC network. Thus, we compress the training images with a random selected from , obtaining a pair for every image.
QClassifier
Given a trained RC network, we randomly select 10% of the training set, and compress each selected image once for each , obtaining a for each . We then evaluate RC for each pair to find the optimal that gives the minimum bitrate for that image. The resulting list of pairs forms the training set for the QC. For training, we use a standard crossentropy loss between the softmaxnormalized logits and the onehot encoded ground truth . We train for 11 epochs on batches of 32 random crops, using the Adam optimizer [20]. We set the initial LR to the Adamdefault , and decay after 5 and 10 epochs by a factor of .
5.3 Architecture and Training Ablations
Training on Fixed
As noted in Section 5.2, we select a random during training, since QC is only available after training. We explored fixing to one value (trying ) and found that this hurts generalization performance. This may be explained by the fact that RC sees more varied residual statistics during training if we have random ’s.
Effect of the Crop Size
Using crops of to train a model evaluated on fullresolution images may seem too constraining. To explore the effect of crop size, we trained different models, each seeing the same number of pixels in every iteration, but distributed differently in terms of batch size vs. crop size. We trained each model for iterations, and then evaluated on the Open Images validation set (using a fixed for training and testing). The results are shown in the following table and indicate that smaller crops and bigger batchsizes are beneficial.
Batch Size  Crop Size  BPSP on Open Images 

16  2.854  
4  2.864  
1  2.877 
Gdn
We found that the GDN layers are crucial for good performance. We also explored instance normalization, and conditional instance normalization layers, in the latter case conditioning on the bitrate of BPG, in the hope that this would allow the network to distinguish different operation modes. However, we found that instance normalization is more sensitive to the resolution used for training, which led worse overall bitrates.
6 Results and Discussion
6.1 Compression performance in bpsp
We follow previous work in evaluating bits per subpixel (Each RGB pixel has 3 subpixels), bpsp for short, sometimes called bits per dimension. In Table 1, we show the performance of our approach on the described test sets. On Open Images, the domain where we train, we are outperforming all methods, including FLIF. Note that while L3C was trained on the same data set, it does not outperform FLIF. On the other data sets, we consistently outperform both L3C and the nonlearned approaches PNG, WebP, and JPEG2000.
These results indicate that our simple approach of using a powerful lossy compressor to compress the highlevel image content and leverage a complementary learned probabilistic model to model the low level variations for lossless residual compression is highly effective. Even though we only train on Open Images, our method can generalize to various domains of natural images: mobile phone pictures (CLIC.mobile), images retouched by professional photographers (CLIC.pro), as well as highquality images with diverse complex structures (DIV2K).
In Fig. 4 we show the bpsp of each of the 500 images of Open Images, when compressed using our method, FLIF, and PNG. For our approach, we also show the bits used to store for each image, measured in bpsp on top (“ only”), and as a percentage on the bottom. The percentage averages at 42%, going up towards the highbpsp end of the figure. This plot shows the wide range of bpsp covered by a random set of natural images, and motivates our QClassifier. We can also see that while our method tends to outperform FLIF on average, FLIF is better for some highbpsp images, where the bpsp of both FLIF and our method approach that of PNG.
Input/Output  Lossy reconstruction  Residual  Two samples from our predicted  

6.2 Runtime
We compare the decoding speed of RC to that of L3C for images, using an NVidia Titan XP. For our components: BPG: 163ms; RC: 166ms; arithmetic coding: 89.1ms; i.e., in a total 418ms, compared to L3C’s 374ms.
QC and optimization are only needed for encoding. We discussed above that both components are computationally cheap. In terms of actual runtime: QC: 6.48ms; optimization: 35.2ms.
6.3 QClassifier and Optimization
In Table 2 we show the benefits of using the QClassifier as well as the optimization. We show the resulting bpsp for the Open Images validation set (top) and for DIV2K (bottom), as well as the percentage of predicted that are away from the optimal (denoted “ to ”), against a baseline of using a fixed (the mean over QC’s training set, see Section 5.2). The last column shows the required number of forward passes through RC.
QClassifier
We first note that even though the QC was only trained on Open Images (see Sec 5.2), we get similar behavior on Open Images and DIV2K. Moreover, we see that using QC is clearly beneficial over using a fixed for all images, and only incurs a small increase in bpsp compared to using the optimal ( for Open Images, for DIV2K). This can be explained by the fact that QC manages to predict within of for of the images in Open Images and of the DIV2K images.
Furthermore, the small increase in bpsp is traded for a reduction from requiring forward passes to compute to a single one. In that sense, using the QC is similar to the “fast” modes common in image compression algorithms, where speed is traded against bitrate.
Optimization
Table 2 shows that using Optimization on top of QC reduces the bitrate on both testing sets.
Discussion
While the gains of both components are small, their computational complexity is also very low (see Section 6.2). As such, we found it quite impressive to get the reported gains. We believe the direction of tuning a handful of parameters post training on an instance basis is a very promising direction for image compression. One fruitful direction could be using dedicated architectures and including a tuning step endtoend as in meta learning.
6.4 Visualizing the learned
Data set  Setup  bpsp  to  # forward 

Open  Optimal  2.789  100%  
Images  Fixed  2.801  82.6%  1 
Our QC  2.794  94.8%  1  
Our QC +  2.790  1  
DIV2K  Optimal  3.080  100%  
Fixed  3.096  73.0%  1  
Our QC  3.088  90.2%  1  
Our QC +  3.079  1 
While the bpsp results from the previous section validate the compression performance of our model, it is interesting to investigate the distribution predicted by RC. Note that we predict a mixture distribution per pixel, which is hard to visualize directly. Instead, we sample from the predicted distribution. We expect the samples to be visually similar to the groundtruth residual .
The sampling results are shown in Fig. 5, where we visualize two images from CLIC.pro with their lossy reconstructions, as obtained by BPG. We also show the groundtruth residuals . Then, we show two samples obtained from the probability distribution predicted by our RC network. For the top image, is in , for the bottom it is in (cf. Fig. 2), and we renormalized to the RGB range for visualization, but to reduce eye strain we replaced the most frequent value (, i.e., gray), with white.
We can clearly see that our approach i) learned to model the noise patterns discarded by BPG inherent with these images, ii) learned to correctly predict a zero residual where BPG manages to perfectly reconstruct, and iii) learned to predict structures similar to the ones in the groundtruth.
7 Conclusion
In this paper, we showed how to leverage BPG to achieve stateoftheart results in fullresolution learned lossless image compression. Our approach outperforms L3C, PNG, WebP, and JPEG2000 consistently, and also outperforms the handcrafted stateoftheart FLIF on images from the Open Images data set. Future work should investigate inputdependent optimizations, which are also used by FLIF and which we started to explore here by optimizing the scale of the probabilistic model for the residual (optimization). Similar approaches could also be applied to latent probability models of lossy image and video compression methods.
Appendix A Learning Better Lossless Compression Using Lossy Compression – Supplementary
a.1 QClassifier Architecture
We show the architecture for the QClassifier in Table A1. Residual denotes a sequence of convolution, ReLU, convolution, with a skip connection adding the input to the output (as in [14], but without BatchNorm).
Layer  Filter  Stride  

Conv + ReLU  3  64  2  
Conv + ReLU  64  128  2  
Residual  128  128  
Conv  128  256  2  
Residual  256  256  
ChannelAvg.  256  256  
Linear  256 
a.2 BPG Performance
a.3 Examples from the testing sets
We provide additional visual examples here:
Specifically, we show one image from each of our testing sets, alongside with the residual and a sample from , which is expected to be visually similar to . Please refer to Section 6.4 for details on sampling and the visualization.
Footnotes
 https://github.com/fabjul/RCPyTorch
 We write to denote the entire probability mass function and to denote evaluated at .
References
 (2017) SofttoHard Vector Quantization for EndtoEnd Learning Compressible Representations. In NIPS, Cited by: §1.
 (2017) NTIRE 2017 Challenge on Single Image SuperResolution: Dataset and Study. In CVPR Workshops, Cited by: §5.1.
 (2019) Generative Adversarial Networks for Extreme Learned Image Compression. In ICCV, Cited by: §1, §1.
 (2015) Density modeling of images using a generalized normalization transformation. arXiv preprint arXiv:1511.06281. Cited by: §4.1.
 (2016) Endtoend Optimized Image Compression. ICLR. Cited by: §1.
 BPG Image format. Note: \urlhttps://bellard.org/bpg/ Cited by: §1.
 (2017) CAScnn: a deep convolutional neural network for image compression artifact suppression. In IJCNN, pp. 752–759. Cited by: §2.
 (2018) PixelSNAIL: An Improved Autoregressive Generative Model. In ICML, Cited by: §2.
 (2012) Elements of Information Theory. John Wiley & Sons. Cited by: §3.1.
 (1996) DEFLATE compressed data format specification version 1.3. Technical report Cited by: §2.
 (2015) Compression artifacts reduction by a deep convolutional network. In ICCV, pp. 576–584. Cited by: §2.
 (2017) Deep generative adversarial compression artifact removal. In ICCV, pp. 4826–4835. Cited by: §2.
 (2014) Generative adversarial nets. In NIPS, Cited by: §2.
 (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: §A.1, §4.1.
 Neural Networks for Machine Learning Lecture 6a Overview of minibatch gradient descent. Cited by: §5.2.
 (1993) Keeping neural networks simple by minimizing the description length of the weights. In COLT, Cited by: §2.
 (2019) Integer discrete flows and lossless compression. In NIPS, Cited by: §1, §1, §2, §3.2.
 (1952) A method for the construction of minimumredundancy codes. Proc. IRE 40 (9), pp. 1098–1101. Cited by: §3.1.
 (2015) Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In ICML, Cited by: §4.1.
 (2015) Adam: a method for stochastic optimization. In ICLR, Cited by: §5.2.
 (2018) Glow: generative flow with invertible 1x1 convolutions. In NeurIPS, Cited by: §4.5.
 (2016) Improved variational inference with inverse autoregressive flow. In NIPS, Cited by: §2.
 (2019) Bitswap: recursive bitsback coding for lossless compression with hierarchical latent variables. In ICML, Cited by: §1, §1, §2, §3.2.
 (2017) PixelCNN Models with Auxiliary Variables for Natural Image Modeling. In ICML, Cited by: §1, §2.
 (2017) OpenImages: a public dataset for largescale multilabel and multiclass image classification.. Dataset available from https://storage.googleapis.com/openimages/web/index.html. Cited by: §5.1.
 (2012) Imagenet classification with deep convolutional neural networks. In NIPS, Cited by: §1.
 (2018) Conditional Probability Models for Deep Image Compression. In CVPR, Cited by: §1.
 (2019) Practical full resolution learned lossless image compression. In CVPR, Cited by: 3rd item, §1, §1, §2, §3.2, §4.2, §4.2, Table 1, §5.1.
 (2018) Joint Autoregressive and Hierarchical Priors for Learned Image Compression. In NeurIPS, Cited by: Figure A1, §A.2, §1, §1.
 (2018) Image Transformer. ICML. Cited by: §2.
 Pillow Library for Python. Note: \urlhttps://pythonpillow.org Cited by: §5.1.
 Portable Network Graphics (PNG). Note: \urlhttp://libpng.org/pub/png/libpng.html Cited by: §2.
 (2017) Parallel Multiscale Autoregressive Density Estimation. In ICML, Cited by: §1, §2.
 (2015) Variational inference with normalizing flows. arXiv preprint arXiv:1505.05770. Cited by: §2.
 (2004) H. 264 and mpeg4 video compression: video coding for nextgeneration multimedia. John Wiley & Sons. Cited by: §2.
 (2017) RealTime Adaptive Image Compression. In ICML, Cited by: §1.
 (2015) Unet: convolutional networks for biomedical image segmentation. In MICCAI, pp. 234–241. Cited by: §4.1.
 (2017) PixelCNN++: A PixelCNN Implementation with Discretized Logistic Mixture Likelihood and Other Modifications. In ICLR, Cited by: §1, §2, §4.2, §4.2.
 (1948) A Mathematical Theory of Communication. Bell System Technical Journal 27 (3), pp. 379–423. External Links: Document, Link, https://onlinelibrary.wiley.com/doi/pdf/10.1002/j.15387305.1948.tb01338.x Cited by: §1, §3.1.
 (2001) The JPEG 2000 still image compression standard. IEEE Signal Processing Magazine 18 (5), pp. 36–58. Cited by: §2.
 (2016) FLIF: Free lossless image format based on MANIAC compression. In ICIP, Vol. . External Links: Document, ISSN 23818549 Cited by: §2.
 (2012) Overview of the high efficiency video coding (hevc) standard. IEEE Transactions on Circuits and Systems for Video Technology 22 (12), pp. 1649–1668. Cited by: §1, §3.3.
 (2016) Compression artifacts removal using convolutional neural networks. arXiv preprint arXiv:1605.00366. Cited by: §2.
 (2017) Lossy Image Compression with Compressive Autoencoders. In ICLR, Cited by: §1.
 (2017) Full Resolution Image Compression with Recurrent Neural Networks. In CVPR, Cited by: §1.
 (2019) Practical lossless compression with latent variables using bits back coding. In ICLR, Cited by: §1, §2, §3.2.
 (2018) Deep Generative Models for DistributionPreserving Lossy Compression. In NeurIPS, Cited by: §1.
 (2016) Conditional Image Generation with PixelCNN Decoders. In NIPS, Cited by: §1, §2, §3.2.
 (2016) Pixel Recurrent Neural Networks. In ICML, Cited by: §1, §2.
 (1992) The JPEG still picture compression standard. IEEE Transactions on Consumer Electronics 38 (1), pp. xviii–xxxiv. Cited by: §1.
 WebP Image format. Note: \urlhttps://developers.google.com/speed/webp/ Cited by: §2.
 (2003) Overview of the h. 264/avc video coding standard. IEEE Transactions on Circuits and Systems for Video Technology 13 (7), pp. 560–576. Cited by: §1, §2.
 (1987) Arithmetic coding for data compression. Communications of the ACM 30 (6), pp. 520–540. Cited by: §3.1, §3.1.
 Workshop and Challenge on Learned Image Compression. Note: \urlhttps://www.compression.cc/challenge/ Cited by: §5.1.