Compression is at the heart of effective representation learning. However, lossy compression is typically achieved through simple parametric models like Gaussian noise to preserve analytic tractability, and the limitations this imposes on learning are largely unexplored. Further, the Gaussian prior assumptions in models such as variational autoencoders (VAEs) provide only an upper bound on the compression rate in general. We introduce a new noise channel, Echo noise, that admits a simple, exact expression for mutual information for arbitrary input distributions. The noise is constructed in a data-driven fashion that does not require restrictive distributional assumptions. With its complex encoding mechanism and exact rate regularization, Echo leads to improved bounds on log-likelihood and dominates -VAEs across the achievable range of rate-distortion trade-offs. Further, we show that Echo noise can outperform state-of-the-art flow methods without the need to train complex distributional transformations.
oddsidemargin has been altered.
marginparsep has been altered.
topmargin has been altered.
marginparwidth has been altered.
marginparpush has been altered.
paperheight has been altered.
The page layout violates the ICML style. Please do not change the page layout, or include packages like geometry, savetrees, or fullpage, which change it for you. We’re not able to reliably undo arbitrary changes to the style. Please remove the offending package(s), or layout-changing commands and try again.
Exact Rate-Distortion in Autoencoders via Echo Noise
Rob Brekelmans 1 Daniel Moyer 1 Aram Galstyan 1 Greg Ver Steeg 1
Rate-distortion theory provides an organizing principle for representation learning that is enshrined in machine learning as the information bottleneck principle (Tishby et al., 2000). The goal is to compress input random variables into a representation with mutual information rate , while minimizing a distortion measure that captures our ability to use the representation for a task. For the rate to be restricted, some information must be lost through noise. Despite the use of increasingly complex encoding functions via neural networks, simple noise models like Gaussians still dominate the literature because of their analytic tractability. Unfortunately, the effect of restrictive noise assumptions on the quality of learned representations is not well understood.
The Variational Autoencoding (VAE) framework (Kingma & Welling, 2013; Rezende et al., 2014) has provided the basis for a number of recent developments in representation learning (Achille & Soatto, 2016; Chen et al., 2016; Higgins et al., 2017; Kim & Mnih, 2018; Chen et al., 2018; Tschannen et al., 2018). While VAEs were originally motivated as performing posterior inference under a generative model, several recent works have viewed the Evidence Lower Bound objective as corresponding to an unsupervised rate-distortion problem (Alemi et al., 2018; Rezende & Viola, 2018; Achille & Soatto, 2016). From this perspective, reconstruction of the input provides the distortion measure, while the KL divergence between encoder and prior gives an upper bound on the information rate that depends heavily on the choice of prior (Tomczak & Welling, 2017; Alemi et al., 2018; Rosca et al., 2018; Lastras-Montano, 2018).
In this work, we deconstruct this interpretation of VAEs and their extensions. Do the restrictive assumptions of the Gaussian noise model limit the quality of VAE representations? Does forcing the latent space to be independent and Gaussian constrain the expressivity of our models? We find evidence to support both claims, showing that a powerful noise model can achieve more efficient lossy compression and that relaxing prior or marginal assumptions can lead to better bounds on both information rate and log-likelihood under the same encoding- decoding architecture.
The main contribution of this paper is the introduction of the Echo noise channel, a powerful, data-driven improvement over Gaussian channels whose compression rate can be precisely expressed for arbitrary input distributions. Echo noise is constructed from the empirical distribution of its inputs, allowing its variation to reflect that of the source (see Fig. 1). We leverage this relationship to derive an analytic form for mutual information that avoids distributional assumptions on either the noise or the encoding marginal.
While Echo noise can be used to measure compression in any rate-distortion setting, we focus on the case of unsupervised learning. Adopting the view of latent variable modeling as performing lossy compression, we demonstrate that the Echo model provides improved bounds on log-likelihood for several image datasets. We show that the Echo channel avoids the need to specify a prior, and instead implicitly uses the optimal prior in the Evidence Lower Bound. This marginal distribution is neither Gaussian nor independent in general.
After introducing the Echo noise channel and an exact characterization of its information rate in Sec. Document, we proceed to motivate Variational Autoencoders from the perspective of rate-distortion in Sec. Document. We formally define our objective in Sec. Document, and provide additional context on rate-distortion and related work in Secs. Document and Document. Finally, we report log likelihood results in Sec. Document, and visualize the space of possible compression-reconstruction trade-offs using rate distortion curves.
To avoid learning representations that memorize the data, we would like to constrain the mutual information between the input and the representation . Since we have freedom to choose how to encode the data, we can design a noise model that allows us to calculate this generally intractable quantity.
The Echo noise channel has a shift-and-scale form that mirrors the reparameterization trick in VAEs. Referring to the observed data distribution as , with , we can define the stochastic encoder using 111Our approach is also easily adapted to multiplicative noise, such as in (Achille & Soatto, 2016).:
For brevity, we omit the subscripts that indicate that the functions and matrix function depend on neural networks parameterized by . All that remains to specify the encoder is to fix the distribution of the noise variable, . For VAEs, the noise is typically chosen to be Gaussian, .
For our goal of calculating mutual information, we will need to compare the marginal entropy , which integrates over samples , and the conditional entropy , whose stochasticity is only due to the noise for deterministic and . The choice of noise will affect both quantities, and our approach is to relate them by enforcing an equivalence between the distributions and .
Since is constructed using the source, we can also imagine defining the noise in a data-driven way. For instance, we could draw in an effort to make the noise resemble the channel output. However, this changes the distribution of and the noise would need to be updated to continue resembling the output.
By iteratively applying Eq. 1, we can guarantee that the noise and marginal distributions match in the limit. Echo noise is thus constructed using an infinite sum over terms that resemble attenuated “echoes” of the transformed data samples. Using superscripts on to indicate iid samples, , we draw according to:
This can be written more compactly as follows.
The Echo noise distribution ), S(missingx), q(x))f, S,qx∈R^d_x The noise distribution, is only defined implicitly through a sampling procedure. For this to be meaningful, we must ensure the infinite sum converges.
The infinite sum in Eq. 3 converges, and thus Echo noise sampling is well-behaved, if s.t. and , where is the spectral radius.
Although the form of the noise distribution may be complex, it has the key property that it exactly matches the encoding marginal .
Lemma 2.2 (Echo noise matches signal).
If and , then has the same distribution as .
We can observe this by simply re-labeling the sample indices in the expanded expression for the noise in Eq. 2. In particular, the training example that we condition on in Eq. 1 can be seen as specifying the first sample, , in a draw from the noise. This equivalence allows us to derive an exact expression for the mutual information rate across an Echo channel for arbitrary input distributions.
Theorem 2.3 (Echo Information).
We start by expanding the definition of mutual information in terms of entropies, reiterating that the stochasticity underlying after conditioning on is due only to the random variable .
Since and are treated as constants inside the expectation, we can use the translation invariance of differential entropy in the fourth line, and the scaling property in the fifth line (Cover & Thomas, 2006). Finally we use Lemma 2.2 to cancel out the entropy terms. ∎
In this work, we consider only diagonal as is typical for VAEs, so that the determinant in Eq. 4 simplifies as In this case, the Echo channel can be interpreted as parallel channels, with mutual information additive across dimensions.
We can visualize applying Echo noise to a complex input distribution in Fig. 1, using the identity transformation and constant noise . Here, we directly observe the equivalence of the noise and output distributions. Further, the data-driven nature of the Echo channel means it can leverage the structure in the (transformed) input to destroy information in a more targeted way than spherical Gaussian noise.
In particular, the ability of the Echo channel to add noise that is correlated across dimensions distinguishes it from common diagonal noise assumptions. It is important to note that the noise still reflects dependence in even when is diagonal. In fact, we show in Appendix Document that for this case, where total correlation measures the divergence from independence, e.g. (Watanabe, 1960).
In the setting of learned and , notice that the noise depends on the parameters. This means that training gradients are propagated through , unlike traditional VAEs where is fixed. This may be a factor in improved performance: data samples are used as both signal and noise in different parts of the optimization, leading to a more efficient use of data.
Finally, the Echo channel fulfills several of the desirable properties that often motivate Gaussian noise and prior assumptions. Eqs. 1 and 3 define a simple sampling procedure that only requires a supply of iid samples from the input distribution. It is easy to sample both the noise and conditional distributions for the purposes of evaluating expectations, as is the case with Gaussian noise. However, Echo also provides a way to sample from the true encoding marginal via its equivalence with . While we cannot evaluate the density of a given under or , we can characterize their relationship on average using the mutual information in Eq. 4. These ingredients make Echo noise useful for representation learning within the autoencoding framework.
Variational Autoencoders (VAEs) (Kingma & Welling, 2013; Rezende et al., 2014) seek to maximize the log-likelihood of data under a latent factor generative model defined by . Here, represent parameters of the generative model decoder and is the prior distribution over latent variables. However, maximum likelihood is intractable in general due to the difficult integral over , .
To avoid this problem, VAEs introduce a variational distribution, , which encodes the input data and approximates the generative model posterior . This leads to the tractable (average) Evidence Lower Bound (ELBO) on likelihood:
The connection between VAEs and rate-distortion theory arises from a decomposition of the KL divergence term seen in (Hoffman & Johnson, 2016). In particular, the true encoding marginal provides the unique, optimal choice for the prior, which tightens the lower bound on likelihood.
This decomposition lends insight into the orthogonal goals of the ELBO regularization term. The mutual information encourages lossy compression of the data into a latent code, while the marginal divergence enforces consistency with the prior. The non-negativity of the KL divergence implies that each of these terms detracts from our bound.
In the absence of any domain-relevant structure encoded in the prior, we can choose to optimally set , removing the marginal divergence from the objective and leading to a tighter version of the ELBO:
While the VAE is motivated from a generative modeling approach based on inference of the latent variables, we choose to focus on the lossy compression aspect of Eq. S3.Ex5. We follow Alemi et al. (2018) in advocating that unsupervised learning be motivated from this encoding perspective. In particular, we study the following optimization problem:
While this resembles the -VAE objective of Higgins et al. (2017), we proceed to interpret the parameter in the context of rate-distortion theory. The special choice of gives a bound on log-likelihood according to Eq. 7, which we use to compare results across methods in Sec. Document.
Having presented a method for exact calculation of and shown how this quantity can lead to a tighter bound on likelihood, we now draw connections to rate-distortion theory within the context of recent related works.
Given a source distribution and a distortion function over samples and their codes , the rate-distortion function is defined as an optimization over conditional distributions :
It is common to optimize an unconstrained problem by introducing a Lagrange multiplier which, at optimality, reflects the tradeoff between compression and fidelity as the slope of the rate-distortion function at , i.e. :
Eq. 8 suggests the cross entropy reconstruction loss as a distortion measure, so that . We can then observe the equivalence between the rate-distortion optimization and our problem definition, as only the tradeoff between rate and distortion affects the characterization of solutions. (Alemi et al., 2018) also analyse this distortion measure in the context of upper and lower bounds on :
With the data entropy as a constant, minimizing the cross entropy distortion corresponds to the variational information maximization lower bound of Agakov (2005). The upper bound is identical to the decomposition in Eq. S3.Ex5, and it can be shown that using the non-negativity of the KL divergence. Thus, any marginal distribution will provide an upper bound on mutual information with a gap of
Several recent works have observed this fact and considered ‘learned priors’ or marginal approximations (Chen et al., 2016; Tomczak & Welling, 2017; Alemi et al., 2018) that seek to match , thereby reducing and tightening our bounds. From this perspective, a static Gaussian prior can be seen as providing a particular and possibly loose marginal approximation (Gao et al., 2018; Alemi et al., 2018). The analysis in Rosca et al. (2018) suggests that the marginal divergence can indeed be significant for VAE models and their flow-based or adversarial extensions.
With this in mind, it is interesting to note the self-consistent equations which solve the variational problem above (see, e.g. Tishby et al. (2000))
Notice that, regardless of the choice of distortion measure, our Echo noise channel enforces the second equation throughout optimization by using the encoding marginal as the ‘optimal prior.’ For our choice of distortion, the solution simplifies as in Theorem 1 of Lastras-Montano (2018):
This provides an interesting comparison with the generative modeling approach. While the Evidence Lower Bound objective can be interpreted as performing posterior inference with prior in the numerator, we see that the information theoretic perspective prescribes using the exact encoding marginal . Indeed, we can interpret our tighter bound in Eq. 7 as corresponding to the likelihood under the generative model . The gap in the ELBO then becomes , encouraging the encoder to match the rate-distortion solution for .
Several other recent works have drawn connections between maximum likelihood and rate-distortion theory. Lastras-Montano (2018) demonstrate a symmetry between improvements in the marginal approximation and likelihood function. Rezende & Viola (2018) analyse the latent space dynamics for solutions to the rate-distortion problem in Eq. 9, although they use the traditional prior and its upper bound on rate. An ‘annealed posterior’ similar to Eq. 10 is considered from a Bayesian perspective in Mandt et al. (2016).
Existing models are usually trained with a static (Alemi et al., 2018; Higgins et al., 2017) or a heuristic annealing schedule (Bowman et al., 2015; Burgess et al., 2018), which implicitly correspond to constant constraints. However, setting target values for either the rate or distortion remains an interesting direction for future discussion. Rezende & Viola (2018) view the distortion as an intuitive user-specified choice, while Zhao et al. (2018) train a separate model to provide constraint values. As both works show, specifying a constant and optimizing the Lagrange multiplier with gradient descent can improve performance.
Finally, our method for calculating mutual information suggests exploring distortion measures for other applications. The Information Bottleneck method defines ‘relevant’ information through another random variable , and can be seen as a special case of rate-distortion with distortion measure (Tishby et al., 2000; Banerjee et al., 2005). Applied to supervised learning problems using an upper bound on the rate, Alemi et al. (2016) show that Information Bottleneck can improve generalization and adversarial robustness. Further, reinforcement learning can be framed in terms of rate-distortion using the theory of bounded rationality (Ortega et al., 2015). Here, the agent trades off , the ‘deliberation cost’ associated with choosing a state-dependent action, with a distortion measure which maximizes reward.
A number of recent works have argued that the maximum likelihood objective may be insufficient to guarantee useful representations of data (Zhao et al., 2017; Alemi et al., 2018). In particular, when paired with powerful decoders that can match the data distribution, Variational Autoencoders (VAEs) may learn to completely ignore the latent code (Bowman et al., 2015).
To rectify these issues, a commonly proposed solution has been to add terms to the objective function that maximize, minimize, or constrain the mutual information between data and representation (Zhao et al., 2017; 2018; Phuong et al., 2018; Braithwaite & Kleijn, 2018; Alemi et al., 2018). However, justifications for these approaches have varied. Further, since mutual information is intractable in general, different estimation methods have been employed in practice. These include sampling (Phuong et al., 2018), autoregressive density estimation (Alemi et al., 2018), mixture entropy estimation (Kolchinsky et al., 2017), learned mixtures (Tomczak & Welling, 2017), indirect optimization via other divergences (Zhao et al., 2017), and a dual form of the KL divergence (Belghazi et al., 2018).
Among these approaches, the InfoVAE model of Zhao et al. (2017) provides a potentially interesting comparison with our method. The objective adds a parameter to more heavily regularize the marginal divergence, and a parameter to control mutual information. After some simplification, the objective becomes:
While it is difficult to directly optimize , (Zhao et al., 2017) instead minimize the Maximum Mean Discrepancy (MMD) (Gretton et al., 2012), which will also be zero when . We choose to analyse the setting of (as in the original paper) and (no information preference), so the objective becomes:
The sizeable MMD penalty encourages , so that . Thus, the KL divergence term in the ELBO should more closely reflect a mutual information regularizer, facilitating comparison with the rate in Echo models.
Flow models, which evaluate densities on simple distributions such as Gaussians but apply complex transformations with tractable Jacobians, are another prominent recent development in unsupervised learning (Germain et al., 2015; Rezende & Mohamed, 2015; Kingma et al., 2016; Papamakarios et al., 2017). We compare against models using inverse autoregressive flow (IAF) in the encoder. This is not a noise model, but can be seen as transforming the output of a Gaussian noise channel into an approximate posterior sample using an autoregressive network (Kingma et al., 2016).
We consider several marginal approximations alongside the IAF encoder, including a Gaussian prior with MMD penalty as described above (IAF+MMD). Masked autoregessive flow (MAF) (Papamakarios et al., 2017) models a similar transformation as IAF, but with computational tradeoffs suited to density estimation (IAF-MAF). Finally, we consider the VampPrior (Tomczak & Welling, 2017), which estimates as a mixture of Gaussian encoders conditioned on a set of learned ‘pseudo-inputs’ (IAF-Vamp).
Before proceeding to analyse results for the Echo noise channel, we first mention several implementation choices unique to our method. In particular, we must ensure that the infinite sum defining the noise in Eq.3 converges and is precisely approximated using a finite number of terms.
Activation Functions: We parameterize the encoding functions and using a neural network and can choose our activation functions to satisfy the convergence conditions of Lemma 2.1. We let the final layer of use an element-wise to guarantee that the magnitude is bounded: . We found it useful to expand the linear range of the function rather than increase the magnitude of the activations.
For the experiments in this paper, is diagonal, with functions on the diagonal. We implement each using a sigmoid activation, making the spectral radius . However, this is not quite enough to ensure convergence, as would lead to an infinite amount of noise. We thus introduce a clipping factor on to further limit the spectral radius and ensure accurate sampling in this high noise, low rate regime.
Sampling Precision: When can our infinite sum be truncated without sacrificing numerical precision? We consider the sum of the remainder terms after truncating at using geometric series identities. For and , we know that the sum of the infinite series will be less than . The first terms will have a sum given by , so the remainder will be less than . For a given choice of , we can numerically solve for such that the sum of truncated terms falls within machine precision . For example, with and , we obtain We therefore scale our element-wise sigmoid to for calculating both the noise and the rate.
Low Rate Limit: This clipping factor limits the magnitude of noise we can add in practice, and thus defines a lower limit on the achievable rate in an Echo model. For diagonal , the mutual information can be bounded in terms of , so that . Note that is increasing in , since considering more terms in the sum reduces the magnitude of remainder. Each included term can then have higher magnitude, leading to lower achievable rates. Thus, this limit can be tuned to achieve strict compression by increasing or simply using fewer latent factors .
Batch Optimization: Another consideration in choosing is that we train using mini-batches of size for stochastic gradient descent. For a given training example, we can use the other iid samples in a batch to construct Echo noise, thereby avoiding additional forward passes to evaluate and . There is also a choice of whether to sample with or without replacement, although these will be equivalent in the large batch limit. In App. Document, we see little difference between these strategies, and proceed to sample without replacement to mirror the treatment of training examples. We let to set the rate limit as low as possible for this sampling scheme.
We first seek to confirm several properties of the Echo noise described in Sec. Document. We can calculate a second-order approximation of total correlation to show that the noise is not independent across dimensions (App.Document). In App. Document, we visualize the noise and latent activations to show that Echo can learn non-Gaussian marginals with different means and variances. This flexibility does not come with a cost in the objective as it would for a VAE with prior assumptions. Finally, we compare and across data points in App. Document, where Echo adds more noise on test examples for which the eventual model likelihood is low.
Binary MNIST Method Rate Distortion -ELBO Echo 26.8 61.9 88.7 VAE 25.9 67.7 93.6 InfoVAE 25.6 67.1 92.7 VAE-Vamp 26.0 67.4 92.9 VAE-MAF 26.8 65.8 92.6 IAF-Prior 25.6 64.9 90.5 IAF+MMD 26.0 64.1 90.1 IAF-Vamp 25.7 64.1 89.8 IAF-MAF 26.6 63.5 90.1 Omniglot Method Rate Distortion -ELBO Echo 30.5 83.6 114.1 VAE 29.6 91.4 121.0 InfoVAE 30.1 89.4 119.4 VAE-Vamp 29.5 91.1 120.6 VAE-MAF 28.3 92.4 120.7 IAF-Prior 30.5 86.9 117.4 IAF+MMD 30.9 85.8 116.7 IAF-Vamp 30.0 86.5 116.5 IAF-MAF 31.5 84.6 116.1 Fashion MNIST Method Rate Distortion -ELBO Echo 16.2 218.5 234.7 VAE 14.7 221.9 236.6 InfoVAE 15.8 220.6 236.3 VAE-Vamp 13.9 222.7 236.6 VAE-MAF 14.6 221.8 236.4 IAF-Prior 16.0 218.7 234.7 IAF+MMD 15.9 218.5 234.4 IAF-Vamp 15.4 219.1 234.5 IAF-MAF 16.1 218.2 234.3 Figure 2: Test Log Likelihood Bounds
We proceed to analyse the log-likelihood performance of relevant models on three image datasets: static Binary MNIST (Salakhutdinov & Murray, 2008), Omniglot (Lake et al., 2015) as adapted by Burda et al. (2015), and Fashion MNIST (fMNIST) (Xiao et al., 2017). All models are trained with 32 latent variables using the same convolutional architecture as in (Alemi et al., 2018) except with ReLU activations. We trained using Adam optimization for 200 epochs, with an initial learning rate of 0.0003 decaying linearly to 0 over the last 100 epochs. See Sec. Document and Appendix Document for additional details.
Table 2 shows negative test ELBO values, with the rate column reported as the appropriate upper bound for comparison methods. We compare Echo against diagonal Gaussian noise and IAF encoders, each with several marginal approximations: a standard Gaussian prior (with and without an MMD penalty), an MAF density estimator, and a VampPrior mixture model. Note that VAE is still used to denote the Gaussian encoder when paired with a different marginal (e.g. VAE-MAF).
We find that the Echo noise autoencoder obtains superior likelihood bounds on Binary MNIST and Omniglot and is competitive with IAF encoders on fMNIST. In analyzing these results, we attempt to quantify the impact of three key elements of the Echo approach: a data-driven noise model, a flexible marginal distribution, and exact rate regularization throughout training.
Additional complexity in the encoding procedure is clearly helpful, as we see by comparing IAF and Gaussian encoders for the same marginal approximations. The expressivity of the autoregressive transformation contributes several nats of improvement across datasets, but precludes interpretation of IAF as a noise model and a more direct comparison with Echo. While the Echo encoder is evidently more flexible than the Gaussian channel in VAEs, it is difficult to isolate the effect of the noise model due to the differences in marginal assumptions.
We can gain insight into the benefit of exact rate regularization for Gaussian priors via the MMD penalty. Using a notably smaller bandwidth than in Zhao et al. (2017) (see App. Document,Document), we enforce to tighten our upper bound on rate. This leads to slightly improved performance for InfoVAE and IAF+MMD on all datasets. It may be surprising that additional regularization can lead to better ELBOs, but we argue that MMD can guide optimization toward solutions that make efficient use of the rate penalty.
Recall that consists of the true mutual information and a marginal divergence. The data processing inequality (Cover & Thomas, 2006) states that only those nats of mutual information can be translated into mutual information between the data and model reconstructions, . While this should already encourage encoders to match a given marginal, the MMD penalty appears to be useful in enforcing this condition throughout training.
A learned approximation such as MAF or VampPrior could also help ensure a tight bound on rate while adding flexibility in the marginal space. However, it is difficult to distinguish these effects and we do not observe large differences between marginals or consistent behavior across experiments. For example, ensuring prior consistency appears to be more beneficial for VAE models on Omniglot, while the IAF encoder is able to better utilize the learned marginals. Alternatively, the Echo achieves both an exact rate and adaptive prior by directly linking the choice of encoder and marginal.
We emphasize that the Echo approach provides performance gains of up to two nats with significantly fewer parameters than comparison methods. IAF and MAF each require training an additional autoregressive model, whose size is of the same order as the encoder-decoder network, while the VampPrior in our experiments uses 250 or 500 learned pseudoinputs of the same dimension as the data. Although Echo involves a extra computation to construct the noise for each training example, it has the same number of parameters as a standard VAE and runs in approximately the same wall clock time.
Moving beyond the special case of , rate-distortion theory provides the practitioner with an entire space of compression-relevance tradeoffs corresponding to constraints on the rate. We plot R-D curves for Binary MNIST in Fig. 3, Omniglot in Fig. 4, and Fashion MNIST in App. Document. We also show model reconstructions at several points along the curve, with the output averaged over 10 encoding samples to observe how stochasticity in the latent space is translated through the decoder. These visualizations are organized to compare models with similar rates, which we emphasize may occur at different values of for different methods depending on the shape of their respective curves.
Echo continues to outperform comparison methods across most of the rate-distortion curve. However, we note that performance begins to drop off as we approach the lower limit on achievable rate, shown with a dashed vertical line in each plot. Recall that this bound arises due to the clipping factor in Sec. Document, ensuring that the rate calculation accurately reflects the noise for a finite number of samples. In this regime, the sigmoids parameterizing are saturated for much of training, with unused dimensions still counting against the rate. We reiterate that this low rate limit may be adjusted by raising or decreasing the number of latent factors.
At low rates, our models maintain only high level features of the input image and have higher variance in encoding noise. The blurred average reconstructions reflect that different samples can lead to semantically different generations. For example, Echo produces several ways of rounding the leftmost 9, 2, and 6 on Binary MNIST. On both datasets, Echo gives qualitatively different output variation than Gaussian noise at low rate and nearly identical distortion. Intermediate rate models still reflect some of this sample diversity, particularly on the more difficult Omniglot dataset.
For high capacity models, we observe that Echo slightly extends its gains over comparison methods with up to 4 nats of improvement in . Intuitively, a more complex marginal can be harder to approximate, leading to a loose upper bound on mutual information. Note that Echo approach may be particularly useful in this regime, as it avoids explicitly constructing the marginal while still providing exact rate regularization.
VAEs can be interpreted as performing a rate-distortion optimization, but are handicapped by their weak compression mechanism, independent Gaussian marginal assumptions, and upper bound on rate. We introduced a new type of channel, Echo noise, that provides a more flexible, data-driven approach to constructing noise and admits an exact expression for mutual information. Our results demonstrate that using Echo noise in autoencoders can lead to better bounds on log-likelihood, as well as superior trade-offs between compression and reconstruction. While this work focused on the case of unsupervised learning, the Echo channel should naturally translate to other rate-distortion settings via the choice of distortion measure.
Acknowledgments The authors acknowledge support from the Defense Advanced Research Projects Agency (DARPA) under awards FA8750-17-C-0106 and W911NF-16- 1-0575, and are grateful to anonymous reviewers for helpful comments.
- Achille & Soatto (2016) Achille, A. and Soatto, S. Information dropout: Learning optimal representations through noisy computation. arXiv preprint arXiv:1611.01353, 2016.
- Agakov (2005) Agakov, F. V. Variational Information Maximization in Stochastic Environments. PhD thesis, University of Edinburgh, 2005.
- Alemi et al. (2016) Alemi, A., Fischer, I., Dillon, J., and Murphy, K. Deep variational information bottleneck. arXiv preprint arXiv:1612.00410, 2016.
- Alemi et al. (2018) Alemi, A., Poole, B., Fischer, I., Dillon, J., Saurous, R. A., and Murphy, K. Fixing a broken elbo. In International Conference on Machine Learning, pp. 159–168, 2018.
- Banerjee et al. (2005) Banerjee, A., Merugu, S., Dhillon, I. S., and Ghosh, J. Clustering with bregman divergences. Journal of machine learning research, 6(Oct):1705–1749, 2005.
- Belghazi et al. (2018) Belghazi, I., Rajeswar, S., Baratin, A., Hjelm, R. D., and Courville, A. Mine: mutual information neural estimation. arXiv preprint arXiv:1801.04062, 2018.
- Bowman et al. (2015) Bowman, S. R., Vilnis, L., Vinyals, O., Dai, A. M., Józefowicz, R., and Bengio, S. Generating sentences from a continuous space. CoRR, abs/1511.06349, 2015. URL http://arxiv.org/abs/1511.06349.
- Braithwaite & Kleijn (2018) Braithwaite, D. and Kleijn, W. B. Bounded information rate variational autoencoders. arXiv preprint arXiv:1807.07306, 2018.
- Burda et al. (2015) Burda, Y., Grosse, R., and Salakhutdinov, R. Importance weighted autoencoders. arXiv preprint arXiv:1509.00519, 2015.
- Burgess et al. (2018) Burgess, C. P., Higgins, I., Pal, A., Matthey, L., Watters, N., Desjardins, G., and Lerchner, A. Understanding disentangling in beta-vae. arXiv preprint arXiv:1804.03599, 2018.
- Chen et al. (2018) Chen, T. Q., Li, X., Grosse, R., and Duvenaud, D. Isolating sources of disentanglement in variational autoencoders. In Advances in Neural Information Processing Systems, 2018.
- Chen et al. (2016) Chen, X., Kingma, D. P., Salimans, T., Duan, Y., Dhariwal, P., Schulman, J., Sutskever, I., and Abbeel, P. Variational lossy autoencoder. arXiv preprint arXiv:1611.02731, 2016.
- Cover & Thomas (2006) Cover, T. M. and Thomas, J. A. Elements of information theory. Wiley-Interscience, 2006.
- Dillon et al. (2017) Dillon, J. V., Langmore, I., Tran, D., Brevdo, E., Vasudevan, S., Moore, D., Patton, B., Alemi, A., Hoffman, M., and Saurous, R. A. Tensorflow distributions. arXiv preprint arXiv:1711.10604, 2017.
- Gao et al. (2018) Gao, S., Brekelmans, R., Ver Steeg, G., and Galstyan, A. Auto-encoding total correlation explanation. arXiv preprint arXiv:1802.05822, 2018.
- Germain et al. (2015) Germain, M., Gregor, K., Murray, I., and Larochelle, H. Made: Masked autoencoder for distribution estimation. In International Conference on Machine Learning, pp. 881–889, 2015.
- Gretton et al. (2012) Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf, B., and Smola, A. A kernel two-sample test. Journal of Machine Learning Research, 13(Mar):723–773, 2012.
- Higgins et al. (2017) Higgins, I., Matthey, L., Pal, A., Christopher Burgess, Xavier Glorot, M. B. S. M., and Lerchner, A. "beta-vae: Learning basic visual concepts with a constrained variational framework.". In Proceedings of the International Conference on Learning Representations (ICLR), 2017.
- Hoffman & Johnson (2016) Hoffman, M. D. and Johnson, M. J. Elbo surgery: yet another way to carve up the variational evidence lower bound. In Workshop in Advances in Approximate Bayesian Inference, NIPS, 2016.
- Kim & Mnih (2018) Kim, H. and Mnih, A. Disentangling by factorising. arXiv preprint arXiv:1802.05983, 2018.
- Kingma & Welling (2013) Kingma, D. P. and Welling, M. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
- Kingma et al. (2016) Kingma, D. P., Salimans, T., Jozefowicz, R., Chen, X., Sutskever, I., and Welling, M. Improved variational inference with inverse autoregressive flow. In Advances in Neural Information Processing Systems, pp. 4743–4751, 2016.
- Kolchinsky et al. (2017) Kolchinsky, A., Tracey, B. D., and Wolpert, D. H. Nonlinear information bottleneck. arXiv preprint arXiv:1705.02436, 2017.
- Lake et al. (2015) Lake, B. M., Salakhutdinov, R., and Tenenbaum, J. B. Human-level concept learning through probabilistic program induction. Science, 350(6266):1332–1338, 2015.
- Lastras-Montano (2018) Lastras-Montano, L. A. Information theoretic lower bounds on negative log likelihood. 2018. URL https://openreview.net/forum?id=rkemqsC9Fm.
- Mandt et al. (2016) Mandt, S., McInerney, J., Abrol, F., Ranganath, R., and Blei, D. Variational tempering. In Artificial Intelligence and Statistics, pp. 704–712, 2016.
- Ortega et al. (2015) Ortega, P. A., Braun, D. A., Dyer, J., Kim, K.-E., and Tishby, N. Information-theoretic bounded rationality. arXiv preprint arXiv:1512.06789, 2015.
- Papamakarios et al. (2017) Papamakarios, G., Murray, I., and Pavlakou, T. Masked autoregressive flow for density estimation. In Advances in Neural Information Processing Systems, pp. 2338–2347, 2017.
- Phuong et al. (2018) Phuong, M., Welling, M., Kushman, N., Tomioka, R., and Nowozin, S. The mutual autoencoder: Controlling information in latent code representations. 2018. URL https://openreview.net/forum?id=HkbmWqxCZ.
- Rezende & Mohamed (2015) Rezende, D. and Mohamed, S. Variational inference with normalizing flows. In International Conference on Machine Learning, pp. 1530–1538, 2015.
- Rezende & Viola (2018) Rezende, D. J. and Viola, F. Taming vaes. arXiv preprint arXiv:1810.00597, 2018.
- Rezende et al. (2014) Rezende, D. J., Mohamed, S., and Wierstra, D. Stochastic backpropagation and approximate inference in deep generative models. In International Conference on Machine Learning, pp. 1278–1286, 2014.
- Rosca et al. (2018) Rosca, M., Lakshminarayanan, B., and Mohamed, S. Distribution matching in variational inference. arXiv preprint arXiv:1802.06847, 2018.
- Salakhutdinov & Murray (2008) Salakhutdinov, R. and Murray, I. On the quantitative analysis of deep belief networks. In Proceedings of the 25th international conference on Machine learning, pp. 872–879. ACM, 2008.
- Tishby et al. (2000) Tishby, N., Pereira, F. C., and Bialek, W. The information bottleneck method. arXiv preprint physics/0004057, 2000.
- Tomczak & Welling (2017) Tomczak, J. M. and Welling, M. Vae with a vampprior. AIStats 2018, 2017. URL arXivpreprintarXiv:1705.07120.
- Tschannen et al. (2018) Tschannen, M., Bachem, O., and Lucic, M. Recent advances in autoencoder-based representation learning. arXiv preprint arXiv:1812.05069, 2018.
- Ver Steeg et al. (2017) Ver Steeg, G., Brekelmans, R., Harutyunyan, H., and Galstyan, A. Disentangled representations via synergy minimization. In Allerton Conference on Communication, Control, and Computing, pp. 180–187. IEEE, 2017.
- Watanabe (1960) Watanabe, S. Information theoretical analysis of multivariate correlation. IBM Journal of research and development, 4(1):66–82, 1960.
- Xiao et al. (2017) Xiao, H., Rasul, K., and Vollgraf, R. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. 2017.
- Zhao et al. (2017) Zhao, S., Song, J., and Ermon, S. Infovae: Information maximizing variational autoencoders. CoRR, abs/1706.02262, 2017. URL http://arxiv.org/abs/1706.02262.
Zhao et al. (2018)
Zhao, S., Song, J., and Ermon, S.
The information autoencoding family: A lagrangian perspective on
latent variable generative models.
CoRR, abs/1806.06514, 2018.
\@ssectSupplementary Material for “Exact Rate-Distortion in Autoencoders via Echo Noise”
All models were trained using a similar convolutional architecture as used in (Alemi et al., 2018), but with ReLU activations, unnormalized gradients, and fewer latent factors. We use Keras notation and list convolutional layers using the arguments (filters, kernel size, stride, padding). We show an example parametrization of Echo in the hidden layer.
Conv2D(32, 5, 1, ‘same’)
Conv2D(32, 5, 2, ‘same’)
Conv2D(64, 5, 1, ‘same’)
Conv2D(64, 5, 2, ‘same’)
Conv2D(256, 7, 1, ‘valid’)
echo_input = [Dense(32, tanh()),
Conv2DTranspose(64, 7, 1, ‘valid’)
Conv2DTranspose(64, 5, 1, ‘same’)
Conv2DTranspose(64, 5, 2, ‘same’)
Conv2DTranspose(32, 5, 1, ‘same’)
Conv2DTranspose(32, 5, 2, ‘same’)
Conv2DTranspose(32, 4, 1, ‘same’)
Conv2D(1, 4, 1, ‘same’, activation = ‘sigmoid’)
|Binary MNIST||Omniglot||Fashion MNIST|
For the Echo models considered in this work, we can also derive an interesting equivalence between the conditional and overall total correlation. Observe that the expression for mutual information in Eq. 4 decomposes for diagonal :
This additivity across dimensions implies that . Before proceeding, we first recall the definitions of total correlation and conditional total correlation (Watanabe, 1960), which measure the divergence from independence of the marginal and conditional, respectively:
Now consider the quantity We can decompose this in two different ways, first by projecting onto the joint marginal:
We can also decompose using the factorized conditional:
The equality of and implies equality for and .
The effects of this relationship have not been widely studied, as for traditional VAE models. On the other hand, is usually non-zero and has been minimized as a proxy for ‘disentanglement’ (Kim & Mnih, 2018; Chen et al., 2018). We could imagine similar regularization for Echo.
We have shown that the parallel Echo channels are perfectly additive in that However, general channels could be sub- or super-additive (Ver Steeg et al., 2017), so that , , or Extending Echo to non-diagonal could allow us to explore the various relationships between and and more precisely characterize those which are useful for representation learning.
We include several additional experimental results in this section, including rate-distortion on Fashion MNIST, analysis of marginal activations, and a comparison of sampling with and without replacement.
We show a full rate-distortion curve for Fashion MNIST in Fig.5, along with reconstructions at various rates. Echo performance nearly matches that of IAF encoders except at low rates.
We visualize marginal activations by dimension for Echo, VAE, and InfoVAE on the Omniglot dataset in Fig.6. We show for nine dimensions in each method, including six with highest rates, two with low rates, and one with minimal rate. For each, we combine activations from 500 encoder samples on each test example and fit a Gaussian KDE estimator with bandwidth chosen according to the Scott criterion.
As discussed in Sec. Document, Echo puts no distributional assumptions on the marginals of each dimension. We can directly observe non-Gaussianity in the activation shapes and rough fit of the estimator, and note that these plots also reflect the distribution of the noise by the equivalence of and Further, different dimensions are free to learn different means and variances without incurring a penalty in the objective.
The VAE prior, on other hand, pushes each marginal to resemble a standard Gaussian . However, we observe that this is not necessarily achieved in all dimensions. The MMD penalty more strictly enforces prior consistency for InfoVAE, leading to improved log likelihood bounds in Sec Document. As an interesting aside, we note that an Echo model trained with the MMD penalty performs similarly to InfoVAE. In this case, the noise and the marginals are close to independent and Gaussian, removing much of the flexibility that powers our approach.
We can analyse the Echo mutual information at each data point by noting that the expression in Eq. 4 involves an expectation over . Since and do not depend on in the proof of Thm. 2.3, we can evaluate as a pointwise mutual information. We compare this quantity with the L2-norm of as a proxy for signal to noise ratio. Test examples are sorted by conditional likelihood on the x-axis, and we see that Echo indeed adds less noise on examples where the generative model is more confident.
Finally, we compare Test ELBOs for Echo noise sampled with and without replacement from within a batch. We use a batch size of 100, which gives a maximum choice of for sampling without replacement. We also use the setting for sampling with replacement, and see small differences between these strategies across datasets.