Exact Rate-Distortion in Autoencoders via Echo Noise
Compression is at the heart of effective representation learning. However, lossy compression is typically achieved through simple parametric models like Gaussian noise to preserve analytic tractability, and the limitations this imposes on learning are largely unexplored. Further, the Gaussian prior assumptions in models such as variational autoencoders (VAEs) provide only an upper bound on the compression rate in general. We introduce a new noise channel, Echo noise, that admits a simple, exact expression for mutual information for arbitrary input distributions. The noise is constructed in a data-driven fashion that does not require restrictive distributional assumptions. With its complex encoding mechanism and exact rate regularization, Echo leads to improved bounds on log-likelihood and dominates -VAEs across the achievable range of rate-distortion trade-offs. Further, we show that Echo noise can outperform flow-based methods without the need to train additional distributional transformations.
Rate-distortion theory provides an organizing principle for representation learning that is enshrined in machine learning as the Information Bottleneck principle (tishby2000information). The goal is to compress input random variables into a representation with mutual information rate , while minimizing a distortion measure that captures our ability to use the representation for a task. For the rate to be restricted, some information must be lost through noise. Despite the use of increasingly complex encoding functions via neural networks, simple noise models like Gaussians still dominate the literature because of their analytic tractability. Unfortunately, the effect of these assumptions on the quality of learned representations is not well understood.
The Variational Autoencoding (VAE) framework (kingma2013auto; rezende2014stochastic) has provided the basis for a number of recent developments in representation learning (informationdropout; chen2018isolating; chen2016variational; higgins2017; kim2018disentangling; tschannen2018recent). While VAEs were originally motivated as performing posterior inference under a generative model, several recent works have viewed the Evidence Lower Bound objective as corresponding to an unsupervised rate-distortion problem (informationdropout; alemi2018fixing; rezende2018taming). From this perspective, reconstruction of the input provides the distortion measure, while the KL divergence between encoder and prior gives an upper bound on the information rate that depends heavily on the choice of prior (alemi2018fixing; rosca2018distribution; tomczak2017vae).
In this work, we deconstruct this interpretation of VAEs and their extensions. Do the restrictive assumptions of the Gaussian noise model limit the quality of VAE representations? Does forcing the latent space to be independent and Gaussian constrain the expressivity of our models? We find evidence to support both claims, showing that a powerful noise model can achieve more efficient lossy compression and that relaxing prior or marginal assumptions can lead to better bounds on both the information rate and log-likelihood.
The main contribution of this paper is the introduction of the Echo noise channel, a powerful, data-driven improvement over Gaussian channels whose compression rate can be precisely expressed for arbitrary input distributions. Echo noise is constructed from the empirical distribution of its inputs, allowing its variation to reflect that of the source (see Fig. 1). We leverage this relationship to derive an analytic form for mutual information that avoids distributional assumptions on either the noise or the encoding marginal. Further, the Echo channel avoids the need to specify a prior, and instead implicitly uses the optimal prior in the Evidence Lower Bound. This marginal distribution is neither Gaussian nor independent in general.
After introducing the Echo noise channel and an exact characterization of its information rate in Sec. 2, we proceed to interpret Variational Autoencoders from an encoding perspective in Sec. 3. We formally define our rate-distortion objective in Sec. 3.1, and draw connections with recent related works in Sec. 4. Finally, we report log likelihood results, \replaced visualize the space of compression-reconstruction trade-offs, and evaluate disentanglement in Echo representations in Sec. 5. and visualize the space of possible compression-reconstruction trade-offs in Sec. 5
2 Echo Noise
To avoid learning representations that memorize the data, we would like to constrain the mutual information between the input and the representation . Since we have freedom to choose how to encode the data, we can design a noise model that facilitates calculating this generally intractable quantity.
The Echo noise channel has a shift-and-scale form that mirrors the reparameterization trick in VAEs. Referring to the observed data distribution as , with , we can define the stochastic encoder using:
For brevity, we omit the subscripts that indicate that the functions and matrix function depend on neural networks parameterized by . All that remains to specify the encoder is to fix the distribution of the noise variable, . For VAEs, the noise is typically chosen to be Gaussian, . 111Our approach is also easily adapted to multiplicative noise, such as in informationdropout.
With the goal of calculating mutual information, we will need to compare the marginal entropy , which integrates over samples , and the conditional entropy , whose stochasticity is only due to the noise for deterministic and . The choice of noise will affect both quantities, and our approach is to relate them by enforcing an equivalence between the distributions and .
Since is defined in terms of the source, we can also imagine constructing the noise in a data-driven way. For instance, we could draw in an effort to make the noise match the channel output. However, this changes the distribution of and the noise would need to be updated to continue resembling the output.
Instead, by iteratively applying Eq. 1, we can guarantee that the noise and marginal distributions match in the limit. Using superscripts to indicate iid samples , we draw according to:
Echo noise is thus constructed using an infinite sum over attenuated “echoes” of the transformed data samples. This can be written more compactly as follows.
The Echo noise distribution is defined for functions and probability density function over , by sampling according to the following procedure.
Although the noise distribution may be complex, it has the interesting property that it exactly matches the eventual output marginal .
Lemma 2.1 (Echo noise matches channel output).
If and , then has the same distribution as .
We can observe this relationship by simply re-labeling the sample indices in the expanded expression for the noise in Eq. 2. In particular, the training example that we condition on in Eq. 1 corresponds to the first sample in a draw from the noise. This equivalence is the key insight leading to an exact expression for the mutual information:
Theorem 2.2 (Echo Information).
We start by expanding the definition of mutual information in terms of entropies. Since and are deterministic, we treat them as constants after conditioning on . The stochasticity underlying is thus only due to the random variable . We start by expanding the definition of mutual information in terms of entropies, reiterating that the stochasticity underlying after conditioning on is due only to the random variable .
We have used the translation invariance of differential entropy in the third line, and the scaling property in the fourth line cover. The entropy terms cancel as a result of Lemma 2.1. Since and are deterministic, we treat them as constants inside the expectation. We use the translation invariance of differential entropy in the third line, and the scaling property in the fourth line cover. Finally, we use Lemma 2.1 to cancel out the entropy terms. ∎
In this work, we consider only diagonal as is typical for VAEs, so that the determinant in Eq. 4 simplifies as
Finally, we note that the noise distribution is only defined implicitly through a sampling procedure. For this to be meaningful, we must ensure that the infinite sum converges.
The infinite sum in Eq. 3 converges, and thus Echo noise sampling is well-behaved, if s.t. and , where is the spectral radius.
In App. B, we discuss several implementation choices to guarantee that these conditions are met and that Echo noise can be accurately sampled using a finite number of terms. This \replacedis . can be particularly difficult in the high noise, low information regime, as zero mutual information () would imply an infinite amount of noise. To avoid this issue and ensure precise sampling, we \added clip the magnitude of so that, for a given and number of samples, the sum of remainder terms is guaranteed to be within machine precision. This imposes a lower bound on the achievable rate across the Echo channel, which depends on the number of terms considered and can be tuned by the practitioner.
2.1 Properties of Echo Noise
We can visualize applying Echo noise to a complex input distribution in Fig. 1, using the identity transformation and constant noise scaling . Here, we directly observe the equivalence of the noise and output distributions. Further, the data-driven nature of the Echo channel means it can leverage the structure in the (transformed) input to destroy information in a more targeted way than spherical Gaussian noise.
In particular, Echo’s ability to add noise that is correlated across dimensions distinguishes it from common diagonal noise models. It is important to note that the noise still reflects the dependence in even when is diagonal. In fact, we show in App. C that for the diagonal case, where total correlation measures the divergence from independence, e.g. (watanabe).
In the setting of learned and , notice that the noise depends on the parameters. This means that training gradients are propagated through , unlike traditional VAEs where is fixed. This may be a factor in improved performance: data samples are used as both signal and noise in different parts of the optimization, leading to a more efficient use of data.
Finally, the Echo channel fulfills several of the desirable properties that often motivate Gaussian noise and prior assumptions. Eqs. 1 and 3 define a simple sampling procedure that only requires a supply of iid samples from the input distribution. It is easy to sample both the noise and conditional distributions for the purposes of evaluating expectations, while Echo also provides a natural way to sample from the true encoding marginal via its equivalence with . While we cannot evaluate the density of a given under or , as might be useful in importance sampling burda2015importance, we can characterize their relationship on average using the mutual information in Eq. 4. These ingredients make Echo noise useful for learning representations within the autoencoder framework.
3 Lossy Compression in VAEs
Variational Autoencoders (VAEs) (kingma2013auto; rezende2014stochastic) seek to maximize the log-likelihood of data under a latent factor generative model defined by , where represents parameters of the generative model decoder and is the prior distribution over latent variables. However, maximum likelihood is intractable in general due to the difficult integral over , .
To avoid this problem, VAEs introduce a variational distribution, , which encodes the input data and approximates the generative model posterior . This leads to the tractable (average) Evidence Lower Bound (ELBO) on likelihood:
The connection between VAEs and rate-distortion theory can be seen using a decomposition of the KL divergence term from elbosurgery.
This decomposition lends insight into the orthogonal goals of the ELBO regularization term. The mutual information encourages lossy compression of the data into a latent code, while the marginal divergence enforces consistency with the prior. The non-negativity of the KL divergence implies that each of these terms detracts from our likelihood bound.
Similarly, we observe that gives an upper bound on the mutual information, with a gap of . From this perspective, a static Gaussian prior can be seen a particular and possibly loose marginal approximation (alemi2018fixing; gao2018auto; rosca2018distribution). The true encoding marginal provides the unique, optimal choice of prior and leads to a tighter bound on the likelihood:
Our exact expression for the mutual information over an Echo channel provides the first general method to directly optimize this objective. This corresponds to adaptively setting equal to throughout training, so that Eq. 7 can be seen as bounding the likelihood under the generative model .
3.1 Rate-Distortion Objective
While the VAE is motivated as performing amortized inference of the latent variables in a generative model, the prior is rarely leveraged to encode domain-specific structure. Further, we have shown that enforcing prior consistency can detract from likelihood bounds.
We instead follow alemi2018fixing in advocating that representation learning be motivated from an encoding perspective using rate-distortion theory. In particular, \addedwe choose reconstruction under the generative model as the distortion measure , and study the following optimization problem:
While this resembles the -VAE objective of higgins2017, we highlight two notable distinctions. First, treating rather than the upper bound avoids the need to specify a prior and facilitates a direct interpretation in terms of lossy compression. Further, the parameter is naturally interpreted as a Lagrange multiplier enforcing a constraint on . The special choice of gives a bound on log-likelihood according to Eq. 7, which we use to compare results across methods in Sec. 5. We direct the reader to App. A for a more formal treatment of rate-distortion.
4 Related Work
Rate-Distortion Theory: A number of recent works have made connections between the Evidence Lower Bound objective and rate-distortion theory (informationdropout; alemi2018fixing; infolowerbounds; rezende2018taming), with \replacedthe average distortion corresponding to the cross entropy reconstruction loss as above.reconstruction of the input providing the distortion measure . In particular, alemi2018fixing consider the following upper and lower bounds on the mutual information :
With the data entropy as a constant, minimizing the cross entropy distortion corresponds to the variational information maximization lower bound of barber2003algorithm. The upper bound matches the decomposition in Eq. 6 for the generalized choice of marginal . Several recent works have also considered ‘learned priors’ or flow-based density estimators alemi2018fixing; chen2016variational; tomczak2017vae that seek to reduce the marginal divergence by approximating (see below). Using this upper bound on the rate term, alemi2018fixing and rezende2018taming obtain objectives similar to Eq. 8.
Existing models are usually trained with a static alemi2018fixing; higgins2017 or a heuristic annealing schedule bowman2016; burgess2018understanding, which implicitly correspond to constant constraints (see App.A). However, setting target values for either the rate or distortion remains an interesting direction for future discussion. rezende2018taming view the distortion as an intuitive quantity to specify in practice, while zhao2018lagrangian train a separate model to provide constraint values. As both works show, specifying a constant and optimizing the Lagrange multiplier with gradient descent can lead to improved performance.
Mutual Information in Unsupervised Learning: A number of recent works have argued that the maximum likelihood objective may be insufficient to guarantee useful representations of data alemi2018fixing; infovae. In particular, when paired with powerful decoders that can match the data distribution, VAEs may learn to completely ignore the latent code bowman2016; chen2016variational.
To rectify these issues, a commonly proposed solution has been to add terms to the objective function that maximize, minimize, or constrain the mutual information between data and representation (alemi2018fixing; braithwaite2018bounded; phuong2018mutual; infovae; zhao2018lagrangian). However, justifications for these approaches have varied and numerous methods have been employed for estimating the mutual information. These include sampling (phuong2018mutual), indirect optimization via other divergences (infovae), mixture entropy estimation (kolchinsky2017nonlinear), learned mixtures (tomczak2017vae), autoregressive density estimation (alemi2018fixing), and a dual form of the KL divergence (belghazi2018mine). \added poole2019variational provide a thorough review and analysis of variational upper and lower bounds on mutual information, although recent results have shown limits on our ability to construct high confidence estimators directly from samples mcallester2018formal. Echo notably avoids this limitation by providing an analytic expression for the rate whenever the representation is sampled according to Eq. 3.
Among the approaches above,Among these approaches, the InfoVAE model of infovae provides a potentially interesting comparison with our method. The objective adds a parameter to more heavily regularize the marginal divergence and a parameter to control mutual information. However, since is intractable, the Maximum Mean Discrepancy (MMD) gretton2012kernel between the encoding outputs and a standard Gaussian is used as a proxy. For the choice of (as in the original paper) and (no information preference), the objective simplifies to:
The sizeable MMD penalty encourages , so that . Thus, the KL divergence term in the ELBO should more closely reflect a mutual information regularizer, facilitating comparison with the rate in Echo models.
Flow models, which evaluate densities on simple distributions such as Gaussians but apply complex transformations with tractable Jacobians, are another prominent recent development in unsupervised learning (germain2015made; kingma2016improved; papamakarios2017masked; rezende2015variational). Flows can be used both as an encoding mechanism and marginal approximation for our purposes. In particular, Inverse Autoregressive Flow kingma2016improved can be seen as transforming the output of a Gaussian noise channel into an approximate posterior sample using a stack of autoregressive networks. Masked Autoregressive Flow papamakarios2017masked models a similar transformation with computational tradeoffs suited for density estimation, mapping latent samples to high probability under a Gaussian base distribution to approximate .
Finally, the VampPrior tomczak2017vae may also be used as a marginal approximation, modeling using a mixture distribution evaluated on a set of ‘pseudo-inputs’ learned by backpropagation.
In this section, we would ideally like to quantify the impact of three key elements of the Echo approach: a data-driven noise model, exact rate regularization throughout training, and a flexible marginal distribution. In App. D.2, we observe that the dimension-wise marginals learned by Echo appear Gaussian despite our lack of explicit constraints. However, the joint marginal over (or equivalently ) may still have a complex dependence structure, which is not penalized for deviating from independence or Gaussianity. We calculate a second-order approximation of total correlation in App. C to confirm that this noise is indeed dependent across dimensions.
5.1 ELBO Results
We proceed to analyse the log-likelihood performance of relevant models on three image datasets: static Binary MNIST (salakhutdinov2008quantitative), Omniglot (lake2015human) as adapted by burda2015importance, and Fashion MNIST (fMNIST) (xiao2017). All models are trained with 32 latent variables using the same convolutional architecture as in alemi2018fixing except with ReLU activations. We trained using Adam optimization for 200 epochs, with an initial learning rate of 0.0003 decaying linearly to 0 over the last 100 epochs. \deletedWe also found it necessary to use a smaller bandwidth than in infovae to enforce prior consistency with the MMD penalty. See App.E for additional details.
Table 1 shows negative test ELBO values, with the rate column reported as the appropriate upper bound for comparison methods. \addedResults are averaged from ten runs of each model after removing the highest and lowest outliers. We compare Echo against diagonal Gaussian noise and IAF encoders, each with four marginal approximations: a Gaussian prior with and without the MMD penalty (e.g. IAF-Prior, IAF+MMD), MAF papamakarios2017masked, and VampPrior tomczak2017vae. Note that VAE is still used to denote the Gaussian encoder when paired with a different marginal (e.g. VAE-Vamp).
We find that the Echo noise autoencoder obtains improved likelihood bounds on Binary MNIST and Omniglot, with competitive results on fMNIST. We emphasize that Echo \replacedachieves this performance approach provides these gains with significantly fewer parameters than comparison methods. IAF and MAF each require training an additional autoregressive model with size similar to the original network, while the VampPrior uses 750 learned pseudoinputs of the same dimension as the data. Although Echo involves \replacedspecial extra computation to construct the noise for each training example, it has the same number of parameters as a standard VAE and runs in approximately the same wall clock time.
We observe only minor differences based on the choice of encoding mechanism, which is somewhat surprising given the additional expressivity of the IAF transformation. The benefit of the flow transformations may be more readily observed on more difficult datasets or with more advanced architecture tuning kingma2016improved.
We do find that a more complex marginal approximation can help performance. Although we see minimal gains from the MMD penalty and MAF marginal, the VampPrior bridges much of the performance gap with Echo noise. Recall that a learned prior can help ensure a tight rate bound while providing flexibility to learn a more complex marginal (in this case, a mixture model). However, the relative contribution of these effects is difficult to decouple. Echo instead provides both an exact rate and an adaptive prior by directly linking the choice of encoder and marginal.
The Echo encoder is evidently more flexible than a Gaussian noise channel, and Echo dominates VAEs across datasets with up to six nats of improvement. The additional expressivity of the IAF transformation is necessary to approach Echo performance, although IAF is not directly interpretable as a noise model and may still be paired with an inexact marginal.
We can gain insight into the benefit of exact rate regularization for Gaussian priors using the MMD penalty to enforce . This leads to slightly improved performance for InfoVAE, although IAF encoders do not appear to be sensitive to this change. It may be surprising that additional regularization can lead to better ELBOs, but we argue MMD can guide optimization toward solutions that make efficient use of the rate penalty . The data processing inequality cover states that only those nats of mutual information can be translated into mutual information between the data and model reconstructions, . While this should already encourage encoders to match a given marginal, the MMD penalty appears useful in enforcing this condition throughout training for VAEs.
A learned approximation such as MAF or VampPrior could also help ensure a tight rate bound while adding flexibility to learn a complex marginal space. However, these effects can be difficult to distinguish. Gaussian encoders appear unable to leverage the complexity of the MAF transformation, with prior consistency proving particularly useful on Omniglot. VampPrior is useful for both encoders and IAF+Vamp provides competitive performance across datasets, although we cannot easily quantify the gap in the marginal approximation. Alternatively, Echo provides both an exact rate and an adaptive prior by directly linking the choice of encoder and marginal.
5.2 Rate Distortion Curves
Moving beyond the special case of , rate-distortion theory provides the practitioner with an entire space of compression-relevance tradeoffs corresponding to constraints on the rate. We plot R-D curves for Binary MNIST in Fig. 2, Omniglot in Fig. 3, and Fashion MNIST in App. D.1. We also show model reconstructions at several points along the curve, with the output averaged over 10 encoding samples to observe how stochasticity in the latent space is translated through the decoder. These visualizations are organized to compare models with similar rates, which we emphasize may occur at different values of for different methods depending on the shape of their respective curves.
The Echo rate-distortion curve indeed exhibits several notable differences with comparison methods. We first note that Echo performance begins to drop off as we approach the lower limit on achievable rate, which is shown with a dashed vertical line and ensures that the rate calculation accurately reflects the noise for a finite number of samples (see App.B). In this regime, the sigmoids parameterizing are saturated for much of training, and unused dimensions still count against the objective since we cannot achieve zero rate. We reiterate that this low rate limit may be adjusted by considering more terms in the infinite sum or decreasing the number of latent factors.
Echo continues to outperform comparison methods across most of the rate-distortion curve. However, we note that performance begins to drop off as we approach the lower limit on achievable rate, shown with a dashed vertical line in each plot and described in detail in App.B. This bound arises from ensuring that the rate calculation accurately reflects the noise for a finite number of samples, and is described in detail in App.B. In this regime, the sigmoids parameterizing are saturated for much of training, with unused dimensions still counting against the rate. We reiterate that this low rate limit may be adjusted by considering more terms in the infinite sum or decreasing the number of latent factors.
At low rates, our models maintain only high level features of the input image, and the blurred average reconstructions reflect that different samples can lead to semantically different generations. \deletedFor example, Echo produces several ways of rounding the leftmost digits on Binary MNIST. On both datasets, Echo gives qualitatively different output variation than Gaussian noise at low rate and similar distortion. Intermediate-rate models still reflect some of this sample diversity, particularly on the more difficult Omniglot dataset.
For very high capacity models, we observe that Echo slightly extends its gains on both datasets, with three to five nats lower distortion than comparison methods at the same rates. Intuitively, a more complex encoding marginal may be harder to match to a (learned) prior, loosening the upper bound on mutual information. The Echo approach can be particularly useful in this regime, as it avoids explicitly constructing the marginal while still providing exact rate regularization.
5.3 Disentangled Representations
Significant recent attention has been devoted to learning disentangled representations of data, which reflect the true generative factors of variation in the data chen2018isolating; mathieu2019disentangling and may be useful for downstream tasks locatello2018challenging; van2019disentangled. While prevailing definitions and metrics for disentanglement have recently been challenged locatello2018challenging, existing methods often rely on the inductive bias of independent ground truth factors, either via total correlation (TC) regularization chen2018isolating; kim2018disentangling, or by using higher to more strongly penalize the KL divergence to an independent prior burgess2018understanding; higgins2017. Since Echo does not assume a factorized encoder or marginal, we \replacedinvestigate whether it can better preserve disentanglement when the ground truth factors are not independent. hypothesize that it may be more flexible in preserving disentanglement when the ground truth factors are not independent.
To evaluate the quality of Echo noise representations, we compare against VAE models with diagonal Gaussian noise and priors, and consider the effects of increasing or adding independence regularization with parameter chen2018isolating; kim2018disentangling:
TC regularization is implemented as in kim2018disentangling, where a discriminator is trained to distinguish samples from and . We keep when modifying . Note that enforcing marginal independence will also limit the dependence in the noise learned by Echo, since and are linked as described in Sec. 2.1.
We calculate disentanglement scores on the dSprites dataset dsprites17, where the ground truth factors of shape, scale, x-y position, and rotation are known and sampled independently across the dataset. To induce dependence in the ground truth factors, we downsample the dataset by partitioning each factor into 4 bins and randomly excluding pairwise combinations of bins with probability 0.15. This leads to a dataset of 15% of the original size, with a total correlation of 1.49 nats in the generative factors. We use both the implementation and experimental setup of locatello2018challenging and average scores over ten runs of each method.
Table 2 reports FactorVAE kim2018disentangling and Mutual Information Gap chen2018isolating scores for both independent and dependent ground truth factors. We find that Echo provides superior disentanglement scores to VAEs across the board, although the relative improvement does not increase in the case of dependent latent factors. On the full dataset, independence regularization improves the MIG score for Echo and both scores for VAE, but may guide both models toward more entangled representations when this inductive bias does not match the ground truth. Finally, we note that increasing need not improve disentanglement for Echo noise, since we have relaxed assumptions of independence in both the encoder and marginal. Higher actually appear to hurt disentanglement scores on the dependent dataset for both methods.
In Figure 4, we visualize an Echo model that has successfully learned to disentangle position and scale, but not rotation, on the full dSprites dataset. Each row represents a single latent dimension, and each column shows mean values as a function of the respective ground truth factors. Note that the first column shows a heatmap in the x-y plane, while the orange, blue, and green lines indicate ellipse, square, and heart, respectively (see chen2018isolating). In general, we observed that Echo models achieved their highest MIG scores on position, scale, and shape, with rotation often entangled across two or more dimensions.
VAEs can be interpreted as performing a rate-distortion optimization, but may be handicapped by their weak compression mechanism, independent Gaussian marginal assumptions, and upper bound on rate. We introduced a new type of channel, Echo noise, that provides a more flexible, data-driven approach to constructing noise and admits an exact expression for mutual information. Our results demonstrate that using Echo noise in autoencoders can lead to better bounds on log-likelihood, favorable trade-offs between compression and reconstruction, and \addedmore disentangled representations.
The Echo channel can be substituted for Gaussian noise \replacedin most scenarios where VAEs are used, with similar runtime and the same number of parameters. Echo should also translate to other rate-distortion problems via the choice of distortion measure, including supervised learning with the traditional Information Bottleneck method alemi2016deep; tishby2000information and \addedinvariant representation learning as in moyer2018invariant. Exploring further settings where mutual information provides meaningful regularization for neural network representations remains an exciting avenue for future work.
Appendix A Rate-Distortion Theory
Given a source and a distortion function over samples and their codes , the rate-distortion function is defined as an optimization over conditional distributions :
It is common to optimize an unconstrained problem by introducing a Lagrange multiplier which, at optimality, reflects the tradeoff between compression and fidelity as the slope of the rate-distortion function at , i.e. : 222Note, we have constrained the distortion here, instead of the rate as in the main text. We write the Lagrange multiplier as to maintain a correspondence between the parameterizations of each problem.
Eq. 8 suggests the cross entropy reconstruction loss as a distortion measure, so that . We can then observe the equivalence between the rate-distortion optimization and our problem definition, as only the tradeoff between rate and distortion affects the characterization of solutions.
It is also interesting to note the self-consistent equations which solve the variational problem above (see, e.g. tishby2000information)
Notice that, regardless of the choice of distortion measure, our Echo noise channel enforces the second equation throughout optimization by using the encoding marginal as the ‘optimal prior.’ For our choice of distortion, the solution simplifies as:
This provides an interesting comparison with the generative modeling approach. While the Evidence Lower Bound objective can be interpreted as performing posterior inference with prior in the numerator, we see that the information theoretic perspective prescribes using the exact encoding marginal . Indeed, our version of the ELBO bounds in Eq. 7 bounds the likelihood under the generative model . The gap in this bound then becomes , encouraging the encoder to match the rate-distortion solution for .
Appendix B Implementation of Echo Noise Sampling
Numerically, Gaussian noise cannot be sampled exactly and is instead approximated to within machine precision. We discuss several unique implementation choices that allow us to generate similarly precise Echo noise samples. In particular, we must ensure that the infinite sum defining the noise in Eq.3 converges and is accurately approximated using a finite number of terms.
Activation Functions: We parameterize the encoding functions and using a neural network and can choose our activation functions to satisfy the convergence conditions of Lemma 2.3. We let the final layer of use an element-wise to guarantee that the magnitude is bounded: . We found it useful to expand the linear range of the function for training stability, although differences were relatively minor and may vary by application. One could also consider clipping the range of a linear activation to enforce a desired magnitude .
For the experiments in this paper, is diagonal, with functions on the diagonal. We implement each using a sigmoid activation, making the spectral radius . However, this is not quite enough to ensure convergence, as would lead to an infinite amount of noise. We thus introduce a clipping factor on to further limit the spectral radius and ensure accurate sampling in this high noise, low rate regime.
Sampling Precision: When can our infinite sum be truncated without sacrificing numerical precision? We consider the sum of the remainder terms after truncating at using geometric series identities. For and , we know that the sum of the infinite series will be less than . The first terms will have a sum given by , so the remainder will be less than . For a given choice of , we can numerically solve for such that the sum of truncated terms falls within machine precision . For example, with and , we obtain We therefore scale our element-wise sigmoid to for calculating both the noise and the rate.
Low Rate Limit: This clipping factor limits the magnitude of noise we can add in practice, and thus defines a lower limit on the achievable rate in an Echo model. For diagonal , the mutual information can be bounded in terms of , so that . Note that is increasing in , since the first term in the remainder decreases exponentially with the number of terms. Each included term can then have higher magnitude, leading to lower achievable rates. Thus, this limit can be tuned to achieve strict compression by increasing or simply using fewer latent factors .
Batch Optimization: Another consideration in choosing is that we train using mini-batches of size for stochastic gradient descent. For a given training example, we can use the other iid samples in a batch to construct Echo noise, thereby avoiding additional forward passes to evaluate and . There is also a choice of whether to sample with or without replacement, although these will be equivalent in the large batch limit. In experiments we saw little difference between these strategies, and proceed to sample without replacement to mirror the treatment of training examples. We let to set the rate limit as low as possible for this sampling scheme.
Appendix C Total Correlation for Echo Noise
To briefly demonstrate that Echo noise is dependent across latent dimensions, we can estimate the total correlation of noise samples in Table 3 using the second-order covariance approximation . This is clearly zero for diagonal Gaussian noise, and provides a sufficient condition to show that the Echo noise is not independent.
|Binary MNIST||Omniglot||Fashion MNIST|
For the Echo models considered in this work, we can also derive an interesting equivalence between the conditional and overall total correlation. Observe that the expression for mutual information in Eq. 4 decomposes for diagonal :
This additivity across dimensions implies that . Before proceeding, we first recall the definitions of total correlation and conditional total correlation watanabe, which measure the divergence from independence of the marginal and conditional, respectively:
Now consider the quantity We can decompose this in two different ways, first by projecting onto the joint marginal:
We can also decompose using the factorized conditional:
The equality of and implies equality for and .
The effects of this relationship have not been widely studied, as for traditional VAE models. On the other hand, is usually non-zero and has been minimized as a proxy for ‘disentanglement’ kim2018disentangling, chen2018isolating. We evaluate similar regularization for Echo in Sec. 5.3.
We have shown that parallel Echo channels are perfectly additive in that . However, general channels could be sub- or super-additive, so that , , or (e.g. Sec. 4.2 of griffith). Extending Echo to non-diagonal could allow us to explore the various relationships between and and more precisely characterize those which are useful for representation learning.
Appendix D Additional Results
d.1 Fashion MNIST Rate-Distortion
We show a full rate-distortion curve for Fashion MNIST in Fig.5, along with reconstructions at various rates. Echo performance nearly matches that of comparison methods except at low rates.
d.2 Marginal Activations
We visualize dimension-wise marginal activations for Echo on Binary MNIST and Omniglot in Fig.6. We show for thirteen dimensions in each method, including nine with highest rates, three with low rates, and one with minimal rate. For each, we combine activations from 2000 encoder samples on each test example and fit a KDE estimator with RBF bandwidth chosen according to the Scott criterion.
As discussed in Sec. 2, Echo avoids assumptions that the marginals are independent and Gaussian as in VAEs. However, we observe the individual Echo marginals to be approximately Gaussian, with the Anderson-Darling test failing to reject the null hypothesis of Gaussianity for any dimension. Nevertheless, the joint marginal may still be dependent (see App. C).
Individual dimensions are also are free to learn different means and variances without incurring a penalty in the objective, with factors generally keeping more mutual information with the data having less variance in the marginals. The highest mean dimension in the Omniglot plot corresponds to an ‘unused’ dimension that saturates the lower limit on achievable rate.
d.3 Echo vs.
We can analyse the Echo mutual information at each data point by noting that the expression in Eq. 4 involves an expectation over . Since and do not depend on in the proof of Thm. 2.2, we can evaluate as a pointwise mutual information. We compare this quantity with the L2-norm of as a proxy for signal to noise ratio. Test examples are sorted by conditional likelihood on the x-axis, and we see that Echo indeed has higher mutual information on examples where the generative model likelihood is high. Further analysis of these pointwise informations remains for future work.
Appendix E Details for Experiments
All models were trained using a similar convolutional architecture as used in alemi2018fixing, but with ReLU activations, unnormalized gradients, and fewer latent factors. We use Keras notation and list convolutional layers using the arguments (filters, kernel size, stride, padding). We show an example parametrization of Echo in the hidden layer.
Conv2D(32, 5, 1, ‘same’)
Conv2D(32, 5, 2, ‘same’)
Conv2D(64, 5, 1, ‘same’)
Conv2D(64, 5, 2, ‘same’)
Conv2D(256, 7, 1, ‘valid’)
echo_input = [Dense(32, tanh()),
Conv2DTranspose(64, 7, 1, ‘valid’)
Conv2DTranspose(64, 5, 1, ‘same’)
Conv2DTranspose(64, 5, 2, ‘same’)
Conv2DTranspose(32, 5, 1, ‘same’)
Conv2DTranspose(32, 5, 2, ‘same’)
Conv2DTranspose(32, 4, 1, ‘same’)
Conv2D(1, 4, 1, ‘same’, activation = ‘sigmoid’)
We trained using Adam optimization for 200 epochs, with a learning rate of 0.0003 decaying linearly to 0 over the last 100 epochs. All experiments were run using NVIDIA Tesla V100 GPUs.
MAF and IAF models were implemented using the Tensorflow Probability package dillon2017tensorflow. Each uses four steps of mean-only autoregressive flow, with each step consisting of three layers of 640 units. \addedFor the VampPrior, we used 750 pseudoinputs on all datasets. For the IAF-Vamp experiments, note that the VampPrior is calculated with respect to the inputs of the IAF transformation to avoid expensive density evaluations on new samples. This is valid since the mean-only transformation has constant Jacobian, but makes this method closely resemble VAE-Vamp. All MMD penalties had a loss coefficient of 999, and were evaluated using a radial basis kernel with bandwidth as in infovae, zhao2018lagrangian.
For rate-distortion experiments, we evaluated
, with additional to fill in gaps in the curve as necessary.
For the disentanglement experiments in Sec. 5.3, we followed the architecture and hyperparameters in locatello2018challenging. We trained for 300,000 gradient steps on both the full dataset and the downsampled dataset with dependent factors. The visualization in Figure 4 was generated using code from chen2018isolating.
Code implementing these experiments can be found at https://github.com/brekelma/echo.