Understanding VAEs in FisherShannon Plane
Abstract
In information theory, Fisher information and Shannon information (entropy) are respectively used to measure the ability in parameter estimation and the uncertainty among variables. The uncertainty principle asserts a fundamental relationship between Fisher information and Shannon information, i.e., the more Fisher information we get, the less Shannon information we gain, and vice versa. This enlightens us about the essence of the encoding/decoding procedure in variational autoencoders (VAEs) and motivates us to investigate VAEs in the FisherShannon plane. Our studies show that the performance of the latent representation learning and the loglikelihood estimation are intrinsically influenced by the tradeoff between Fisher information and Shannon information. To flexibly adjust the tradeoff, we further propose a variant of VAEs that can explicitly control Fisher information in encoding/decoding mechanism, termed as Fisher autoencoder (FAE). Through qualitative and quantitative experiments, we show the complementary properties of Fisher information and Shannon information, and give a guide for Fisher information conditioning to achieve high resolution reconstruction, disentangle feature learning, overfitting/overregularization resistance, etc.
1 Introduction
The Heisenberg Uncertainty Inequality expresses a limitation of operational possibilities imposed by quantum mechanics [18], which states that the more precisely the position of a given particle is determined, the less precisely its momentum are known, and vice versa. In information theory, the pair of position and momentum turns to be Fisher information and Shannon information [8], which respectively correspond to the ability to estimate the parameter and the uncertainty to characterize the random variable [17]. Based on this informationtheoretical uncertainty principle, a FisherShannon plane is usually constructed to facilitate the analysis of the tradeoff between Fisher information and Shannon information in an information system [23].
Back to the studies in variational autoencoders (VAEs) [9], the main goal of this type of latent variable models can be simply explained in two perspectives. First, in the perspective of variational optimization [7], we aim to learn a proper model parameter to maximize the loglikelihood over data with an evidence lower bound (ELBO). Second, in the perspective of representation learning [3], we target to learn a latent code that sufficiently incorporates the information of an observation to facilitate downstream tasks. Based on these two perspectives, several variants of VAE [21, 22, 6, 24] have been developed in recent years.
Nevertheless, the encoding/decoding mechanism has been reported to fail in connecting these two perspectives. For instance, the latent variable modeled by the encoder is approximately ignored when the decoder is too expressive [6]. Thus, a series of studies [24, 1] explore to enhance the representation learning in VAEs on the basis of information theory. By treating VAEs as an information system, an constraint about mutual information can be introduced in the objective to guarantee that sufficient information of the observation flows into the latent code. For example, Phuong et al. [16] optimized EBLO plus a mutual information regularizer to explicitly control the amount of information stored in latent codes. Although it is useful to consider mutual information (a member of Shannon family) in VAEs’ encoding for representation learning, the control of information encoding has not yet theoretically analyzed in the current works. Besides, the preference between highlevel summary and disentangled features is an important concern in representation leaning, e.g., a useful latent representation in classification may not be proper in other tasks like generation. Considering these aspects, it needs further explore to have a comprehensive understanding of VAEs in the information theoretic view.
In this paper, we start from the uncertainty inequality and study VAEs in the FisherShannon (FS) information plane. Our analysis shows that the performance of the latent representation learning and the loglikelihood estimation are intrinsically influenced by the tradeoff between Fisher information and entropy power (a form of Shannon information). Besides, we further show different effects in latent representation by conditioning the Fisher information and entropy power, which enables to link variants of VAE models and gives a theoretical guide about how to adapt VAEs into the realworld applications. To facilitate the Fisher information and entropy power conditioning, we propose a family of VAEs regularized by the Fisher information, named Fisher autoencoder (FAE). By conditioning Fisher information in encoding, FAE provides a novel insight in VAEs’s encoding/decoding and can unify various VAEs variants. Finally, we conduct a range of experiments to validate our analysis in the FisherShannon plane and demonstrate the effects of Fisher information and Shannon information conditioning. Correspondingly, we show how to adjust FAE to achieve highresolution reconstruction, disentangled feature learning, and overfitting/overregularization resistance.
2 Related work
Information uncertainty: Fisher information and Shannon information are considered important tools to describe the informational behavior in information systems respectively in parametric view and variable view [4]. The generalization of Heisenberg Uncertainty Principle [18] into information system demonstrates that Fisher information and Shannon information are intrinsically linked, with the uncertainty inequality, where higher Fisher information will result in lower Shannon information and vice versa [8]. With this property, Fisher information and Shannon information are considered complementary aspects and be widely used in solving dual problem when one aspect is intractable [14]. To better take advantage of this property, Vignat and Bercher [23] construct the FisherShannon information plane for signal analysis in joint view of FisherShannon information.
VAEs in information perspective: Variational autoencoders [9], with a autoencoding form, can be regarded as an information system serve two goals. On one hand, we expect a proper parameter estimation to maximize the marginal likelihood; on the other hand, we hope the latent code can provide sufficient information of data point so as to serve downstream tasks. To improve the loglikelihood estimation, several works, such as PixelCNN and PixelRNN [21, 22] model the dependencies among pixels with autoregressive structure to achieve an expressive density estimator. As for the latent code, plenty of works address it in the perspective of Shannon information [6]. Mutual information, an member of Shannon family, is applied to measure the mutual dependence between datapoint and latent code [24, 1]. The leverage of mutual information is achieved with MaximumMean Discrepancy [24] and Wasserstein distance [20]. More generally, Phuong et al. [16] regularize the mutual information in VAE’s objective to control the information in latent code.
Effects of Fisher information and Shannon information: Fisher information and Shannon information (typically we call entropy), as complementary aspects, possess their properties. In [17], the entropy is explained as a measure of “global character” that is not too sensitive to strong changes in the distribution taking place on a smallsized region; the Fisher information is interpreted as a measure of the ability to estimate a parameter, which is regarded as the “local” one. In other words, entropy reflects the uncertainty of the information system and systems benefits entropy to avoid the “surprisal” as much as possible; on the other hand, Fisher information aids the system to precisely estimate the distribution parameter. Thus, in encoding mechanism, the Fisher information and entropy are respectively expected to serve encoding refined features and highlevel summary. To address this issue, we study VAEs in the FisherShannon information plane to provide a complete understanding of VAEs in a joint view of Fisher information and Shannon information and give a guide to provide “useful” encoding mechanism in different scenario.
3 FisherShannon Information Plane
In this section, we first present the information uncertainty principle to link the Fisher information and the Shannon information [8] of the random variable. After that, a FisherShannon information plane [23] is constructed to jointly analyze the characteristics of the random variable on its distribution. This provides the simple basics to understand VAEs with the FisherShannon information plane.
3.1 FisherShannon Information Uncertainty principle
In information theory, considering a random variable , whose probability density function is denoted as , the Fisher information
(1) 
Above two information quantities are respectively related to the precision that the parameters can be estimated and the characteristics of the random variable. For convenience to use in deduction, Shannon entropy is usually transformed to the following quantity which is called the entropy power [19]:
(2) 
The measure of and verifies a set of resembling inequalities in information theory [19]. Specifically, one of the inequalities connecting the two quantities and being tightly related to the phenomena studied in VAEs, is the uncertainty inequality [8], which is formulated as:
(3) 
where the equality holds when the random variable is a Gaussian variable. Note that this inequality possesses several versions, and when is a random vector, the corresponding Fisher information term is replaced with the determinant of the Fisher information matrix (see [8]).
When the given variable is fixed, the product of the Fisher information and the entropy power is a constant that is greater or equal to , which depends on the distribution form, the dimension of the random vector, etc. We can further formulate this property as follow:
(4) 
where is a constant number and . Eq. (3) and (4) indicate the measure of Fisher information and Shannon information are complementary that should be both considered in the analysis.
3.2 FisherShannon Information Plane
To facilitate the analysis of above two information quantities together, an information plane based on the Fisher information and the Shannon entropy power is proposed in [23] as follows,
(5) 
where denotes the area of this plane, which is limited by the Gaussian case. This plane consists of several FisherShannon (FS) curves, which characterizes the random variable configured with different distributions by considering its location when the variable is fixed.
The FS plane to VAEs is connected by analyzing the latent variable in the perspective of both Fisher information and Shannon information. This is because the Fisher information is tightly related to the likelihood estimation of VAEs and the Shannon entropy power reflects the dependence between data and latent variables. As shown in Eq. (4), when the random variable is fixed, the Fisher information and Shannon entropy power’s product is a constant, thus the quality of likelihood estimation and the dependence between data and latent variables need a tradeoff. A natural solution is to control the Fisher information and Shannon entropy power according to the application of VAEs. Thanks to the duality between Fisher information and Shannon information, it is sufficient only control one aspect to achieve the control of the other one. In this paper, we propose a family of VAEs that control the Fisher information in VAEs, named Fisher AutoEncoder (FAE). The methods and properties of FAE will be discussed in the next section.
4 The Fisher AutoEncoder
As shown in the previous section, one can control Fisher information or Shannon entropy power to control the tradeoff between the likelihood estimation and the dependence between data and the latent code . In this section, we come up with a family of VAEs that takes advantage of the Fisher information, thus named Fisher Autoencoder (FAE) and analyze its characteristics.
4.1 Fisher Information Control in VAE
The Fisher AutoEncoder aims to to control the Fisher information amount in the objective. Thus, the objective becomes to maximize the evidence lower bound (ELBO) with constraint of Fisher information and we reformulate the VAE’s objective as follows:
(6) 
where and are positive constant that denote the desired Fisher information value. A large value of (resp. ) implies we favor a precise parameter estimation for (resp. ); while a low value of (resp. ) indicates we weaken the parameter estimation to increase the Shannon information, so that the model can leverage the dependence of the latent codes on data.
To solve the scenario described in Eq. (6), we transfer this optimization problem into a Lagrangian objective, formulated as:
(7) 
where and are positive constant that control the regularizors. With this objective, we can control the Fisher information in encoder/decoder with an expected desired value . In the most cases, the calculation of Fisher information is not difficult. We can estimate the Fisher information directly by definition.
4.2 Characteristic of FAE: an example of FI regularization in Gaussian encoder
Here we give a FAE exemplar that only controls the Fisher information in encoder, which means we set in Eq. (7) as zero. In this model, we assume that all random variables are of dimension for simplicity in presentation. The FAE objective is formulated as:
(8) 
where is a generalized regularizor that considering the control of Fisher information. Same to the VAE [9], both prior distribution and posterior approximation are Gaussian, thus the KLdivergence can be analytically computed as:
(9) 
where and respectively correspond to the mean and standard derivation of a Gaussian distribution. The Fisher information can be easily computed by definition:
(10) 
Finally, putting Eq. (9) and Eq. (10) together, we have the following regularizer ,
(11) 
Considering the KLdivergence term in the original VAEs [9], the optimal w.r.t. variance reaches at , which aligns the posterior to a normal distribution . However, in Eq. (11), we can observe that the variance is also penalized by the desired Fisher information value , which will push the variance to approach zero when is large or make the variance larger than when is picked as a small value.
In the above discussion, we analyze the characteristics of FAE in variance control. This property corresponds to the inequality of CramerRao, from which the uncertainty principle shown in Eq. (3) can be derive [8]. Given a stochastic variable of mean and variance , the Fisher information is the lower bound of the variance in a nonbiased estimation:
(12) 
the equality holds when the stochastic variable is in the Gaussian case, as in Eq. (10). This inequality gives us the first impression of the characteristic of Fisher information: When FI is in a low value, the variance of the estimation is forced to be high, causing larger uncertainty of the model estimation. Thus, we need to enlarge the FI to make the variance more controllable.
Another important property of Fisher information can be revealed in definition: it captures the variability of the gradient. For a distribution that has high variability, we intuitively expect estimation of the parameter to be easier. A higher value of Fisher information can guarantee the gradient is larger for the learning process.
4.3 Connection to the Mutual AutoEncoder
In this section, we present the Mutual AutoEncoder (MAE) Phuong et al. [16], which is representative in the family of VAEs that leverage the Shannon information and discussion the connection between FAE and MAE.
MAE controls the mutual information between latent variable and data as follows:
(13) 
where and are positive constants that control the information regularization and the desired mutual information. Since the mutual information is difficult to directly compute, the mutual information is inferred using Gibbs inequality [2]:
(14) 
where is a parametric distribution that can be modeled by a network. Eq. (14) controls the mutual information by controlling the conditional entropy:
(15) 
Thus, when mutual information is desired with a high value, by maximizing , the conditional entropy is minimized and vice versa. Correspondingly, when we expect a high value of Fisher information in the estimation of , the conditional entropy is also abased as shown in Eq. (4) and vice versa. Therefore, FAE can also control the mutual information by setting proper desired value of Fisher information in encoder.
5 Experiments
In this section, we perform a range of experiments to investigate the FisherShannon impacts in VAEs. Meanwhile, we expose the encoding effects of Fisher Information control in Fisher AutoEncoder.
5.1 Experiment Goals and Experimental Settings
As discussed, the entropy power and Fisher Information corresponds to different characteristics. Thus, we aim to explore these characteristics and give corresponding analysis in order to give a potential guide for FisherShannon (FS) conditioning. Some specific goals are summarized as:

Explore several variants of VAEs models’ characteristics in the FS plane.

Explore the characteristics of latent code w.r.t. the Fisher Information and entropy power.

Discuss the usage of different FI in VAEs encoding.
The experiments are conducted on the MNIST dataset [12] and the SVHN dataset [15]. The first dataset consists of ten categories of 2828 handwritten digits and is binarized as in [11]. We follow the original partition to split the data as 50,000/10,000/10,000 for the training, validation and test. For SVHN dataset , it is MNISTlike but consists of 3232 color images. We apply 73257 digits for training, 26032 digits for testing, and 20000 samples from the additional set [15] for validation. Moreover, we also construct a toy dataset using MNIST data to better illustrate the characteristics of FI and entropy power. The dataset consists of 5800 samples of label “0” and each 100 samples of other labels in training set; the validation and test set remains same as MNIST.
In FAE and its baselines, all random variables are supposed to be Gaussian variables. Here we only concern the hyperparameters to adjust the penalty of Fisher information in encoding. In practice, we observe this value can be effective when set from 0.01 to 10 (depends on dataset). For the architecture of inference network and generative network, we both deploy a 5layers network. Since the impacts of fullyconnected and convolution architecture do not differ much in the experiments, we here present results using the architecture as 5 fullconnected layers of dimension 300. The latent code is of dimension 40.
5.2 Quantitative Results
FisherShannon Plane Analysis
In this part, we conduct a series of experiments on different models to evaluate them in FisherShannon plane to present different characteristics of using Fisher information and entropy power.
We first evaluate the loglikelihood. To compute the test marginal negative loglikelihood (NLL), we apply the Importance Sampling [5] with 5,000 data points for the previously mentioned models. The result is shown in Table 1: when the desired Fisher information of in FAE (or the desired mutual information between data and latent variable in MAE) is large, the models can achieve a competitive results of stateofthearts like pixelVAE [22] and IAF[10]. When set the desired information or as zero, we can observe that the results are slightly better than plain VAE, but less competitive than the former models. This is due to when the latent code is set to be less informative, the model will align to the prior , and the term will approach zero. The model can thus focus more on optimizing the reconstruction error, which results in better performance in VAEs. However, since the decoder is not powerful as pixelVAE, the results are not comparable.
VAE[9]  pixelVAE[21]  IAF[10]  

NLL  85.56  79.21  79.85  
MAE ()  MAE ()  FAE ()  FAE ()  
NLL  80.86  81.58  79.30  83.24 
We put the former models in the FS information plane and draw the “NJ curve” for the Gaussian variable (where ) in the left subfigure of Figure 1. According to the illustration, we can observe the complementary effects of Fisher information and entropy power in VAE. When the Fisher information elevates, the corresponding entropy power abases and vice versa. We can observe when the dependence between data and latent code is higher, which means we get a better estimation of , the corresponding models appear in the bottomright corner in the FS plane. In the contrary, the models that contains less information in latent code appear in the upperleft corner. It is also interesting to notice that the inverse autoregressive flow VAE [10] is beyond the curve. This is due to the IAF transforms the posterior into a complex distribution from the simple Gaussian distribution. This phenomenon gives us the inspiration to improve the Fisher information and entropy power at the same time by using a more complex distribution in modeling. From this plane, we can learn that we can control the Fisher information in latent variable and “move” on this curve. The upperleft corner indicates a less informative code, while the bottomright indicates a more informative code. To approach which side depends on our task goal in real world.
In the right subfigure of Figure 1, the training process of a FAE is also visualized in the FS plane. We plot the location of different epochs in FS plane. It is obvious that in FS plane the training process is intrinsically moving along the “NJ curve” from upperleft side to the bottomright side. In fact, for most models, the goal is to move further in the bottomright corner, thus we get better knowledge about data. Controlling the Fisher information can affect how far we can move to the right side along this curve. In next part, we discuss when we need to move to bottomright and when we need to stay upperleft.
Effects of Fisher Information and Entropy Power
The former part discusses the characteristics of different models and the corresponding performance. We are still curious about the effects of Fisher information and entropy power in encoding: when we should keep larger Fisher information and when we should keep larger entropy power? In this part, we show different effects when varying Fisher information and entropy power.
We set the latent code size to 10, and gradually increase the value of in FAE (from 0 to 20). The embedding of latent variable is visualized with TSNE [13]; the distribution of is visualized by sampling from and count the norm of normalized , i.e. . The results are presented in Figure 2.
In Figure 2, from left to right, the entropy power increases while the Fisher information decreases. We can observe as the entropy power increases, the latent variable embedding becomes more dense so that they can cover larger area in the latent space; while as the Fisher information increases, the embedding becomes more compact. When observing the distribution , it is obvious that the distribution is more centered when Fisher information is larger, while the distribution swells like a balloon when the Fisher information is small.
As discussed in section 4.2, the Fisher information will control the variance of the encoding distribution. We can easily find out in Figure 2, the variance of the distribution is getting smaller as FI increases. Intuitively, when VAEs encode the data, if we assign a large Fisher information, the encoding space is compressed to be smaller, thus the hashing cost is smaller for the decoder and the model can reach a better likelihood. In the contrary, when we need to enlarge the encoding space, for example, we would like to cluster data points, we can set a larger entropy power (or a smaller Fisher information). This ensures we can put data points with similar attribute together.
In brief, we conclude the characteristics of large Fisher information and entropy power:

Large Fisher information provides with a more refined encoding mechanism, which ensures the latent code contains more information.

Large entropy power provides with a more coarse encoding mechanism, which absorb the similar attributes to make data points in cluster.
Therefore, larger Fisher information is helpful in learning of disentangle feature, high quality reconstruction, etc.; while larger entropy power is helpful in generalization, data compression, etc.
5.3 Qualitative Evaluation
In this section, we present some qualitative results to provide an intuitive visualization. This will help us better understand the characteristics of Fisher information.
We present some reconstruction samples of FAE with large and small desired Fisher information in Figure 7. As shown, the samples reconstructed with large provides more details and are sharper. This is especially significant in the case of SVHN, where we can observe more clear texture compared to the one reconstructed with small . Instead, we can find some blurry samples reconstructed with small on MNIST. The blur is more obvious on SVHN.
We also present some qualitative results on disentangle feature learning in the latent variable in Figure 4. We respectively train two FAE model with and ; then we reconstruct samples by traversing over the latent variable and visualize the corresponding results. The traversal on 10 dimension of the latent variable is over the [10, 10] range. As shown, the FAE with large learns a better disentangled representation of the generative factors. The latent variable is encoded in a more refined mechanism, where we can distinguish the character that each dimension controls. For example, the first line presents the variance of the width of the digit zero; the second line represents the variance of the inclination direction of the digits, etc. However, when we set , which indicates a large entropy power in the latent variable, the disentangled representation is not obvious as the former. This refers to the characteristic that we discussed in the previous sections: larger entropy power helps the model absorb similar attributes to make a highlevel summary of data.
5.4 Generalization of using FAE
In this section, we conduct a series of experiments on the reversed MNIST dataset (which mostly consists of data with label “0”, as described in section 5.1) to describe a scenario, where we should constrain the Fisher information to obtain larger entropy power.
We trained FAE on this dataset with different . The results of test loglikelihood are presented in Table 2. From the table, the train NLL is shown to decrease as increases, while the test NLL increases. This phenomenon indicates the model suffers from overfitting on the dataset.
In Figure 5, the reconstructed samples are presented. When Fisher information is large, the model tends to well fit data. However, since the most part of training data are digit zero, the model mainly captures the attributes of digit “0”. In test cases, the model will reconstruct samples in a way that contains some attributes of zero. As we constrain the Fisher information, the model has better capacity of generalization with larger entropy power to overcome the overfitting.
FAE ()  FAE ()  FAE ()  FAE ()  

Train NLL  94.48  95.12  95.89  96.67 
Test NLL  164.56  134.94  109.19  108.37 
5.5 Discussion: How to Benefit FS Information
From the previous parts, one can learn that larger Fisher information helps in precise distribution estimation, disentangled feature learning, and overcome the overregularization; while larger entropy power leaves more uncertainty in the model for generalization and enables to handle the overfitting.
In real world, when we train a model on a dataset, it is important to first investigate whether the given dataset can cover all kinds of situations. When the dataset is believed to be complete, or somehow to be related to the future data, larger Fisher information can be preferred in the model to refine the learning. Otherwise, we should warn the model with uncertainty in the future, so it can get a better generalization capacity to handle the “surprisal”.
6 Conclusion
Based on the information inequality between Fisher information and Shannon information, in this paper, we apply the FisherShannon plane to study VAEs to measure the likelihood estimation and latent representation learning, which are two perspectives of the VAE tasks. In our study of VAEs in FisherShannon plane, the tradeoff between Fisher information and Shannon information affects the encoding mechanism, and results in different VAE characteristics. Thus, to connect these variants, we propose Fisher AutoEncoder for encoding control. In our experiments, we demonstrate the complementary characteristics Fisher information and entropy power and give a comprehensible guide of VAE’s information encoding: when dataset is can relate past and future, the model can require a large Fisher information for refined encoding to learn more disentangled features; otherwise, the model should constrain Fisher information to gain better generalization capacity to handle the uncertainty in future data.
Appendix A Nonparametric Fisher information
The Fisher information we formulated in section 3.1 is a nonparametric version. In the original definition of Fisher information, it is formulated as [19]:
(16) 
where can be multidimensional. In [19], if only depends on , for example, the gaussian distribution w.r.t. mean , then can be dropped from to become a nonparametric version.
(17) 
In reality, we can regard the distribution parameter as a variable and formulate the parametric distribution as a posterior:
(18) 
Then we can manipulate the prior of to find a distribution that satisfies:
(19) 
Finally, we transform a distribution from parametric version into nonparametric version. This transformation needs to satisfy three conditions according to [19]:

is strictly positive

is differentiable

The integral (17) exists
Appendix B Minor Characteristic: Stability in Learning
In terms of optimization for VAEs, the stability of parameter estimation is also a concern [10, 24], which also reflects a minor characteristic of using Fisher information. In [25], the instability of parameter will lead to degeneration when VAE’s architecture is deep and is linked to the Fisher information loss. Here, we present the resistance of degeneration with the advantage of Fisher information regularizer in VAEs.
Specifically, we extend the encoder and the decoder to 20, 30, 40 layers (marked as 20L, 30L, 40L) respectively to observe the performance of FAE and MAE in the context of degeneration. The performance is illustrated in Figure 6. As can be seen, the latent space and the reconstruction is similar to those in normal situation with the network going deep. Fisher information captures the variability of the gradient and can alleviate the degeneration in very deep networks. Instead, MAE, although it can also control information in encoding mechanism, it is risky in gradient vanishing. It thus causes inaccurate parameter estimation and results in degeneration in the latent space and reconstruction.
Appendix C Supplementary experiment results on SVHN
Footnotes
 Note that, we follow the nonparametric Fisher information definition that differentiates on random variables.
References
 Alexander A Alemi, Ben Poole, Ian Fischer, Joshua V Dillon, Rif A Saurous, and Kevin Murphy. Fixing a broken elbo. 2018.
 David Barber and Felix Agakov. The im algorithm: a variational approach to information maximization. In Proceedings of the 16th International Conference on Neural Information Processing Systems, pages 201–208. MIT Press, 2003.
 Y. Bengio, A. Courville, and P. Vincent. Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8):1798–1828, Aug 2013. ISSN 01628828. doi: 10.1109/TPAMI.2013.50.
 Nicolas Brunel and JeanPierre Nadal. Mutual information, fisher information, and population coding. Neural computation, 1998.
 Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders. arXiv preprint arXiv:1509.00519, 2015.
 Xi Chen, Diederik P Kingma, Tim Salimans, Yan Duan, Prafulla Dhariwal, John Schulman, Ilya Sutskever, and Pieter Abbeel. Variational lossy autoencoder. International Conference on Learning Representation, 2017.
 Chris Cremer, Xuechen Li, and David Duvenaud. Inference suboptimality in variational autoencoders. International Conference on Learning Representation, 2018.
 A. Dembo, T. M. Cover, and J. A. Thomas. Information theoretic inequalities. IEEE Transactions on Information Theory, 37(6):1501–1518, Nov 1991. ISSN 00189448. doi: 10.1109/18.104312.
 Diederik P. Kingma and Max Welling. Autoencoding variational bayes. International Conference on Learning Representations, 2014.
 Diederik P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling. Improved variational inference with inverse autoregressive flow. In Advances in Neural Information Processing Systems, pages 4743–4751, 2016.
 Hugo Larochelle and Iain Murray. The neural autoregressive distribution estimator. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 29–37, 2011.
 Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, Nov 1998. ISSN 00189219. doi: 10.1109/5.726791.
 Laurens van der Maaten and Geoffrey Hinton. Visualizing data using tSNE. Journal of Machine Learning Research, 9(Nov):2579–2605, 2008.
 MT Martin, F Pennini, and A Plastino. Fisher’s information and the analysis of complex signals. Physics Letters A, 256(23):173–180, 1999.
 Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y. Ng. Reading digits in natural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011, 2011. URL http://ufldl.stanford.edu/housenumbers/nips2011_housenumbers.pdf.
 Mary Phuong, Max Welling, Nate Kushman, Ryota Tomioka, and Sebastian Nowozin. The mutual autoencoder: Controlling information in latent code representations, 2018. URL https://openreview.net/forum?id=HkbmWqxCZ.
 Osvaldo Anibal Rosso, Felipe Olivares, and Angelo Plastino. Noise versus chaos in a causal fishershannon plane. Papers in physics, 7:070006, 2015.
 E Schrödinger. About heisenberg uncertainty relation. Proc. Prussian Acad. Sci. Phys. Math. XIX, 293, 1930.
 A.J. Stam. Some inequalities satisfied by the quantities of information of fisher and shannon. Information and Control, 2(2):101 – 112, 1959. ISSN 00199958. doi: https://doi.org/10.1016/S00199958(59)903481. URL http://www.sciencedirect.com/science/article/pii/S0019995859903481.
 Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Schoelkopf. Wasserstein autoencoders. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=HkL7n10b.
 Aäron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. 2016a. URL http://arxiv.org/abs/1601.06759.
 Aäron van den Oord, Nal Kalchbrenner, Oriol Vinyals, Lasse Espeholt, Alex Graves, and Koray Kavukcuoglu. Conditional image generation with pixelcnn decoders. 2016b. URL http://arxiv.org/abs/1606.05328.
 C. Vignat and J.F. Bercher. Analysis of signals in the fishershannon information plane. Physics Letters A, 312(1):27 – 33, 2003. ISSN 03759601. doi: https://doi.org/10.1016/S03759601(03)00570X. URL http://www.sciencedirect.com/science/article/pii/S037596010300570X.
 Shengjia Zhao, Jiaming Song, and Stefano Ermon. InfoVAE: Information maximizing variational autoencoders. arXiv:1706.02262, 2017.
 H. Zheng, J. Yao, Y. Zhang, and I. W. Tsang. Degeneration in VAE: in the Light of Fisher Information Loss. ArXiv eprints, February 2018.