# Learning in Variational Autoencoders

with Kullback-Leibler and Renyi Integral Bounds

###### Abstract

In this paper we propose two novel bounds for the log-likelihood based on Kullback-Leibler and the Rényi divergences, which can be used for variational inference and in particular for the training of Variational AutoEncoders. Our proposal is motivated by the difficulties encountered in training VAEs on continuous datasets with high contrast images, such as those with handwritten digits and characters, where numerical issues often appear unless noise is added, either to the dataset during training or to the generative model given by the decoder. The new bounds we propose, which are obtained from the maximization of the likelihood of an interval for the observations, allow numerically stable training procedures without the necessity of adding any extra source of noise to the data.

## 1 Introduction

Variational AutoEncoders (VAEs) (Kingma & Welling, 2014), (Rezende et al., 2014) are a class of generative models, providing a powerful approach to conduct statistical inference with complex probabilistic models, within the variational Bayesian framework. As any autoencoder, VAE is composed of an encoder and a decoder. The encoder maps the high-dimensional input data into a lower-dimensional latent model, through a probabilistic model . The decoder does the reverse, by modeling a conditional probability distribution that generates the reconstructed input, as the output. The encoder and the decoder are trained in synergy to provide reconstructed images of high quality, constrained to the assumption that the latent variable is distributed as a chosen prior on the latent space. After training, VAE can act as a generative model, by sampling from the prior distribution in the latent space and, then, decoding these samples. Both the decoder and the encoder conditional probabilities are parameterized by a neural network and are generally optimized through maximizing a lower bound on the model evidence, known in the literature as the ELBO (Jordan et al., 1999).

Recently, three main research directions have been pursued, to provide better learning and, thus, better performance in VAE: increasing the complexity of the approximate posterior, developing more accurate lower bounds to the maximum likelihood as training objectives and adding noise at different stages in the algorithmic framework. The performance of VAE can be improved through the use of more complex models for the approximate posterior: normalizing flows (Rezende & Mohamed, 2015), (Rezende & Mohamed, 2016) and products of conditional distributions (Sønderby et al., 2016). The latter is employed both in the model of the approximate posterior distribution, , as well as in the one of the generative distribution, .

In the context of developing bounds on the model evidence, the original ELBO bound, as the objective function to train VAE, is derived using the Kullback-Leibler divergence between the true and approximate posteriors (Kingma & Welling, 2014). This bound is tightened in (Burda et al., 2016), through an importance weighted unbiased estimator of the marginal distribution, , and its gradient estimator is given in (Mnih & Rezende, 2016). The state-of-the-art results provided by IWAE are enhanced through several linear combinations of VAE and IWAE (Rainforth et al., 2018). In addition, the authors of (Rainforth et al., 2018) investigate under which conditions tighter bounds improve the learning and generative capabilities of VAE.

Another extension of the ELBO is derived starting from the Rényi -divergence between the true and approximate posteriors (Li & Turner, 2016). Its importance weighted estimator is given in (Webb & Teh, 2016). With the divergence, an upper bound, termed CUBO, is proposed in (Dieng et al., 2017) and the gap between this newly introduced upper bound, and the original ELBO becomes the training objective.

In addition to developing more accurate bounds, learning in VAEs and, in general, in autoencoders is also facilitated by adding noise. Stochastic noise corruption, applied to the input data, is able to train denoising autoencoders and to generate samples of high visual quality (Vicent et al., 2010), (Bengio et al., 2013). Bit-flip and drop-out noise added to the input layer of the encoder of a VAE and Gaussian noise added to the samples given by the encoder is fundamental to perform inference in the case of new data inputs (Rezende et al., 2014). By adding noise to the latent model and by proposing a new tractable bound, called the denoising variational lower bound, the authors of (Im et al., 2017) obtain improved performance over VAE and IWAE. In particular, they transform the approximate posterior using a noise distribution and thus obtain a more complex model of the approximate posterior based on a Gaussian mixture model. With a plethora of possible approaches to chose from, in order to successfully train VAE on continuous datasets and obtain good reconstructions, we feel the need of a deeper understanding and a thorough study of the role of the bounds on the model evidence and the added noise at different steps in the training procedure. This work-in-progress paper will present our first steps in this direction and it will focus on the analysis of training VAE over the continuous MNIST and OMNIGLOT datasets.

### 1.1 Problem Statement

This paper stems from the difficulties encountered when training a VAE for continuous images with high contrast, such as the handwritten digits and characters from the MNIST and OMNIGLOT datasets, where the majority of the pixels are concentrated close to or . In the last layer of the decoder, we used the sigmoid activation function for the mean and the exponential activation function for the standard deviation, to ensure its positivity, and a Gaussian independence model for the generative distribution. The standard ELBO was the objective function used for training.

During the learning process, we observed that many of the standard deviations given by the decoder become very close to , as shown in Figure 3. This phenomenon results in the densities, growing to very large values, which implies that the reconstruction error of the ELBO bound gets much bigger than the KL term with the prior. This creates an unbalance in the two terms composing the bound and destabilize the optimization procedure, figs. 1-2 . This means that the reward by minimizing the standard deviations is much bigger than the reward of increasing the Kullback-Leibler term of the bound, which could potentially impact the learning of the prior distribution. As a solution to this problem, in the maximum likelihood framework, we propose to maximize the likelihood of intervals for the observations, instead of standard likelihood.

Our contribution is the derivation of novel integral lower bounds on the model evidence, by taking into account that in any stochastic model, probabilities are bounded by , whereas densities can tend to infinity. In order to mitigate the above mentioned issues, we construct a learning objective based on maximizing the likelihood of an interval for the observations, instead of maximizing the standard likelihood. The new learning objective represents a lower bound on the model evidence. We derive this lower bound in two forms: one starting from the Kullback-Leibler divergence between the true and approximate posterior distributions and another one from the Rényi -divergence between the same distributions. We provide proof-of-concept results of the training algorithms constructed with these novel bounds.

We developed our idea of using likelihoods of intervals independently from (Salimans et al., 2017). The authors of (Salimans et al., 2017) improve the conditional densities model of (van Oord et al., 2016), by introducing a latent model for the color intensity of the pixels, which is given by a mixture of logistic distributions. The logarithm of the probability of an interval around the currently observed pixel, in discretized form, becomes the objective function. With this improved architecture, they obtain state-of-the-art result on the CIFAR- benchmark dataset. Their purpose was to alleviate the memory and computational burden of the PixelCNN algorithm, as well as to obtain better results and a speed-up in training. In contrast, our motivation was to solve an unboundedness problem of the probability densities, which appears during the maximization of the objective function.

## 2 Mathematical Derivations

In this section, we provide the technical derivations for the newly introduced Kullback-Leibler and Rényi integral bounds, termed IELBO and IRELBO, respectively.

### 2.1 The Integral ELBO

We start the derivation of the IELBO bound from the definition of the model evidence:

We would like to point out that the approximate posterior, , plays the role of an importance distribution and can be chosen to be any probability density function. Here, in particular, we choose to fix the conditional variable to an example , .

Applying Jensen’s inequality and taking and , we obtain the following lower bound:

(1) |

We define the integral ELBO bound as

IELBO | ||||

(2) |

### 2.2 The Integral Rényi ELBO

We start the derivation of the IRELBO bound from Rényi’s -divergence (Rényi, 1961) between the approximate posterior distribution, , conditioned on the current example, , and the true posterior distribution, :

The definite integral of this divergence in an arbitrary interval, and , reads:

(3) |

where the last line follows from Jensen’s inequality and

(4) |

Applying Hölder’s inequality in integral form yields

(5) |

For , if we choose , then and

For , we have that:

We define the integral Rényi bound as

(6) |

In our assumption, is an independence model, with each pixel, distributed as . Because this distribution is raised to a negative exponent, we require the Dawson function, , to compute the integral

(7) |

with and .

## 3 Experimental Setting

### 3.1 The Model

An image is represented by pixel values in a certain range. In maximum log likelihood estimation, we replace the with the likelihood of the interval , where can be taken such that and , with and small enough, such that both . The value of can be seen as expressing a numerical tolerance on the value of the pixels and in practice it is motivated by the fact that pixels printed on screen have limited precision.

In VAEs we assume that it is possible to find a latent variable that could in principle explain the observation . To reduce the comparison to the essential, we will analyze the simplest model present in the literature, composed of a Gaussian approximate posterior with a diagonal covariance matrix and a standard Gaussian distribution for the prior in the latent space. For the encoding process, the input is passed to a neural network with output the mean and the logarithm of the covariances of a Gaussian distribution in the latent space, i.e. . The positivity of the entries of the covariance matrix is guaranteed by the exponential transformation on the output of the network. For the decoder, we will use an analogous model: the output of the network neural network given by the mean and covariance of a Gaussian probability distribution for the observations, i.e., , where the sigmoid function is used to restrict the mean between and .

### 3.2 Neural Network Architectures

In the learning schemes for the Kullback-Leibler integral bound (IELBO) we use the following architecture for a VAE: the neural network of the encoder and that of the decoder contain two deterministic hidden layers, each with nodes and the ELU activation function. The standard deviation given by the encoder is transformed through the exponential function, to ensure its positivity. The dimension of the latent space is equal to . The learning rate for the Adam optimizer is equal to . The integration interval is equal to . All images have been rescale to to avoid numerical issues. We consider samples drawn from the approximate posterior and a batch size of input samples. The weights are initialized with the xavier initialization, while the biases are set equal to .

In the learning schemes for the Rényi integral bound, IRELBO, we use the following architecture of VAE: the neural network of the encoder and that of the decoder contain one hidden layer, with nodes and the ReLU activation function. The standard deviation given by the encoder is transformed through the exponential function, to ensure its positivity. The dimension of the latent space is equal to and the learning rate for Adam is equal to . The integration interval is equal to . We consider , one sample drawn from the approximate posterior and a batch size of input samples. The weights are initialized with the xavier initialization.

## 4 Experimental Results and Discussion

In this section we provide preliminary results for the newly introduced integral bounds employed for the training of a standard VAE, for the continuous MNIST (LeCun, 1998) and OMNIGLOT datasets (Lake et al., 2015). No extra source of noise has been added during training, neither to the input samples, nor to generative Gaussian distribution , for instance in the form of a lower bound for the entries of the covariance matrix. The Figures in the paper which shows reconstructed test images represents in the first and fourth rows the original images; in the second and fifth rows the reconstructed images using the mean of the decoder distribution (obtained by encoding the mean of ); finally in the third and sixth rows the reconstructed images using the mean of the decoder distribution (obtained by passing a sample drawn from through the decoder).

The results we have presented indicate that the original ELBO is not suited for high contrast images, unless noise is added either to the data or to the model. Using the IELBO, we were able to efficiently reconstruct images, as shown in Figures 5, 6, 7, and 8. The bounds on the train and test datasets saturate after few hundred epochs, are smooth and have comparable values. The quality of the reconstructed MNIST test images appear to be very good, while for the reconstructed OMNIGLOT test images is of acceptable quality. On these datasets, the original ELBO without added any extra source of noise performs poorly. For MNIST, Figure 4 illustrates reconstructed test images of medium quality, while in Figure 1 it is possible to see that the bound evaluated on the test images has severe fluctuations associated to numerical issues, and deviates significantly from the one evaluated on the training set, as the epochs increase. For OMNIGLOT, we were not able to train the algorithm with the original ELBO, due to numerical issues as those in Figure 2.

Using the Rényi integral bound, we obtained very good quality for the reconstructed MNIST test images in Figure 9, and smooth bounds on the train and test examples, in Figure 10. Compared with the IELBO results on MNIST, training VAE with the IRELBO requires a greater number of epochs for convergence.

The integral bounds contain an extra hyper-parameter compared to the original ELBO. In our experiments, we observed that the value of the bound and the reconstruction quality of the test images are significantly affected by the choice of the integration interval.

As future work, we plan to study the impact of the size of the integration interval on the value of the bound and the quality of the reconstructed images. We plan to investigate how to efficiently select the integration interval for , for instance by dividing the interval in fixed bins, or by doing this dynamically based on the pixel mean.

Several relevant issues remain to be solved. We plan to study the behavior of these novel bounds on other continuous datasets, which have a more uniform distribution of the values of the pixels between , in opposition to the continuous MNIST and OMNIGLOT datasets. Preliminary experiments on datasets with low contrast images, such as Frey Faces (http://www.cs.nyu.edu/ roweis/data.html, ), provided good results. We are also investigating the impact on the reconstruction quality and on the value of the bound of different models for the decoder, such as the logit-normal distribution. Another relevant research direction is the study of the gradients of the novel bounds, from the point of view of the efficiency of the estimators and the speed of convergence. Finally, it will be also interesting to examine the advantages of the Rényi integral bound over the Kullback-Leibler one, to determine if a more complicated divergence function provides benefits in the learning capacity of the algorithm.

## 5 Conclusions

Motivated by the numerical difficulties encountered during the training of VAEs using the standard ELBO for continuous datasets characterized by high contrast (non-binary) images, such as with MNIST and OMNIGLOT, we introduced two novel lower bounds for the log likelihood, computed by maximizing the likelihood of intervals for the continuous observations. We conducted proof-of-concept experiments, which showed the capacity of our algorithms to produce good quality reconstructed test images and avoid numerical issues, without the need to add extra noise either to the data during training, or to the generative model. One benefit of our bounds is that they require the computation of likelihoods of intervals, which implicitly prevent the generation of small variances for the reconstructed inputs. Indeed, the likelihood of an interval can be bounded, thus, avoiding the numerical issues present during the computation of the bound based on the standard likelihood. Preliminary experiments on datasets with low contrast images, such as Frey Faces, show that the proposed bounds also allow the reconstruction of other types of images.

## 6 Acknowledgements

We would like to thank Andrei Ciuparu for useful discussions related to the training issues we encountered with the original ELBO and an anonymous reviewer for suggesting the reference (Salimans et al., 2017).

## References

- Bengio et al. (2013) Bengio, Yoshua, Yao, Li, Alain, Guillaume, and Vincent, Pascal. Generalized denoising auto-encoders as generative models. In Advances in Neural Information Processing Systems (NIPS ), 2013.
- Burda et al. (2016) Burda, Yuri, Grosse, Roger, and Salakhutdinov, Ruslan. Importance weighted autoencoders. In International Conference on Learning Representations (ICLR ), 2016.
- Dieng et al. (2017) Dieng, Adji Bousso, Tran, Dustin, Ranganath, Rajesh, Paisley, John, and Blei, David. Variational inference via upper bound minimization. In Advances in Neural Information Processing Systems, pp. 2729–2738, 2017.
- (4) http://www.cs.nyu.edu/ roweis/data.html.
- Im et al. (2017) Im, Daniel Jiwoong, Ahn, Sungjin, Memisevic, Roland, and Bengio, Yoshua. Denoising criterion for variational auto-encoding framework. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, 2017.
- Jordan et al. (1999) Jordan, Michael I., Ghahramani, Zoubin, Jaakkola, Tommi S., and Saul, Lawrence K. An introduction to variational methods for graphical models. Mach. Learn., 37(2):183–233, November 1999. ISSN 0885-6125.
- Kingma & Welling (2014) Kingma, D.P. and Welling, M. Auto-encoding variational Bayes. In International Conference on Learning Representations (ICLR 2014), 2014.
- Lake et al. (2015) Lake, Brenden M., Salakhutdinov, Ruslan, and Tenenbaum, Joshua B. Human-level concept learning through probabilistic program induction. In Science , pp. 1332–1338, 2015.
- Langley (2000) Langley, P. Crafting papers on machine learning. In Langley, Pat (ed.), Proceedings of the 17th International Conference on Machine Learning (ICML 2000), pp. 1207–1216, Stanford, CA, 2000. Morgan Kaufmann.
- LeCun (1998) LeCun, Yann. Gradient-based learning applied to document recognition. In Proceedings of the IEEE, volume 86, issue 11, 1998.
- Li & Turner (2016) Li, Yingzhen and Turner, Richard E. Rényi divergence variational inference. In Neural Information Processing Systems (NIPS ), 2016.
- Mnih & Rezende (2016) Mnih, Andriy and Rezende, Danilo J. Variational inference for Monte Carlo objectives. In Proceedings of the International Conference on Machine Learning (ICML 2016), 2016.
- Rainforth et al. (2018) Rainforth, Tom, Kosiorek, Adam R., , Le, Tuan Anh, Maddison, Chris J., Igl, Maximilian, Wood, Frank, and Teh, Yee Whye. Tighter variational bounds are not necessarily better. In arXiv:1802.04537, 2018.
- Rényi (1961) Rényi, Alfréd. On measures of entropy and information. In Proceedings of Fourth Berkeley Symposium on Mathematics Statistics and Probability, pp. 547–561, 1961.
- Rezende & Mohamed (2015) Rezende, Danilo Jimenez and Mohamed, Shakir. Variational inference with normalizing flows. In Proceedings of the International Conference on Machine Learning (ICML ), pp. 1530–1538, 2015.
- Rezende & Mohamed (2016) Rezende, Danilo Jimenez and Mohamed, Shakir. Improved variational inference with inverse autoregressive flow. In International Conference on Learning Representations (ICLR ) workshop track, 2016.
- Rezende et al. (2014) Rezende, Danilo Jimenez, Mohamed, Shakir, and Wierstra, Daan. Stochastic backpropagation and approximate inference in deep generative models. In Proceedings of the International Conference on Machine Learning (ICML ), pp. 1278–1286, 2014.
- Salimans et al. (2017) Salimans, Tim, Karpathy, Andrej, Chen, Xi, and Kingma, Diederik P. Pixelcnn++: Improving the PixelCNN with discretized logistic mixture likelihood and other modifications. In Proceedings of the International Conference on Learning Representations (ICLR ), 2017.
- Sønderby et al. (2016) Sønderby, Casper Kaae, Raiko, Tapani, Maale, Lars, Snderby, Sren Kaae, and Winther, Ole. Ladder variational autoencoders. In Neural Information Processing Systems (NIPS ), 2016.
- van Oord et al. (2016) van Oord, Aaron, Kalchbrenner, Nal, Vinyals, Oriol, Espeholt, Lasse, Graves, Alex, and Kavukcuoglu, Koray. Conditional image generation with PixelCNN decoders. In Neural Information Processing Systems 29 (NIPS ), 2016.
- Vicent et al. (2010) Vicent, Pascal, Larochelle, Hugo, Lajoie, Isabelle, Bengio, Yoshua, and Manzagol, Pierre-Antoine. Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. In Journal of Machine Learning Research , pp. 3371–3408, 2010.
- Webb & Teh (2016) Webb, Stefan and Teh, Yee Whye. A tighter Monte Carlo objective with Rényi -divergence. In Bayesian deep learning workshop at NIPS , 2016.