Variational Information Bottleneck on Vector Quantized Autoencoders
In this paper, we provide an information-theoretic interpretation of the Vector Quantized-Variational Autoencoder (VQ-VAE). We show that the loss function of the original VQ-VAE  can be derived from the variational deterministic information bottleneck (VDIB) principle . On the other hand, the VQ-VAE trained by the Expectation Maximization (EM) algorithm  can be viewed as an approximation to the variational information bottleneck(VIB) principle .
Variational Information Bottleneck on Vector Quantized Autoencoders
|Hanwei Wu and Markus Flierl|
|School of Electrical Engineering and Computer Science|
|KTH Royal Institute of Technology, Stockholm|
The recent advances of variational autoencoder(VAE) provide new unsupervised approaches to learn hidden structure of the data . The variational autoencoder is a powerful generative model which allows inference of the learned latent representation. However, the classic VAEs are prone to the “posterior collapse ”phenomenon that the latent representations are ignored due to the powerful decoder. Vector quantized variational autoencoder (VQ-VAE) learns discrete representations by incorporating the idea of vector quantization into the bottleneck stage and the “posterior collapse ”can be avoided .  proposes to use the Expectation Maximization algorithm to train the VQ-VAE in the bottleneck stage and achieves higher perplexity of its latent space. Both the proposed VQ-VAE models made progress on training of the discrete latent variable models to match their continous counterparts.
In this paper, we show that the formulation of the VQ-VAEs can be interpreted from an information-theoretic perspective. The loss function of the original VQ-VAE can be derived from the deterministic variational information bottleneck principle . On the other hand, the VQ-VAE trained by EM algorithm can be viewed as an approximation to the variational information bottleneck.
Ii Related Work
Given a joint probability distribution of input data and the observed relevant random variable , the information bottleneck (IB) method seeks a representation such that the mutual information is minimized, while preserving the mutual information . can be seen as a measure of the predicative power of on , and can be seen as a compression measure. Hence, the information bottleneck is designed to find the trade off between the accuracy and compression.  first used the information bottleneck principle to analysis the deep neural networks theoretically, but no practical models are derived from the IB model.  presents a variational approximation to the information bottleneck so that the IB-based models can be parameterized by the neural networks.
The deterministic information bottleneck (DIB) principle introduces alternative formulation of the IB problem. It focus on the representational cost of the latent instead of finding the minimal sufficient statistics for predicting . Hence, DIB replaces mutual information with the entropy , Using the similar techniques from ,  derived a variational deterministic information bottleneck(VDIB) to approximate the DIB.
Iii Variational Information Bottleneck
We adapt an unsupervised clustering setting to derive the loss functions of the VDIB and VIB. We denote the data point index as the input data, the codeword index as the latent variable, and the feature representation of input data as the observed relevant variable and as the reconstructed representation. The above variables are subject to the Markov chain constraint
The information bottleneck principle can be formulated as a rate-distortion like problem 
The loss function of the information bottleneck principle is the equivalent problem with the Lagrangian formulation,
where is the Lagrangian parameter.
Consider the information bottleneck distortion is defined as
where denotes the KullbackâLeibler divergence. Let be the measure on , and we have , , we can decompose the into two terms
where (6) is derived from using the chain rule to express the conditional probably as
Since the second term of (6) is determined solely by the given data distribution and is a constant, so it can be ignored in the loss function for the propose of minimization. The first term of (6) can have an upper bounded by replacing the with a variational approximation 
where (9) is resulted from the non-negative of the KL divergence
Similarly, the mutual information can have an upper bounded by replacing marginal with a variational approximation
where (14) is resulted from the non-negative of the KL divergence
Iv Connection to VQ-VAEs
In this section, we establish the connection between the VIB and VDIB principles with the VQ-VAE and the VQ-VAE trained by EM algorithm. In the VQ-VAE setting, the distribution is parameterized by the encoder neural network and the distribution is parameterized by the decoder neural network .
where is the stop gradient operator, is the number of codewords of the quantizer, is the output of the encoder of the -th data point, is the output of the bottleneck quantizer and the input of the decoder. The stop gradient operator outputs its input as it is in the forward pass, and it is not taken into account for computing gradients in the training process.
The first term of (24) is the reconstruction error between the output and input. The gradients of the backpropagation is copied from the decoder input to the encoder output . Hence, the first term only optimizes the encoder and decoder, and the codewords receive no update gradients. The second term is the commitment loss that is used to force the encoder output commits to the codewords and the bottleneck codewords are optimized by the third term. is a constant weight parameter for the commitment loss.
For the second regularization term, VDIB minimizes the cross entropy with the empirical expression
Conventionally, the marginal is set to be a uniform distribution. Then becomes a constant and can be omit from the loss function. The loss function of VDIB then can be reduced to the loss function (24) of VQ-VAE.
For the VIB, the KL divergence can be expressed as
The classic VQ-VAE applies nearest neighbor search on the codebook in the bottleneck stage
where is the codeword. Hence, the conditional entropy is zero.
On the other hand, the VQ-VAE trained by the EM algorithm uses a soft clustering scheme based on the distance between the codeword and the output of the encoder. The probability the data assigns with the codeword is
That is, the EM algorithm explicitly increases the conditional entropy and achieve a lower value for (18). The experiments in  also suggests that VQ-VAE trained by the EM algorithm can achieve higher perplexity of the codewords than the original VQ-VAE.
We derive the loss function of VIB and VDIB from a clustering setting. We show the loss function of the original VQ-VAE can be derived from the VDIB principle. In addition, we show that the VQ-VAE trained with the EM algorithm explicitly increases the perplexity of the latents and can be viewed as an approximation of the VIB principle.
-  A. Oord, K. Kavukcuoglu, and O. Vinyals, “Neural discrete representation learning,” in Advances on Neural Information Processing Systems (NIPS), Long Beach, CA, Dec. 2017.
-  DJ Strouse and D. Schwab, “Variational deterministic information bottleneck,” 2018, [Online]. Available: http://djstrouse.com/downloads/vdib.pdf.
-  A. Roy, A. Vaswani, A. Neelakantan, and N. Parmar, “Theory and experiments on vector quantized autoencoders,” 2018, [Online]. Available: https://arxiv.org/abs/1803.03382.
-  A.A. Alemi, I. Fischer, J. V. Dillon, and K. Murphy, “Deep variational information bottleneck,” in Proceedings of the International Conference on Learning Representations (ICLR), Toulon, France, Apr. 2017.
-  D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” in Proceedings of the International Conference on Learning Representations (ICLR), Banff, Canada, Apr. 2014.
-  DJ Strouse and D. Schwab, “The deterministic information bottleneck,” Neural Comput., vol. 29, no. 6, pp. 1611â1630, 2017.
-  N. Tishby, F. C. Pereira, and W. Bialek, “The information bottleneck method,” .
-  N. Tishby and N. Zaslavsky, “Deep learning and the information bottleneck principle,” .
-  A. Gilad-Bachrach, A. Navot, and N. Tishby, “An information theoretic tradeoff between complexity and accuracy,” in Proceedings of the COLT, 2003.