# Bottleneck Conditional Density Estimation

###### Abstract

We introduce a new framework for training deep generative models for high-dimensional conditional density estimation. The Bottleneck Conditional Density Estimator (BCDE) is a variant of the conditional variational autoencoder (CVAE) that employs layer(s) of stochastic variables as the bottleneck between the input and target , where both are high-dimensional. Crucially, we propose a new hybrid training method that blends the conditional generative model with a joint generative model. Hybrid blending is the key to effective training of the BCDE, which avoids overfitting and provides a novel mechanism for leveraging unlabeled data. We show that our hybrid training procedure enables models to achieve competitive results in the MNIST quadrant prediction task in the fully-supervised setting, and sets new benchmarks in the semi-supervised regime for MNIST, SVHN, and CelebA.

## 1 Introduction

Conditional density estimation (CDE) refers to the problem of estimating a conditional density for the input and target . In contrast to classification where the target is simply a discrete class label, is typically continuous or high-dimensional in CDE. Furthermore, we want to estimate the full conditional density (as opposed to its conditional mean in regression), an important task the conditional distribution has multiple modes. CDE problems in which both and are high-dimensional have a wide range of important applications, including video prediction, cross-modality prediction (e.g. image-to-caption), model estimation in model-based reinforcement learning, and so on.

Classical non-parametric conditional density estimators typically rely on local Euclidean distance in the original input and target spaces (Holmes et al., 2012). This approach quickly becomes ineffective in high-dimensions from both computational and statistical points of view. Recent advances in deep generative models have led to new parametric models for high-dimensional CDE tasks, namely the conditional variational autoencoders (CVAE) (Sohn et al., 2015). CVAEs have been applied to a variety of problems, such as MNIST quadrant prediction, segmentation (Sohn et al., 2015), attribute-based image generation (Yan et al., 2015), and machine translation (Zhang et al., 2016).

But CVAEs suffer from two statistical deficiencies. First, they do not learn the distribution of the input . We argue that in the case of high-dimensional input where there might exist a low-dimensional representation (such as a low-dimensional manifold) of the data, recovering this structure is important, even if the task at hand is to learn the conditional density . Otherwise, the model is susceptible to overfitting. Second, for many CDE tasks, the acquisition of labeled points is costly, motivating the need for semi-supervised CDE. A purely conditional model would not be able to utilize any available unlabeled data.^{1}^{1}1We define a “labeled point” to be a paired sample, and an “unlabeled point” to be unpaired or . We note that while variational methods (Kingma & Welling, 2013; Rezende et al., 2014) have been applied to semi-supervised classification (where is a class label) (Kingma et al., 2014; Maaløe et al., 2016), semi-supervised CDE (where is high-dimensional) remains an open problem.

We focus on a set of deep conditional generative models, which we call bottleneck conditional density estimators (BCDEs). In BCDEs, the input influences the target via layers of bottleneck stochastic variables in the generative path. The BCDE naturally has a joint generative sibling model which we denote the bottleneck joint density estimator (BJDE), where the bottleneck generates and independently. Motivated by Lasserre et al. (2006), we propose a hybrid training framework that regularizes the conditionally-trained BCDE parameters toward the jointly-trained BJDE parameters. This is the key feature that enables semi-supervised learning for conditional density estimation in the BCDEs.

Our BCDE hybrid training framework is a novel approach for leveraging unlabeled data for conditional density estimation. Using our BCDE hybrid training framework, we establish new benchmarks for the quadrant prediction task (Sohn et al., 2015) in the semi-supervised regime for MNIST, SVHN, and CelebA. Our experiments show that 1) hybrid training is competitive for fully-supervised CDE, 2) in semi-supervised CDE, hybrid training helps to avoid overfitting, performs significantly better than conditional training with unlabeled data pre-training, and achieves state-of-the-art results, and 3) hybrid training encourages the model to learn better and more robust representations.

## 2 Background

### 2.1 Variational Autoencoders

Variational Autoencoder (VAE) is a deep generative model for density estimation. It consists of a latent variable with unit Gaussian prior , which in turn generates an observable vector . The observation is usually conditionally Gaussian , where and are neural networks whose parameters are represented by .^{2}^{2}2For discrete , one can use a deep network to parameterize a Bernoulli or a discretized logistic distribution. VAE can be seen as a non-linear generalization of the probabilistic PCA (Tipping & Bishop, 1999), and thus, can recover non-linear manifolds in the data. However, VAE’s flexibility makes posterior inference of the latent variables intractable. This inference issue is addressed via a recognition model , which serves as an amortized variational approximation of the intractable posterior . Learning in VAE’s is done by jointly optimizing the parameters of both the generative and recognition models so as to maximize an objective that resembles an autoencoder regularized reconstruction loss (Kingma & Welling, 2013), i.e.,

(1) |

We note that the objective Eq. 1 can be rewritten in the following form that exposes its connection to the variational lower bound of the log-likelihood

(2) |

We make two remarks regarding the minimization of the term in Eq. 2. First, when is a conditionally independent Gaussian, this approximation is at best as good as the mean-field approximation that minimizes over all independent Gaussian ’s. Second, this term serves as a form of amortized posterior regularization that encourages the posterior to be close to an amortized variational family (Dayan et al., 1995; Ganchev et al., 2010; Hinton et al., 1995). In practice, both and are jointly optimized in Eq. 1, and the reparameterization trick (Kingma & Welling, 2013) is used to transform the expectation over into , which leads to an easily obtained stochastic gradient.

### 2.2 Conditional VAEs (CVAEs)

In Sohn et al. (2015), the authors introduce the conditional version of variational autoencoders. The conditional generative model is similar to VAE, except that the latent variable and the observed vector are both conditioned on the input . The conditional generative path is

(3) | ||||

(4) |

and when we use a Bernoulli decoder is

(5) |

Here, denotes the parameters of the neural networks used in the generative path. The CVAE is trained by maximizing a lower bound of the conditional likelihood

(6) |

but with a recognition network , which is typically Gaussian , and takes both and as input.

### 2.3 Blending Generative and Discriminative

It is well-known that a generative model may yield sub-optimal performance when compared to the same model trained discriminatively (Ng & Jordan, 2002), a phenomenon attributable to the generative model being mis-specified (Lasserre et al., 2006). However, generative models can easily handle unlabeled data in semi-supervised setting. This is the main motivation behind blending generative and discriminative models. Lasserre et al. (2006) proposed a principled method for hybrid blending by duplicating the parameter of the generative model into a discriminatively trained and a generatively trained , i.e.,

(7) |

The discriminatively trained parameter is regularized toward the generatively trained parameter via a prior that prefers small . As a result, in addition to learning from the labeled data , the discriminative parameter can be informed by the unlabeled data via , enabling a form of semi-supervised discriminatively trained generative model. However, this approach is limited to simple generative models (e.g., naive Bayes and HMMs), where exact inference of is tractable.

## 3 Neural Bottleneck Conditional Density Estimation

While Sohn et al. (2015) has successfully applied the CVAE to CDE, CVAE suffers from two limitations. First, the CVAE does not learn the distribution of its input , and thus, is far more susceptible to overfitting. Second, it cannot incorporate unlabeled data. To resolve these limitations, we propose a new approach to high-dimensional CDE that blends the discriminative model that learns the conditional distribution , with a generative model that learns the joint distribution .

### 3.1 Overview

Figure 1 provides a high-level overview of our approach that consists of a new architecture and a new training procedure. Our new architecture imposes a bottleneck constraint, resulting a class of conditional density estimators, we call it bottleneck conditional density estimators (BCDEs). Unlike CVAE, the BCDE generative path prevents from directly influencing . Following the conditional training paradigm in Sohn et al. (2015), conditional/discriminative training of the BCDE means maximizing the lower bound of a conditional likelihood similar to (6),i.e.,

When trained over a dataset of paired samples, the overall conditional training objective is

(8) |

However, this approach suffers from the same limitations as CVAE and imposes a bottleneck that limits the flexibility of the generative model. Instead, we propose a hybrid training framework that takes advantage of the bottleneck architecture to avoid overfitting and supports semi-supervision.

One component in our hybrid training procedure tackles the problem of estimating the joint density . To do this, we use the joint counterpart of the BCDE: the bottleneck joint density estimator (BJDE). Unlike conditional models, the BJDE allows us to incorporate unpaired and data during training. Thus, the BJDE can be trained in a semi-supervised fashion. We will also show that the BJDE is well-suited to factored inference (see Section 3.4), i.e., a factorization procedure that makes the parameter space of the recognition model more compact.

The BJDE also serves as a way to regularize the BCDE, where the regularization constraint can be viewed as soft-tying between the parameters of these two models’ generative and recognition networks. Via this regularization, BCDE benefits from unpaired and for conditional density estimation.

Standard | BJDE: | |||||
---|---|---|---|---|---|---|

BCDE: | ||||||

Factored | BJDE: | |||||

BCDE: |

### 3.2 Bottleneck Joint Density Estimation

In the BJDE, we wish to learn the joint distribution of and . The bottleneck is introduced in the generative path via the bottleneck variable , which points to and (see Figs. 2(c), 2(b) and 2(a)). Thus, the variational lower bound of the joint likelihood is

(9) |

We use to indicate the parameters of the BJDE networks and reserve for the BCDE parameters. For samples in which or is unobserved, we will need to compute the variational lower bound for the marginal likelihoods. Here, the bottleneck plays a critical role. If were to directly influence in a non-trivial manner, any attempt to incorporate unlabeled would require the recognition model to infer the unobserved from the observed —a conditional density estimation problem which might be as hard as our original task. In the bottleneck architecture, the conditional independence of and given implies that only the low-dimensional bottleneck needs to be marginalized. Thus, the usual variational lower bounds for the marginal likelihoods yield

Since takes on the task of reconstructing both and , the BJDE is sensitive to the distributions of and and learns a joint manifold over the two data sources. Thus, the BJDE provides the following benefits: 1) learning the distribution of makes the inference of given robust to perturbations in the inputs, 2) becomes a joint-embedding of and , 3) the model can leverage unlabeled data. Following the convention in Eq. 8, the joint training objectives is

(10) | |||

where is a dataset of paired samples, and and are datasets of unpaired samples.

### 3.3 Blending Joint and Conditional Deep Models

Because of potential model mis-specifications, the BJDE is not expected to yield good performance if applied to the conditional task. Thus, we aim to blend the BJDE and BCDE models in the spirit of Lasserre et al. (2006). However, we note that (7) is not directly applicable since the BCDE and BJDE are two different models, and not two different views (discriminative and generative) of the same model. Therefore, it is not immediately clear how to tie the BCDE and BJDE parameters together. Further, these models involve conditional probabilities parameterized by deep networks and have no closed form for inference.

Any natural prior for the BCDE parameter and the BJDE parameter should encourage to be close to . In the presence of the latent variable , it is then natural to encourage to be close to and to be close to . However, enforcing the former condition is intractable as we do not have a closed form for . Fortunately, an approximation of is provided by the recognition model . Thus, we propose to softly tie together the parameters of networks defining and . This strategy effectively leads to a joint prior over the model network parameters, as well as the recognition network parameters .

As a result, we arrive at the following hybrid blending of deep stochastic models and its variational lower bound

(11) |

We interpret as a -regularization term that softly ties the joint parameters and conditional parameters in an appropriate way. For the BCDE and BJDE, there is a natural one-to-one mapping from the conditional parameters to a subset of the joint parameters. For the joint model described in Fig. 2(c) and conditional model in Fig. 2(d), the parameter pairings are provided in Table 1. Formally, we define and use the index to denote the parameter of the neural network on the Bayesian network link in the BCDE. For example , . Similarly, let . In the BJDE, the same notation yields . The hybrid blending regularization term can be written as

(12) |

where denotes the set of common indices of the joint and conditional parameters. When the index is , it effectively means that is softly tied to , i.e.,

Setting unties the BCDE from the BJDE, and effectively yields to a conditionally trained BCDE, while letting forces the corresponding parameters of the BCDE and BJDE to be identical.

Interestingly, Eq. 11 does not contain the term . Since explicit training of may lead to learning a better joint embedding in the space of , we note the following generalization of Eq. 11 that trades off the contribution between and ,

(13) |

Intuitively, the equation computes the lower bound of , either using the joint parameters or factorizes into before computing the lower bound of with the conditional parameters. A proof that the lower bound holds for any is provided in Appendix B. For simplicity, we set and do not tune in our experiments.

### 3.4 Factored Inference

The inference network is usually parameterized as a single neural network that takes both and as input. Using the precision-weighted merging scheme proposed by Sønderby et al. (2016), we also consider an alternative parameterization of that takes a weighted-average of the Gaussian distribution and a Gaussian likelihood term (see Appendix A). Doing so offers a more compact recognition model and more sharing parameters between the BCDE and BJDE (e.g., see the bottom two rows in Table 1), but at the cost of lower flexibility for the variational family .

Models | ||||
---|---|---|---|---|

CVAE (Sohn et al., 2015) | - | - | - | |

BCDE (conditional) | ||||

BCDE (naïve pre-train) | ||||

BCDE (hybrid) | ||||

BCDE (hybrid + factored) |

Models | ||||
---|---|---|---|---|

CVAE (Sohn et al., 2015) | - | - | - | |

BCDE (conditional) | ||||

BCDE (naïve pre-train) | ||||

BCDE (hybrid) | ||||

BCDE (hybrid + factored) |

Models | ||||
---|---|---|---|---|

CVAE (Sohn et al., 2015) | - | - | - | |

BCDE (conditional) | ||||

BCDE (naïve pre-train) | ||||

BCDE (hybrid) | ||||

BCDE (hybrid + factored) |

## 4 Experiments

We evaluated the performance of our hybrid training procedure on the permutation-invariant quadrant prediction task (Sohn et al., 2014, 2015) for MNIST, SVHN, and CelebA. The quadrant prediction task is a conditional density estimation problem where an image data set is partially occluded. The model is given the observed region and is evaluated by its perplexity on the occluded region. The quadrant prediction task consists of four sub-tasks depending on the degree of partial observability. 1-quadrant prediction: the bottom left quadrant is observed. 2-quadrant prediction: the left half is observed. 3-quadrant prediction: the bottom right quadrant is not observed. Top-down prediction: the top half is observed.

In the fully-supervised case, the original MNIST training set is converted into our CDE training set by splitting each image into its observed and unobserved regions according to the quadrant prediction task. Note that the training set does not contain the original class label information. In the -label semi-supervised case, we randomly sub-sampled pairs to create our labeled training set . The remaining paired samples are decoupled and put into our unlabeled training sets . Test performance is the conditional density estimation performance on the entire test set, which is also split into input and target according to the quadrant prediction task. Analogous procedure is used for SVHN and CelebA.

For comparison against Sohn et al. (2015), we evaluate the performance of our models on the MNIST 1-quadrant, 2-quadrant, and 3-quadrant prediction tasks. The MNIST digits are statically-binarized by sampling from the Bernoulli distribution according to their pixel values (Salakhutdinov & Murray, 2008). We use a sigmoid layer to learn the parameter of the Bernoulli observation model.

We provide the performance on the top-down prediction task for SVHN and CelebA. We used a discretized logistic observation model Kingma et al. (2016) to model the pixel values for SVHN and a Gaussian observation model with fixed variance for CelebA. For numerical stability, we rely on the implementation of the discretized logistic distribution described in Salimans et al. (2017).

In all cases, we extracted a validation set of samples for hyperparameter tuning. While our training objective uses a single (IW=) importance-weighted sample (Burda et al., 2015), we measure performance using IW= to get a tighter bound on the test log-likelihood (Sohn et al., 2015). We run replicates of all experiments and report the mean performance with standard errors. For a more expressive variational family (Ranganath et al., 2015), we use two stochastic layers in the BCDE and perform inference via top-down inference (Sønderby et al., 2016). We use multi-layered perceptrons (MLPs) for MNIST and SVHN, and convolutional neural networks (CNNs) for CelebA. All neural networks are batch-normalized (Ioffe & Szegedy, 2015) and updated with Adam (Kingma & Ba, 2014). The number of training epochs is determined based on the validation set. The dimensionality of each stochastic layer is , , and for MNIST, CelebA, and SVHN respectively. All models were implemented in Python^{3}^{3}3github.com/ruishu/bcde using Tensorflow (Abadi, 2015).

### 4.1 Conditional Log-Likelihood Performance

Tables 6, 6, 6, 6 and 6 show the performance comparisons between the CVAE and the BCDE. For baselines, we use the CVAE, the BCDE trained with the conditional objective, and the BCDE initialized via pre-training and using the available and data separately (and then trained conditionally). Against these baselines, we measure the performance of the BCDE (with and without factored inference) trained with the hybrid objective . We tuned the regularization hyperparameter on the MNIST 2-quadrant semi-supervised tasks and settled on using for all tasks.

Fully-supervised regime. By comparing in the fully-supervised regime for MNIST (Tables 6, 6 and 6, ), we show that the hybrid BCDE achieves competitive performance against the pretrained BCDE and out-performs previously reported results for CVAE (Sohn et al., 2015).

Semi-supervised regime. As the labeled training size reduces, the benefit of having the hybrid training procedure becomes more apparent. The BCDEs trained with the hybrid objective function tend to significantly improve upon its conditionally-trained counterparts.

On MNIST, hybrid training of the factored BCDE achieves the best performance. Both hybrid models achieve over a 1-nat difference than the pre-trained baseline in some cases—a significant difference for binarized MNIST (Wu et al., 2016). Conditional BCDE performs very poorly in the semi-supervised tasks due to overfitting.

On CelebA, hybrid training of the factored BCDE also achieves the best performance. Both hybrid models significantly out-perform the conditional baselines and yield better visual predictions than conditional BCDE (see Appendix C). The hybrid models also outperform pre-trained BCDE with only half the amount of labeled data.

On SVHN, the hybrid BCDE with standard inference model significantly out-performs the conditional baselines. However, the use of factored inference results in much poorer performance. Since the decoder is a discretized logistic distribution with learnable scale, it is possible that the factored inference model is not expressive enough to model the posterior distribution.

Model entropy. In Figure 3, we sample from for the conditional BCDE and the hybrid BCDE. We show that the conditionally-trained BCDE achieves poorer performance because it learns a lower-entropy model. In contrast, hybrid training learns a lower perplexity model, resulting in a high-entropy conditional image generator that spreads the conditional probability mass over the target output space (Theis et al., 2015).

### 4.2 Conditional Training Overfits

To demonstrate the hybrid training’s regularization behavior, we show the test set performance during training (Fig. 4) on the 2-quadrant MNIST task (). Even with pre-trained initialization of parameters, models that were trained conditionally quickly overfit, resulting in poor test set performance. In contrast, hybrid training regularizes the conditional model toward the joint model, which is much more resilient against overfitting.

### 4.3 Robustness of Representation

Since hybrid training encourages the BCDE to consider the distribution of , we can demonstrate that models trained in a hybrid manner are robust against structured perturbations of the data set. To show this, we experimented with two variants of the MNIST quadrant task called the shift-sensitive and shift-invariant top-bottom prediction tasks. In these experiments, we set .

#### 4.3.1 Shift-Sensitive Estimation

In the shift-sensitive task, the objective is to learn to predict the bottom half of the MNIST digit () when given the top half (). However, we introduce structural perturbation to the top and bottom halves of the image in our training, validation, and test sets by randomly shifting each pair horizontally by the same number of pixels (shift varies between ). We then train the BCDE using either the conditional or hybrid objective in the fully-supervised regime. Note that compared to the original top-down prediction task, the perplexity of the conditional task remains the same after the perturbation is applied.

Models | No Shift | Shift | |
---|---|---|---|

Conditional | |||

Hybrid | |||

Hybrid + Factored |

Table 7 shows that hybrid training consistently achieves better performance than conditional training. Furthermore, the hybridly trained models were less affected by the introduction of the perturbation, demonstrating a higher degree of robustness. Because of its more compact recognition model, hybrid + factored is less vulnerable to overfitting, resulting in a smaller performance gap between performance on the shifted and original data.

#### 4.3.2 Shift-Invariant Estimation

The shift-invariant task is similar to the shift-sensitive top-bottom task, but with one key difference: we only introduce structural noise to the top half of the image in our training, validation, and test sets. The goal is thus to learn that the prediction of (which is always centered) is invariant to the shifted position of .

Models | No Shift | Shift | |
---|---|---|---|

Conditional | |||

Hybrid | |||

Hybrid + Factored |

Table 8 shows similar behavior to Table 7. Hybrid training continues to achieve better performance than conditional models and suffer a much smaller performance gap when structural corruption in is introduced.

In Fig. 5, we show the PCA projections of the latent space sub-region populated by digits and color-coded all points based on the degree of shift. We observe that hybrid training versus conditional training of the BCDE result in very different learned representations in the stochastic layer. Because of regularization toward the joint model, the hybrid BCDE’s latent representation retrains information about and learns to untangle shift from other features. And as expected, conditional training does not encourage the BCDE to be aware of the distribution of , resulting in a latent representation that is ignorant of the shift feature of .

## 5 Conclusion

We presented a new framework for high-dimensional conditional density estimation. The building blocks of our framework are a pair of sibling models: the Bottleneck Conditional Density Estimator (BCDE) and the Bottleneck Joint Density Estimator (BJDE). These models use layers of stochastic neural networks as bottleneck between the input and output data. While the BCDE learns the conditional distribution , the BJDE learns the joint distribution . The bottleneck constraint implies that only the bottleneck needs to be marginalized when either the input or the output are missing during training, thus, enabling the BJDE to be trained in a semi-supervised fashion.

The key component of our framework is our hybrid objective function that regularizes the BCDE towards the BJDE. Our new objective is a novel extension of Lasserre et al. (2006) that enables the principle of hybrid blending to be applied to deep variational models. Our framework provides a new mechanism for the BCDE, a conditional model, to become more robust and to learn from unlabeled data in semi-supervised conditional density estimation.

Our experiments showed that hybrid training is competitive in the fully-supervised regime against pre-training, and achieves superior performance in the semi-supervised quadrant prediction task in comparison to conditional models, achieving new state-of-the-art performances on MNIST, SVHN, and CelebA. Even with pre-trained weight initializations, the conditional model is still susceptible to overfitting. In contrast, hybrid training is significantly more robust against overfitting. Furthermore, hybrid training transfers the nice embedding properties of the BJDE to the BCDE, allowing the BCDE to learn better and more robust representation of the input . The success of our hybrid training framework makes it a prime candidate for other high-dimensional conditional density estimation problems, especially in semi-supervised settings.

## References

- Abadi (2015) Abadi, Martín, et. al. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. URL http://tensorflow.org/. Software available from tensorflow.org.
- Burda et al. (2015) Burda, Y., Grosse, R., and Salakhutdinov, R. Importance Weighted Autoencoders. arXiv preprints:1509.00519, 2015.
- Dayan et al. (1995) Dayan, P., Hinton, G., Neal, R., and Zemel, R. The Helmholtz Machine. Neural computation, 1995.
- Ganchev et al. (2010) Ganchev, K., Graca, J., Gillenwater, J., and Taskar, B. Posterior regularization for structured latent variable models. JMLR, 2010.
- Hinton et al. (1995) Hinton, G., Dayan, P., Frey, B., and Radford, R. The “wake-sleep” algorithm for unsupervised neural networks. Science, 1995.
- Holmes et al. (2012) Holmes, M. P., Gray, A. G., and Isbell, C. L. Fast Nonparametric Conditional Density Estimation. arXiv:1206.5278, 2012.
- Ioffe & Szegedy (2015) Ioffe, S. and Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. arXiv:1502.03167, 2015.
- Kingma & Ba (2014) Kingma, D. and Ba, J. Adam: A Method for Stochastic Optimization. arXiv:1412.6980, 2014.
- Kingma & Welling (2013) Kingma, D. P and Welling, M. Auto-Encoding Variational Bayes. arXiv:1312.6114, 2013.
- Kingma et al. (2014) Kingma, D. P., Rezende, D. J., Mohamed, S., and Welling, M. Semi-Supervised Learning with Deep Generative Models. arXiv:1406.5298, 2014.
- Kingma et al. (2016) Kingma, Diederik P., Salimans, Tim, and Welling, Max. Improving variational inference with inverse autoregressive flow. CoRR, abs/1606.04934, 2016. URL http://arxiv.org/abs/1606.04934.
- Lasserre et al. (2006) Lasserre, J., Bishop, C., and Minka, T. Principled hybrids of generative and discriminative models. In The IEEE Conference on Computer Vision and Pattern Recognition, 2006.
- Maaløe et al. (2016) Maaløe, L., Kaae Sønderby, C., Kaae Sønderby, S., and Winther, O. Auxiliary Deep Generative Models. arXiv:1602.05473, 2016.
- Ng & Jordan (2002) Ng, A. and Jordan, M. On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes. Neural Information Processing Systems, 2002.
- Ranganath et al. (2015) Ranganath, R., Tran, D., and Blei, D. M. Hierarchical Variational Models. ArXiv e-prints, 1511.02386, November 2015.
- Rezende et al. (2014) Rezende, D., Mohamed, S., and Wierstra, D. Stochastic Backpropagation and Approximate Inference in Deep Generative Models. ArXiv e-prints, 1401.4082, January 2014.
- Salakhutdinov & Murray (2008) Salakhutdinov, R. and Murray, I. On the quantitative analysis of deep belief networks. International Conference on Machine Learning, 2008.
- Salimans et al. (2017) Salimans, Tim, Karpathy, Andrej, Chen, Xi, and Kingma, Diederik P. Pixelcnn++: Improving the pixelcnn with discretized logistic mixture likelihood and other modifications. CoRR, abs/1701.05517, 2017. URL http://arxiv.org/abs/1701.05517.
- Sohn et al. (2014) Sohn, K., Shang, W., and H., Lee. Improved multimodal deep learning with variation of information. Neural Information Processing Systems, 2014.
- Sohn et al. (2015) Sohn, K., Yan, X., and Lee, H. Learning structured output representation using deep conditional generative models. Neural Information Processing Systems, 2015.
- Sønderby et al. (2016) Sønderby, C. K., Raiko, T., Maaløe, L., Kaae Sønderby, S., and Winther, O. Ladder Variational Autoencoders. arXiv:1602.02282, 2016.
- Theis et al. (2015) Theis, L., van den Oord, A., and Bethge, M. A note on the evaluation of generative models. arXiv:1511.01844, 2015.
- Tipping & Bishop (1999) Tipping, M. and Bishop, C. Probabilistic Principal Component Analysis. J. R. Statist. Soc. B, 1999.
- Wu et al. (2016) Wu, Y., Burda, Y., Salakhutdinov, R., and Grosse, R. On the Quantitative Analysis of Decoder-Based Generative Models. arXiv:1611.04273, 2016.
- Yan et al. (2015) Yan, X., Yang, J., Sohn, K., and Lee, H. Attribute2Image: Conditional Image Generation from Visual Attributes. arXiv:1512.00570, 2015.
- Zhang et al. (2016) Zhang, B., Xiong, D., Su, J., Duan, H., and Zhang, M. Variational Neural Machine Translation. arXiv:1605.07869, 2016.

## Appendix A Factored Inference

When training the BJDE in the semi-supervised regime, we introduce a factored inference procedure that reduce the number of parameters in the recognition model.

In the semi-supervised regime, the 1-layer BJDE recognition model requires approximating three posteriors: , , and . The standard approach would be to assign one recognition network for each approximate posterior. This approach, however, does not take advantage of the fact that these posteriors share the same likelihood functions, i.e., .

Rather than learning the three approximate posteriors independently, we propose to learn the approximate likelihood functions , and let . Consequently, this factorization of the recognition model enables parameter sharing within the joint recognition model (which is beneficial for semi-supervised learning) and eliminates the need for constructing a neural network that takes both and as inputs. The latter property is especially useful when learning a joint model over multiple, heterogeneous data types (e.g. image, text, and audio).

In practice, we directly learn recognition networks for and and perform factored inference as follows

(14) |

where parameterizes the recognition networks. To ensure proper normalization in Eq. 14, it is sufficient for to be bounded. If the prior belongs to an exponential family with sufficient statistics , we can parameterize , where is a network such that . Then the approximate posterior can be obtained by simple addition in the natural parameter space of the corresponding exponential family. When the prior and approximate likelihood are both Gaussians, this is exactly precision-weighted merging of the means and variances (Sønderby et al., 2016).

## Appendix B Derivation of the Hybrid Objective

We first provide the derivation of Eq. 11. We begin with the factorization proposed in Eq. 7, which we repeat here for self-containedness,

(15) |

Since our model includes unpaired , we modify Eq. 15 to include

(16) |

To account for the variational parameters, we include them in the joint density as well,

(17) |

By taking the log and replacing the necessary densities with their variational lower bound,

(18) |

we arrive at Eq. 11. We note, however, that a more general hybrid objective Eq. 13 is achievable. To derive the general objective, we consider an alternative factorization of the joint density in Eq. 17,

(19) |

We factorize the likelihood term such that and are always explained by the joint parameters ,

(20) |

We then introduce an auxiliary variable ,

(21) |

where

(22) | ||||

(23) |

Using Jensen’s inequality, we can lower bound with

(24) |

By taking the log of Eq. 19, replacing all remaining densities with their variational lower bound, and setting ,

(25) | |||

(26) |

we arrive at the general hybrid objective. Note that when , Eq. 26 reduces to Eq. 18.

## Appendix C Visualizations for CelebA and SVHN

We show visualizations of the hybrid BCDE predictions for CelebA and SVHN on the top-down prediction task in the semi-supervised regime. For each data set, we visualize both the images sampled during reconstruction as well as prediction using an approximation of the MAP estimate by greedily sampling the mode of each conditional distribution in the generative path.