# Universal Conditional Machine

###### Abstract

We propose a single neural probabilistic model based on variational autoencoder that can be conditioned on an arbitrary subset of observed features and then sample the remaining features in “one shot”. The features may be both real-valued and categorical. Training of the model is performed by stochastic variational Bayes. The experimental evaluation on synthetic data, as well as feature imputation and image inpainting problems, shows the effectiveness of the proposed approach and diversity of the generated samples.

Universal Conditional Machine

Oleg Ivanov Lomonosov Moscow State University Moscow, Russia tigvarts@gmail.com Michael Figurnov Higher School of Economics Moscow, Russia michael@figurnov.ru Dmitry Vetrov Higher School of Economics Moscow, Russia vetrovd@yandex.ru

noticebox[b]Preprint. Work in progress.\end@float

## 1 Introduction

In past years, a number of generative probabilistic models based on neural networks have been proposed. The most popular approaches include variational autoencoder [1] (VAE) and generative adversarial net [2] (GAN). They learn a distribution over objects and allow sampling from this distribution.

In many cases, we are interested in learning a conditional distribution . For instance, if is an image of a face, could be the characteristics describing the face (are glasses present or not; length of hair, etc.) Conditional variational autoencoder [3] and conditional generative adversarial nets [4] are popular methods for this problem.

In this paper, we consider the problem of learning all conditional distributions of the form , where is the set of all features and is its arbitrary subset. This problem generalizes both learning the joint distribution and learning the conditional distribution . To tackle this problem, we propose a Universal Conditional Machine (UCM) model. It is a latent variable model similar to VAE, but allows conditioning on an arbitrary subset of the features. The conditioning features affect the prior on the latent Gaussian variables which are used to generate unobserved features. The model is trained using stochastic gradient variational Bayes [1].

We consider two applications of the proposed model. The first one is feature imputation where the goal is to restore the missing features given the observed ones. The imputed values may be valuable by themselves or may improve the performance of other machine learning algorithms which process the dataset. Another application is image inpainting in which the goal is to fill in an unobserved part of an image with an artificial content in a realistic way. This can be used for removing unnecessary objects from the images or, vice versa, for complementing the partially closed or corrupted object.

The experimental evaluation shows that the proposed model successfully samples from the conditional distributions. The distribution over samples is close to the true conditional distribution. This property is very important when the true distribution has several modes. We demonstrate the effectiveness of the model on the artificial data on which the true conditional distribution is analytically tractable. The model is shown to be effective in feature imputation problem which helps to increase the quality of subsequent discriminative models on different problems from UCI datasets collection [5]. We demonstrate that model can generate diverse and realistic image inpaintings on MNIST [6], Omniglot [7] and CelebA [8] datasets.

The paper is organized as follows. In section 2 we briefly describe variational autoencoders and conditional variational autoencoders. In section 3 we define the problem, describe the UCM model and its training procedure. In section 4 we evaluate UCM. Firstly, we show its capability of learning complex distributions on synthetic data. Then, we evaluate UCM on missing feature imputation problem. Finally, we test UCM on image inpainting task. In section 5 we review the related works. Section 6 concludes the paper. Appendix contains additional explanations, experiments with UCM, and link to the source code for the model.

## 2 Background

### 2.1 Variational Autoencoder

Variational autoencoder [1] (VAE) is a directed generative model with latent variables. The generative process in variational autoencoder is as follows: first, a latent variable is generated from the prior distribution , and then the data is generated from the generative distribution , where are the generative model’s parameters. This process induces the distribution . The distribution is modeled by a neural network with parameters . is a standard Gaussian distribution.

The parameters are tuned by maximizing the likelihood of the training data points from the true data distribution . In general, this optimization problem is challenging due to intractable posterior inference. However, a variational lower bound can be optimized efficiently using backpropagation and stochastic gradient descent:

(1) |

Here is a proposal distribution parameterized by neural network with parameters that approximates the posterior . Usually this distribution is Gaussian with a diagonal covariance matrix. The closer to , the tighter variational lower bound . To compute the gradient of the variational lower bound with respect to , reparameterization trick is used: where and and are deterministic functions parameterized by neural networks. So the gradient can be estimated using Monte-Carlo method for the first term and computing the second term analytically:

(2) |

So can be optimized using stochastic gradient ascent with respect to and .

### 2.2 Conditional Variational Autoencoder

Conditional variational autoencoder [3] (CVAE) approximates the conditional distribution . It outperforms deterministic models when the distribution is multi-modal (diverse s are probable for the given ). For example, assume that is a real-valued image. Then, a deterministic regression model with mean squared error loss would predict the average blurry value for . On the other hand, CVAE learns the distribution of , from which one can sample diverse and realistic objects.

Variational lower bound for CVAE can be derived similarly to VAE by conditioning all considered distributions on :

(3) |

Similarly to VAE, this objective is optimizing using the reparameterization trick. Note that the prior distribution is conditioned on and is modeled by a neural network with parameters . Thus, CVAE uses three trainable neural networks, while VAE only uses two.

In order to estimate the log-likelihood of CVAE the authors use two methods:

(4) |

(5) |

The first estimator is called Monte-Carlo estimator and the second one is called Importance Sampling estimator. They are equivalent, but in practice the Monte-Carlo estimator requires much more estimations to obtain the same accuracy of estimation. Small leads to underestimation of the log-likelihood for both Monte-Carlo and Importance Sampling [9], but for Monte-Carlo the underestimation is expressed much stronger.

During training the proposal distribution is used to generate the latent variables , while during the testing stage the prior is used. KL divergence tries to close the gap between two distributions but, according to authors, it is not enough. A vector from may be different enough from all vectors from at training stage, so the generator network may be confused. To overcome the issue authors propose to use a hybrid model (7), a weighted mixture of variational lower bound (3) and a single-sample Monte-Carlo estimation of log-likelihood (6). The model corresponding to the second term is called Gaussian Stochastic Neural Network (6), because it is a feed-forward neural network with a single Gaussian stochastic layer in the middle:

(6) |

(7) |

Nevertheless, we discovered one disadvantage of this technique in our model, as described in section 3.3.2.

## 3 Universal Conditional Machine

### 3.1 Problem Statement

Consider a distribution over a -dimensional vector with real or categorical components. The components of the vector are called features.

Let binary vector be the binary mask unobserved features of the object. Then we describe the vector of unobserved features as . For example, . Using this notation we denote as a vector of observed features.

Our goal is to build a model of the conditional distribution for an arbitrary , where and are parameters that used in our model at the testing stage.

However, the true distribution is intractable without strong assumptions about . Therefore, our model has to be more precise for some and less precise for others. To formalize our requirements about the accuracy of our model we introduce the distribution over different unobserved feature masks . The distribution is arbitrary and may be defined by user depending on the problem. Formally it should have full support over so that can evaluate arbitrary conditioning. Nevertheless it is not necessary if the model is used for specific kinds of conditioning (we do it in section 4.3).

Using we can introduce the following log-likelihood objective function for the model:

(8) |

The special cases of the objective (8) are variational autoencoder () and conditional variational autoencoder ( is constant).

### 3.2 Model Description

The generative process of our model is similar to the generative process of CVAE: for each object firstly we generate using prior network, and then sample missed features using generative network. This process induces the following model distribution over missed features:

(9) |

We use , and Gaussian distribution over , with parameters from a neural network with weights : . The real-valued components of distribution are defined likewise. Each categorical component of distribution is parameterized by a function , whose outputs are logits of probabilities for each category: . We assume that components of the latent vector are conditionally independent given and , and the components of are conditionally independent given , and .

The variables and have variable length that depends on . So in order to use architectures such as multi-layer perceptron and convolutional neural network we consider where is an element-wise product. So in implementation has fixed length. The output of the generative network also has a fixed length, but we use only unobserved components to compute likelihood.

### 3.3 Learning Universal Conditional Machine

#### 3.3.1 Variational Lower Bound

We can perform the derivation for as for variational autoencoder:

(10) |

Therefore we have the following variational lower bound optimization problem:

(11) |

We use fully-factorized Gaussian proposal distribution which allows us to perform reparameterization trick and compute KL divergence analytically in order to optimize (11).

#### 3.3.2 Hybrid Model

Hybrid model was proposed in [3]. The key idea is that samples from don’t participate in training procedure, but it is the distribution used to generate latent variables at testing stage. Such inconsistency can confuse generator network. In [3] authors propose to use a weighted mixture of variational lower bound and one-sample Monte-Carlo estimation of log-likelihood (the model corresponding to the latter term is called Gaussian Stochastic Neural Network, or GSNN). They report that using mixture model trick improves log-likelihood on the majority of datasets. The same trick is applicable to our model as well:

(12) | |||

(13) |

Nevertheless, often using hybrid model is not reasonable. Our experiments in section 4.1 show that makes the shape of learned distribution closer to unimodal.So if the true conditional distributions are multimodal and , the model is unlikely to learn these true distributions.

Based on our experiments we recommend to use this model with care or even completely avoid it by setting , i.e. optimizing the variational lower bound (11) only.

#### 3.3.3 Prior In Latent Space

During the optimization of objective (11) the parameters and of the prior distribution on may tend to infinity, since there is no penalty for large values of those parameters. We usually observe the growth of during training, though it is slow enough. To prevent potential numerical instabilities, we put a Normal-Gamma prior on the parameters of the prior distribution to prevent the divergence. Formally, we redefine as follows:

(14) |

As a result, the regularizers and are added to the model log-likelihood. Hyperparameter is chosen to be large () and is taken to be a small positive number (). This distribution is close to uniform near zero, so it doesn’t affect the learning process significantly.

#### 3.3.4 Missing Features

The optimization objective (11) requires all features of each object at training stage: some of the features will be observed variables at the input of the model and other will be unobserved features used to evaluate the model. Nevertheless, in some problem settings the training data contains missing features too. We propose the following slight modification of the problem (11) in order to cover such problems as well.

The missed values cannot be observed so , where describes the missed value in the data. This requirement means that unobserved features mask depends on the object : . In the reconstruction loss (9) we simply omit the missing features, i. e. marginalize them out:

(15) |

The proposed modification is evaluated in section 4.2.

## 4 Experiments

### 4.1 Synthetic Data

In this section we show that the model is capable of learning a complex multimodal distribution of synthetic data. Let and . where . The distribution is plotted on figure 2. The dataset contains 100000 points sampled from . We use multi-layer perceptron with four ReLU layers of size 400-200-100-50, 25-dimensional Gaussian latent variables.

For different mixture coefficients we visualize samples from the learned distributions , , and . The observed features for the conditional distributions are generated from the marginal distributions and respectively.

UCM weight | IS- | MC- |
---|---|---|

We see in table 1 and on figure 2, that even with very small weight GSNN prevents model from learning distributions with several local optimas. GSNN also increases low-samples Monte-Carlo log-likelihood estimation and decreases much more precise Importance Sampling log-likelihood estimation. When the whole distribution structure is lost.

We see that using ruins multimodality of the restored distribution, so we highly recommend to use or at least . Further in the experiments we use .

### 4.2 Missing Features Imputation

The datasets with missing features are widespread. Consider a dataset with -dimensional objects where each feature may be missing (which we denote by ) and their target values . The majority of discriminative methods do not support missing values in the objects. The procedure of filling in the missing features values is called missing features imputation.

A simple baseline is to replace the missing feature of an object with the average value of that feature or some special value. Our model provides more flexible way of feature imputation. It allows to generate a number of different imputations for each object from the distribution on the missed features. In the experiments, we replace each object with missing features by objects with sampled imputations, so the size of the dataset increases by times. Then, the classifier or regressor is trained on the extended training set. For the test set, the algorithm is applied to each of the imputed objects and the predictions are combined. For regression problems we use mean of the predictions, while for classification we choose the mode.

Our experiments show that the better results are achieved when our model learns the concatenation of objects features and targets . The example that shows the necessity of it is in appendix B. We treat as an additional feature that is always unobserved during the testing time. We consider the workflow where the predicted by UCM values of are not fed to the classifier or regressor to make a fair comparison of feature imputation quality.

To save information about the difference between the imputed value and the observed value we can extend the feature representation of with a binary mask of its missed values.

To train our model we use distribution in which and otherwise.

We split the dataset into train and test set such that every third object is in the test set. Before training we drop randomly 50% of values both in train and test set. Also we normalize real-valued features. After that we train UCM, GSNN and Multivariate Gaussian (MG). We impute missing features using average imputation and the above algorithms. For training we use Gradient Boosting Classifier or Regressor of XGBoost [10] Python package with . Also we compare imputation with XGBoost built-in missing data processing method. We repeat this procedure 10 times with different dropped features and then average results and compute their standard derivation.

As we can see in the table 2, using our model increases the quality of classifier or regressor which is built to solve the problem. In the majority of cases using UCM increases the score of the classifier or regressor trained on the imputed data. In the cases when UCM doesn’t outperform GSNN we can assume that the conditional distribution to learn is not multimodal or complicated enough, so the simpler model learns better.

### 4.3 Image Inpainting

The image inpainting problem has a number of different formulations. The formulation of our interest is as follows: some of the pixels of an image are unobserved and we want to restore them in a natural way. Unlike the majority of papers, we want restore not just one more probable inpainting, but the distribution over all possible inpaintings from which we can sample. This distribution is extremely multi-modal, because often there is a lot of different possible ways to inpaint the image.

Unlike the previous subsection, here we have uncorrupted images without missing features in the training set, so . Also the distribution over the mask of unobserved pixels is not Bernoulli, because otherwise the reconstruction problem is too simple.

The common case in image inpainting is when unobserved pixels form a rectangle. We sample the corner points of rectangles uniprobably on the image, but reject those rectangles which area is less than a quarter of the image area. Another case is sampling uniprobably a horizontal line of some width on the image and restoring the rest of the image.

As we show in the section 5, state of the art results use different adversarial losses to achieve more sharp and realistic samples. The UCM model can be adapted to the image inpainting task by using a combination of those adversarial losses as a part of reconstruction loss . Nevertheless, such construction is out of scope for this research, so we leave it for the future work. In current work we show that the model can generate both diverse and realistic inpaintings.

In all experiments we use optimization method Adam [13], skip-connections between prior network and generative network inspired by [14], [15] and [16], and convolutional neural networks based on ResNet blocks [17].

We trained and evaluated UCM on binarized MNIST [6], Omniglot [7] and CelebA [8]. The details of learning procedure and description of datasets are available in appendixes A and D. The examples of inpaintings are available on the figure 5, 5, and 5.

The table 3 shows that UCM learns distribution over inpaintings better than other models.

Method | MNIST | Omniglot | CelebA |
---|---|---|---|

UCM IS- | |||

UCM MC- | |||

UCM MC- | |||

GSNN MC- | |||

GSNN MC- | |||

Naive Bayes |

## 5 Related Work

Generative adversarial network (GAN) [2] is a popular generative model capable of producing sharp and realistic image samples. Unfortunately, this model suffers from two issues limiting its applicability. The first one is mode collapse: the model learns just a subset of possible outputs [18, 19]. The second issue is that GANs only support real-valued outputs and therefore are not applicable to categorical features. Because of these downsides, we build UCM based on variational autoencoders.

Universal Marginalizer [20] (UM) is a model based on a feed-forward neural network which approximates marginals of unobserved features conditioned on observable values. The sampling from joint conditional distribution is performed via chain rule, requiring passes through the model, instead of just one for UCM. That makes UM up to 1000 times slower than UMC at the testing stage even on MNIST dataset. Also without complicating the model it cannot naturally learn and sample distributions where the marginal distribution on components is multi-modal itself. The relation between UCM and UM is similar to the relation between VAE and PixelCNN [21]. More detailed description of UM, experimental evaluation and comparison with UCM are in appendix E.

Image inpainting is a classic computer vision problem. Most of the earlier methods rely on local and texture information or hand-crafted problem-specific features [22]. In past years multiple neural network based approaches have been proposed.

[23], [24] and [25] use different kinds and combinations of adversarial, reconstruction, texture and other losses. In [26] a GAN is first trained on the whole training dataset. The inpainting is an optimization procedure that finds the latent variables that explain the observed features best. Then, the obtained latents are passed through the generative model to restore the unobserved portion of the image. We can say that UCM is a similar model which uses prior network to find a proper latents instead of solving the optimization problem.

All described methods aim to produce a single realistic inpainting, while UCM is capable of sampling diverse inpainting. Additionally, the last three models have high test-time computational complexity of inpainting, because they require a optimization problem to be solved. On the other hand, UCM is a “single-shot” method with a low computational cost.

## 6 Conclusion

In this paper we consider the problem of simultaneous learning of all conditional distributions for a vector. This problem has a number of different special cases with practical applications. We proposed neural network based probabilistic model for distribution conditioning learning with Gaussian latent variables. This model is scalable and efficient in inference and learning. We proposed several tricks to improve optimization and gave recommendations about hyperparameters choice. The model was successfully applied to feature imputation and inpainting tasks. The experimental results show that the model is capable of learning rich conditional distributions. An interesting direction for future work is adding an adversarial loss term to avoid blurry outputs, a common problem of VAE.

## References

- [1] Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. CoRR, abs/1312.6114, 2013.
- [2] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
- [3] Kihyuk Sohn, Honglak Lee, and Xinchen Yan. Learning structured output representation using deep conditional generative models. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 3483–3491. Curran Associates, Inc., 2015.
- [4] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. CoRR, abs/1411.1784, 2014.
- [5] M. Lichman. UCI machine learning repository, 2013.
- [6] Yann LeCun, Corinna Cortes, and Christopher JC Burges. The mnist database of handwritten digits, 1998.
- [7] Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum. Human-level concept learning through probabilistic program induction. Science, 350(6266):1332–1338, 2015.
- [8] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), December 2015.
- [9] Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders. arXiv preprint arXiv:1509.00519, 2015.
- [10] Tianqi Chen and Carlos Guestrin. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, pages 785–794, New York, NY, USA, 2016. ACM.
- [11] I-C Yeh. Modeling of strength of high-performance concrete using artificial neural networks. Cement and Concrete research, 28(12):1797–1808, 1998.
- [12] Paulo Cortez, António Cerdeira, Fernando Almeida, Telmo Matos, and José Reis. Modeling wine preferences by data mining from physicochemical properties. Decision Support Systems, 47(4):547–553, 2009.
- [13] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- [14] Xiaojiao Mao, Chunhua Shen, and Yu-Bin Yang. Image restoration using very deep convolutional encoder-decoder networks with symmetric skip connections. In Advances in neural information processing systems, pages 2802–2810, 2016.
- [15] Casper Kaae Sønderby, Tapani Raiko, Lars Maaløe, Søren Kaae Sønderby, and Ole Winther. Ladder variational autoencoders. In Advances in neural information processing systems, pages 3738–3746, 2016.
- [16] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015.
- [17] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- [18] Ilya O Tolstikhin, Sylvain Gelly, Olivier Bousquet, Carl-Johann Simon-Gabriel, and Bernhard Schölkopf. Adagan: Boosting generative models. In Advances in Neural Information Processing Systems, pages 5430–5439, 2017.
- [19] Akash Srivastava, Lazar Valkoz, Chris Russell, Michael U Gutmann, and Charles Sutton. Veegan: Reducing mode collapse in gans using implicit variational learning. In Advances in Neural Information Processing Systems, pages 3310–3320, 2017.
- [20] Laura Douglas, Iliyan Zarov, Konstantinos Gourgoulias, Chris Lucas, Chris Hart, Adam Baker, Maneesh Sahani, Yura Perov, and Saurabh Johri. A universal marginalizer for amortized inference in generative models. arXiv preprint arXiv:1711.00695, 2017.
- [21] Aaron Van Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. In International Conference on Machine Learning, pages 1747–1756, 2016.
- [22] Marcelo Bertalmio, Guillermo Sapiro, Vincent Caselles, and Coloma Ballester. Image inpainting. In Proceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH ’00, pages 417–424, New York, NY, USA, 2000. ACM Press/Addison-Wesley Publishing Co.
- [23] Deepak Pathak, Philipp Krähenbühl, Jeff Donahue, Trevor Darrell, and Alexei A. Efros. Context encoders: Feature learning by inpainting. CoRR, abs/1604.07379, 2016.
- [24] Raymond Yeh, Chen Chen, Teck-Yian Lim, Mark Hasegawa-Johnson, and Minh N. Do. Semantic image inpainting with perceptual and contextual losses. CoRR, abs/1607.07539, 2016.
- [25] C. Yang, X. Lu, Z. Lin, E. Shechtman, O. Wang, and H. Li. High-resolution image inpainting using multi-scale neural patch synthesis. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4076–4084, July 2017.
- [26] Raymond A Yeh, Chen Chen, Teck Yian Lim, Alexander G Schwing, Mark Hasegawa-Johnson, and Minh N Do. Semantic image inpainting with deep generative models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5485–5493, 2017.

## Appendix

## Appendix A Neural network architectures

The main idea of neural networks architecture is reflected on figure 6.

The number of hidden layers, their widths and structure may be different.

In missing features imputation and in experiments with synthetic data we use no skip connections, so all information for decoder goes through the latent variables. In these experiments we use 3 or 4-layer fully connected network at hidden layers.

In image inpainting we found skip-connection very useful in both terms of log-likelihood improvement and the image realism, because latent variables are responsible for the global information only while the local information passes through skip-connections. Therefore the border between image and inpainting becomes less conspicuous.

The neural networks we used for image inpainting have He-Uniform initialization of convolutional ResNet blocks, and the skip-connections are implemented using concatenation, not addition. The proposal network structure is exactly the same as the prior network except skip-connections.

Also one could use much simpler fully-connected networks with one hidden layer as a proposal, prior and generative networks in UCM and still obtain nice inpaintings on MNIST.

## Appendix B Missing features imputation

Consider a dataset with -dimensional objects where each feature may be missing (which we denote by ) and their target values . Our experiments show that the better results are achieved when our model learns the concatenation of objects features and targets . The example that shows the necessity of it is following. Consider a dataset where , , . In this case . We can see that generating data from may only confuse the classifier, because with one-half probability it generates for and for . On the other hand, . Filling gaps using may only improve classifier or regressor by giving it some information from the joint distribution and thus simplifying the dependence to be learned at the training time. So we treat as an additional feature that is always unobserved during the testing time. We consider the workflow where the predicted by UCM values of are not fed to the classifier or regressor to make a fair comparison of feature imputation quality.

## Appendix C Gaussian Stochastic Neural Network

On the figure 7 we can see, that the inpaintings produced by GSNN are smooth, blurry and not diverse compared with UCM.

## Appendix D Image Inpainting Datasets

##### Mnist

is a dataset of 60000 train and 10000 test grayscale images of digits from 0 to 9 of size 28x28. We binarize all images in the dataset. For MNIST we consider Bernoulli log-likelihood as the reconstruction loss: where is an output of the generative neural network. The observed pixels form a three pixels wide horizontal line. We use 16 latent variables.

##### Omniglot

is a dataset of 19280 train and 13180 test black-and-white images of different alphabets symbols of size 105x105. As in previous section, the brightness of each pixel is treated as a Bernoulli probability of it to be 1. The mask we use is a random rectangular which is described above. We use 64 latent variables. We train model for 50 epochs which takes about two and a half of a hour on GeForce GTX 1080Ti.

##### CelebA

is a dataset of 162770 train, 19867 validation and 19962 test color images of faces of celebrities of size 178x218. Before learning we normalize the channels in dataset. We use logarithm of fully-factorized Gaussian distribution as reconstruction loss. The mask we use is a random rectangular which is described above. We use 32 latent variables. We train model for 20 epochs which takes about 30 hours on GeForce GTX 1080Ti.

## Appendix E Universal Marginalizer

Universal Marginalizer (UM) is a model which uses a single neural network to estimate the marginal distributions over the unobserved features. So it optimizes the following objective:

(16) |

For given mask we fix a permutation of its unobserved components: , where is a number of unobserved components. Using the learned model and the permutation we can generate objects from joint distribution and estimate their probability using chain rule.

(17) |

For example, .

Conditional sampling or conditional likelihood estimation for one object requires requests to UM to compute . Each request is a forward pass through neural network. In the case of conditional sampling those requests even cannot be paralleled because the input of the next request contains outputs of the previous ones.

The problem authors did not address in the original paper is the relation between the distribution of unobserved components at the testing stage and the distribution of masks in the requests to UM . The distribution over masks induces the distribution , and in the most cases . The distribution also depends on the permutations that we use to generate objects.

We observed in experiments, that UM must be learned using unobserved mask distribution . For example, if all masks from have a fixed number of unobserved components (e. g., ), then UM will never see an example of mask with unobserved components, which results in a drastically low likelihood estimations and unrealistic samples, because all those masks are necessary to generate a sample conditioned on components.

We developed an easy generative process for for arbitrary if the permutation of unobserved components is random and uniprobably: firstly we generate , , then and .

More complicated generative process exists for a sorted permutation where .

The results of using this modification of UM are provided on figure 8 and in table 4. We can say that the relation between UCM and UM is similar to relation between VAE and PixelCNN. The second one is much slower at the testing stage, but it easily takes into account local dependencies in data while the first one is faster but assumes conditional independence of the outputs.

Method | UCM | UM |
---|---|---|

Negative log-likelihood | 61 | 41 |

Training time (30 epochs) | 5min 47s | 3min 14s |

Test time (100 samples generation) | 0.7ms | 1s |

Nevertheless, there are a number of cases where UM cannot learn the distribution well while UCM can. For example, when the space is real-valued and marginal distributions has many local optimas, there is no straightforward parametrization which allows UM to approximate them, and, therefore also the conditioned joint distribution. On the figure 9 is the most simple example, where UM marginal distributions are parametrized with Gaussians.

## Appendix F Code

The source code for Universal Marginalizer and Universal Conditional Machine (hybrid model and GSNN included) is available at https://github.com/tigvarts/ucm. Attached jupyter notebooks reproduce all results from this supplementary. It also may be used for further experiments with Universal Conditional Machine and Universal Marginalizer. The code works with Python 3.6, PyTorch 0.4.0, TorchVision 0.2.1, CUDA, NumPy, and MatPlotLib.