Flow Models for Arbitrary Conditional Likelihoods
Abstract
Understanding the dependencies among features of a dataset is at the core of most unsupervised learning tasks. However, a majority of generative modeling approaches are focused solely on the joint distribution and utilize models where it is intractable to obtain the conditional distribution of some arbitrary subset of features given the rest of the observed covariates : . Traditional conditional approaches provide a model for a fixed set of covariates conditioned on another fixed set of observed covariates. Instead, in this work we develop a model that is capable of yielding all conditional distributions (for arbitrary ) via tractable conditional likelihoods. We propose a novel extension of (change of variables based) flow generative models, arbitrary conditioning flow models (ACFlow), that can be conditioned on arbitrary subsets of observed covariates, which was previously infeasible. We apply ACFlow to the imputation of features, and also develop a unified platform for both multiple and single imputation by introducing an auxiliary objective that provides a principled single “best guess” for flow models. Extensive empirical evaluations show that our models achieve stateoftheart performance in both single and multiple imputation across image inpainting and feature imputation in synthetic and realworld datasets. Code is available at https://github.com/lupalab/ACFlow.
1 Introduction
Spurred on by recent impressive results, there has been a surge in interest for generative probabilistic modeling in machine learning. These models learn an approximation of the underlying data distribution and are capable of drawing realistic samples from it. Generative models have a multitude of potential applications, including image restoration [Ledig et al.2017], agent planning [Houthooft et al.2016], and unsupervised representation learning [Chen et al.2016]. However, most generative approaches are solely focused on the joint distribution of features, , and are opaque in the conditional dependencies that are carried among subsets of features. In this work, we propose a framework, arbitrary conditioning flow models (ACFlow), to construct generative models that yield tractable (analytically available) conditional likelihoods of an arbitrary subset of covariates, , given the remaining observed covariates . We focus on the use of ACFlow for the purpose of imputation, where we infer possible values of , given observed values both in general realvalued data and images (for inpainting).
Classic methods for imputation include nearest neighbors [Troyanskaya et al.2001], random forest [Stekhoven and Bühlmann2011] and autoencoder [Gondara and Wang2018] approaches. Often, these methods infer just one single prediction for each missing value, a.k.a. single imputation, and do not provide a conditional distribution. For more complex data, such as high dimensional data or multimodal data, multiple imputation is a more appropriate strategy. This is not only because of the potential to cover multiple modes, but also the ability to quantify the uncertainty of imputations, which may be essential for downstream tasks.
From the perspective of probabilistic modeling, data imputation attempts to learn a distribution of the unobserved covariates, , given the observed covariates, . Thus, generative modeling is a natural fit for data imputation. It handles single imputation and multiple imputation in an unified framework by allowing the generation of an arbitrary number of samples. More importantly, it quantifies the uncertainty in a principled manner. In this work, we combine flowbased and autoregressive approaches for flexible, tractable generative modeling and extend it for data imputation.
Tractable deep generative models are designed to learn the underlying distribution . There are also works that try to model the conditional distribution conditioned on either class labels [Kingma and Dhariwal2018] or other data points [Li, Gao, and Oliva2019]. For data imputation, however, we are trying to model a subset of the features, conditioned on the rest, , i.e. . Although the complete data are from a certain distribution, the conditional distributions vary when conditioned on different . Furthermore, the dimensionality of and could be arbitrary, for example, when the data are missing completely at random (MCAR, missing independently of covariates’ values).
Dealing with arbitrary dimensionality is a rarely explored topic in current generative models. One might want to explicitly learn a separate model for each different subset of observed covariates. However, this approach quickly becomes infeasible as it requires an exponential number of models with respect to the dimensionality of the input space. In this work, we propose several conditional transformations that handle arbitrary dimensionality in a principled manner.
Our contributions are as follows. 1) We propose a novel extension of flowbased generative models to model the conditional distribution of arbitrary unobserved covariates in data imputation tasks. Our method is the first to develop invertible transformations that operate on an arbitrary set of covariates. 2) We strengthen a flowbased model by using a novel autoregressive conditional likelihood. 3) We propose a novel penalty to generate a single imputed “best guess” for models without an analytically available mean. 4) We run extensive empirical studies and show that ACFlow achieves stateoftheart performance for both missing feature imputation and image inpainting on benchmark realworld datasets.
2 Problem Formulation
Consider a realvalued^{1}^{1}1Data with categorical dimensions can be handled by a special case of our model, please refer to appendix A. distribution over . We are interested in estimating the conditional distribution of all possible subsets of covariates conditioned on the remaining observed covariates . That is, we shall estimate where and , for all possible subsets .
For ease of notation, let be a binary mask indicating which dimensions are observed. Furthermore, let index dimensions for which for the bitmask on a vector . Thus, denotes observed dimensions and denotes unobserved dimensions. We also apply this index to matrices such that indexes rows and then columns. Without loss of generality, conditionals may also be conditioned on the bitmask , , and will be estimated with maximum loglikelihood estimation as described below. In addition, imputation tasks shall be accomplished by generating samples from the conditional distributions .
3 Background
ACFlow builds on Transformation Autoregressive Networks (TANs) [Oliva et al.2018], a flowbased model that combines transformation of variables with autoregressive likelihoods. We expound on flowbased models and TANs below.
The change of variable theorem (1) is the cornerstone of flowbased generative models, where represents an invertible transformation that transforms covariates from input space into a latent space .
(1) 
In order to efficiently compute the determinant, the transformation is often designed to have a diagonal or triangular Jacobian [Dinh, Krueger, and Bengio2014, Dinh, SohlDickstein, and Bengio2016, Kingma and Dhariwal2018]. Since this type of transformation is rather restrictive, the flow models are often composed of multiple transformations in a sequence to get a more flexible composite transformation, i.e. . Here, the covariates flow through a chain of transformations, substituting the last output variable as input for the next transformation.
Typically, a flowbased model transforms the covariates to a latent space with a simple base distribution, like a standard Gaussian. However, TANs provides more flexibility by modeling the latent distribution with an autoregressive approach [Larochelle and Murray2011]. This alters the earlier equation (1), in that is now represented as the product of conditional distributions.
(2) 
Since flow models give the exact likelihood, they can be trained by directly optimizing the log likelihood. In addition, thanks to the invertibility of the transformations, one can draw samples by simply inverting the transformations over a set of samples from the latent space.
4 Methods
We now develop ACFlow by constructing both conditional transformations of variables and autoregressive likelihoods that work with an arbitrary set of unobserved covariates. To deal with arbitrary dimensionality for conditioning covariates , we define a zero imputing function that imputes input vector with zeros based on the specified binary mask :
(3) 
where represents the cumulative sum over . For example, gives a dimensional vector with missing values imputed by zeros (See Fig. 1(a) for an illustration.). Thus, we get a conditioning vector with fixed dimensionality. However, handling the arbitrary dimensionality of requires further care, as discussed below.
We first consider a conditional extension to the change of variable theorem:
(4) 
where is a transformation on the unobserved covariates with respect to the observed covariates and binary mask as demonstrated in Fig. 1(a). However, the fact that could have varying dimensionality and arbitrary missing dimensions makes it challenging to define ’s across different bitmasks . One challenge comes from requiring the transformation to have adaptive outputs that can adapt to different dimensionality of . Another challenge is that different missing patterns require the transformation to capture different dependencies. Since the missing pattern could be arbitrary, we require the transformation to learn a large range of possible dependencies. Aiming at solving those challenges, we propose several conditional transformations that leverage the conditioning information in and and can be adapted to with arbitrary dimensionality. We describe them in detail below. For notation simplicity, we drop the subscripts and in the following sections.
Affine Coupling Transformation
Affine coupling is a commonly used flow transformation. Here, we derive a conditional extension to it. Just as in the unconditional counterparts [Dinh, Krueger, and Bengio2014, Dinh, SohlDickstein, and Bengio2016, Kingma and Dhariwal2018], we divide the unobserved covariates into two parts, and according to some binary mask , i.e., and . We then keep the first part and transform the second part , i.e.,
(5) 
where represents the elementwise (Hadamard) product. and computes the scale and shift factors as the function of both observed covariates and covariates in group A. Note that both inputs and outputs of and contain arbitrary dimensional vectors. To deal with arbitrary dimensionality in inputs, we apply the zero imputing function (3) to and respectively to get two dimensional vectors with missing values imputed by zeros. We also apply to to get a dimensional mask. The shift and scale functions and are implemented as deep neural networks (DNN) over and , i.e,
(6) 
where defines a concatenate function and the two DNN functions can share weights.
The outputs of and need to be adaptive to the dimensions of , thus we apply the indexing mechanism, , that takes the corresponding dimensions of nonzero values with respect to binary masks and , i.e.,
(7) 
A visualization of this transformation is presented in Fig. 1(b).
Note that the way we divide might depend on our prior knowledge about the data correlations. For image data, we use checkerboard and channelwise split as in [Dinh, SohlDickstein, and Bengio2016]. For realvalued vectors, we use evenodd split as in [Dinh, Krueger, and Bengio2014].
Linear Transformation
Another common transformation for flowbased models is the linear transformation. Contrary to the coupling transformation that only leverages correlation between two separated subsets, the linear transformation can take advantage of correlation between all dimensions. Furthermore, the linear transformation can be viewed as a generalized permutation which rearranges dimensions so that next transformation can be more effective.
In order to transform linearly, we would like to have an adaptive weight matrix of size and a bias vector of size . Similar to the affine coupling described above, we first apply a deep neural network over and binary mask to get a matrix and a dimensional bias vector , then we index them with respect to the binary mask , i.e.,
(8) 
The linear transformation can then be derived as . Fig. 1(c) illustrates this transformation over an 8dimensional example.
In order to decrease complexity, it is straightforward to parametrize with rank matrices by taking the product of two rank matrices with size and respectively. Hence, the DNN can reduce its output dimensions to . During preliminary experiments, we observed minimal drop in performance when using a large enough .
RNN Coupling Transformation
The affine coupling transformation can be viewed as a rather rough recurrent transformation with only one recurrent step. We can generalize it by running an RNN over and transform each dimension sequentially (shown in Fig.1(d)). Note that a recurrent transformation naturally handles different dimensionality. To leverage conditioning inputs and , we concatenate and to each dimension of . The outputs of the RNN are used to derive the shift and scale parameters respectively.
(9) 
where , , and indexes through dimensions.
Other Transformations and Compositions
Other transformations like elementwise leakyReLU transformation [Oliva et al.2018] and reverse transformation [Dinh, Krueger, and Bengio2014] are readily applicable to transform the unobserved covariates since they do not rely on the conditioning covariates. Other than these specific transformations described above, any transformations that follow the general formulation shown in Fig. 1(a) can be easily plugged into our model. We obtain flexible, highly nonlinear transformations with the composition of multiple of these aforementioned transformations. (That is, we use the output of the preceding transformation as input to the next transformation.) The Jacobian of the resulting composed (stacked) transformation is accounted with the chain rule.
Likelihoods
The conditional likelihoods in latent space can be computed by either a base distribution, like a Gaussian, or an autoregressive model as in TANs. For Gaussian based likelihoods, we can get mean and covariance parameters by applying another function over and , which is essentially equivalent to another linear transformation conditioned on and . However, this approach is generally less flexible than using an autoregressive approach.
For autoregressive likelihoods, conditioning vectors can be used in the same way as the RNN coupling transformation. The difference is that the RNN outputs are now used to derive the parameters for some base distribution, for example, a Gaussian Mixture Model:
(10) 
where is a shared fully connected network that maps to the parameters for the mixture model (i.e. each mixture component’s location, scale, and prior weight parameter). During sampling, we iteratively sample each point, , before computing the parameters for . Incorporating the autoregressive likelihood into Eq. (4) yields:
(11) 
where is the cardinality of the unobserved covariates.
4.1 Training and Best Guess Objective
Since we follow a flowbased approach, we have access to the normalized conditional likelihoods; thus, we can train our model by maximizing log likelihood.
During training, if we have access to complete training data, we will need to manually create binary masks based on some predefined distribution . is typically chosen based on the application. For instance, Bernoulli random masks are commonly used for realvalued vectors. Given binary masks, training data are divided into and and fed into the conditional model .
If training data already contains missing values, we can only train our model on the remaining covariates. As before, we manually split each data point into two parts, and based on a binary mask . Note that dimensions in corresponding to the missing values are always set to , i.e., they are never observed during training. In this setting, we will need another binary mask indicating those dimensions that are not missing. Accordingly, we define observed dimensions as and unobserved dimensions as . In this setting, we optimize . During testing, we set , that is, the model imputes the missing values conditioned on all the remaining covariates.
Best Guess Objective
Given a trained ACFlow model, multiple imputations can be easily accomplished by drawing multiple samples from the learned conditional distribution. However, certain downstream tasks may require a single imputed “best guess”. Unfortunately, the analytical mean is not available for flowbased deep generative models. Furthermore, getting an accurate empirical estimate could be prohibitive in high dimensional space. In this work, we propose a robust solution that gives a single best guess in terms of the MSE metric (it can be easily extended to other metrics, e.g. an adversarial one).
Specifically, we obtain our best guess by inverting the conditional transformation over the mean of the latent distribution, i.e., . The mean is analytically available for Gaussian mixture base model. To ensure that this best guess is close to unobserved values, we optimize with an auxiliary MSE loss:
(12) 
where controls the relative importance of the auxiliary objective. Note that we only penalize one particular point from to be close to . Hence, it does not affect the diversity of the conditional distribution.
5 Related Work
5.1 Conditional Generative Models
Unlike our arbritrary conditioning task, conditional generative modeling deals only with a fixed vector of external information (such as a class label) to condition the likelihood of a fixed set of covariates (such as pixels in an image). Typically, approaches supplement the inputs of common generative models with some encoding of the conditional information. Conditional GANs [Mirza and Osindero2014], for example, extended GANs by inputting a class label encoding into both the generator and discriminator function, allowing these models to learn multiple conditional distributions. Similarly, a conditional form of VAE was proposed by [Sohn, Yan, and Lee2015], which introduces an additional network that learns the conditional prior , where z is some latent encoding and is a class label.
5.2 Arbitrary Conditioning Models
Previous attempts to learn probability distributions conditioned on arbitrary subsets of known covariates include the Universal Marginalizer [Douglas et al.2017], which is trained as a feedforward network to approximate the marginal posterior distribution of each unobserved dimension conditioned on the observed ones. During sampling, they propose to use a sequential sampling mechanism by adding a newly sampled dimension to the observed sets and running the network in an autoregressive manner. GAIN approaches data imputation with a GAN approach [Yoon, Jordon, and Schaar2016]. The model uses a generator that provides imputations for missing data points while the discriminator attempts to distinguish imputed covariates from observed covariates. VAEAC [Ivanov, Figurnov, and Vetrov2019] utilizes a conditional variational autoencoder and extends it to deal with arbitrary conditioning. The decoder network outputs likelihoods that are over all possible dimensions, although, since they are conditionally independent given the code, it is possible to use only the likelihoods corresponding to the unobserved dimensions. However, we observe that VAEAC suffers from failure modes common in VAEs given that they optimize with respect to ELBO loss [Zhao and Ermon2019]. In particular, VAEs sometimes sacrifice learning the true posterior for minimizing reconstruction loss, which often leads the model to generate blurry samples for multimodal distributions.
Unlike VAEAC and GAIN, ACFlow is capable of producing an analytical (normalized) likelihood and avoids blurry samples and other failure modes in VAE approaches. Furthermore, in contrast to the Universal Marginalizer, ACFlow models dependencies between unobserved covariates at training time and makes use of change of variables.
6 Experiments
In this section, we evaluate our proposed model against VAEAC [Ivanov, Figurnov, and Vetrov2019], GAIN [Yoon, Jordon, and Schaar2016] and other common imputation methods such as MICE [Buuren and GroothuisOudshoorn2010] and MissForest [Stekhoven and Bühlmann2012]. We compare these models on both missing feature imputation tasks and image inpainting tasks below.
6.1 Synthetic Data Imputation
To validate the effectiveness of our model, we conduct experiments on synthetic 2dimensional datasets [Behrmann, Duvenaud, and Jacobsen2018], i.e., . The joint distributions are plotted in Fig. 2. Here, the conditional distributions are highly multimodal and vary significantly conditioned on different observations. We compare our model against VAEAC by training them on 100,000 samples from the joint distribution. The masks are generated by dropping out one dimension at random. The details about training procedure are provided in Appendix B. Fig. 2 shows imputation results form our model and VAEAC. We show imputations from both and by generating 10 samples conditioned on each observed covariate. Observed cavariates are sampled from the marginal distribution and respectively. We see that ACFlow is capable of learning multimodal distributions, while VAEAC tends to merge multiple modes into one single mode.
6.2 Image Inpainting
MNIST  Omniglot  CelebA  
VAEAC 
PSNR  19.6130.042  17.6930.023  23.6560.027 
NLL  3835.9776.048  4015.2132.530  56300.84215.214  
PRD  (0.975, 0.877)  (0.926, 0.525)  (0.966, 0.967)  
ACFlow 
PSNR  17.3490.018  15.5720.031  22.3930.040 
NLL  1919.5552.500  1834.5520.695  39376.46647.670  
PRD  (0.984, 0.945)  (0.971, 0.962)  (0.988, 0.970)  
ACFlow BG 
PSNR  20.8280.031  18.8380.009  25.7230.020 
NLL  1916.8871.190  1833.0680.760  39504.88320.475  
PRD  (0.983, 0.947)  (0.970, 0.967)  (0.987,0.964) 
In this section, we evaluate our method on image inpainting tasks using three common datasets, MNIST, Omniglot and CelebA with a varied distribution of binary masks. Details about mask generation and image preprocessing are available in the Appendix C.1.
We extend the RealNVP model [Dinh, SohlDickstein, and Bengio2016] by replacing all the coupling layers with our proposed arbitrary conditional alternative. For the sake of sampling efficiency, we use standard Gaussian base likelihoods here. Implementation details of ACFlow and baselines are provided in Appendix C.2.
Figure. 3 shows samples drawn from ACFlow and VAEAC. More samples are available in Appendix C.3. We notice our model can generate coherent and diverse inpaintings for all three datasets and different masks. Compared to VAEAC, our model is capable of generating sharp samples and restoring more details. Even when the missing rate is high, ACFlow can still generate decent inpaintings.


To quantitatively evaluate our model, we report peak signaltonoise ratio (PSNR) and negative loglikelihoods (NLL) in Table. 1. We generate 5 different masks for each test image and report the average scores and the standard deviation. We note that PSNR is a metric that prefers blurry images over sample diversity [Hore and Ziou2010]. Loglikelihood measures how well the model matches the real conditional distribution, but may not correlate with visual quality [Theis, Oord, and Matthias2016]. Hence, we evaluate the tradeoff between sample quality and diversity via the precision and recall scores (PRD) [Sajjadi et al.2018]. Since we cannot sample from the groundtruth conditional distribution, we compute the PRD score between the imputed joint distribution and the true joint distribution via 10,000 samples from them. The PRD scores for two distributions measure how much of one distribution can be generated by another. Higher recall means a greater portion of samples from the true distribution are covered by ; and similarly, higher precision means a greater portion of samples from are covered by . We report the pairs in Table. 1 to represent recall and precision, respectively.
From quantitative results, we can see that our model gives lower NLL and higher PRD, which means our model does better in learning the true distribution. Training with the auxiliary “best guess” penalty (denoted as “ACFlow BG”) can improve the PSNR scores significantly, but hardly impacts the likelihood and PRD scores. Thus, our proposed penalty provides a “free lunch” for the image inpainting task.
6.3 Realvalued Feature Imputation
Next, we evaluate our model on missing feature imputation. We use UCI repository datasets preprocessed as described in [Papamakarios, Pavlakou, and Murray2017]. We construct models by composing several leakyReLU, conditional linear, and RNN coupling transformations, along with an autoregressive conditional likelihood (see Appendix. D.1 for further architectural details).
First, we consider fully observed training data, where missing covariates are to be imputed at test time. We construct masks, , by dropping a random subset of the dimensions according to a Bernoulli distribution with . Afterwards, we also evaluate our model when the data itself contains missing values that are never available during training. We consider training and testing with data features missing completely at random at a 10% and 50% rate.
Baseline models for this experiment include MICE, MissForest, GAIN and VAEAC. We also compare to a denoising autoencoder in which missing covariates are treated as dropout noise. Implementation details about ACFlow and baselines are provided in Appendix D.1.
In Fig. 4, we show the NRMSE (i.e. root mean squared error normalized by the standard deviation of each feature and then averaged across all features) and the test NLL (if available) for ACFlow and baselines. For models that can perform multiple imputation, 10 imputations are drawn and averaged for each test point to compute the NRMSE scores. For our model trained with the auxiliary objective, we use the single “best guess” to compute the NRMSE. In order to not bias towards any specific missing pattern, we report the mean and standard deviations over 5 randomly generated binary masks (std. dev. are reported in Appendix, Table. D.1).
One can see that our model gives the best NLL on nearly all scenarios, which indicates our model is better at learning the true distribution. In terms of the NRMSE, ACFlow is comparable to the previous stateoftheart when trained purely by maximizing the likelihood. However, training with the auxiliary objective (ACFlow BG) improves the NRMSE scores significantly and gives stateoftheart results on all datasets considered. As expected, training with missing values is more difficult; however, our model performs best even when the missing rate is relatively high.
7 Conclusion
In this work, we demonstrated that we can improve samples generated with arbitrary conditioning on NRMSE, PSNR, loglikelihood, visual quality, and PRD by leveraging conditional flowmodels and conditional autoregressive likelihoods. We also considered the task of training both single and multiple imputations in a unified platform to provide a “best guess” single imputation when the mean is not analytically available. The samples generated from our model show that we improve in both diversity and quality of imputations in many datasets. Our model typically recovers more details than the previous stateoftheart methods.
References
 [Behrmann, Duvenaud, and Jacobsen2018] Behrmann, J.; Duvenaud, D.; and Jacobsen, J.H. 2018. Invertible residual networks. arXiv preprint arXiv:1811.00995.
 [Buuren and GroothuisOudshoorn2010] Buuren, S. v., and GroothuisOudshoorn, K. 2010. Mice: Multivariate imputation by chained equations in r. In Journal of statistical software, 1–68.
 [Chen et al.2016] Chen, X.; Duan, Y.; Houthooft, R.; Schulman, J.; Sutskever, I.; and Abbeel, P. 2016. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in neural information processing systems, 2172–2180.
 [Dinh, Krueger, and Bengio2014] Dinh, L.; Krueger, D.; and Bengio, Y. 2014. Nice: Nonlinear independent components estimation. arXiv preprint arXiv:1410.8516.
 [Dinh, SohlDickstein, and Bengio2016] Dinh, L.; SohlDickstein, J.; and Bengio, S. 2016. Density estimation using real nvp. arXiv preprint arXiv:1605.08803.
 [Douglas et al.2017] Douglas, L.; Zarov, I.; Gourgoulias, K.; Lucas, C.; Hart, C.; Baker, A.; Sahani, M.; Perov, Y.; and Johri, S. 2017. A universal marginalizer for amortized inference in generative models. arXiv preprint arXiv:1711.00695.
 [Gondara and Wang2018] Gondara, L., and Wang, K. 2018. Mida: Multiple imputation using denoising autoencoders. In PacificAsia Conference on Knowledge Discovery and Data Mining, 260–272. Springer.
 [He et al.2016] He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778.
 [Hore and Ziou2010] Hore, A., and Ziou, D. 2010. Image quality metrics: Psnr vs. ssim. In 2010 20th International Conference on Pattern Recognition, 2366–2369.
 [Houthooft et al.2016] Houthooft, R.; Chen, X.; Duan, Y.; Schulman, J.; De Turck, F.; and Abbeel, P. 2016. Vime: Variational information maximizing exploration. In Advances in Neural Information Processing Systems, 1109–1117.
 [Ivanov, Figurnov, and Vetrov2019] Ivanov, O.; Figurnov, M.; and Vetrov, D. 2019. Variational autoencoder with arbitrary conditioning. In International Conference on Learning Representations.
 [Kingma and Dhariwal2018] Kingma, D. P., and Dhariwal, P. 2018. Glow: Generative flow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems, 10215–10224.
 [Larochelle and Murray2011] Larochelle, H., and Murray, I. 2011. The neural autoregressive distribution estimator. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, 29–37.
 [Ledig et al.2017] Ledig, C.; Theis, L.; Huszár, F.; Caballero, J.; Cunningham, A.; Acosta, A.; Aitken, A.; Tejani, A.; Totz, J.; Wang, Z.; et al. 2017. Photorealistic single image superresolution using a generative adversarial network. In Proceedings of the IEEE conference on computer vision and pattern recognition, 4681–4690.
 [Li et al.2017] Li, Y.; Liu, S.; Yang, J.; and Yang, M.H. 2017. Generative face completion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3911–3919.
 [Li, Gao, and Oliva2019] Li, Y.; Gao, T.; and Oliva, J. 2019. A forest from the trees: Generation through neighborhoods. arXiv preprint arXiv:1902.01435.
 [Mirza and Osindero2014] Mirza, M., and Osindero, S. 2014. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784.
 [Oliva et al.2018] Oliva, J. B.; Dubey, A.; Zaheer, M.; Póczos, B.; Salakhutdinov, R.; Xing, E. P.; and Schneider, J. 2018. Transformation autoregressive networks. arXiv preprint arXiv:1801.09819.
 [Papamakarios, Pavlakou, and Murray2017] Papamakarios, G.; Pavlakou, T.; and Murray, I. 2017. Masked autoregressive flow for density estimation. In Advances in Neural Information Processing Systems, 2338–2347.
 [Pathak et al.2016] Pathak, D.; Krahenbuhl, P.; Donahue, J.; Darrell, T.; and Efros, A. A. 2016. Context encoders: Feature learning by inpainting. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2536–2544.
 [Sajjadi et al.2018] Sajjadi, M. S.; Bachem, O.; Lucic, M.; Bousquet, O.; and Gelly, S. 2018. Assessing generative models via precision and recall. In Advances in Neural Information Processing Systems, 5228–5237.
 [Sohn, Yan, and Lee2015] Sohn, K.; Yan, X.; and Lee, H. 2015. Learning structured output representation using deep conditional generative models. In Advances in neural information processing systems, 3483–3491.
 [Stekhoven and Bühlmann2011] Stekhoven, D. J., and Bühlmann, P. 2011. Missforest—nonparametric missing value imputation for mixedtype data. Bioinformatics 28(1):112–118.
 [Stekhoven and Bühlmann2012] Stekhoven, D. J., and Bühlmann, P. 2012. Missforest — nonparametric missing value imputation for mixedtype data. In Bioinformatics, Volume 28, Issue 1, 112–118.
 [Theis, Oord, and Matthias2016] Theis, L.; Oord, A. v. d.; and Matthias, B. 2016. A note on the evaluation of generative models. arXiv preprint arXiv:1511.01844.
 [Troyanskaya et al.2001] Troyanskaya, O.; Cantor, M.; Sherlock, G.; Brown, P.; Hastie, T.; Tibshirani, R.; Botstein, D.; and Altman, R. B. 2001. Missing value estimation methods for dna microarrays. Bioinformatics 17(6):520–525.
 [Yoon, Jordon, and Schaar2016] Yoon, J.; Jordon, J.; and Schaar, M. v. d. 2016. Gain: Missing data imputation using generative adversarial nets. arXiv preprint arXiv:1605.08803.
 [Zhao and Ermon2019] Zhao, Shengjia, S. J., and Ermon, S. 2019. Infovae: Information maximizing variational autoencoders. In Proceedings of the AAAI Conference on Artificial Intelligence, 5885–5892.
Appendix A ACFlow for Categorical Data
In the main text, we mainly focus on realvalued data, but our model is also applicable with data that contains categorical dimensions. Consider a dimensional vector with both realvalued and categorical features. The conditional distribution can be factorized as , where and represent the realvalued and categorical components in respectively. Since the conditioning inputs in Eq. (4) can be either realvalued or categorical, can be directly modeled by ACFlow. However, ACFlow is not directly applicable to , because the change of variable theorem, Eq. (1), cannot be applied to categorical covariates. But we can use an autoregressive model, i.e. ACFlow without any transformations, to get likelihoods for categorical components. Hence, our proposed method can be applied to both realvalued and categorical data.
Appendix B Synthetic Data Imputation
We closely follow the released code for VAEAC^{1}^{1}1https://github.com/tigvarts/vaeac/blob/master/imputation˙networks.py to construct the proposal, prior, and generative networks. Specifically, we use fully connected layers and skip connections as is in their official implementation. We also use shortcut connections between the prior network and the generative network. We search over different combinations of the number of layers, the number of hidden units of fully connected layers, and the dimension of the latent code. Validation likelihoods are used to select the best model.
ACFlow is constructed by stacking multiple conditional transformations and an autoregressive likelihood. One conditional transformation layer (shown in Fig. B.1) contains one linear transformation, followed by one leaky relu transformation, and then one RNN coupling transformation. The DNN in the linear transformation is a 2layer fully connected network with 256 units. RNN coupling is implemented as a 2layer GRU with 256 hidden units. We stack four conditional transformation layers and use reverse transformations in between. The autoregressive likelihood model is a 2layer GRU with 256 hidden units. The base distribution is a Gaussian Mixture Model with 40 components.
Appendix C Image Inpainting
c.1 Datasets and Masks
We use MNIST, Omniglot, and CelebA to evaluate our model on image inpainting tasks. MNIST contains grayscale images of size . Omniglot data is also resized to and augmented with rotations. For CelebA dataset, we take a central crop of and then resize it to . Since a flow model requires continuous inputs, we dequantize the images by adding independent uniform noise to them.
In order to obtain inpaintings for different masks, we train ACFlow and VAEAC using a mixture of different masks. For MNIST and Omniglot, we use MCAR masks, rectangular masks, and half masks. For CelebA, other than the three masks described above, we also use a random pattern mask proposed in [Pathak et al.2016] and the masks used in [Li et al.2017]. We shall describe the meaning of different masks below.
MCAR mask
MCAR mask utilizes a pixelwise independent Bernoulli distribution with to construct masks. That is, we randomly drop out 80% of the pixels.
Rectangular mask
We randomly generate a rectangle inside which pixels are marked as unobserved. The mask is generated by sampling a point to be the upperleft corner and randomly generating the height and width of the rectangle, although the area is required to be at least 10% of the full image area.
Half mask
Half mask means either the upper, lower, left or right half of the image is unobserved.
Random pattern mask
Random pattern means we mask out a random region with an arbitrary shape. We take the implementation from VAEAC^{2}^{2}2https://github.com/tigvarts/vaeac/blob/master/mask˙generators.py#L100.
Masks in [Li et al.2017]
They proposed six different masks to mask out different parts of a face, like eyes, mouth and checks.
c.2 Models
For this experiment, ACFlow is implemented by replacing affine coupling in RealNVP [Dinh, SohlDickstein, and Bengio2016] with our proposed conditional affine coupling transformation. The DNN in affine coupling is implemented as a ResNet [He et al.2016]. For MNIST and Omniglot, we use 2 residual blocks. For CelebA, we use 4 blocks. We also use the squeeze operation to enable large receptive fields. For MNIST and Omniglot, we use 3 squeeze operations. For CelebA, we use 4 squeeze operations. Note that we also need to squeeze the binary mask in the same way to make sure it corresponds to .
For ACFlow BG, we set the hyperparameter to 1.0 in all experiments. During preliminary experiments, we find our model is quite robust to different values.
VAEAC released their implementation for the CelebA dataset and we modify their code by removing the first pooling layer to suit images. For MNIST and Omniglot, we build similar networks by using less convolution and pooling layers. Specifically, we pad the images to and use 5 pooling layers to get a latent vector. We use 32 latent variables for MNIST and Omniglot and 256 latent variables for CelebA. We also use shortcut connections between the prior network and the generative network as in [Ivanov, Figurnov, and Vetrov2019].
c.3 Additional Samples
We show additional samples from ACFlow in Fig. C.3. We also show some “best guess” inpaintings obtained by a model trained with the auxiliary objective in Fig. C.2. As we expected, the “best guess” imputations tend to be blurry due to the MSE penalty. However, the samples from the same model are still diverse and coherent as can be seen from Fig. C.4.
Appendix D Realvalued Feature Imputation
d.1 Models
For this experiment, we build ACFlow by combining 6 conditional transformation layers (shown in Fig. B.1) with a 4layer GRU autoregerssive likelihood model. The base distribution is a Gaussian Mixture with 40 components. ACFlow BG uses the same network architecture but trained with . We search over 0.1, 1.0, and 10.0 for and find it gives better likelihood for all datasets when . Hence, we only report results when in the main text.
We modify the official VAEAC implementation for this experiment. In their original implementation, they restricted the generative network to output a distribution with variance equal to . We found that learning the variance can improve VAEAC’s likelihood significantly and gives them comparable NRMSE scores. Note that during sampling, we still use the mean of the outputs from the generative network, because it gives slightly better NRMSE scores.
MICE and MissForest are trained using default parameters in the R packages.
For GAIN, we use the variant the VAEAC authors proposed. They observed consistent improvement over the original one on UCI datasets. To apply GAIN when we have complete training data, we add another MSE loss over the unobserved covariates to train the generator.
The autoencoder is implemented as a 6layer fully connected network with 256 hidden units. ReLU is used after each fully connected layer except the last one. We use standard Gaussian as the base distribution.
d.2 Results
We list all results in Table. D.1. We report mean and standard deviation by generating 5 different masks for each test data point.
Missing Rate  Method  bsds  gas  hepmass  miniboone  power  
p=0  GAIN  NRMSE  0.8950.151  0.7150.041  0.9480.006  0.6200.002  0.9490.017 
AutoEncoder  NRMSE  0.6350.000  1.0160.178  0.9300.001  0.4830.003  0.8870.002  
NLL  3.5020.014  2.5430.115  12.7070.018  6.8610.114  2.4880.039  
VAEAC  NRMSE  0.6150.000  0.5740.033  0.8960.001  0.4620.002  0.8800.001  
NLL  1.7080.005  2.4180.006  10.0820.010  3.4520.067  0.0420.002  
ACFlow  NRMSE  0.6030.000  0.5670.050  0.9090.000  0.4780.004  0.8770.001  
NLL  5.2690.007  8.0860.010  8.1970.008  0.9720.022  0.5610.003  
ACFlow BG  NRMSE  0.5720.000  0.3690.016  0.8610.001  0.4420.001  0.8330.002  
NLL  4.8410.008  7.5930.011  6.8330.006  1.0980.032  0.5280.003  
p=0.1  MICE  NRMSE  0.6310.003  0.5180.004  0.9640.004  0.6050.004  0.9110.008 
MissForest  NRMSE  0.6650.002  0.4180.005  0.9850.002  0.5610.003  0.9910.019  
GAIN  NRMSE  0.7490.128  0.5020.070  1.0240.023  0.6150.017  1.0740.038  
AutoEncoder  NRMSE  0.6480.001  0.7610.095  0.9360.001  0.4980.002  0.8870.002  
NLL  4.3000.038  2.2660.169  12.8510.012  7.3050.043  2.5210.028  
VAEAC  NRMSE  0.6200.000  0.5580.047  0.8990.000  0.4670.004  0.8810.003  
NLL  2.2450.015  2.8230.009  10.3890.005  4.2420.071  0.1030.005  
ACFlow  NRMSE  0.6100.000  0.5880.025  0.9080.001  0.5330.005  0.8770.002  
NLL  4.2250.018  7.5680.005  7.7840.006  5.1500.053  0.5570.003  
ACFlow BG  NRMSE  0.5860.001  0.3840.018  0.8630.001  0.4680.003  0.8360.002  
NLL  3.1870.017  7.2120.008  9.6700.007  3.5770.057  0.5100.003  
p=0.5  MICE  NRMSE  0.6280.001  0.5390.005  0.9690.002  0.6150.002  0.9160.005 
MissForest  NRMSE  0.6620.001  0.4360.003  0.9900.002  0.5730.005  0.9900.012  
GAIN  NRMSE  0.9290.123  1.1520.180  1.1430.035  0.8000.042  1.1010.044  
AutoEncoder  NRMSE  0.7390.001  0.6180.056  0.9620.001  0.5670.005  0.9050.002  
NLL  10.0780.021  0.9900.097  13.4820.012  10.7750.091  2.8580.047  
VAEAC  NRMSE  0.6660.001  0.5310.036  0.9150.001  0.5130.004  0.8920.002  
NLL  9.9300.029  1.9520.023  11.4150.012  9.0510.079  0.3430.004  
ACFlow  NRMSE  0.6670.001  0.4880.030  0.9380.000  0.6140.004  0.8900.000  
NLL  1.5080.010  5.4050.008  10.5380.006  9.8920.084  0.4580.005  
ACFlow BG  NRMSE  0.6450.000  0.4210.016  0.8900.000  0.5820.007  0.8430.001  
NLL  3.4970.015  4.8180.009  10.9750.006  10.8490.105  0.4170.005 