# Structured Output Learning with Conditional Generative Flows

###### Abstract

Traditional structured prediction models try to learn the conditional likelihood, i.e., , to capture the relationship between the structured output and the input features . For many models, computing the likelihood is intractable. These models are therefore hard to train, requiring the use of surrogate objectives or variational inference to approximate likelihood. In this paper, we propose conditional Glow (c-Glow), a conditional generative flow for structured output learning. C-Glow benefits from the ability of flow-based models to compute exactly and efficiently. Learning with c-Glow does not require a surrogate objective or performing inference during training. Once trained, we can directly and efficiently generate conditional samples to do structured prediction. We evaluate this approach on different structured prediction tasks and find c-Glow’s structured outputs comparable in quality with state-of-the-art deep structured prediction approaches.

Structured Output Learning with Conditional Generative Flows

You Lu Department of Computer Science Virginia Tech Blacksburg, VA 24061 you.lu@vt.edu Bert Huang Department of Computer Science Virginia Tech Blacksburg, VA 24061 bhuang@vt.edu

noticebox[b]Preprint. Under review.\end@float

## 1 Introduction

Structured prediction models are widely used in tasks such as image segmentation Nowozin and Lampert (2011) and sequence labeling Lafferty et al. (2001). In these structured output tasks, the goal is to model a mapping from the input to the high-dimensional, structured output . In many such problems, it is also important to be able to make diverse predictions to capture the variability of plausible solutions to the structured output problem Sohn et al. (2015).

Many existing methods for structured output learning use graphical models, such as conditional random fields (CRFs) Wainwright and Jordan (2008), and approximate the conditional distribution . Approximation is necessary because, for most graphical models, computing the exact likelihood is intractable. Recently, deep structured prediction models Chen et al. (2015); Zheng et al. (2015); Sohn et al. (2015); Wang et al. (2016); Belanger and McCallum (2016); Graber et al. (2018) combine deep neural networks with graphical models, so that they can use the power of deep neural networks to extract high-quality features and graphical models to model correlations and dependencies among variables. The main drawback of these approaches is that, due to the intractable likelihood, they are difficult to train. Training them requires the construction of surrogate objectives that approximate or bound the likelihood, often involving variational inference to infer latent variables. Moreover, once the model is trained, inference and sampling from CRFs require expensive iterative procedures Koller et al. (2009).

In this paper, we develop conditional generative flows (c-Glow) for structured output learning. Our model is a variant of Glow Kingma and Dhariwal (2018), with additional neural networks for capturing the relationship between input features and structured output variables. Compared to most methods for structured output learning, c-Glow has the unique advantage that it can directly model the conditional distribution without restrictive assumptions (e.g., variables being fully connected Krähenbühl and Koltun (2011)). We can train c-Glow by exploiting the fact that invertible flows allow exact computation of log-likelihood, removing the need for surrogates or inference. Compared to other methods using normalizing flows (e.g., Trippe and Turner (2018); Kingma and Dhariwal (2018)), c-Glow’s output label is both conditioned on complex input and a high-dimensional tensor rather than a one-dimensional scalar. We evaluate c-Glow on three structured prediction tasks: semantic segmentation, depth refinement, and image inpainting, finding that c-Glow’s exact likelihood-based training is able to learn models that can efficiently predict structured outputs of comparable quality to state-of-the-art deep structured prediction approaches.

## 2 Related Work

There are two main branches of research related to our paper: deep structured prediction and normalizing flows. In this section, we briefly cover some of the most related literature.

### 2.1 Deep Structured Models

One emerging strategy to construct deep structured models is to combine deep neural networks with graphical models. However, this kind of model can be difficult to train, since the likelihood of graphical models is usually intractable. Chen et al. (2015) proposed joint learning approaches that blend the learning and approximate inference to alleviate some of these computational challenges. Zheng et al. (2015) proposed CRF-RNN, a method that treats mean-field variational CRF inference as a recurrent neural network to allow gradient-based learning of model parameters. Wang et al. (2016) proposed proximal methods for inference. And Sohn et al. (2015) used variational autoencoders Kingma and Welling (2013) to generate latent variables for predicting the output.

Another direction combining structured output learning with deep models is to construct energy functions with deep networks. Structured prediction energy networks (SPENs) Belanger and McCallum (2016) define energy functions for scoring structured outputs as differentiable deep networks. The likelihood of a SPEN is intractable, so the authors used structured SVM loss to learn. SPENs can also be trained in an end-to-end learning framework Belanger et al. (2017) based on unrolled optimization. Methods to alleviate the cost of SPEN inference include replacing the argmax inference with an inference network Tu and Gimpel (2018). Inspired by Q-learning, Gygli et al. (2017) used an oracle value function as the objective for energy-based deep networks. Graber et al. (2018) generalized SPENs by adding non-linear transformations on top of the score function.

### 2.2 Normalizing Flows

Normalizing flows are neural networks constructed with fully invertible components. The invertibility of the resulting network provides various mathematical benefits. Normalizing flows have been successfully used to build likelihood-based deep generative models Dinh et al. (2014, 2016); Kingma and Dhariwal (2018) and to improve variational approximation Rezende and Mohamed (2015); Kingma et al. (2016). Autoregressive flows Kingma et al. (2016); Papamakarios et al. (2017); Huang et al. (2018); Ziegler and Rush (2019) condition each affine transformation on all previous variables, so that they ensure an invertible transformation and triangular Jacobian matrix. Continuous normalizing flows Chen et al. (2018); Grathwohl et al. (2018) define the transformation function using ordinary differential equations. While most normalizing flow models define generative models, Trippe and Turner (2018) developed radial flows to model univariate conditional probabilities.

Most related to our approach are flow-based generative models for complex output. Dinh et al. (2014) first proposed a flow-based model, NICE, for modeling complex high-dimensional densities. They later proposed Real-NVP Dinh et al. (2016), which improves the expressiveness of NICE by adding more flexible coupling layers. The Glow model Kingma and Dhariwal (2018) further improved the performance of such approaches by incorporating new invertible layers. Most recently, Flow++ Ho et al. (2019) improved generative flows with variational dequantization and architecture design, and Ma and Hovy (2019) proposed new invertible layers for flow-based models.

## 3 Background

In this section, we introduce notation and background knowledge directly related to our work.

### 3.1 Structured Output Learning

Let and be random variables with unknown true distribution . We collect a dataset , where is the th input vector and is the corresponding output. To approximate , we develop a model and then minimize the negative log-likelihood

In structured output learning, the label comes from a complex, high-dimensional output space with dependencies among output dimensions. Many structured output learning approaches use an energy-based model to define a conditional distribution:

where is the energy function. In deep structured prediction, depends on via a deep network. Due to the high dimensionality of , the partition function, i.e., , is intractable. To train the model, we need methods to approximate the partition function such as variational inference or surrogate objectives, resulting in complicated training and sub-optimal results.

### 3.2 Conditional Normalizing Flows

A normalizing flow is a composition of invertible functions , which transforms the target to a latent code drawn from a simple distribution. In conditional normalizing flows Trippe and Turner (2018), we rewrite each function as , making it parameterized by both and its parameter . Thus, with the change of variables formula, we can rewrite the conditional likelihood as

(1) |

where , , and .

In this paper, we address the structured output problem by using normalizing flows. That is, we directly use the conditional normalizing flows, i.e., Equation 1, to calculate the conditional distribution. Thus, the model can be trained by locally optimizing the exact likelihood. Note that conditional normalizing flows have been used for conditional density estimation. Trippe and Turner (2018) use it to solve the one-dimensional regression problem. Our method is different from theirs in that the labels in our problem are high-dimensional tensors rather than scalars. We therefore will build on recently developed methods for (unconditional) flow-based generative models for high-dimensional data.

### 3.3 Glow

Glow Kingma and Dhariwal (2018) is a flow-based generative model that extends other flow-based models: NICE Dinh et al. (2014) and Real-NVP Dinh et al. (2016). Glow’s modifications have demonstrated significant improvements in likelihood and sample quality for natural images. The model mainly consists of three components. Let and be the input and output of a layer, whose shape is , with spatial dimensions and channel dimension . The three components are as follows.

Actnorm layers. Each activation normalization (actnorm) layer performs an affine transformation of activations using two parameters, i.e., a scalar , and a bias . The transformation can be written as

where is the element-wise product.

Invertible 11 convolutional layers. Each invertible 1x1 convolutional layer is a generalization of a permutation operation. Its function format is

where is a weight matrix.

Affine layers. As in the NICE and Real-NVP models, Glow also has affine coupling layers to capture the correlations among spatial dimensions. Its transformation is

where NN is a neural network, and and functions perform operations along the channel dimension. The and have the same size as .

Glow uses a multi-scale architecture Dinh et al. (2016) to combine the layers. This architecture has a “squeeze” layer for shuffling the variables and a “split” layer for reducing the computational cost.

## 4 Conditional Generative Flows for Structured Output Learning

In this section, we introduce our conditional generative flow, i.e., c-Glow, which is a flow-based generative model for structured prediction.

### 4.1 Conditional Glow

To modify Glow to be a conditional generative flow, we need to add conditioning architectures to its three components: the actnorm layer, the 11 convolutional layer, and the affine coupling layer. The main idea is to use a neural network, which we refer to as a conditioning network (CN), to generate the parameter weights for each layer. The details are as follows.

Conditional actnorm. The parameters of an actnorm layer are two vectors, i.e., the scale and the bias . In conditional Glow, we use a CN to generate these two vectors and then use them to transform the variable, i.e.,

Conditional 11 convolutional. The 11 convolutional layer uses a weight matrix to permute each spatial dimension’s variable. In conditional Glow, we use a conditioning network to generate this matrix:

Conditional affine coupling. The affine coupling layer separates the input variable into two halves, i.e., and . It uses as the input to an NN to generate scale and bias parameters for . To build a conditional affine coupling layer, we use a CN to extract features from , and then we concatenate it with to form the input of NN.

We can still use the multi-scale architecture to combine these conditional components, to preserve the efficiency of computation. Figure 1 illustrates the Glow and c-Glow architectures for comparison.

Since the conditioning networks do not need to be invertible when optimizing a conditional model, we do not specify their architecture here. Any differentiable network suffices and preserves the ability of c-Glow to compute the exact conditional likelihood of each input-output pair.

To learn the model parameters, we can take advantage of the efficiently computable log-likelihood for flow-based models. Therefore, we can back-propagate to differentiate the exact conditional likelihood, i.e., Equation 1, and optimize all the c-Glow parameters using gradient methods.

### 4.2 Inference

Given a model, we can perform efficient sampling with a single forward pass through the c-Glow. We first calculate the transformation functions given and then sample the latent code from . Finally, we propagate the sampled to the model, and we get the corresponding sample . The whole process can be summarized as

(2) |

where is the inverse function.

The core task in structured output learning is to predict the best output, i.e., , for an input . Many existing approaches solve the most-probable explanation (MPE) problem: . However, MPE can be difficult for c-Glow because the likelihood function is non-convex with a highly multi-modal surface. Optimization over converges to a local optimum. In our experiments, we find the local optima to be only slightly better than conditional samples. Therefore, we use sample averages to estimate marginal expectations of output variables. Let be samples drawn from . Estimated marginal expectations for each variable can be computed from the average

(3) |

In the general form of c-Glow, the variables are defined as continuous variables. In some tasks like semantic segmentation, the space of is discrete. Following previous literature Belanger and McCallum (2016); Gygli et al. (2017), we relax the discrete output space to a continuous space during training. When we do prediction, we simply round to discrete values.

## 5 Experiments

In this section, we evaluate c-Glow on three structured prediction tasks: semantic segmentation, depth refinement, and image inpainting.

### 5.1 Architecture and Setup

To specify a c-Glow architecture, we need to define conditioning networks that generate weights for the conditional actnorm, 11 convolutional, and affine layers.

For the conditional actnorm layer, we use a six-layer conditioning network. The first three layers are convolutional layers that downscale the input to a reasonable size. The last three layers are then fully connected layers, which transform the resized to the scale and the bias vectors. For the downscaling convolutional layers, we use a simple method to determine their kernel size and stride. Let and be the input and output sizes. Then we set the stride to and the kernel size to .

For the conditional 11 convolutional layer, we use a similar six-layer network to generate the weight matrix. The only difference is that the last fully connected layer will generate the weight matrix . For the actnorm and 11 convolutional conditional networks, the number of channels of the convolutional layers, i.e., , and the width of fully connected layers, i.e., , will impact the model’s performance. We discuss how we set them in the appendix.

For the conditional affine layer, we use a three-layer conditional network to extract features from , and we concatenate it with . Among the three layers, the first and the last layers use kernels. The middle layer is a downscaling convolutional layer. We vary the number of channels of this conditional network to be , and we find that the model is not very sensitive to this variation. In our experiments, we fix it to have channels. The affine layer itself is composed of three convolutional layers with 256 channels.

We use the same multi-scale architecture as Glow to connect the layers, so the number of levels and the number of steps of each level will also impact the model’s performance. In our experiments, we follow Kingma and Dhariwal (2018) to set , and vary . We use Adam Kingma and Ba (2014) to tune the learning rates, with , and default s. We set the mini-batch size to be . Based on our empirical results, it is enough for the model to converge in reasonable amount of time. For the experiments of segmentation and refinement, the training sets are relatively small, we run the program for iterations to guarantee the algorithms have fully converged. For the experiments of inpainting, the training set is large, we run the program for iterations.

### 5.2 Semantic Segmentation

In this set of experiments, we use the Weizmann Horse Image Database Borenstein and Ullman (2002), which contains images of horses and their segmentation masks indicating whether pixels are part of horses or not. The training and test sets contain images, respectively. We resize the images and their masks to pixels. We compare our method with non-linear transformations (NLStruct) by Graber et al. (2018) and FCN-VGG Long et al. (2015). We use pixel-wise accuracy and mean intersection-over-union (IOU) as metrics. For c-Glow, we follow Kingma and Dhariwal (2018) to preprocess the masks. That is, we copy each mask three times and tile them together, so has three channels. We find that this transformation can improve the model performance. For the c-Glow, we try different parameter settings, and we find that when , and , the model performs the best. We further discuss these parameter settings in the appendix.

#### 5.2.1 Segmentation Results

We compare c-Glow with FCN-VGG^{1}^{1}1We use code from https://github.com/wkentaro/pytorch-fcn. and NLStruct. We list the scores on test images in Table 1. We reproduce NLStruct’s performance reported by Graber et al. (2018). In their experiments, they set the input image size to be , and the mask size to be , which is slightly different from our setting. Since they did not report the accuracy of NLStruct, we leave that cell blank. Our c-Glow model generates higher quality segmentations than these other approaches. Figure 2 shows some segmentation results.

c-Glow | NLStruct | FCN-VGG | |
---|---|---|---|

Accuracy | 0.927 | — | 0.850 |

IOU | 0.830 | 0.755 | 0.670 |

#### 5.2.2 Conditional Samples of Segmentations

Given an input , an important task is generating diverse possible predictions. Doing so requires the model to have the ability to draw conditional samples, i.e., . Unlike many structured output learning models, we can directly use the generative process, i.e., Equation 2, to generate conditional samples exactly from the learned distribution. Figure 3 contains examples of these samples of segmentations for the horse image data. The differences among the samples illustrate the uncertainty of the model, and since each sample is drawn from a joint distribution over all pixels, they retain some non-local characteristics of reasonable segmentations.

### 5.3 Denoising for Depth Refinement

In this set of experiments, we use the seven scenes dataset Newcombe et al. (2011), which contains noisy depth maps of natural scenes. The task is to denoise the depth maps. We use the same method as Wang et al. (2016) to process the dataset. We train our model on images from the Chess scene and test on images from other scenes. The images are randomly cropped to . We use peak signal-to-noise ratio (PSNR) as our metric, where higher PSNR is better. We compare c-Glow with ProximalNet Wang et al. (2016), FilterForest Ryan Fanello et al. (2014), and BM3D Dabov et al. (2007).

We use the same parameter setting as that in the segmentation experiments, i.e, , and . We list the metric scores in Table 2. Our c-Glow performs comparably to these other recent methods. Figure 4 shows some sampled refined images.

c-Glow | ProximalNet | FilterForest | BM3D | |

PSNR | 36.43 | 36.31 | 35.63 | 35.46 |

### 5.4 Image Inpainting

Inferring parts of images that are censored or occluded requires modeling of the structure of dependencies across pixels. In this set of experiments, we test c-Glow on the task of inpainting censored images from the CelebA dataset Liu et al. (2015). We randomly select 2,000 images as our test set. We centrally crop the images and resize them to pixels. We use central block masks such that of the pixels are hidden from the input. We use the same parameter settings as the two previous sets of experiments. For training the model, we set the features to be the occluded images and the labels to be the center region that needs to be inpainted. Figure 5 shows some inpainting results. Note that our model is a general-purpose structured prediction model, so we are not able to obtain state-of-the-art results. Models like inpainting GANs Yeh et al. (2017); Yu et al. (2018) can generate better images, because they are specific designed for image inpainting and use techniques like Poisson blending and contextual attention to improve image quality. However, c-Glow can still generate reasonable inpainted images that appear to capture face shape and skin tone. These experiments demonstrate the flexibility of the general-purpose c-Glow architecture for diverse structured output tasks.

## 6 Conclusion

In this paper, we propose conditional generative flows (c-Glow), which are flow-based conditional generative models for structured output learning. The model is developed to allow the change-of-variables formula to transform conditional likelihood for high-dimensional variables. We show how to convert the Glow model to a conditional form by incorporating conditioning networks. In contrast with existing deep structured models, our model can be trained by directly minimizing the exact negative log-likelihood, so it does not need a surrogate objective or approximate inference. With a learned model, we can efficiently draw conditional samples from the exact learned distribution. In our experiments, we test c-Glow on image segmentation, denoising, and inpainting, finding that c-Glow can generate reasonable conditional samples and predictive abilities is comparable to recent deep structured prediction approaches.

## Acknowledgments

We thank NVIDIA for their support through the GPU Grant Program and Amazon for their support via the AWS Cloud Credits for Research program.

## References

- Belanger and McCallum (2016) David Belanger and Andrew McCallum. Structured prediction energy networks. In International Conference on Machine Learning, pages 983–992, 2016.
- Belanger et al. (2017) David Belanger, Bishan Yang, and Andrew McCallum. End-to-end learning for structured prediction energy networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 429–439, 2017.
- Borenstein and Ullman (2002) Eran Borenstein and Shimon Ullman. Class-specific, top-down segmentation. In European conference on computer vision, pages 109–122. Springer, 2002.
- Chen et al. (2015) Liang-Chieh Chen, Alexander Schwing, Alan Yuille, and Raquel Urtasun. Learning deep structured models. In International Conference on Machine Learning, pages 1785–1794, 2015.
- Chen et al. (2018) Tian Qi Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differential equations. In Advances in Neural Information Processing Systems, pages 6571–6583, 2018.
- Dabov et al. (2007) K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian. Image denoising by sparse 3-d transform-domain collaborative filtering. IEEE Transactions on Image Processing, 16(8):2080–2095, Aug 2007.
- Dinh et al. (2014) Laurent Dinh, David Krueger, and Yoshua Bengio. NICE: Non-linear independent components estimation. arXiv preprint arXiv:1410.8516, 2014.
- Dinh et al. (2016) Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real NVP. arXiv preprint arXiv:1605.08803, 2016.
- Graber et al. (2018) Colin Graber, Ofer Meshi, and Alexander Schwing. Deep structured prediction with nonlinear output transformations. In Advances in Neural Information Processing Systems, pages 6320–6331, 2018.
- Grathwohl et al. (2018) Will Grathwohl, Ricky TQ Chen, Jesse Betterncourt, Ilya Sutskever, and David Duvenaud. Ffjord: Free-form continuous dynamics for scalable reversible generative models. arXiv preprint arXiv:1810.01367, 2018.
- Gygli et al. (2017) Michael Gygli, Mohammad Norouzi, and Anelia Angelova. Deep value networks learn to evaluate and iteratively refine structured outputs. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1341–1351, 2017.
- Ho et al. (2019) Jonathan Ho, Xi Chen, Aravind Srinivas, Yan Duan, and Pieter Abbeel. Flow++: Improving flow-based generative models with variational dequantization and architecture design. arXiv preprint arXiv:1902.00275, 2019.
- Huang et al. (2018) Chin-Wei Huang, David Krueger, Alexandre Lacoste, and Aaron Courville. Neural autoregressive flows. arXiv preprint arXiv:1804.00779, 2018.
- Kingma and Ba (2014) Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Kingma and Welling (2013) Diederik P Kingma and Max Welling. Auto-encoding variational Bayes. arXiv preprint arXiv:1312.6114, 2013.
- Kingma and Dhariwal (2018) Durk P Kingma and Prafulla Dhariwal. Glow: Generative flow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems, pages 10215–10224, 2018.
- Kingma et al. (2016) Durk P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling. Improved variational inference with inverse autoregressive flow. In Advances in Neural Information Processing Systems, pages 4743–4751, 2016.
- Koller et al. (2009) Daphne Koller, Nir Friedman, and Francis Bach. Probabilistic graphical models: principles and techniques. MIT press, 2009.
- Krähenbühl and Koltun (2011) Philipp Krähenbühl and Vladlen Koltun. Efficient inference in fully connected CRFs with Gaussian edge potentials. In Advances in Neural Information Processing Systems, pages 109–117, 2011.
- Lafferty et al. (2001) John Lafferty, Andrew McCallum, and Fernando CN Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In International Conference on Machine Learning, 2001.
- Liu et al. (2015) Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), 2015.
- Long et al. (2015) Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Computer Vision and Pattern Recognition, pages 3431–3440, 2015.
- Ma and Hovy (2019) Xuezhe Ma and Eduard Hovy. Macow: Masked convolutional generative flow. arXiv preprint arXiv:1902.04208, 2019.
- Newcombe et al. (2011) Richard A Newcombe, Shahram Izadi, Otmar Hilliges, David Molyneaux, David Kim, Andrew J Davison, Pushmeet Kohli, Jamie Shotton, Steve Hodges, and Andrew W Fitzgibbon. Kinectfusion: Real-time dense surface mapping and tracking. In ISMAR, volume 11, pages 127–136, 2011.
- Nowozin and Lampert (2011) Sebastian Nowozin and Christoph H. Lampert. Structured learning and prediction in computer vision. Foundations and Trends in Computer Graphics and Vision, 6:185–365, 2011.
- Papamakarios et al. (2017) George Papamakarios, Theo Pavlakou, and Iain Murray. Masked autoregressive flow for density estimation. In Advances in Neural Information Processing Systems, pages 2338–2347, 2017.
- Rezende and Mohamed (2015) Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing flows. arXiv preprint arXiv:1505.05770, 2015.
- Ryan Fanello et al. (2014) Sean Ryan Fanello, Cem Keskin, Pushmeet Kohli, Shahram Izadi, Jamie Shotton, Antonio Criminisi, Ugo Pattacini, and Tim Paek. Filter forests for learning data-dependent convolutional kernels. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1709–1716, 2014.
- Sohn et al. (2015) Kihyuk Sohn, Honglak Lee, and Xinchen Yan. Learning structured output representation using deep conditional generative models. In Advances in Neural Information Processing Systems, pages 3483–3491, 2015.
- Trippe and Turner (2018) Brian L Trippe and Richard E Turner. Conditional density estimation with Bayesian normalising flows. arXiv preprint arXiv:1802.04908, 2018.
- Tu and Gimpel (2018) Lifu Tu and Kevin Gimpel. Learning approximate inference networks for structured prediction. arXiv preprint arXiv:1803.03376, 2018.
- Wainwright and Jordan (2008) Martin J Wainwright and Michael I Jordan. Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning, 1:1–305, 2008.
- Wang et al. (2016) Shenlong Wang, Sanja Fidler, and Raquel Urtasun. Proximal deep structured models. In Advances in Neural Information Processing Systems, pages 865–873, 2016.
- Yeh et al. (2017) Raymond A Yeh, Chen Chen, Teck Yian Lim, Alexander G Schwing, Mark Hasegawa-Johnson, and Minh N Do. Semantic image inpainting with deep generative models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5485–5493, 2017.
- Yu et al. (2018) Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui Shen, Xin Lu, and Thomas S Huang. Generative image inpainting with contextual attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5505–5514, 2018.
- Zheng et al. (2015) Shuai Zheng, Sadeep Jayasumana, Bernardino Romera-Paredes, Vibhav Vineet, Zhizhong Su, Dalong Du, Chang Huang, and Philip HS Torr. Conditional random fields as recurrent neural networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 1529–1537, 2015.
- Ziegler and Rush (2019) Zachary M Ziegler and Alexander M Rush. Latent normalizing flows for discrete sequences. arXiv preprint arXiv:1901.10548, 2019.

## Appendix A Detailed Experiment Settings

In this section, we introduce the detailed settings in our experiments.

### a.1 Network Architectures

Figure 6 illustrates the networks we use to generate weights in our experiments. For each layer except for the last layer, we use ReLU to activate the output. As in Glow, we use zero initialization for each layer. That is, we initialize the weights of each layer to be zero.

### a.2 Parameter Selection

We vary the parameters of c-Glow, i.e., , , and , and test their performances on the Horse test set. For each group of parameters, we run the program for iterations to guarantee the program is fully converged. We then calculate the accuracy and IOU for each model. The results of trials with various values are in Table 3. Though model performance is not very sensitive to these parameters, the small model, i.e., the first row in the table, works slightly better. We believe that this is because large models are more prone to overfit the training set. From the results, when , , and , the model gets the best IOU. In our experiments of depth refinement and image inpainting, we use the same setting of parameters.

Accuracy | IOU | |||
---|---|---|---|---|

8 | 8 | 32 | 0.927 | 0.830 |

8 | 32 | 96 | 0.922 | 0.822 |

8 | 64 | 128 | 0.922 | 0.821 |

16 | 8 | 32 | 0.920 | 0.817 |

32 | 8 | 32 | 0.920 | 0.815 |

### a.3 Conditional Likelihoods

To the best of our knowledge, c-Glow is the first deep structured prediction model whose exact likelihood is tractable. Figure 7 plots the evolution of minibatch negative log likelihoods during training. Since c-Glow learns a continuous density, the negative log likelihoods can become negative as the model fits the data distribution better.

### a.4 Conditional Samples of Scenes

Figure 8 shows conditional samples of different scenes. Note that the images are gray-scale images, so the difference among samples are hard to distinguish.

### a.5 More Results of CelebA

Figure 9 shows more conditional samples and image inpainting results on CelebA data.