Generative flows are attractive because they admit exact likelihood optimization and efficient image synthesis. Recently, Kingma & Dhariwal (2018) demonstrated with Glow that generative flows are capable of generating high quality images. We generalize the convolutions proposed in Glow to invertible convolutions, which are more flexible since they operate on both channel and spatial axes. We propose two methods to produce invertible convolutions that have receptive fields identical to standard convolutions: Emerging convolutions are obtained by chaining specific autoregressive convolutions, and periodic convolutions are decoupled in the frequency domain. Our experiments show that the flexibility of convolutions significantly improves the performance of generative flow models on galaxy images, CIFAR10 and ImageNet.
oddsidemargin has been altered.
marginparsep has been altered.
topmargin has been altered.
marginparwidth has been altered.
marginparpush has been altered.
paperheight has been altered.
The page layout violates the ICML style. Please do not change the page layout, or include packages like geometry, savetrees, or fullpage, which change it for you. We’re not able to reliably undo arbitrary changes to the style. Please remove the offending package(s), or layout-changing commands and try again.
Emerging Convolutions for Generative Normalizing Flows
Emiel Hoogeboom 0 0 Rianne van den Berg 0 Max Welling 0 0
Preprint. Work in progress\@xsect
Generative models aim to learn a representation of the data , in contrast with discriminative models that learn a probability distribution of labels given data . Generative modeling may be used for numerous applications such as anomaly detection, denoising, inpainting, and super-resolution. The task of generative modeling is challenging, because data is often very high-dimensional, which makes optimization and choosing a successful objective difficult.
Generative flow methods have several advantages over other generative models: i) They optimize the log likelihood of a continuous distribution exactly, as opposed to Variational Auto-Encoders (VAEs) (Kingma & Welling, 2014) which optimize a lower bound to the log-likelihood. ii) Drawing samples has a computational cost comparable to inference, in contrast with Pixel CNNs (Van Oord et al., 2016). iii) Generative flows also have the potential for huge memory savings, because activations necessary in the backward pass can be obtained by computing the inverse of layers (Gomez et al., 2017; Li & Grathwohl, 2018).
The performance of density estimation models can be largely attributed to Masked Autoregressive Flows (MAFs) (Papamakarios et al., 2017) and coupling layers (Dinh et al., 2017). MAFs contain flexible autoregressive transformations, but are computationally expensive to invert, which is a disadvantage for sampling high-dimensional data. Coupling layers transform a subset of the dimensions of the data, parameterized by the remaining dimensions. The inverse of coupling layers is straightforward to compute, which makes them suitable for generative flows. However, since coupling layers can only operate on a subset of the dimensions of the data, they may be limited in flexibility.
To improve their effectiveness, coupling layers are alternated with less complex transformations that do operate on all dimensions of the data. Dinh et al. (2017) use a fixed channel permutation in Real NVP, and Kingma & Dhariwal (2018) utilize convolutions in Glow.
However, convolutions suffer from limited flexibility, and using standard convolutions is not straightforward as they are very computationally expensive to invert. We propose two methods to obtain easily invertible and flexible convolutions: emerging and periodic convolutions. Both of these convolutions have receptive fields identical to standard convolutions, resulting in flexible transformations over both the channel and spatial axes.
The structure of an emerging convolution is depicted in Figure 1, where the top depicts the convolution filters, and the bottom shows the equivalent matrices of these convolutions. Two autoregressive convolutions are chained to obtain an emerging receptive field identical to a standard convolution. Empirically, we find that replacing convolutions with the generalized invertible convolutions produces significantly better results on galaxy images, CIFAR10 and ImageNet, even when correcting for the increase in parameters.
In addition to invertible convolutions, we also propose a QR decomposition for convolutions, which resolves flexibility issues of the PLU decomposition proposed by Kingma & Dhariwal (2018).
The main contributions of this paper are listed below:
Invertible emerging convolutions using autoregressive convolutions.
Invertible periodic convolutions using decoupling in the frequency domain.
Numerically stable and flexible convolutions parameterized by a QR decomposition.
An accelerated inversion module for autoregressive convolutions.
This paper is structured as follows: Section id1 provides background information related to generative flows. Section id1 presents our methods and in section id1 related work is discussed. Section id1 describes our results.
|Generative Flow||Function||Inverse||Log Determinant|
Consider a bijective map between variables and . The likelihood of the variable can be written as the likelihood of the transformation evaluated by , using the change of variables formula:
The complicated probability density is equal to the probability density multiplied by the Jacobian determinant, where is chosen to be tractable. The function can be learned, but the choice of is constrained by two practical issues: Firstly, the Jacobian determinant should be tractable. Secondly, to draw samples from , the inverse of should be tractable.
In machine learning, likelihoods are generally optimized in log-space for numerical precision, and deep learning models are structured in layers. Let be the intermediate representations produced by the network layers, where and . The log-likelihood of is written as the log-likelihood of , and the summation of the log Jacobian determinant of each layer:
We will evaluate our methods with experiments on image datasets, where pixels are discrete-valued from 0 to 255. Since generative flows are continuous density models, they may trivially place infinite mass on discretized bin locations. Therefore, we use the definition of Theis et al. (2016) that defines the relation between a discrete model and continuous model as an integration over bins: , where . They further derive a lowerbound to optimize this model with Jensen’s inequality, resulting in additive uniform noise for the integer valued pixels from the data distribution :
Generative flows are bijective functions, often structured as deep learning layers, that are designed to have tractable Jacobian determinants and inverses. An overview of several generative flows is provided in Table 1, and a description is given below:
Coupling layers (Dinh et al., 2017) split the input in two parts. The output is a combination of a copy of the first half, and a transformation of the second half, parametrized by the first part. As a result, the inverse and Jacobian determinant are straightforward to compute.
Actnorm layers (Kingma & Dhariwal, 2018) are data dependent initialized layers with scale and translation parameters. They are initialized such that the distribution of activations has mean zero and standard deviation one. Actnorm layers improve training stability and performance.
Convolutions (Kingma & Dhariwal, 2018) are easy to invert, and can be seen as a generalization of the permutation operations that were used by Dinh et al. (2017). 1 1 convolutions improve the effectiveness of the coupling layers.
Generally, a convolution layer111Note that in deep learning frameworks, convolutions are often actually cross-correlations. In our equations denotes a cross-correlation and denotes a convolution. In addition, a convolution layer is usually implemented as an aggregation of cross-correlations, i.e. a cross-correlation layer. We denote such a layer with the symbol . In the main text we may omit these details, and these operations are referred to as convolutions. with filter and input is equivalent to the multiplication of , a matrix, and a vectorized input . An example of a single channel convolution and its equivalent matrix is depicted in Figure 2. The signals and are indexed as , where is the width index, is the height index, and is the total width. Note that the matrix becomes sparser as the image dimensions grow and that the parameters of the filter occur repeatedly in the matrix . A two-channel convolution is visualized in Figure 3, where we have omitted parameters inside filters to avoid clutter. Here, and are vectorized using indexing , where denotes the channel index and the number of channels.
Using standard convolutions as a generative flow is inefficient. The determinant and inverse can be obtained naïvely by operating directly on the corresponding matrix, but this would be very expensive, corresponding to computational complexity .
Autoregressive convolutions have been widely used in the field of normalizing flows (Germain et al., 2015; Kingma et al., 2016) because it is straightforward to compute their Jacobian determinant. Although there exist autoregressive convolutions with different input and output dimensions, we let for invertibility. In this case, autoregressive convolutions can be expressed as a multiplication between a triangular weight matrix and a vectorized input.
In practice, a filter is constructed from weights and a binary mask that enforces the autoregressive structure (see Figure 4). The convolution with the masked filter is autoregressive without the need to mask inputs, which allows parallel computation of the convolution layer:
where denotes a convolution layer\@footnotemark. The matrix multiplication produces the equivalent result, where and are the vectorized signals, and is a sparse triangular matrix constructed from (see Figure 4). The Jacobian is triangular by design and its determinant can be computed in since it only depends on the diagonal elements of the matrix :
where index denotes the channel and (, ) denotes the spatial center of the filter. The inverse of an autoregressive convolution can theoretically be computed using . In reality this matrix is large and impractical to invert. Since is triangular, the solution for can be found through forward substitution:
The inverse can be computed by sequentially traversing through the input feature map in the imposed autoregressive order. The computational complexity of the inverse is and computation can be parallelized across examples in the minibatch.
We present two methods to generalize 1 1 convolutions to invertible convolutions, improving the flexibility of generative flow models. Emerging convolutions are obtained by chaining autoregressive convolutions (section id1), and periodic convolutions are decoupled in frequency domain (section id1). In section id1, we provide a stable and flexible parameterization for invertible convolutions.
Although autoregressive convolutions are invertible, their transformation is restricted by the imposed autoregressive order, enforced through masking of the filters (as depicted in Figure 4). To alleviate this restriction, we propose emerging convolutions, which are more flexible and nevertheless invertible. Emerging convolutions are obtained by chaining specific autoregressive convolutions, invertible via the autoregressive inverses. To some extent this resembles the combination of stacks used to resolve the blind spot problem in conditional image modeling with PixelCNNs (van den Oord et al., 2016), with the important difference that we do not constrain the resulting convolution itself to be autoregressive.
The emerging receptive field can be controlled by chaining autoregressive convolutions with variations in the imposed order. A collection of achievable receptive fields for emerging convolutions is depicted in Figure 5, based on commonly used autoregressive masking.
The autoregressive inverse requires the solution to a sequential problem, and as a result, it inevitably suffers some additional computational cost. In emerging convolutions we minimize this cost through the use of an accelerated parallel inversion module, implemented in Cython, and by maintaining relatively small dimensionality in the emerging convolutions compared to the internal size of coupling layers.
Deep learning applications tend to use square filters, and libraries are specifically optimized for these shapes. Since most of the receptive fields in Figure 5 are unusually shaped, these would require masking to fit them in rectangular arrays, leading to unnecessary computation.
However, there is a special case in which the emerging receptive field of two specific autoregressive convolutions is identical to a standard convolution. These square emerging convolutions can be obtained by combining off center square convolutions, depicted in the bottom row of Figure 5 (also Figure 1). Our square emerging convolution filters are more efficient since they require fewer masked values in rectangular arrays.
There are two approaches to efficiently compute square emerging convolutions during optimization and density estimation: either a emerging convolution is expressed as two smaller consecutive convolutions. Alternatively, the order of convolution can be changed: first the smaller filters ( and ) are convolved to obtain a single equivalent convolution filter. Then, the output of the emerging convolution is obtained by convolving the equivalent filter, , with the feature map :
This equivalence follows from the associativity of convolutions and the time reversal of real discrete signals in cross-correlations.
When , two autoregressive convolutions simplify to an LU decomposed convolution. To ensure that emerging convolutions are flexible, we use emerging convolutions that consists of: a single convolution, and two square autoregressive convolutions with different masking as depicted in the bottom row of Figure 1. Again, the individual convolutions may all be combined into a single emerging convolution filter using the associativity of convolutions (Equation 7).
In some cases, data may be periodic or boundaries may contain roughly the same values. In these cases it may be advantageous to use invertible periodic convolutions, which assume that boundaries wrap around. When computed in the frequency domain, this alternative convolution has a tractable determinant Jacobian and inverse. The method leverages the convolution theorem, which states that the Fourier transform of a convolution is given by the element-wise product of the Fourier transformed signals. Specifically, the input and filter are transformed using the Discrete Fourier Transform (DFT) and multiplied element-wise, after which the inverse DFT is taken. By considering the transformation in the frequency domain, the computational complexity of the determinant Jacobian and the inverse are considerably reduced. In contrast with emerging convolutions, which are very specifically parameterized, the filters of periodic convolutions are completely unconstrained.
A standard convolution layer in deep learning is conventionally implemented as an aggregation of cross-correlations for every output channel. The convolution layer with input and filter outputs the feature map , which is computed as:
Let denote the Fourier transform and let denote the inverse Fourier transform. The Fourier transform can be moved inside the channel summation, since it is distributive over addition. Let , and , which are indexed by frequencies and . Because a convolution differs from a cross-correlation by a time reversal for real signals, let denote the reflection of filter in both spatial directions. Using these definitions, each cross-correlation is written as an element-wise multiplication in the frequency domain:
which can be written as a sum of products in scalar form:
The summation of multiplications can be reformulated as a matrix multiplication over the channel axis by viewing the output at frequency as a multiplication of the matrix and the input vector :
The matrix has dimensions , the input and output are vectors with dimension and . The output in the original domain can simply be retrieved by taking the inverse Fourier transform, . The perspective of matrix multiplication in the frequency domain decouples the convolution transformation (see Figure 6). Therefore, the log determinant of a periodic convolution layer is equal to the sum of determinants of individual frequency components:
The determinant remains unchanged by the Fourier transform and its inverse, since these are unitary transformations. The inverse operation requires an inversion of the matrix for every frequency :
The solution of in the original domain is obtained by the inverse Fourier transform, , for every channel .
Recall that a standard convolution layer is equivalent to a matrix multiplication with a matrix, where we let for invertibility. The Fourier transform decouples the transformation of the convolution layer at each frequency, which divides the computation into separate matrix multiplications with matrices. Therefore, the computational cost of the determinant is reduced from to in the frequency domain, and computation can be parallelized since the matrices are independent across frequencies and independent of the data. Furthermore, the inverse matrices only need to be computed once after the model has converged, which reduces the inverse convolution to an efficient matrix multiplication with computational complexity222The inverse also incurs some overhead due to the Fourier transform of the feature maps which corresponds to a computational complexity . .
Standard 1 1 convolutions are flexible but may be numerically unstable during optimization, causing crashes in the training procedure. Kingma & Dhariwal (2018) propose to learn a PLU decomposition, but since the permutation matrix is fixed during optimization, their flexibility is limited.
In order to resolve the stability issues while retaining the flexibility of the transformation, we propose to use a QR decomposition. Any real square matrix can be decomposed into a multiplication of an orthogonal and a triangular matrix. In a similar fashion to the PLU parametrization, we stabilize the decomposition by choosing , where is orthogonal, is strictly triangular, and elements in are nonzero. Any orthogonal matrix can be constructed from at most Householder reflections through , where is a Householder reflection:
are learnable parameters. Note that in our case . In practice, arbitrary flexibility of may be redundant, and we can trade off computational complexity and flexibility by using a smaller number of Householder reflections. The log determinant of the QR decomposition is and can be computed in . The computational complexity to construct is between and depending on the desired flexibility. The QR parametrization has two main advantages: in contrast with the straightforward parameterization it is numerically stable, and it can be completely flexible in contrast with the PLU parametrization.
The field of generative modeling has been approached from several directions. This work mainly builds upon generative flow methods developed in (Dinh et al., 2017; Kingma & Dhariwal, 2018). Another type of normalizing flow (Papamakarios et al., 2017) also uses autoregressive convolutions for density estimation, but both its depth and number of channels makes drawing samples computationally expensive. Our method is compared to these previous works in section id1 using negative log-likelihood (bits/dim). We do not use inception based metrics, as they do not generalize to different datasets, and they do not report overfitting (Barratt & Sharma, 2018).
Other likelihood-based methods such as PixelCNNs (Van Oord et al., 2016) impose a specific order on the dimensions of the image, which may not reflect the actual generative process. Furthermore, drawing samples tends to be computationally expensive. Alternatively, VAEs (Kingma & Welling, 2014) optimize a lower bound of the likelihood. The likelihood can be evaluated via an importance sampling scheme, but the quality of the estimate depends on the number of samples and the quality of the proposal distribution.
Many non likelihood-based methods that can generate high resolution image samples utilize Generative Adversarial Networks (GAN) (Goodfellow et al., 2014). Although GANs tend to generate high quality images, they do not directly optimize a likelihood. This makes it difficult to obtain likelihoods and to measure their coverage of the dataset.
The architecture of (Kingma & Dhariwal, 2018) is the starting point for the architecture in our experiments. In the flow module, the invertible convolution can simply be replaced with a periodic or emerging convolution. For a detailed overview of the architecture see Figure 7. We quantitatively evaluate models on a variety of datasets in bits per dimension, which is equivalent to the negative log-likelihood. In addition, we provide image samples generated with periodic convolutions trained on galaxy images, and samples generated with emerging convolutions trained on CIFAR10.
Note that generative models are very computationally expensive in general, and we do not have the computational budget to run extremely high-dimensional image modeling tasks.
Since periodic convolutions assume that image boundaries are connected, they are suited for data where pixels along the boundaries are roughly the same, or are actually connected. An example of such data is pictures taken in space, as they tend to contain some scattered light sources, and boundaries are mostly dark. Ackermann et al. () collected a small classification dataset of galaxies with images of merging and non-merging galaxies. On the non-merging galaxy images, we compare the bits per dimension of three models, constrained by the same parameter budget: convolutions (Glow), Periodic and Emerging convolutions (see Table 2). Experiments show that both our periodic and emerging convolutions significantly outperform convolutions, and their performance is less sensitive to initialization. Samples of the model using periodic convolutions are depicted in Figure 8.
The performance of emerging convolution is extensively tested on CIFAR10 and ImageNet, with different architectural sizes. The experiments in Table 3 use the architecture from Kingma & Dhariwal (2018), where emerging convolutions replace the convolutions. Emerging convolutions perform either on par or better than Glow333The CIFAR10 performance of Glow was obtained by running the code from the original github repository., which may be caused by the overparameterization of these large models. Samples of the model using emerging convolutions are depicted in Figure 9.
|CIFAR10||ImageNet 32x32||ImageNet 64x64|
In some cases, it may not be feasible to run very large models in production because of the large computational cost. Therefore, it is interesting to study the behavior of models when they are constrained in size. We compare and emerging convolutions with the same number of flows per level (), for and . Both on CIFAR10 and ImageNet, we observe that models using emerging convolutions perform significantly better. Furthermore, for smaller models the contribution of emerging convolutions becomes more important, as evidenced by the increasing performance gap (see Table 4).
Recall that the inverse of autoregressive convolutions requires solving a sequential problem, which we have accelerated with an inversion module that uses Cython and parallelism across the minibatch. Considering CIFAR-10 and the same architecture as used in Table 3, it takes 39ms to sample an image using our accelerated emerging inverses, 46 times faster than the naïvely obtained inverses using tensorflow bijectors (see Table 5). As expected, sampling from models using convolutions remains faster and takes 5ms.
Masked Autoregressive Flows (MAFs) are a very flexible method for density estimation, and they improve performance over emerging convolutions slightly, 3.33 versus 3.34 bits per dimension. However, the width and depth of MAFs makes them a poor choice for sampling, because it considerably increases the time to compute their inverse: 3000ms per sample using a naïve solution, and 650ms per sample using our inversion module. Since emerging convolutions operate on lower dimensions of the data, they are 17 times faster to invert than the MAFs.
QR convolutions are compared with standard and PLU convolutions on the CIFAR10 dataset. The models have 3 levels and 8 flows per level. Experiments confirm that our stable QR decomposition achieves the same performance as the standard parameterization, as shown in Table 6. This is expected, since any real square matrix has a QR decomposition. Furthermore, the experiments confirm that the less flexible PLU parameterization leads to worse performance, which is caused by the fixed permutation matrix.
We have introduced three generative flows: i) emerging convolutions as invertible standard zero-padded convolutions, ii) periodic convolutions for periodic data or data with minimal boundary variation, and iii) stable and flexible convolutions using a QR parametrization. Our methods show consistent improvements over various datasets using the same parameter budget, especially when considering models constrained in size.
- (1) Ackermann, S., Schawinksi, K., Zhang, C., Weigel, A. K., and Turp, M. D. Using transfer learning to detect galaxy mergers. Monthly Notices of the Royal Astronomical Society.
- Barratt & Sharma (2018) Barratt, S. and Sharma, R. A note on the inception score. ICML Workshop on Theoretical Foundations and Applications of Deep Generative Models, 2018.
- Dinh et al. (2017) Dinh, L., Sohl-Dickstein, J., and Bengio, S. Density estimation using real nvp. International Conference on Learning Representations, ICLR, 2017.
- Germain et al. (2015) Germain, M., Gregor, K., Murray, I., and Larochelle, H. Made: Masked autoencoder for distribution estimation. In International Conference on Machine Learning, pp. 881–889, 2015.
- Gomez et al. (2017) Gomez, A. N., Ren, M., Urtasun, R., and Grosse, R. B. The reversible residual network: Backpropagation without storing activations. In Advances in Neural Information Processing Systems, pp. 2214–2224, 2017.
- Goodfellow et al. (2014) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680, 2014.
- Kingma & Dhariwal (2018) Kingma, D. P. and Dhariwal, P. Glow: Generative flow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems, pp. 10236–10245, 2018.
- Kingma & Welling (2014) Kingma, D. P. and Welling, M. Stochastic gradient vb and the variational auto-encoder. In Second International Conference on Learning Representations, ICLR, 2014.
- Kingma et al. (2016) Kingma, D. P., Salimans, T., Jozefowicz, R., Chen, X., Sutskever, I., and Welling, M. Improved variational inference with inverse autoregressive flow. In Advances in Neural Information Processing Systems, pp. 4743–4751, 2016.
- Li & Grathwohl (2018) Li, X. and Grathwohl, W. Training glow with constant memory cost. NIPS Workshop on Bayesian Deep Learning, 2018.
- Papamakarios et al. (2017) Papamakarios, G., Murray, I., and Pavlakou, T. Masked autoregressive flow for density estimation. In Advances in Neural Information Processing Systems, pp. 2338–2347, 2017.
- Theis et al. (2016) Theis, L., van den Oord, A., and Bethge, M. A note on the evaluation of generative models. In International Conference on Learning Representations (ICLR 2016), pp. 1–10, 2016.
- van den Oord et al. (2016) van den Oord, A., Kalchbrenner, N., Espeholt, L., Vinyals, O., Graves, A., et al. Conditional image generation with pixelcnn decoders. In Advances in Neural Information Processing Systems, pp. 4790–4798, 2016.
- Van Oord et al. (2016) Van Oord, A., Kalchbrenner, N., and Kavukcuoglu, K. Pixel recurrent neural networks. In International Conference on Machine Learning, pp. 1747–1756, 2016.
Models are optimized with settings identical to (Kingma & Dhariwal, 2018). The optimizer Adamax is used with a learning rate of 0.001. Coupling layers contain neural networks with three convolution layers. The first and last convolution are and the center convolution is . The two hidden layers have (width) channels and ReLU activations. A flow module consists of an actnorm layer, a mixing layer and a coupling layer. A mixing layer is either a convolution (Glow) or a (emerging or periodic) convolution. A forward pass of the entire model uses flows per level . At the end of a level the squeeze operation reduces the spatial dimensions by two, and increases the channel dimensions by four.
In the MAF model, convolutions are succeeded (not replaced) by two MAF layers with opposite autoregressive order. MAF layers have the same structure as coupling layers, with the difference that the two hidden layers have 96 channels.
The models constrained in size are tested on CIFAR10 and have a different number of flows () per level. The other hyperparameters remain unchanged.
The experiments that compare parameterizations of the convolutions all use the same hyperparameters, , , , and .