References
Abstract

Generative flows are attractive because they admit exact likelihood optimization and efficient image synthesis. Recently, Kingma & Dhariwal (2018) demonstrated with Glow that generative flows are capable of generating high quality images. We generalize the convolutions proposed in Glow to invertible convolutions, which are more flexible since they operate on both channel and spatial axes. We propose two methods to produce invertible convolutions that have receptive fields identical to standard convolutions: Emerging convolutions are obtained by chaining specific autoregressive convolutions, and periodic convolutions are decoupled in the frequency domain. Our experiments show that the flexibility of convolutions significantly improves the performance of generative flow models on galaxy images, CIFAR10 and ImageNet.

oddsidemargin has been altered.
marginparsep has been altered.
topmargin has been altered.
marginparwidth has been altered.
marginparpush has been altered.
paperheight has been altered.
The page layout violates the ICML style. Please do not change the page layout, or include packages like geometry, savetrees, or fullpage, which change it for you. We’re not able to reliably undo arbitrary changes to the style. Please remove the offending package(s), or layout-changing commands and try again.

 

Emerging Convolutions for Generative Normalizing Flows

 

Emiel Hoogeboom0 0  Rianne van den Berg0  Max Welling0 0 


footnotetext: 1AUTHORERR: Missing \icmlaffiliation. 2AUTHORERR: Missing \icmlaffiliation. 3AUTHORERR: Missing \icmlaffiliation. . Correspondence to: Emiel Hoogeboom <e.hoogeboom@uva.nl>.  
Preprint. Work in progress
\@xsect

Generative models aim to learn a representation of the data , in contrast with discriminative models that learn a probability distribution of labels given data . Generative modeling may be used for numerous applications such as anomaly detection, denoising, inpainting, and super-resolution. The task of generative modeling is challenging, because data is often very high-dimensional, which makes optimization and choosing a successful objective difficult.

Figure 1: Illustration of a square emerging convolution. The input has spatial dimensions equal to , and two channels. Convolutions use one-pixel-wide zero padding at each border. Two consecutive square autoregressive convolutions with filters and have a receptive field identical to a standard convolution, with filter , where denotes a convolution layer. These operations are equivalent to the multiplication of matrices and a vectorized input signal . Since the filters are learned decomposed, the Jacobian determinant and inverse are straightforward to compute.

Generative flow methods have several advantages over other generative models: i) They optimize the log likelihood of a continuous distribution exactly, as opposed to Variational Auto-Encoders (VAEs) (Kingma & Welling, 2014) which optimize a lower bound to the log-likelihood. ii) Drawing samples has a computational cost comparable to inference, in contrast with Pixel CNNs (Van Oord et al., 2016). iii) Generative flows also have the potential for huge memory savings, because activations necessary in the backward pass can be obtained by computing the inverse of layers (Gomez et al., 2017; Li & Grathwohl, 2018).

The performance of density estimation models can be largely attributed to Masked Autoregressive Flows (MAFs) (Papamakarios et al., 2017) and coupling layers (Dinh et al., 2017). MAFs contain flexible autoregressive transformations, but are computationally expensive to invert, which is a disadvantage for sampling high-dimensional data. Coupling layers transform a subset of the dimensions of the data, parameterized by the remaining dimensions. The inverse of coupling layers is straightforward to compute, which makes them suitable for generative flows. However, since coupling layers can only operate on a subset of the dimensions of the data, they may be limited in flexibility.

To improve their effectiveness, coupling layers are alternated with less complex transformations that do operate on all dimensions of the data. Dinh et al. (2017) use a fixed channel permutation in Real NVP, and Kingma & Dhariwal (2018) utilize convolutions in Glow.

However, convolutions suffer from limited flexibility, and using standard convolutions is not straightforward as they are very computationally expensive to invert. We propose two methods to obtain easily invertible and flexible convolutions: emerging and periodic convolutions. Both of these convolutions have receptive fields identical to standard convolutions, resulting in flexible transformations over both the channel and spatial axes.

The structure of an emerging convolution is depicted in Figure 1, where the top depicts the convolution filters, and the bottom shows the equivalent matrices of these convolutions. Two autoregressive convolutions are chained to obtain an emerging receptive field identical to a standard convolution. Empirically, we find that replacing convolutions with the generalized invertible convolutions produces significantly better results on galaxy images, CIFAR10 and ImageNet, even when correcting for the increase in parameters.

In addition to invertible convolutions, we also propose a QR decomposition for convolutions, which resolves flexibility issues of the PLU decomposition proposed by Kingma & Dhariwal (2018).

The main contributions of this paper are listed below:

  1. Invertible emerging convolutions using autoregressive convolutions.

  2. Invertible periodic convolutions using decoupling in the frequency domain.

  3. Numerically stable and flexible convolutions parameterized by a QR decomposition.

  4. An accelerated inversion module for autoregressive convolutions.

This paper is structured as follows: Section id1 provides background information related to generative flows. Section id1 presents our methods and in section id1 related work is discussed. Section id1 describes our results.

\@xsect
Generative Flow Function Inverse Log Determinant
Actnorm
Affine coupling
Conv
Emerging Conv
Periodic Conv
Table 1: The definition of several generative normalizing flows. All flow functions have an inverse and determinant that are straightforward to compute. The height , width and number of channels of an output remains identical to the dimensions of the input. The symbols and denote element-wise multiplication and division. Input and output may be denoted as tensors and with dimensions . The inputs and outputs may be denoted as one-dimensional vectors and with dimension . Input and output in frequency domain are denoted with and , with dimensions , where the last two components denote frequencies.
\@xsect

Consider a bijective map between variables and . The likelihood of the variable can be written as the likelihood of the transformation evaluated by , using the change of variables formula:

(1)

The complicated probability density is equal to the probability density multiplied by the Jacobian determinant, where is chosen to be tractable. The function can be learned, but the choice of is constrained by two practical issues: Firstly, the Jacobian determinant should be tractable. Secondly, to draw samples from , the inverse of should be tractable.

\@xsect

In machine learning, likelihoods are generally optimized in log-space for numerical precision, and deep learning models are structured in layers. Let be the intermediate representations produced by the network layers, where and . The log-likelihood of is written as the log-likelihood of , and the summation of the log Jacobian determinant of each layer:

(2)
\@xsect

We will evaluate our methods with experiments on image datasets, where pixels are discrete-valued from 0 to 255. Since generative flows are continuous density models, they may trivially place infinite mass on discretized bin locations. Therefore, we use the definition of Theis et al. (2016) that defines the relation between a discrete model and continuous model as an integration over bins: , where . They further derive a lowerbound to optimize this model with Jensen’s inequality, resulting in additive uniform noise for the integer valued pixels from the data distribution :

(3)
\@xsect

Generative flows are bijective functions, often structured as deep learning layers, that are designed to have tractable Jacobian determinants and inverses. An overview of several generative flows is provided in Table 1, and a description is given below:

Coupling layers (Dinh et al., 2017) split the input in two parts. The output is a combination of a copy of the first half, and a transformation of the second half, parametrized by the first part. As a result, the inverse and Jacobian determinant are straightforward to compute.

Actnorm layers (Kingma & Dhariwal, 2018) are data dependent initialized layers with scale and translation parameters. They are initialized such that the distribution of activations has mean zero and standard deviation one. Actnorm layers improve training stability and performance.

Convolutions (Kingma & Dhariwal, 2018) are easy to invert, and can be seen as a generalization of the permutation operations that were used by Dinh et al. (2017). 1 1 convolutions improve the effectiveness of the coupling layers.

\@xsect
Figure 2: Illustration of a standard 3 3 convolution layer with one input and output channel. The spatial input size is , and the input values are . The convolution uses one-pixel-wide zero padding at each border, and the filter has parameters . Left: the convolution . Right: the matrix multiplication which produces the equivalent result.
Figure 3: A standard 3 3 convolution layer with two input and output channels. The input is spatially, and has two channels. The convolution uses one-pixel-wide zero padding at each border. Left: the convolution filter . Right: the matrix which produces the equivalent result when multiplied with a vectorized input.

Generally, a convolution layer111Note that in deep learning frameworks, convolutions are often actually cross-correlations. In our equations denotes a cross-correlation and denotes a convolution. In addition, a convolution layer is usually implemented as an aggregation of cross-correlations, i.e. a cross-correlation layer. We denote such a layer with the symbol . In the main text we may omit these details, and these operations are referred to as convolutions. with filter and input is equivalent to the multiplication of , a matrix, and a vectorized input . An example of a single channel convolution and its equivalent matrix is depicted in Figure 2. The signals and are indexed as , where is the width index, is the height index, and is the total width. Note that the matrix becomes sparser as the image dimensions grow and that the parameters of the filter occur repeatedly in the matrix . A two-channel convolution is visualized in Figure 3, where we have omitted parameters inside filters to avoid clutter. Here, and are vectorized using indexing , where denotes the channel index and the number of channels.

Using standard convolutions as a generative flow is inefficient. The determinant and inverse can be obtained naïvely by operating directly on the corresponding matrix, but this would be very expensive, corresponding to computational complexity .

\@xsect
Figure 4: An autoregressive 3 3 convolution layer with two input and output channels. The input has spatial dimensions , and two channels. The convolution uses one-pixel-wide zero padding at each border. Left: the autoregressive convolution filter . Right: the matrix which produces the equivalent result on a vectorized input. Note that the equivalent matrix is triangular.

Autoregressive convolutions have been widely used in the field of normalizing flows (Germain et al., 2015; Kingma et al., 2016) because it is straightforward to compute their Jacobian determinant. Although there exist autoregressive convolutions with different input and output dimensions, we let for invertibility. In this case, autoregressive convolutions can be expressed as a multiplication between a triangular weight matrix and a vectorized input.

In practice, a filter is constructed from weights and a binary mask that enforces the autoregressive structure (see Figure 4). The convolution with the masked filter is autoregressive without the need to mask inputs, which allows parallel computation of the convolution layer:

(4)

where denotes a convolution layer\@footnotemark. The matrix multiplication produces the equivalent result, where and are the vectorized signals, and is a sparse triangular matrix constructed from (see Figure 4). The Jacobian is triangular by design and its determinant can be computed in since it only depends on the diagonal elements of the matrix :

(5)

where index denotes the channel and (, ) denotes the spatial center of the filter. The inverse of an autoregressive convolution can theoretically be computed using . In reality this matrix is large and impractical to invert. Since is triangular, the solution for can be found through forward substitution:

(6)

The inverse can be computed by sequentially traversing through the input feature map in the imposed autoregressive order. The computational complexity of the inverse is and computation can be parallelized across examples in the minibatch.

\@xsect

We present two methods to generalize 1 1 convolutions to invertible convolutions, improving the flexibility of generative flow models. Emerging convolutions are obtained by chaining autoregressive convolutions (section id1), and periodic convolutions are decoupled in frequency domain (section id1). In section id1, we provide a stable and flexible parameterization for invertible convolutions.

\@xsect

Although autoregressive convolutions are invertible, their transformation is restricted by the imposed autoregressive order, enforced through masking of the filters (as depicted in Figure 4). To alleviate this restriction, we propose emerging convolutions, which are more flexible and nevertheless invertible. Emerging convolutions are obtained by chaining specific autoregressive convolutions, invertible via the autoregressive inverses. To some extent this resembles the combination of stacks used to resolve the blind spot problem in conditional image modeling with PixelCNNs (van den Oord et al., 2016), with the important difference that we do not constrain the resulting convolution itself to be autoregressive.

The emerging receptive field can be controlled by chaining autoregressive convolutions with variations in the imposed order. A collection of achievable receptive fields for emerging convolutions is depicted in Figure 5, based on commonly used autoregressive masking.

The autoregressive inverse requires the solution to a sequential problem, and as a result, it inevitably suffers some additional computational cost. In emerging convolutions we minimize this cost through the use of an accelerated parallel inversion module, implemented in Cython, and by maintaining relatively small dimensionality in the emerging convolutions compared to the internal size of coupling layers.

Figure 5: Achievable emerging receptive fields that consist of two distinct auto-regressive convolutions. Grey areas denote the first convolution filter and orange areas denote the second convolution filter. Blue areas denote the emerging receptive field, and white areas are masked. The convolution in the bottom row is a special case, which has a receptive field identical to a standard convolution.
\@xsect

Deep learning applications tend to use square filters, and libraries are specifically optimized for these shapes. Since most of the receptive fields in Figure 5 are unusually shaped, these would require masking to fit them in rectangular arrays, leading to unnecessary computation.

However, there is a special case in which the emerging receptive field of two specific autoregressive convolutions is identical to a standard convolution. These square emerging convolutions can be obtained by combining off center square convolutions, depicted in the bottom row of Figure 5 (also Figure 1). Our square emerging convolution filters are more efficient since they require fewer masked values in rectangular arrays.

There are two approaches to efficiently compute square emerging convolutions during optimization and density estimation: either a emerging convolution is expressed as two smaller consecutive convolutions. Alternatively, the order of convolution can be changed: first the smaller filters ( and ) are convolved to obtain a single equivalent convolution filter. Then, the output of the emerging convolution is obtained by convolving the equivalent filter, , with the feature map :

(7)

This equivalence follows from the associativity of convolutions and the time reversal of real discrete signals in cross-correlations.

When , two autoregressive convolutions simplify to an LU decomposed convolution. To ensure that emerging convolutions are flexible, we use emerging convolutions that consists of: a single convolution, and two square autoregressive convolutions with different masking as depicted in the bottom row of Figure 1. Again, the individual convolutions may all be combined into a single emerging convolution filter using the associativity of convolutions (Equation 7).

\@xsect

In some cases, data may be periodic or boundaries may contain roughly the same values. In these cases it may be advantageous to use invertible periodic convolutions, which assume that boundaries wrap around. When computed in the frequency domain, this alternative convolution has a tractable determinant Jacobian and inverse. The method leverages the convolution theorem, which states that the Fourier transform of a convolution is given by the element-wise product of the Fourier transformed signals. Specifically, the input and filter are transformed using the Discrete Fourier Transform (DFT) and multiplied element-wise, after which the inverse DFT is taken. By considering the transformation in the frequency domain, the computational complexity of the determinant Jacobian and the inverse are considerably reduced. In contrast with emerging convolutions, which are very specifically parameterized, the filters of periodic convolutions are completely unconstrained.

A standard convolution layer in deep learning is conventionally implemented as an aggregation of cross-correlations for every output channel. The convolution layer with input and filter outputs the feature map , which is computed as:

(8)

Let denote the Fourier transform and let denote the inverse Fourier transform. The Fourier transform can be moved inside the channel summation, since it is distributive over addition. Let , and , which are indexed by frequencies and . Because a convolution differs from a cross-correlation by a time reversal for real signals, let denote the reflection of filter in both spatial directions. Using these definitions, each cross-correlation is written as an element-wise multiplication in the frequency domain:

(9)

which can be written as a sum of products in scalar form:

(10)

The summation of multiplications can be reformulated as a matrix multiplication over the channel axis by viewing the output at frequency as a multiplication of the matrix and the input vector :

(11)
Figure 6: Visualization of a periodic 3 3 convolution layer in the frequency domain. The input and output have height 3, width 3 and channels 2. The shape of the filter in the frequency domain determined by the shape of the image, which is also 3 3 spatially in this specific example. Left: the convolution filter transformed to the frequency domain . Right: the matrix in the frequency domain, which produces the equivalent result on a vectorized input. The equivalent matrix in the frequency domain is partitioned.

The matrix has dimensions , the input and output are vectors with dimension and . The output in the original domain can simply be retrieved by taking the inverse Fourier transform, . The perspective of matrix multiplication in the frequency domain decouples the convolution transformation (see Figure 6). Therefore, the log determinant of a periodic convolution layer is equal to the sum of determinants of individual frequency components:

(12)

The determinant remains unchanged by the Fourier transform and its inverse, since these are unitary transformations. The inverse operation requires an inversion of the matrix for every frequency :

(13)

The solution of in the original domain is obtained by the inverse Fourier transform, , for every channel .

Recall that a standard convolution layer is equivalent to a matrix multiplication with a matrix, where we let for invertibility. The Fourier transform decouples the transformation of the convolution layer at each frequency, which divides the computation into separate matrix multiplications with matrices. Therefore, the computational cost of the determinant is reduced from to in the frequency domain, and computation can be parallelized since the matrices are independent across frequencies and independent of the data. Furthermore, the inverse matrices only need to be computed once after the model has converged, which reduces the inverse convolution to an efficient matrix multiplication with computational complexity222The inverse also incurs some overhead due to the Fourier transform of the feature maps which corresponds to a computational complexity . .

\@xsect

Standard 1 1 convolutions are flexible but may be numerically unstable during optimization, causing crashes in the training procedure. Kingma & Dhariwal (2018) propose to learn a PLU decomposition, but since the permutation matrix is fixed during optimization, their flexibility is limited.

In order to resolve the stability issues while retaining the flexibility of the transformation, we propose to use a QR decomposition. Any real square matrix can be decomposed into a multiplication of an orthogonal and a triangular matrix. In a similar fashion to the PLU parametrization, we stabilize the decomposition by choosing , where is orthogonal, is strictly triangular, and elements in are nonzero. Any orthogonal matrix can be constructed from at most Householder reflections through , where is a Householder reflection:

(14)

are learnable parameters. Note that in our case . In practice, arbitrary flexibility of may be redundant, and we can trade off computational complexity and flexibility by using a smaller number of Householder reflections. The log determinant of the QR decomposition is and can be computed in . The computational complexity to construct is between and depending on the desired flexibility. The QR parametrization has two main advantages: in contrast with the straightforward parameterization it is numerically stable, and it can be completely flexible in contrast with the PLU parametrization.

\@xsect

The field of generative modeling has been approached from several directions. This work mainly builds upon generative flow methods developed in (Dinh et al., 2017; Kingma & Dhariwal, 2018). Another type of normalizing flow (Papamakarios et al., 2017) also uses autoregressive convolutions for density estimation, but both its depth and number of channels makes drawing samples computationally expensive. Our method is compared to these previous works in section id1 using negative log-likelihood (bits/dim). We do not use inception based metrics, as they do not generalize to different datasets, and they do not report overfitting (Barratt & Sharma, 2018).

Other likelihood-based methods such as PixelCNNs (Van Oord et al., 2016) impose a specific order on the dimensions of the image, which may not reflect the actual generative process. Furthermore, drawing samples tends to be computationally expensive. Alternatively, VAEs (Kingma & Welling, 2014) optimize a lower bound of the likelihood. The likelihood can be evaluated via an importance sampling scheme, but the quality of the estimate depends on the number of samples and the quality of the proposal distribution.

Many non likelihood-based methods that can generate high resolution image samples utilize Generative Adversarial Networks (GAN) (Goodfellow et al., 2014). Although GANs tend to generate high quality images, they do not directly optimize a likelihood. This makes it difficult to obtain likelihoods and to measure their coverage of the dataset.

\@xsect

The architecture of (Kingma & Dhariwal, 2018) is the starting point for the architecture in our experiments. In the flow module, the invertible convolution can simply be replaced with a periodic or emerging convolution. For a detailed overview of the architecture see Figure 7. We quantitatively evaluate models on a variety of datasets in bits per dimension, which is equivalent to the negative log-likelihood. In addition, we provide image samples generated with periodic convolutions trained on galaxy images, and samples generated with emerging convolutions trained on CIFAR10.

Note that generative models are very computationally expensive in general, and we do not have the computational budget to run extremely high-dimensional image modeling tasks.

Figure 7: Overview of the model architecture. Left and center depict the flow modules we propose: containing either a periodic convolution or an emerging convolution. The diagram on the right shows the entire model architecture, where the flow module is now grouped. The squeeze module reorders pixels by reducing the spatial dimensions by a half, and increasing the channel depth by four. A hierarchical prior is placed on part of the intermediate representation using the split module as in (Kingma & Dhariwal, 2018). and denote input and output. The model has levels, and flow modules per level.
\@xsect

Since periodic convolutions assume that image boundaries are connected, they are suited for data where pixels along the boundaries are roughly the same, or are actually connected. An example of such data is pictures taken in space, as they tend to contain some scattered light sources, and boundaries are mostly dark. Ackermann et al. () collected a small classification dataset of galaxies with images of merging and non-merging galaxies. On the non-merging galaxy images, we compare the bits per dimension of three models, constrained by the same parameter budget: convolutions (Glow), Periodic and Emerging convolutions (see Table 2). Experiments show that both our periodic and emerging convolutions significantly outperform convolutions, and their performance is less sensitive to initialization. Samples of the model using periodic convolutions are depicted in Figure 8.

Galaxy
(Glow) 2.03
Periodic 1.98
Emerging 1.98
Table 2: Comparison of , periodic and emerging convolutions on the galaxy images dataset. Performance is measured in bits per dimension. Results are obtained by running 3 times with different random seeds, reports standard deviation.
Figure 8: 100 samples from a generative flow model utilizing periodic convolutions, trained on the galaxy images dataset.
Figure 9: 100 samples from a generative flow model utilizing emerging convolutions, trained on CIFAR10.
\@xsect

The performance of emerging convolution is extensively tested on CIFAR10 and ImageNet, with different architectural sizes. The experiments in Table 3 use the architecture from Kingma & Dhariwal (2018), where emerging convolutions replace the convolutions. Emerging convolutions perform either on par or better than Glow333The CIFAR10 performance of Glow was obtained by running the code from the original github repository., which may be caused by the overparameterization of these large models. Samples of the model using emerging convolutions are depicted in Figure 9.

CIFAR10 ImageNet 32x32 ImageNet 64x64
Real NVP 3.51 4.28 3.98
Glow 3.36 4.09 3.81
Emerging 3.34 4.09 3.81
Table 3: Performance of Emerging convolutions on CIFAR10, ImageNet 32x32 and ImageNet 64x64 in bits per dimension (negative log-likelihood), and reports standard deviation.
CIFAR10 ImageNet 32x32 D
(Glow) 3.46 4.18 8
Emerging 3.43 4.16 8
(Glow) 3.56 4.28 4
Emerging 3.51 4.25 4
Table 4: Performance of generative flows models with different architectures on CIFAR10, ImageNet 32x32 and ImageNet 64x64 in bits per dimension. Results are obtained by running 3 times with different random seeds, reports standard deviation.

In some cases, it may not be feasible to run very large models in production because of the large computational cost. Therefore, it is interesting to study the behavior of models when they are constrained in size. We compare and emerging convolutions with the same number of flows per level (), for and . Both on CIFAR10 and ImageNet, we observe that models using emerging convolutions perform significantly better. Furthermore, for smaller models the contribution of emerging convolutions becomes more important, as evidenced by the increasing performance gap (see Table 4).

\@xsect

Recall that the inverse of autoregressive convolutions requires solving a sequential problem, which we have accelerated with an inversion module that uses Cython and parallelism across the minibatch. Considering CIFAR-10 and the same architecture as used in Table 3, it takes 39ms to sample an image using our accelerated emerging inverses, 46 times faster than the naïvely obtained inverses using tensorflow bijectors (see Table 5). As expected, sampling from models using convolutions remains faster and takes 5ms.

Masked Autoregressive Flows (MAFs) are a very flexible method for density estimation, and they improve performance over emerging convolutions slightly, 3.33 versus 3.34 bits per dimension. However, the width and depth of MAFs makes them a poor choice for sampling, because it considerably increases the time to compute their inverse: 3000ms per sample using a naïve solution, and 650ms per sample using our inversion module. Since emerging convolutions operate on lower dimensions of the data, they are 17 times faster to invert than the MAFs.

CIFAR10 bits/dim Naïve
sample (ms) Accelerated
sample (ms)
(Glow) 3.36 5 5
MAF & 3.33 3000 650
Emerging 3.34 1800 39
Table 5: Comparison of , MAF and Emerging convolutions on CIFAR-10. Performance is measured in bits per dimension, and the time required to sample a datapoint, when computed in minibatches of size 100. The naïve implementation uses Tensorflow bijectors, and our accelerated implementation uses Cython with MPI parallelization.
\@xsect

QR convolutions are compared with standard and PLU convolutions on the CIFAR10 dataset. The models have 3 levels and 8 flows per level. Experiments confirm that our stable QR decomposition achieves the same performance as the standard parameterization, as shown in Table 6. This is expected, since any real square matrix has a QR decomposition. Furthermore, the experiments confirm that the less flexible PLU parameterization leads to worse performance, which is caused by the fixed permutation matrix.

Parametrization CIFAR10
W 3.46
PLU 3.47
QR 3.46
Table 6: Comparison of standard, PLU and QR convolutions. Performance is measured in bits per dimension (negative log-likelihood). Results are obtained by running 3 times with different random seeds, reports standard deviation.
\@xsect

We have introduced three generative flows: i) emerging convolutions as invertible standard zero-padded convolutions, ii) periodic convolutions for periodic data or data with minimal boundary variation, and iii) stable and flexible convolutions using a QR parametrization. Our methods show consistent improvements over various datasets using the same parameter budget, especially when considering models constrained in size.

References

  • (1) Ackermann, S., Schawinksi, K., Zhang, C., Weigel, A. K., and Turp, M. D. Using transfer learning to detect galaxy mergers. Monthly Notices of the Royal Astronomical Society.
  • Barratt & Sharma (2018) Barratt, S. and Sharma, R. A note on the inception score. ICML Workshop on Theoretical Foundations and Applications of Deep Generative Models, 2018.
  • Dinh et al. (2017) Dinh, L., Sohl-Dickstein, J., and Bengio, S. Density estimation using real nvp. International Conference on Learning Representations, ICLR, 2017.
  • Germain et al. (2015) Germain, M., Gregor, K., Murray, I., and Larochelle, H. Made: Masked autoencoder for distribution estimation. In International Conference on Machine Learning, pp. 881–889, 2015.
  • Gomez et al. (2017) Gomez, A. N., Ren, M., Urtasun, R., and Grosse, R. B. The reversible residual network: Backpropagation without storing activations. In Advances in Neural Information Processing Systems, pp. 2214–2224, 2017.
  • Goodfellow et al. (2014) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680, 2014.
  • Kingma & Dhariwal (2018) Kingma, D. P. and Dhariwal, P. Glow: Generative flow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems, pp. 10236–10245, 2018.
  • Kingma & Welling (2014) Kingma, D. P. and Welling, M. Stochastic gradient vb and the variational auto-encoder. In Second International Conference on Learning Representations, ICLR, 2014.
  • Kingma et al. (2016) Kingma, D. P., Salimans, T., Jozefowicz, R., Chen, X., Sutskever, I., and Welling, M. Improved variational inference with inverse autoregressive flow. In Advances in Neural Information Processing Systems, pp. 4743–4751, 2016.
  • Li & Grathwohl (2018) Li, X. and Grathwohl, W. Training glow with constant memory cost. NIPS Workshop on Bayesian Deep Learning, 2018.
  • Papamakarios et al. (2017) Papamakarios, G., Murray, I., and Pavlakou, T. Masked autoregressive flow for density estimation. In Advances in Neural Information Processing Systems, pp. 2338–2347, 2017.
  • Theis et al. (2016) Theis, L., van den Oord, A., and Bethge, M. A note on the evaluation of generative models. In International Conference on Learning Representations (ICLR 2016), pp. 1–10, 2016.
  • van den Oord et al. (2016) van den Oord, A., Kalchbrenner, N., Espeholt, L., Vinyals, O., Graves, A., et al. Conditional image generation with pixelcnn decoders. In Advances in Neural Information Processing Systems, pp. 4790–4798, 2016.
  • Van Oord et al. (2016) Van Oord, A., Kalchbrenner, N., and Kavukcuoglu, K. Pixel recurrent neural networks. In International Conference on Machine Learning, pp. 1747–1756, 2016.
\@xsect

Models are optimized with settings identical to (Kingma & Dhariwal, 2018). The optimizer Adamax is used with a learning rate of 0.001. Coupling layers contain neural networks with three convolution layers. The first and last convolution are and the center convolution is . The two hidden layers have (width) channels and ReLU activations. A flow module consists of an actnorm layer, a mixing layer and a coupling layer. A mixing layer is either a convolution (Glow) or a (emerging or periodic) convolution. A forward pass of the entire model uses flows per level . At the end of a level the squeeze operation reduces the spatial dimensions by two, and increases the channel dimensions by four.

Model batchsize
3 8 512 256
Periodic 3 8 510 256
Emerging 3 8 510 256
Table 7: Experimental setup on galaxy image modeling.

In the MAF model, convolutions are succeeded (not replaced) by two MAF layers with opposite autoregressive order. MAF layers have the same structure as coupling layers, with the difference that the two hidden layers have 96 channels.

Model batchsize
Glow 3 32 512 256
MAF 3 32 462 256
Emerging 3 32 512 256
Table 8: Experimental setup on CIFAR10

The models constrained in size are tested on CIFAR10 and have a different number of flows () per level. The other hyperparameters remain unchanged.

Model batchsize
Glow 3 512 256
Emerging 3 507 256
Table 9: Experimental setup with constrained model size
Experiment Dataset batchsize
Glow ImageNet 32 3 48 512 256
Emerging ImageNet 32 3 48 512 256
Glow ImageNet 64 3 48 512 128
Emerging ImageNet 64 3 48 512 128
Table 10: Experimental setup on ImageNet

The experiments that compare parameterizations of the convolutions all use the same hyperparameters, , , , and .

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minumum 40 characters
   
Add comment
Cancel
Loading ...
334770
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description