FIS-GAN: GAN with Flow-based Importance Sampling
Generative Adversarial Networks (GAN) training process, in most cases, apply uniform and Gaussian sampling methods in latent space, which probably spends most of the computation on examples that can be properly handled and easy to generate.
Theoretically, importance sampling speeds up stochastic gradient algorithms for supervised learning by prioritizing training examples. In this paper, we explore the possibility for adapting importance sampling into adversarial learning. We use importance sampling to replace uniform and Gaussian sampling methods in latent space and combine normalizing flow with importance sampling to approximate latent space posterior distribution by density estimation.
Empirically, results on MNIST and Fashion-MNIST demonstrate that our method significantly accelerates the convergence of generative process while retaining visual fidelity in generated samples.
We have witnessed the rapid and thriving improvement in generated ginal GAN image with high resolution by Generative Adversarial Networks (GAN). Eversince the proposition of oririginal GAN, many variants of the original prototype have appeared in the past five years.
The dramatic increase in available training data has made the use of Deep Neural Networks (DNN) feasible, which in turn has significantly improved the state-of-the-art in many fields, in particular Computer Vision and Natural Language Processing. However, due to the complexity of the resulting optimization problem, computational cost is now the core issue in training these large architectures. When training such models, it appears to any practitioners that not all samples are equally important; many of them are properly handled after a few epochs of training, and most could be ignored at a point without impacting the final model. Furthermore, the trained GAN tends to generate examples that are much easier to learn, which will affect the quality of generation.
Overall, our contributions are as follows,
We combined importance sampling methods with Generative Adversarial Networks (GAN) to accelerate the training process of generator, which makes the generator samples noise in latent space focus more on the instance with larger Jacobian matrix norm.
We applied conditional Gaussian distribution to build up importance density for each instance in a batch to denote the information from importance sampling distribution, and combined it with deep density estimation models like normalizing flow.
We validated our frameworks performance on some widely used GAN models and experimented results on MNIST and Fashion-MNIST to show that our proposed method can significantly improve the generative quality and metric indeces.
We reduced the number of training epochs required to converge by replacing the uniform and Gaussian sampling methods with importance sampling.
We begin with a review of the building blocks of our model, then a elaboration on our approach.
2.1 Generative Adversarial Networks
Generative Adversarial Networks (Goodfellow et al. 2014) is a framework for training deep generative models using a minimax game. The goal of GAN is to learn a generator distribution from latent space that matches the real data distribution . GAN framework includes a generator network that draws samples from the generator distribution by transforming a noise variable into a sample , instead of assigning probability explicitly to every in the data distribution. This generator is trained by playing against an adversarial discriminator network that intends to distinguish between samples from the generator’s distribution and the true data distribution . So for a given generator, the optimal discriminator is
More specifically, the minimax game can be given by the following expression:
2.2 Normalizing Flow
Normalizing flows transform simple densities into rich complex distributions that can be used for generative models.
We can chain any number of bijectors together, much like we chain layers together in a neural network. This construction is known as a “normalizing flow”. Each bijector function is a learnable “layer”, and you can use an optimizer to learn the parameters of the transformation to suit our data distribution that we are trying to model.
In normalizing flows, we wish to map simple distributions (easy to sample and evaluate densities) to complex ones (learned from data)
2.3 Importance Sampling
The technique of Importance Sampling (IS) can be used to improve the sampling efficiency. The basic idea of the importance sampling is quite straightforward: instead of sampling from the nominal distribution, we draw samples from an alternative distribution from which an appropriate weight is assigned to each sample. Most recent works focus on its application on stochastic gradient descent.
Let be the -th input-output pair from the training set, a Deep Learning model parameterized by the vector and the loss function to be minimized during training. The goal of training is to find
where corresponds to the number of examples in the training set. Using Stochastic Gradient Descent with learning rate we can iteratively update the parameters of our model, between two consecutive iterations and with
3 Acceleration for GAN Based on Flow and Importance Sampling
We set a frequency for normalizing-flow-based density estimation and importance sampling process during training iterations. In terms of a flow and importance sampling step, we divide the specific batch-size noise generation process into 4 parts, importance density update, density estimation through normalizing flow, noise re-sampling (importance sampling) through approximated distribution and Jacobian matrix norm computation with update parameters of and . Other iterations only include noise re-sampling and update parameters of and through minmax game of these two steps.
In the process of importance density update, we put the information of previous Jacobian matrix norm (from last batch before current iteration ) into the update of density in latent space before iteration . Based on instances in latent space with their per example Jacobian matrix norm, we construct conditional Gaussian distribution around every instance and then sample a certain number of new noises which are proportional to Jacobian matrix norm of each instance.
Instance density in latent space after importance value update contains the information needed of last batch instances’ Jacobian matrix norms. We apply flow-based models to approximate the posterior importance value distribution in latent space. After computing approximated importance value distribution of latent space by normalizing flow, we do the noise re-sampling in latent space to draw noise from the approximated distribution and calculate per example Jacobian matrix norm of these sampled instances when they go through generator. Meanwhile, we optimize the parameters of our network framework. For the following batches, we just repeat noise re-sampling steps and Jacobian matrix norm computation until the next batches iteration. After running batches, we repeat these steps to update importance value distribution in latent space.
3.1 Importance Density Update
Importance sampling prioritizes training examples for SGD in a standardized way. This technique suggests latent space sampling example can be assigned to a probability that is proportional to the norm of term ’s Jacobian matrix norm. This distribution both prioritizes challenging examples and minimizes the stochastic gradient estimation’s variance. If applying the importance sampling in GAN’s noise generation, we will have to approximate a posterior distribution for importance value in the whole latent space based on the information of last batch examples’ Jacobian matrix norm.
Before utilizing techniques for approximating posterior distribution of importance value, we define a importance value at each sampled data during the last batch through Jacobian matrix norm firstly.
where denotes the matrix norm of generator Jacobian of batch .
Thus, every instance of last batch in latent space is given the prior knowledge of Jacobian matrix norm. Given that most non-parametric estimation algorithms for high-dimensional data are not suitable for density with discrete given probability and interpolation methods are likely to receive bad performance when data dimension is over 50, we intend to transform importance value of each example into density information by constructing conditional Gaussian distribution and sampling new data for each example in latent space. (Wang and Scott, 2017)
Let denotes the set of last batch instances and represents their importance value. Then, we construct random variables and each variable as conditional Gaussian distribution.
where mean value is defined by latent space instance and variance is diagonal whose trace is proportional to . Sum of new sampling data amount from is given as , and then for every random variable , data will be sampled to change the density near previous instance . After this sampling process, information from importance value of each instance’s gradient norm has been transformed into the the density information in latent space for density distribution estimation , where denotes the number of augmented data sampled from .
3.2 Density Estimation through Flow
A normalizing flow model is constructed as an invertible transformation that maps observed data in latent space to a standard Gaussian latent variable as in non-linear independent component analysis. Stacking individual simple invertible transformations is the key idea in the design of a flow model. Explicitly, is constructed from a series of invertible flows as , with each having a tractable Jacobian determinant. This way, sampling is efficient, as it can be performed by computing for and so is training by maximum likelihood, since the model density is easy to compute and differentiate with respect to the parameters of the flows (Ho et al. 2019)
While computing the Jacobian determinant, in order to make sure the matrices are non-sigular, we set thresholds for adding stochastic perturbation to balance the computational complexity and precision.
The trained flow model can be considered as the maximum a posterior estimation for instances importance value in latent space.
3.3 Noise Re-Sampling (Importance Sampling) and Jacobian Matrix Norm Computation
After finishing constructing normalizing flow for posterior distribution, we can sample noises of a new batch on the standard Gaussian distribution
Then, let batch sampled data go through a trained normalizing flow and thus we can acquire samples in latent space of generator.
After sampling noises in new batches on approximating distribution, we put sampled data into generator to generate real image noises.
Here we compute the matrix norm Jacobian per example, by the derivatives of output dimension variables w.r.t. variable latent space variables ( are the dimensionality of output space and latent space). The norm of Jacobian matrices with respect to model parameters for every example of a minibatch contains the information of importance value per example, which can be used for building the next importance density during another batch.
Then we compute the matrix norm of Jacobian by
In this section, we will evaluate proposed method above. Our experiment consists of three parts, baseline tests on MNIST and Fashion-MNIST, comparing the effects of different flows on GAN acceleration and different matrix norms on importance measurement. By comparing Frechet inception distance (FID) of different time steps, we found that the Flow-based importance sampling could significantly accelerate GAN training.
4.1 Acceleration Test
In this section, we evaluateed the performance of our proposed method (FIS-GAN), on various datasets. We believe that FIS-GAN, like other optimization acceleratting methods based on importance sampling, can accelerate the training process of GAN. In order to measure the performance of different GANs, we select FID as the evaluation metric, regularly sampled by the generator and evaluate FID. We keep all GANs in the same architecture. The FIS-GAN has a Flow-based importance sampling acceleration. More details can be found in Appendix.
For quantitative experiments, we calculated FID every 100 iterations and compared the FID changes of the first 20,000 iterations. For qualitative experiments, we calculated FID every 1000 iterations, running a total of 200,000 iterations. The images of qualitative experiment are generated by the lowest FID model. We used Adam optimizer for generator, discriminator and flow with learning rate 1e-3, 1e-4, 1e-3.
We tested FIS-GAN on MNIST, and selected vanilla GAN as baseline. We compared FIS-GAN with baseline under two flow update frequencies to verify the improvement of convergence of FIS-GAN to GAN. The diagram shows two flow update modes. This more frequent one is to update the flow every 10 steps, with epochs 5. Another flow is 50 epochs per 50 steps. Because our flow is constructed in a small latent space, usually 64d or 128d, the training time of flow is shorter than that of GAN. The experimental results show that the FID of flows in both updating modes are better than the baseline model. More detailed training details can be found in the following figures.
At the same time, the loss curve also shows that with the flow modeling of generator latent space, more difficult samples have been better trained, so that generator loss significantly decreased.
Experiments on Fashion-MNIST reveal more interesting facts, and FIS-GAN achieves robuster training than the baseline model. Generally speaking, GAN’s top 5,000 steps on Fashion-MNIST achieved greater improvement, while latter training is relatively stable before mode collapse. Therefore, we focus on the details of the beginning of the training. As can be seen from the figure, FIS-GAN enters the stable training interval faster, which shows that the FID of the generated image is smoother.
4.2 Ablation on Matrix Norm
It is noteworthy that the choice of different matrix norms to calculate the importance values will also have impacts on the FIS-GAN model. In this section, we evaluated two different matrix norms, the nuclear norm and the Frobenius norm. The parameters of comparative tests are identical, except different matrix norms of Jacobian used in importance value.
For calculating matrix norm of importance index, we compared the performance of Frobenius norm and nuclear norm on MINIST. Previously, the performance of 2000 steps shows that Frobenius and nuclear have similar performances and are superior to baseline models. But the computational complexity of nuclear norm is higher than that of Frobenius norm, so we still recommend using nuclear norm.
4.3 Ablation on Flow Type
Due to the variety of flows, we selected several important flows to construct latent space distribution, including Real-NVP, MAF and IAF. The parameters of these models are the same. But they construct reversible transformations in different ways. Real-NVP constructs flows by reversible affine transformation. MAF and IAF are dual autoregressive flows. One is fast in training and the other is easy in sampling.
The effects of structural priori of invertible transformation on FIS-GAN training were demonstrated by different flow ablation experiments. The figure above shows the FIS-GAN of different flows with a sampling interval of 100 steps in 2000 steps before MINIST. FIS-GAN based on Real-NVP, MAF and IAF is superior to baseline model, and shows some different characteristics. Considering the training speed and sampling speed, we recommend Real-NVP more. But in general, MAF is the most significant improvement for FIS-GAN
5 Related Work
GAN: Generative Adversarial Nets (Goodfellow, et al. 2014) are a group of implicit generative models,
since GAN can only generate samples while the likelihood cannot be directly calculated.
The adversarial training is usually not very stable, finally falling into mode collapse after a long training process.
To tackle this problem, WGAN (Arjovsky, et al.. 2017) and WGAN-GP (Gulrajani, et al. 2017)
introduced optimal transport into GAN, leading to the 1-Lipschitz condition for the discriminator.
The numerics of GAN (Mescheder, et al. 2017) revealed a key role of eigenvalues of the Jacobian in the optimization.
SNGAN (Miyato, et al. 2018) further developed spectral normzation and imposed Lipschitz continous on the discriminator.
Up to now, BigGAN (Brock, et al. 2018) achieved state-of-the-art performance
in GAN-based image generation with a large scale training.
However, GAN still suffers from a long training time and missing mode issue,
partial due to the unbalanced training while some easy samples are trained too much to fall into the collapsed mode.
Flow: Flow-based models are powerful neural distributions, typically constructed by invertible neural networks with tractable Jacobian. The change of variables formula enables a direct training process via maximum likelihood estimation. And its reversibility allows sampling from latent space after MLE training. NICE (Dinh, et al. 2014), Real-NVP (Dinh, et al. 2016), MAF (Papamakarios, et al. 2017), IAF (Kingma, et al. 2016) employed a trivial analytic inverse with a triangle Jacobian matrix. NAF (Huang, et al. 2018), BNAF (De Cao, Titov, and Aziz. 2019), NSF (Durkan, et al. 2019) outperformed large models in density estimation with different neural autoregressive transforms. Glow (Kingma, and Dhariwal. 2018) introduces matrix multiplication with LU decomposition, taking the place of random dimension permutation. i-ResNet (Behrmann, Duvenaud, and Jacobsen. 2018), Residual Flow (Chen, et al. 2019), MintNet (Song, Meng, and Ermon. 2019) considered fixed-point iterations as inverse, creating state-of-the-art records in bits per dimension. Meanwhile, Flow-GAN (Grover, Dhar, and Ermon. 2018) firstly made an attempt to combine both flow and adversarial training in the manner of invertible generator.
Admittedly, flow is less competitive than GAN in image generation.
However, flow can directly access the likelihood, which is very attractive for density estimation.
Maximum Likelihood Estimation overcomes the missing mode issue.
If flow is trained successfully, it is expected to cover all possible modes in data distribution.
This is much harder than GAN.
Because mapping all latent variables to a same high fidelity sample for generator
can easily escape the penalty from the discriminator, which means mode collapse.
Importance Sampling: Originally, importance sampling was widely studied for solving convex optimization problems (Bordes et al. 2005). Inspired by the perceptive way of human children learning, Bengio et al. (2009) designed a sampling scheme, in which they provided the network with examples of increasing difficulty in an arbitrary task.
Prior strategies consider importance sampling for speeding up deep learning. Recent works (Zhao and Zhang, 2015; Needell et al., 2014) connected importance sampling with stochastic gradient descent gradient estimation variance, which shows that the optimal sampling distribution should be proportional to the per sample gradient norm.
More closely related to our work, Schaul et al. (2015) and
Loshchilov and Hutter (2015) use the loss to create the sampling distribution. These two approaches above require keep a history of losses for previously seen samples, and sample either proportionally to the loss or based on the loss ranking. One of the main limitations of history based sampling, is the need for tuning a large number of hyperparameters that control the effects of “stale” importance scores. For deep models, other algorithms have closer approximations of gradient norms (Katharopoulos and Fleuret, 2018; Johnson and Guestrin, 2018)
In comparison to all above related works, our methods build up a framework for continuous latent space importance sampling for GAN. A most significant novelty is our dynamical perspective. When a GAN generator try to generator noises from latent space, it will not sample the instances that have been generated in past batches, which requires a kind of approximation for posterior distribution. It is adapting importance sampling to GAN framework that distinguishes this paper.
This paper studied GAN framework with importance sampling and normalizing flows that accelerates the training convergence of networks. Specifically, we constructed importance density based on importance sampling methods for normalizing flow to approximate a posterior distribution in latent space and then the generator can draw noises from the approximated distribution of importance value, which leads a better performance on GAN metrics during the early epochs.
-  Arjovsky, M., Chintala, S. and Bottou, L. Wasserstein generative adversarial networks. In International Cnference on Machine Learning., pp. 214-223, 2017
-  Behrmann, J., Duvenaud, D. and Jacobsen, J.H. Invertible residual networks. arXiv preprint arXiv:1811.00995.
-  Brock, A., Donahue, J. and Simonyan, K. Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096.
-  Chen, R.T., Behrmann, J., Duvenaud, D. and Jacobsen, J.H. Residual Flows for Invertible Generative Modeling. arXiv preprint arXiv:1906.02735.
-  De Cao, N., Titov, I. and Aziz, W. Block neural autoregressive flow. arXiv preprint arXiv:1904.04676.
-  Dinh, L., Krueger, D. and Bengio, Y. Nice: Non-linear independent components estimation. arXiv preprint arXiv:1410.8516.
-  Dinh, L., Sohl-Dickstein, J. and Bengio, S. Density estimation using real nvp. arXiv preprint arXiv:1605.08803.
-  Durkan, C., Bekasov, A., Murray, I. and Papamakarios, G. Neural Spline Flows. arXiv preprint arXiv:1906.04032.
-  Johnson, T. and Guestrin, C. Training Deep Models Faster with Robust, Approximate Importance Sampling. In Advances in neural information processing systems. 2018
-  Kingma, D.P. and Dhariwal, P. Glow: Generative flow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems pp. 10215-10224. 2018
-  Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A. and Bengio, Y. Generative adversarial nets. In Advances in neural information processing systems pp. 2672-2680. 2014
-  Grover, A., Dhar, M. and Ermon, S. Flow-GAN: Combining maximum likelihood and adversarial learning in generative models. In Thirty-Second AAAI Conference on Artificial Intelligence. 2018
-  Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V. and Courville, A.C. Improved training of wasserstein gans. In Advances in neural information processing systems pp. 5767-5777. 2017
-  Ho, J., Chen, X., Srinivas, A., Duan, Y. and Abbeel P. Flow++: Improving Flow-Based Generative Models with Variational Dequantization and Architecture Design. arXiv preprint arXiv:1902.00275.
-  Huang, C.W., Krueger, D., Lacoste, A. and Courville, A. Neural autoregressive flows. arXiv preprint arXiv:1804.00779.
-  Katharopoulos, A. and Fleuret, F. Not All Samples Are Created Equal: Deep Learning with Importance Sampling. In International Cnference on Machine Learning., 2018
-  Kingma, D.P., Salimans, T., Jozefowicz, R., Chen, X., Sutskever, I. and Welling, M. Improved variational inference with inverse autoregressive flow. In Advances in neural information processing systems pp. 4743-4751. 2016
-  Mescheder, L., Nowozin, S. and Geiger, A. The numerics of gans. In Advances in Neural Information Processing Systems pp. 1825-1835. 2017
-  Miyato, T., Kataoka, T., Koyama, M. and Yoshida, Y. Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957.
-  Song, Y., Meng, C. and Ermon, S. MintNet: Building Invertible Neural Networks with Masked Convolutions. arXiv preprint arXiv:1907.07945.
-  Wang, Z. and Scott, D. Nonparametric Density Estimation for High-Dimensional Data - Algorithms and Applications. arXiv preprint arXiv:1904.00176