EvolGAN: Evolutionary Generative Adversarial Networks
We propose to use a quality estimator and evolutionary methods to search the latent space of generative adversarial networks trained on small, difficult datasets, or both. The new method leads to the generation of significantly higher quality images while preserving the original generator’s diversity. Human raters preferred an image from the new version with frequency 83.7% for Cats, 74% for FashionGen, 70.4% for Horses, and 69.2% for Artworks - minor improvements for the already excellent GANs for faces. This approach applies to any quality scorer and GAN generator.
Generative adversarial networks (GAN) are the state-of-the-art generative models in many domains. However, they need quite a lot of training data to reach a decent performance. Using off-the-shelf image quality estimators, we propose a novel but simple evolutionary modification for making them more reliable for small, difficult, or multimodal datasets. Contrarily to previous approaches using evolutionary methods for image generation, we do not modify the training phase. We use a generator mapping a latent vector to an image built as in a classical GAN. The difference lies in the method used for choosing a latent vector . Instead of randomly generating a latent vector , we perform an evolutionary optimization, with as decision variables and the estimated quality of — based on a state-of-the-art quality estimation method— as an objective function. We show that:
The quality of generated images is better, both for the proxy used for estimating the quality, i.e., the objective function, as well as for human raters. For example, the modified images are preferred by human raters more than 80% of the time for images of cats and around 70% of the time for horses and artworks.
The diversity of the original GAN is preserved: the new images are preferred by humans and still similar.
The computational overhead introduced by the evolutionary optimization is moderate, compared to the computational requirement for training the original GAN.
The approach is simple, generic, easy to implement , and fast. It can be used as a drop-in replacement for classical GAN provided that we have a quality estimator for the outputs of the GAN. Besides the training of the original GAN, many experiments were performed on a laptop without any GPU.
2 Related Works
2.1 Generative Adversarial Networks
Generative Adversarial Networks [?] (GANs) are widely used in machine learning [?, ?, ?, ?, ?, ?, ?, ?] for generative modeling. Generative Adversarial Networks are made up of two neural networks: a Generator , mapping a latent vector to an image and a Discriminator mapping an image to a realism value . Given a dataset , GANs are trained using two training steps operating concurrently:
Given a randomly generated , the generator tries to fool into classifying its output as a real image, e.g. by maximizing . For this part of the training, only the weights of G are modified.
Given a minibatch containing both random fake images and real images randomly drawn in , the discriminator learns to distinguish and , e.g. by optimizing the cross-entropy.
The ability of GANs to synthesize faces  is particularly impressive and of wide interest. However, such results are possible only with huge datasets for each modality and/or after careful cropping, which restricts their applicability. Here we consider the problem of improving GANs trained on small or difficult datasets. Classical tools for making GANs compatible with small datasets include:
Data augmentation, by translation, rotation, symmetries, or other transformations.
Transfer from an existing GAN trained on another dataset to a new dataset [?].
Modification of the distribution in order to match a specific request as done in several papers.  modifies the training, using quality assessement as we do; however they modify the training whereas we modify inference. In the same vein,  works on scale disentanglement: it also works at training time. These works could actually be combined with ours.  generates images conditionally to a classifier output or conditionally to a captioning network output.  and  condition the generation to a playability criterion (estimated by an agent using the GAN output) or some high-level constraints.  uses a variational autoencoder (VAE), so that constraints can be added to the generation: they can add an attribute (e.g. black hair) and still take into account a realism criterion extracted from the VAE: this uses labels from the dataset.  uses disentanglement of the latent space for semantic face editing: the user can modify a specific latent variable.  allows image editing and manipulation: it uses projections onto the output domain.
Biasing the dataset.  augments the dataset by generating images with a distribution skewed towards the minority classes.
Learning a specific probability distribution, rather than using a predefined, for example Gaussian, distribution. Such a method is advocated in [?].
The latter is the closest to the present work in the sense that we stay close to the goal of the original GAN, i.e. modeling some outputs without trying to bias the construction towards some subset. However, whereas [?] learn a probability distribution on the fly while training the GAN, our approach learns a classical GAN and modifies, a posteriori, the probability distribution by considering a subdomain of the space of the latent variables in which images have better quality. We could work on an arbitrary generative model based on latent variables, not only GANs. As opposed to all previously mentioned works, we improve the generation, without modifying the target distribution and without using any side-information or handcrafted criterion - our ingredient is a quality estimator. Other combinations of deep learning and evolutionary algorithms have been published around GANs. For instance,  evolves a population of generators, whereas our evolutionary algorithm evolves individuals in the latent space. [?] also evolves individuals in the latent space, but using human feedback rather than the quality estimators that we are using.  evolves individuals in the latent space, but either guided by human feedback or by using similarity to a target image.
2.2 Quality estimators: Koncept512Â and AVA
Quality estimation is a long-standing research topic [15, 16] recently improved by deep learning . In the present work, we focus on such quality estimation tools based on supervised convolutional networks. The KonIQ-10k dataset is a large publicly available image quality assessment dataset with 10,073 images rated by humans. Each image is annotated by 120 human raters. The Koncept512 image quality scorer  is based on an InceptionResNet-v2 architecture and trained on KonIQ-10k for predicting the mean opinion scores of the annotators. It takes as input an image and outputs a quality estimate . Koncept512 is the state of the art in technical quality estimation , and is freely available. We use the release without any modification.  provides a tool similar to Koncept512, termed AVA, but dedicated to aesthetics rather than technical quality. It was easy to apply it as a drop-in replacement of Koncept512 in our experiments.
3.1 Our algorithm: EvolGAN
We do not modify the training of the GAN. We use a generator created by a GAN. takes as input a random latent vector , and outputs an image . While the latent vector is generally chosen randomly (e.g., ), we treat it as a free parameter to be optimized according to a quality criterion . More formally, we obtain :
In this paper, is either Koncept512 or AVA. Our algorithm computes an approximate solution of problem 1 and outputs . Importantly, we do not want a global optimum of Eq. 1. We want a local optimum, in order to have essentially the same image – must be close to , which would not happen without this condition. The optimization algorithm used to obtain in Eq. 1 is a simple -Evolution Strategy with random mutation rates [?], adapted as detailed in Section 3.2 (see Alg. 1). We keep the budget of our algorithm low, and the mutation strength parameter can be used to ensure that the image generated by EvolGAN is similar to the initial image. For instance, with , the expected number of mutated variables is, by construction (see Section 3.1), bounded by . We sometimes use the aesthetic quality estimator AVA rather than the technical quality estimator Koncept512 for quality estimation. We consider a coordinate-wise mutation rate: we mutate or do not mutate each coordinate, independently with some probability.
3.2 Optimization algorithms
After a few preliminary trials we decided to use the -Evolution Strategy with uniform mixing of mutation rates [?], with a modification as described in algorithm 1. This modification is designed for tuning the compromise between quality and diversity as discussed in Table 1.
We used . Optionally, can be provided as an argument, leading to . The difference with the standard uniform mixing of mutation rates is that . With , the resulting image is close to the original image , whereas with the outcome is not similar to . Choosing (or , closely related to FastGA[?]) leads to faster convergence rates but also to less diversity (see Alg.1, line 1). We will show that overall, is the best choice for EvolGAN. We therefore get algorithms as presented in Table 1.
|behavior of||with budget|
|standard evol. alg.|
|with mutation rate .|
|uniform mixing of mutation rates|
|[?] (also related to [?]).|
|all variables mutated: equivalent|
|to random search|
|intermediate values||intermediate behavior|
3.3 Open source codes
We use the GAN publicly available at https://github.com/moxiegushi/pokeGAN, which is an implementation of Wasserstein GAN [?], the StyleGAN2  available at thispersondoesnotexist.com, and PGAN on FashionGen from Pytorch GAN zoo . Koncept512 is available at https://github.com/subpic/koniq. Our combination of Koncept512 and PGAN is available at DOUBLEBLIND. We use the evolutionary programming platform Nevergrad [?].
We present applications of EvolGAN on three different GAN models: (i) StyleGAN2 for faces, cats, horses and artworks (ii) PokeGAN for mountains and Pokemons (iii) PGAN from Pytorch GAN zoo for FashionGen.
4.1 Quality improvement on StyleGAN2
The experiments are based on open source codes [?, 10, ?, 4, 3]. We use the StyleGAN2  trained on a horse dataset, a face dataset, an artwork dataset, and a cat dataset
|gregated||60.4% 3.4% (208 ratings)|
|Cats||300||Koncept512||83.71 1.75% (446 ratings)|
|Horses||300||Koncept512||70.43 4.27% (115 ratings)|
|Artworks||300||Koncept512||69.19 3.09% (224 ratings)|
Harder settings. Animals and artworks are a much more difficult setting (Fig. 3) - StyleGAN2 sometimes fails to propose a high quality image. Fig. 3 presents examples of generations of and in such cases. Here, has more headroom for providing improvements than for faces: results are presented in Table 3. The case of horses or cats is particularly interesting: the failed generations often contain globally unrealistic elements, such as random hair balls flying in the air or unusual positioning of limbs, which are removed by EvolGAN. For illustration purpose, in Fig. 3 we present a few examples of generations which go wrong for the original StyleGan2 and for ; the bad examples in the case of the original StyleGan2 are much worse.
4.2 Small difficult datasets and
In this section we focus on the use of EvolGAN for small datasets. We use the original pokemon dataset in PokeGAN [?] and an additional dataset created from copyright-free images of mountains. The previous section was quite successful, using (i.e. random search). The drawback is that the obtained images are not necessarily related to the original ones, and we might lose diversity (though Section 4.4 shows that this is not always the case, see discussion later). We will see that fails in the present case. In this section, we use small, and check if the obtained images are better than (see Table 5) and close to the original image (see Fig.4). Fig. 4 presents a Pokemon generated by the default GAN and its improved counterpart obtained by with . Table 5 presents our experimental setting and the results of our human study conducted on PokeGAN. We see a significant improvement when using Koncept512 on real-world data (as opposed to drawings such as Pokemons, for which Koncept512 fails), whereas we fail with AVA as in previous experiments (see Table 2). We succeed on drawings with Koncept512 only with : on this dataset of drawings (poorly adapted to Koncept512), large leads to a pure black image.
|Type||Number of||Number of||Budget||Quality||Frequency|
|images||images||epochs||preferred to original|
|Real world scenes|
|Mountains||84||4900||500||Koncept512||73.3% 4.5% (98 ratings)|
|Pokemons||1840||4900||6000||Koncept512||56.3 5.2% (92 ratings)|
|Artificial scenes, higher mutation rates|
4.3 Quality improvement
Pytorch Gan Zoo  is an implementation of progressive GANs (PGAN[?]), applied here with FashionGen [?] as a dataset. The dimension of the latent space is 256. In Table 6, we present the results of our human study comparing to . With , humans prefer EvolGAN to the baseline in more than 73% of cases, even after only 40 iterations. also ensures that the images stay close to the original images when the budget is low enough (see Table 1). Fig. 5 shows some examples of generations using EvolGAN and the original PGAN.
|40||73.33 8.21% (30 ratings))|
|320||75.00 8.33% (28 ratings))|
|40 and 320 aggreg.||74.13 5.79% (58 ratings))|
|40||48.27 9.44% (29 ratings)|
|320||67.74 8.53% (31 ratings)|
|40 and 320 aggreg.||58.33 6.41% (60 ratings)|
|40||56.66 9.20% (30 ratings)|
|320||66.66 9.24% (27 ratings)|
|40 and 320 aggreg.||61.40 6.50% (57 ratings)|
|40||59.55 5.23% (89 ratings)|
|320||69.76 4.98% (86 ratings)|
|40 and 320 aggreg.||64.57 3.62% (175 ratings)|
4.4 Consistency: preservation of diversity.
Here we show that the generated image is close to the one before the optimization. More precisely, given , the following two methods provide related outputs: method 1 (classical GAN) outputs , and method 2 (EvolGAN) outputs , where is obtained by our evolutionary algorithms starting at with budget and parameter (Sect. 3.2). Fig. 5 shows some example generated images using PGAN and EvolGAN. For most examples, is very similar to so the diversity of the original GAN is preserved well. Following [5, 19], we measure numerically the diversity of the generated images from the PGAN model, and from EvolGAN models based on it, using the LPIPS score. The scores were computed on samples of images generated with each method. For each sample, we computed the LPIPS with another randomly-chosen generated image. The results are presented in Table 4. Higher values correspond to higher diversity of samples. EvolGAN preserves the diversity of the images generated when used with .
4.5 Using AVA rather than Koncept512
In Table 7 we show that AVA is less suited than Koncept512 as a quality assessor in EvolGAN. The human annotators do not find the images generated using EvolGAN with AVA to be better than those generated without EvolGAN. We hypothesize that this is due to the subjectivity of what AVA estimates: aesthetic quality. While humans generally agree on the factors accounting for the technical quality of an image (sharp edges, absence of blur, right brightness), annotators often disagree on aesthetic quality. Another factor may be that aesthetics are inherently harder to evaluate than technical quality.
|500||50.55 3.05 %|
(a) Faces with StyleGAN2: reproducing Table 2 with AVA in lieu of Koncept512.
Dataset Budget Quality estimator score Cats 300 AVA 47.05 7.05% Artworks 300 AVA 55.71 5.97 %
(b) Reproducing Table 3 with AVA in lieu of Koncept512.
Type Number of Number of Budget Quality Frequency of of training estimator of image images images epochs preferred to original Mountains 84 4900 500 AVA 42.5% Pokemons 1840 4900 500 AVA 52.6% Pokemons 1840 4900 500 AVA 52.6%
(c) Reproducing Table 5 with AVA rather than Koncept512.
We have shown that, given a generative model , optimizing by an evolutionary algorithm using Koncept512 as a criterion and preferably with (i.e. the classical -Evolution Strategy), leads to
Choosing . small, i.e., the classical -Evolution Strategy with mutation rate , is usually the best choice: we preserve the diversity (with provably a small number of mutated variables, and experimentally a resulting image differing from the original one mainly in terms of quality), and the improvement compared to the original GAN is clearer as we can directly compare to — a budget was usually enough. Importantly, evolutionary algorithms clearly outperformed random search and not only in terms of speed: we completely lose the diversity with random search, as well as the ability to improve a given point. Our application of evolution is a case in which we provably preserve diversity — with a mutation rate bounded by , and a budget , and a dimension , we get an expected ratio of mutated variables at most . In our setting so the maximum expected ratio of mutated variables is at most in Fig. 5. A tighter, run-dependent bound, can be obtained by comparing and and could be a stopping criterion.
Successes. We get an improved GAN without modifying the training. The results are positive in all cases in particular difficult real-world data (Table 3), though the gap is moderate when the original model is already excellent (faces, Table 2) or when data are not real-world (Pokemons, Table 5.EvolGAN with Koncept512 is particularly successful on several difficult cases with real-world data — Mountains with Pokegan, Cats, Horses and Artworks with StyleGAN2 and FashionGen with Pytorch Gan Zoo.
Remark on quality assessement. Koncept512 can be used on a wide range of applications. As far as our framework can tell, it outperforms AVA as a tool for EvolGAN (Table 7). However, it fails on artificial scenes such as Pokemons, unless, we use a small for staying local.
Computational cost. All the experiments with PokeGAN presented here could be run on a laptop without using any GPU. The experiments with StyleGAN2 and PGAN use at most 500 (and often just 40) calls to the original GAN, without any specific retraining: we just repeat the inference with various latent vectors chosen by our evolutionary algorithm as detailed in Section 3.1.
Funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation), Project-ID 251654672, TRR 161 (Project A05).
- (2017) Latent constraints: learning to generate conditionally from unconditional generative models. CoRR abs/1711.05772. External Links: Cited by: 3rd item.
- (2019) Searching the latent space of a generative adversarial network to generate doom levels. In 2019 IEEE Conference on Games (CoG), Vol. , pp. 1–8. Cited by: 3rd item.
- (2019) Effective aesthetics prediction with multi-level spatially pooled features. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9375–9383. Cited by: §2.2, §4.1.
- (2020) KonIQ-10k: an ecologically valid database for deep learning of blind image quality assessment. IEEE Transactions on Image Processing (), pp. 1–1. External Links: Cited by: §2.2, §4.1.
- (2018) Multimodal unsupervised image-to-image translation. CoRR abs/1804.04732. External Links: Cited by: §4.4.
- (2019) Analyzing and improving the image quality of stylegan. External Links: Cited by: §2.1, §3.3, §4.1.
- (2018) Bagan: data augmentation with balancing gan. arXiv preprint arXiv:1803.09655. Cited by: 4th item.
- (2016) Plug & play generative networks: conditional iterative generation of images in latent space. External Links: Cited by: 3rd item.
- (2019) Quality aware generative adversarial networks. In Advances in Neural Information Processing Systems, pp. 2948–2958. Cited by: 3rd item.
- (2019) Pytorch GAN Zoo. GitHub. Note: \urlhttps://GitHub.com/FacebookResearch/pytorch_GAN_zoo Cited by: §3.3, §4.1, §4.3.
- (2019) Inspirational adversarial image generation. arXiv preprint 1906.11661. Cited by: §2.1.
- (2019) Interpreting the latent space of gans for semantic face editing. External Links: Cited by: 3rd item.
- (2018) Evolving mario levels in the latent space of a deep convolutional generative adversarial network. In Proceedings of the Genetic and Evolutionary Computation Conference, GECCO â18, New York, NY, USA, pp. 221â228. External Links: Cited by: 3rd item.
- (2018) Evolutionary generative adversarial networks. CoRR abs/1803.00657. External Links: Cited by: §2.1.
- (2004) Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing. Cited by: §2.2.
- (2012) Unsupervised feature learning framework for no-reference image quality assessment. IEEE Conference on Computer Vision and Pattern Recognition. Cited by: §2.2.
- (2020) BSD-gan: branched generative adversarial network for scale-disentangled representation learning and image synthesis. IEEE Transactions on Image Processing. Cited by: 3rd item.
- (2016) Generative visual manipulation on the natural image manifold. In European conference on computer vision, pp. 597–613. Cited by: 3rd item.
- (2017) Toward multimodal image-to-image translation. In Advances in neural information processing systems, pp. 465–476. Cited by: §4.4.