An Improved Evaluation Framework for Generative Adversarial Networks

An Improved Evaluation Framework for Generative Adversarial Networks

Abstract

1

In this paper, we propose an improved quantitative evaluation framework for Generative Adversarial Networks on generating domain-specific images, where we improve conventional evaluation methods on two levels: the feature representation and the evaluation metric. Unlike most existing evaluation frameworks which transfer the representation of ImageNet inception model to map images onto the feature space, our framework uses a specialized encoder to acquire fine-grained domain-specific representation. Moreover, for datasets with multiple classes, we propose Class-Aware Frechet Distance (CAFD), which employs a Gaussian mixture model on the feature space to better fit the multi-manifold feature distribution. Experiments and analysis on both the feature level and the image level were conducted to demonstrate improvements of our proposed framework over the recently proposed state-of-the-art FID method. To our best knowledge, we are the first to provide counter examples where FID gives inconsistent results with human judgments. It is shown in the experiments that our framework is able to overcome the shortness of FID and improves robustness. Code will be made available2.

Keywords:
Generative adversarial network, evaluation, metric, representation

1 Introduction

Generative Adversarial Networks (GANs) have shown outstanding abilities on many computer vision tasks including generating domain-specific images [1], style transfer [2], super resolution [3], etc. The basic idea of GANs is to hold a two-player game between generator and discriminator, where the discriminator aims to distinguish between real and fake samples while the generator tries to generate samples as real as possible to fool the discriminator.

Researchers [4, 5, 6, 7] have been continuously exploring better GAN architectures. However, developing a widely-accepted GAN evaluation framework remains to be a challenging topic [8]. Due to lack of GAN benchmark results, newly proposed GAN variants are validated on different evaluation frameworks and therefore incomparable. Because human judgements are inherently limited by manpower resource, good quantitative evaluation frameworks are of very high importance to guide future research on designing, selecting, and interpreting GAN models.

Figure 1: Comparison between our proposed framework and the recently proposed state-of-the-art evaluation method FID [9]. Our framework uses a domain-specific representation to get better features and employs a multi-manifold Gaussian mixture model to better fit the distribution.

There have been varieties of efforts on evaluating GANs on its ability of generating domain-specifc images. The goal is to measure the distance between the generated samples and the real in the dataset. Most existing methods utilized the ImageNet [10] inception model to map images onto the feature space. The most widely used criteria is probably the Inception Score [11], which measures the distance via Kullback-Leiber Divergence (KLD). However, it is probability based and is unable to report overfitting. Recently, Frechet Inception Distance (FID) was proposed [9] on improving Inception Score. It directly measures Frechet Distance on the feature space with the single-manifold Gaussian assumption. It has been proved that FID is far better than Inception Score [12, 13, 14]. However, we argue that assuming normality on the whole feature distribution may lose class information on labeled datasets.

In this work, we propose an improved quantitative evaluation framework. Comparison between our framework and the current state-of-the-art FID method is shown in Fig. 1. We improve conventional evaluation methods on two levels: the feature representation and the evaluation metric. Unlike most existing methods including the Inception Score [11] and FID [9], our framework uses a specialized encoder trained on the dataset to get domain-specific representation. We argue that applying the ImageNet model to either labeled or unlabeled datasets is ineffective. Moreover, we propose Class-Aware Frechet Distance (CAFD) in our framework to measure the distribution distance of each class (mode) respectively on the feature space to include class information. Instead of the single-manifold Gaussian assumption, we employ a Gaussian mixture model (GMM) to better fit the feature distribtution. We also include KL divergence (KLD) between mode distribution of real data and generated samples into the framework to help detect mode dropping.

Experiments and analysis on both the feature level and the image level were conducted to demonstrate the improved effectiveness of our proposed framework. To our best knowledge, we are the first [15] to provide counter examples where FID is inconsistent with human judgements (See Figs. 3 and 5). It is shown in the experiments that our framework is able to overcome the shortness of existing methods.

2 Related Work

Generative Adversarial Networks. The idea of Generative Adversarial Network was originally proposed in [1]. It has been applied to various computer vision tasks [2, 3, 16, 17]. Researchers have been continuously developing better GAN architectures [18, 19] and training strategies [20, 21] on generating domain-specific images. Deep convolutional networks were firstly introduced to the GAN community by [4]. Wasserstein GAN (WGAN) [5] was proposed to significantly improve convergence on GAN training. Recently, several variants were proposed [6, 7, 22, 23, 24, 25, 26] to improve the image quality generated by GAN models.


Evaluation Methods. Several GAN evaluation methods have been proposed by researchers. While model-based methods including Parzen window estimation and the annealed importance sampling (AIS) [27] require either density estimation or observation on the inner structure of the decoder, model-agnostic methods [9, 11, 13, 22, 23, 28, 29] are more popular in the GAN community. These methods are sample based. Most of them map images onto the feature space via an ImageNet pretrained model and measure the similarity of the distribution between the dataset and the generated data. Maximum mean discrepancy (MMD) was proposed in [23, 29] and it has been further used in classifier two-sample tests [28], where statistical hypothesis testing is used to assess whether two sample sets are from the same distribution. Inception Score [11], along with its improved version Mode Score [22], was the most widely used metric in the last two years. Recently, FID [9] was proposed on improving the Inception Score.


Studies on Existing Frameworks. It is common [30] in the literature to see algorithms which use existing metrics to optimize early stopping, hyperparameter tuning, and even model architecture. Thus, comparison and analysis on previous evaluation methods have been attracting more and more attention recently [8, 12, 13, 14]. While Inception Score was the most popular metric in the last two years, it was believed to be misleading in recent literature [9, 12, 14, 15, 30]. Applying the ImageNet model to encode features in Inception Score is ineffective [8, 30, 31]. The recently proposed FID has been proved to be far better than Inception Score [9, 12, 13]. And its robustness was experimentally demonstrated recently in a technical report by Google Brain [14]. However, in this paper, we argue that FID still has problems and provide counter examples where FID gives inconsistent results with human judgements. Moreover, we propose an improved evaluation framework and overcome the shortness of existing methods.

3 Problems on FID

3.1 Method Formulation

The evaluation problem can be formulated as modeling the distance between two distributions and , where denotes the distribution of real samples in the dataset and denotes the distributions of new samples generated by GAN models.

The main difficulties for GANs on generating domain-specific images can be summarized into three types below.

  • Lack of generating ability. Either the generator cannot generate useful samples or the GAN training cannot diverge.

  • Mode collapse. Different modes collapse to a new mixed mode in the generated samples. (e.g. An animal resembling both a horse and a deer.)

  • Mode dropping. Only part of the modes in the dataset are generated while some modes are implicitly ignored. (e.g. The handwritten ‘5’ can hardly be generated by GAN trained on MNIST.)

Therefore, a good evaluation framework should be consistent to human judgements, penalize on mode collapse and mode dropping.

Most of the conventional methods utilized an ImageNet pretrained inception model to map images onto the feature space. Inception Score, which was originally formulated as Eq. (1), ignored information in the dataset completely. Thus, its original formulation was considered to be relatively misleading.

(1)

The Mode Score was proposed [22] to overcome this shortness. Its formulation is shown in Eq. (2). By including the prior distribution of the ground truth labels, Mode Score improved Inception Score [22] on reporting mode dropping.

(2)

FID [9], which was formulated in Eq. (3), was proposed on improving Inception Score [11]. Unlike the previous two metrics which are probability-based, FID directly measures Frechet distance on the feature space. It assumes single-manifold normality on the feature distribution and uses an ImageNet model for encoding features. FID was believed to be better than Inception Score [12, 13, 14]. However, we argue that FID has two major problems (See Section 3.2 and 3.3).

(3)

3.2 Ineffective Representation

As both Inception Score [11] and Mode Score [22] is probability-based, applying the ImageNet pretrained model on non-ImageNet dataset is relatively meaningless. This misuse of representation on Inception Score was mentioned previously [31]. However, we argue that applying the ImageNet model to map the generated images to the feature space in FID can also be misleading.

On labeled dataset with multiple classes, the class labels unmatch those in ImageNet. For example, the class ‘Bird’ in CIFAR-10 [32] is divided into several sophisticated category labels in ImageNet. Therefore, the CNN representation trained on ImageNet is either meaningless or over-complicated.

On unlabeled dataset with images from a single class such as CelebA [33] and LSUN Bedrooms [34], applying the ImageNet inception model is also inapproriate. The categories of ImageNet labels are so sophisticated that the trained model needs to encode diverse features on various objects. However, the learned features are ineffective on a specific domain. The encoded features are limited to a relatively low-dimensional manifold lack of fine-grained information. In Section 5.1.2, we designed experiments on both the feature level and the image level to demonstrate the effects of representation. To our best knowledge, we are the first [15] to provide examples where FID gives misleading results on unlabeled datasets (See Fig. 3).

3.3 Single-Manifold vs. Multi-Manifold

We argue that the single-manifold multivariate Gaussian assumption in FID is considered to be over-simplified. As the training decreases intra-class distance and increases inter-class distance, the features are distributed in groups by their class labels. Thus, on datasets with multiple classes, the feature distribution is more like a multi-manifold structure, which is better fitted by a multivariate Gaussian mixture model (GMM).

Considering the specific Gaussian mixture model where with probability , we can derive the first and second moment of the feature distribution in Eq. (4) and Eq. (5).

(4)
(5)

It should be noted that when the feature is n-dimensional and there are classes in total, there are a total of variables in the model. However, directly modeling the whole distribution Gaussian as in FID will result in degrees of freedom, which is a relatively small number. Thus, FID detects mode-related problems in a much implicit way. Although FID was proved to be robust to mode dropping and mode collapse in recent literature [9, 12, 13, 14], we argue that experimental demonstrations on its robustness in previous work is insufficient. Either simply dropping a mode or linearly combining images will result in increased FID. However, in cases where the mode-related problems are more complicated, FID may give misleading results (See Fig. 5). In Section 5.2, we conducted sufficient experiments to analyze the property of encoded features. To our best knowledge, we are the first to provide counter examples where FID fails to give consistent results with human judgements on datasets with multiple classes.

4 Proposed Framework

4.1 Domain-Specific Encoder

As discussed in Section 3.2, applying the ImageNet inception model to either labeled or unlabeled datasets is ineffective. Thus, we argue that a specialized domain-specific encoder should be used in the evaluation framework. As shown in Fig. 2(a), while the features encoded by the ImageNet model are limited within a low-dimensional subspace, the domain-specific model could encode more fine-grained information, making the encoded features much more effective.

(a) Representation
(b) Evaluation metric
Figure 2: Visual demonstrations on highlights of our proposed framework. In the left figure, the features encoded by the ImageNet model are limited within a low-dimensional subspace. Thus, we propose that a domain-specific encoder is needed. In the right figure, we show that instead of a single-manifold Gaussian distribution, the features are more like a multi-manifold structure. CAFD employs a Gaussian mixture model to include class information.

Specifically, for datasets with multiple classes such as CIFAR-10 [32], representation is acquired via training a domain-specific classifier. On dataset without class labels such as CelebA [33], the unsupervised representation learning method, specifically, AutoEncoder, is used to get more effective representation.

4.2 Class-Aware Frechet Distance

Before introducing our improved evaluation metric, we would firstly take a step back towards existing popular metrics. Both Inception Score [11] and Mode Score [22] measure distance between probability distribution while FID [9] directly measures distance on the feature space. These are two different perspectives towards evaluating GAN models. Probability-based metrics better handle mode-related problems (with the correct use of a domain-specific encoder), while directly measuring distance between features better models the generating ability. In fact, we believe these two perspectives are complementary. In our framework, we propose a class-aware metric on the feature space to combine the two perspectives together.

As shown in Fig. 2(b), for datasets with multiple classes, the feature distribution is more like a multi-manifold structure (See Section 3.3). Thus, we use a Gaussian mixture model (GMM) and propose Class-Aware Frechet Distance (CAFD) to include class information. Specifically, we compute probability-based Frechet Distance between real data and generated samples in each class respectively.

As previously discussed in Section 4.1, we train a domain-specific classifier on datasets with multiple classes and use its derived representation. In our proposed framework, we also made use of the predicted probability . To calculate the expected mean of each class in a specific set of generated samples, we can derive the formulation below in Eq. (6).

(6)

where

(7)

Similarly, The covariance matrix in each class is shown in Eq. (8).

(8)

We compute Frechet distance in each of the classes and average the results to get Class-Aware Frechet Distance (CAFD) in Eq. (9).

(9)

This improved form based on Gaussian mixture model assumption can better evaluate the actual distance than the original FID. Moreover, more comprehensive evaluation results can be derived. When CAFD is applied to evaluating a specific GAN model, we could get better class-aware understanding towards the generating ability. For example, as shown in Table 1, the selected model generates digit 1 well but struggles on other classes. This information will provide guidance for researchers on how well their generative models perform on each mode and may explain what specific problems exist.

Class 0 1 2 3 4 5
Distance
Class 6 7 8 9 average
Distance
Table 1: Frechet distance on different classes of MNIST dataset.

As both FID and CAFD aim to model how well domain-specific images are generated, they are not designed to deal with mode dropping, where some of the modes are missed in the generated samples. We propose that both metrics detect mode dropping in a relatively implicit way, which may fail in some corner cases. Thus, motivated by Mode Score [22], we propose that KL divergence should be included into the evaluation framework.

To sum up, the correct use of encoder, the CAFD and the KL divergence term combine for an complete improved evaluation framework. Our proposed method combines the advantages of Inception Score [11], Mode Score [22] and FID [9] and overcomes their shortness.

4.3 Discussion

Our method is sensitive to different representations. Different selection of encoders may result in changes on the evaluation results. Experiments in Section 5.1 demonstrate that the ImageNet inception model may give misleading results (See Fig. 3). Thus, a domain-specific encoder should be used in each evaluation framework. We argue that because the representation is not fixed, the correct use (with domain-specific representation) of Inception Score, Mode Score and FID would suffer from this sensitivity problem as well. It is worth emphasizing that different generative methods should be compared only under the same encoder.

Unlike Inception Score, because CAFD measures distance on the feature space as FID does, it is able to report overfitting. By measuring CAFD with respect to training set and test set respectively, researchers can get understanding towards whether their GAN models overfit the training data. Moreover, the intermediate results could provide researchers comprehensive understanding towards their GAN models (e.g. See Table 1).

5 Experiments

5.1 Study on Representation

In this section, we study the representation for mapping the generated images onto the feature space. As discussed in Section 4.1, applying the pretrained ImageNet inception model to either labeled or unlabeled datasets is considered to be inappropriate. We first investigated the problem of unmatched class labels on a labeled dataset, specifically, CIFAR-10 [32]. Then, experiments on both the feature space and image level were conducted on CelebA [33], which is a dataset including only face images.

Experiments on CIFAR-10 [32]:

We used Inception-v3 [35] model trained on ImageNet to classify the 5000 images labeled ‘Bird’ and 5000 images labeled ‘Dog’ in CIFAR-10 [32] dataset respectively. Table 2 shows the results. The images from the single class ‘Bird’ in CIFAR-10 is classified into various subclasses, where surprisingly the top class is Fox Squirrel (which is not a Bird class) with a 10.1% frequency. The classification results are extremely diverse. It can be inferred that the Inception-v3 model trained on ImageNet does not map images with the label ‘Bird’ onto a single-manifold subspace. Results on the label ‘Dog’ show similar patterns. We argue that features determining whether a dog is a Japanese spaniel or an English foxhound are unnecessary on CIFAR-10. The ImageNet representation cannot well fit non-ImageNet datasets.

Rank CIFAR-10 ’Bird’ Frequency CIFAR-10 ’Dog’ Frequency
1 Fox Squirrel 10.1% Japanese spaniel 9.8%
2 Limpkin 6.9% Dandie Dinmont 5.2%
3 Black Stork 6.4% English foxhound 4.6%
4 Black Grouse 5.3% Toy terrier 3.2%
5 Brambling 4.1% Bluetick 2.8%
Table 2: The classification results on CIFAR-10 [32] images using inception model trained on ImageNet. The class labels ’Bird’ and ’Dog’ are divided into several subclasses.

Therefore, when the dataset includes multiple classes and its class labels are different from those of ImageNet, the feature encoder should be specifically trained. To attain effective representation on non-ImageNet datasets, we need to ensure that the class labels of data used for training GAN models are consistent with those of data used for training the encoder.

Experiments on CelebA [33]:

Regardless of the unmatched classes problem, applying the ImageNet pretrained model to label-free dataset for GAN training can still give misleading results. Take the face dataset CelebA [33] for example. On one hand, in order to evaluate how well the face images were generated, the encoder needs to encode facial texture features, which are hardly learned in the ImageNet inception model. On the other hand, the features determining whether a bird is a limpkin or a grouse are obviously unnecessary on CelebA. Thus, the percentage of effective features on the whole feature space is relatively low.

Experiments were conducted on the CelebA [33] dataset to better demonstrate the deficiency of the ImageNet model. We performed three different types of adjustments on the first 10,000 images on CelebA: a) Random noise uniformly distributed in [-33,33] was applied on each pixel. b) Each image was divided into 8x8=64 regions and seven of them were sheltered by a pixel sampled from the face. c) Each image was first divided into 4x4=16 regions and random exchanges were performed twice.

Results are shown in Fig. 3. With the ImageNet inception model, it is obvious that FID gave inconsistent results with human judgements. In fact, when similar adjustments were conducted with the overall color maintained, FID fluctuated within only a small range. The ImageNet model mainly extracts general features on color, shape to better classify objects in the world while domain-specific facial textures cannot be well represented.

(a) noise (FID=75.9)
(b) sheltering (FID=74.3)
(c) exchange (FID=70.9)
Figure 3: Examples where FID gives inconsistent results with human judgements () on CelebA [33]. The ImageNet inception model fails to encode fine-grained features on faces. a) Random noise uniformly distributed in [-33,33] was applied on each pixel. b) Each image was divided into 8x8=64 regions and seven of them were sheltered by a pixel sampled from the face. c) Each image was first divided into 4x4=16 regions and random exchanges were performed twice.
noise sheltering exchange
ImageNet 76 74 71
AutoEncoder 83 21417 38609
Discriminator 122466 48322 28557
Human Good Bad Worst
Table 3: FID results on different representations. Only the AutoEncoder used in our proposed framework provides consistent results with human judgements.

To attain domain-specific representation, we trained an AutoEncoder on the dataset and used its representation to extract features in our proposed framework. In this experiment, the network architecture of the AutoEncoder is the inverse of the 4-conv DCGAN [4] with the feature dimension 1024. For comparison, we also tried to apply the representation of the discriminator after GAN training, which was previously proposed in [22]. Results are shown in Table 3.

It is shown that only representations derived from the AutoEncoder in our proposed framework are effective and give results consistent with human judgements. The discriminator which learns to discriminate fake samples from the real cannot learn good representation for distance measurement.

To further support our statement that the features encoded by ImageNet model are limited within a low-dimensional manifold, we trained an AutoEncoder with the feature dimension 2048, which is the same as the dimension of features encoded by ImageNet. We again applied the inverse structure of DCGAN [4] as the architecture of the AutoEncoder. Principle component analysis (PCA) was conducted on both features encoded by the AutoEncoder and the ImageNet inception model on CelebA [33]. Table 4 shows the percent of explained variance on the first 5 components.

We argue that the ImageNet model should have much greater representation capability than the 4-conv encoder. However, its first two components has relatively higher explained variance (9.35% and 7.04%). This supports our claim that the features encoded by ImageNet are limited in a low-dimensional subspace.

AutoEncoder ImageNet
Component Explained Accumulated Explained Accumulated
1 5.58% 5.58% 9.35% 9.35%
2 4.66% 10.24% 7.04% 16.39%
3 3.93% 14.17% 3.88% 20.27%
4 3.66% 17.83% 2.67% 22.95%
5 3.41% 21.24% 2.47% 25.42%
Table 4: Results on the explained variance of principle component analysis (PCA) on features encoded by different represenation. Although the architecture of ImageNet model is much more complex than the AutoEncoder, the features encoded by ImageNet model are limited in a relatively low-dimensional subspace.

Thus, for datasets where images are from a single class such as CelebA [33] and LSUN Bedrooms [34], the representation should be acquired via training an AutoEncoder. Our framework employs a domain-specific encoder, which provides more fine-grained information related to specific domain.

5.2 Study on Evaluation Metric

In this section, we used the domain-specific representation and studied the improvements of the evaluation metric CAFD proposed in our framework against the state-of-the-art metric FID [9]. In datasets with multiple classes, the Gaussian mixture model in CAFD will better fit the feature distribution. Experiments and analysis on both the feature level and the image level were conducted on MNIST. First, we study the distribution of the encoded features via statistical normality test. Then, data is visualized to help get better understanding on the feature space. Finally, A specific case is given where CAFD shows great robustness while FID fails to give consistent results with human judgements.

Single-manifold vs. Multi-manifold:

The Gaussian assumption on the features were commonly used in the literature [36]. Although there are non-linear operations such as relu and max-pooling in the neural network, assuming the normality usually simplifies the model and enables numerical expression. However, in labeled dataset with multiple classes, the single-manifold Gaussian assumption is considered to be over-simplified.

In this experiment, we performed Anderson-Darling test (AD-test) [37] to quantatively study the normality of the data. Specifically, to test the multivariate normality on a set of features, we performed principle component analysis (PCA) on the data, applied AD-test to the first 10 components and averaged the results. We compared the test results on each class and the whole training set on MNIST. We used a simple 2-conv structure trained on the MNIST classification task as our feature encoder with a output dimension 1024. To reduce the influence of sample numbers on the result, we divided the whole features randomly into 10 sets to study the normality of the mixed features. Results are shown in Table 5. Although the p-value of both features are small, features within a single class get much greater results than the mixed features. It can be inferred that compared to the whole training set, features within each class are much more Gaussian.

Set Number 0 1 2 3 4
Class
Mixed
Set Number 5 6 7 8 9
Class
Mixed
Table 5: P-value results of AD-test [37] on features of each class and the whole training images. The whole features were randomly divided into 10 sets. Compared to the mixed features, features encoding images from a single class are more like a single-manifold Gaussian structure.
Set Number 0 1 2 3 4
Class
Mixed
Set Number 5 6 7 8 9
Class
Mixed
Table 6: P-value results of mardia test [38] on features of each class and the whole test images. The whole features were randomly divided into 10 sets.

In addition, we used mardia test [38] in the R package MVN [39] to directly study the multivariate normality. We first performed principle component analysis (PCA) on both images within a class and the whole test set respectively. Then, mardia test [38] was used to assess the multivariate normality of the first 5 components. Results (shown in Table 6) are consistent with previous experiments on AD-test [37]. Both normality tests suggested that compared to a single-manifold multivariate Gaussian model, the overall features are better fitted with a multi-manifold Gaussian mixture model. Thus, the basic assumption of CAFD in our framework is more reasonable than the FID [9] method.

Feature Visualization:

To get intuitive understanding towards the multi-manifold structure on the feature distribution, we performed feature visualization via t-sne [40] on MNIST training set and colored them by their class labels. As shown in Fig. 4, it is clear that features encoding images from the same class cluster together and the whole features are more like a mixture of ten independent distribtuions with their own class centers. Therefore, assuming the normality on the whole features is considered to be over-simplified. The encoder tends to cluster features from the same class and the overall distribution is multi-manifold in a group manner.

Figure 4: Visualization of the features encoding the training set on MNIST via t-sne [40]. Features are distributed in groups by their class labels.
(a) generated (FID=73.11)
(b) hack (FID=72.82)
Figure 5: Examples where FID gives inconsistent results with human judgements on MNIST. Due to the over-simplified Gaussian assumption, FID can be hacked by mode collapse. a) Samples generated by a DCGAN model. b) Handmade images via axis permutation and FGSM [41].

Comparison between FID and CAFD:

In this experiment, we designed cases where FID fails to give consistent results with human judgements. FID, as a overall statistical measure, is able to detect either a single mode dropping or a trivial linear combination of two images. However, as its formulation has relatively limited constraints, it may be hacked by complicated situations.

Considering the features extracted from MNIST test data, which has a zero FID with itself. We performed operations below on the features.

  1. Performed principle component analysis (PCA) on the original features.

  2. Normalized each axis to zero mean and unit variance.

  3. Switched the normalized projection of the first two component.

  4. Unnormalized the data and reconstructed features.

The adjusted features are completely different with the original one with zero FID maintained. The over-simplified Gaussian assumption on overall distribution cannot tell the differences while our proposed method is able to report the changes with CAFD raising from 0 to 539.8 (See Table 7).

We used FGSM [41] to reconstruct the images from the adjusted features. Specifically, we first trained an decoder for initialization via an AutoEncoder with the encoder fixed. Then, we performed pixelwise adjustment via FGSM [41] to lower the reconstruction error. Because the used encoder has a relatively simple structure, the final reconstruction error is still relatively high after optimized. We trained a simple DCGAN [4] model and took samples (generated by intermediate models during training) with comparable FID with our constructed images. Results are shown in Fig. 5.

FID CAFD
test 0 0 0
adjusted 0 539.8 0.03675
generated 73.1 201.4 0.001893
hack 72.8 468.6 0.04941
train 22.0 99.8 0.000572
Table 7: Results of FID, CAFD and KLD on MNIST. Lower scores infer better image quality. The ‘test’ denotes the MNIST test set, ‘adjusted’ denotes the features after axis permutation. ‘generated’ and ‘hack’ are the sampled images in Fig. 5. Compared to FID, CAFD are more robust to feature-level adjustments.

It is obvious that the quality of constructed images are much worse than the generated samples. After axis permutation, the constructed images suffers from mode collapse. There are many pictures in the right which resemble more than one digits and are hard to recognize. However, it still received a FID of 72.82 lower than that (73.11) received by generated samples. CAFD and KLD results on these cases are shown in Table 7. While FID gives misleading results, CAFD are much more robust on the adjusted features. Compared to the constructed images (468.6), the generated images received a much lower CAFD (201.4), which is consistent with human judgements. This demonstrates the improved effectiveness of the evaluation metric in our proposed framework.

6 Conclusions

In this paper, we have presented an improved evaluation framework for Generative Adversarial Networks, which improves conventional methods on both representation and evaluation metric. We argue that a domain-specific encoder is needed and propose Class-Aware Frechet Distance to better fit the feature distribution. To our best knowledge, we are the first to provide counter examples where the state-of-the-art FID is inconsistent with human judgements. Experiments and analysis on both the feature level and the image level have shown that our framework is more effective than FID and improves its robustness.

{subappendices}

G A Benchmark for Popular GANs

(a) Results on MNIST
(b) Results on FASHION-MNIST [42]
Figure 6: Results of our evaluation framework on popular GAN models. The experiments were performed on MNIST and FASHION-MNIST [42].
MNIST FASHION-MNIST [42]
DCGAN [4]
LSGAN [7]
BEGAN [6]
EBGAN [24]
DRAGAN [25] 120.20.6 51.90.4
WGAN [5]
WGAN-GP [26] 126.70.9 54.10.5
Table 8: CAFD Results of different GAN models on MNIST and FASHION-MNIST [42]. The encoder was specifically trained on the dataset.

In order to benmark the performance of GANs on generating domain-specific images, we conducted experiments on 7 popular GAN models3 including DCGAN [4], LSGAN [7], BEGAN [6], EBGAN [24], DRAGAN [25], WGAN [5], WGAN-GP [26]. Our experiments were performed on MNIST and FASHION-MNIST [42]. We will include other popular datasets such as CIFAR-10 [32], CelebA [33] and ImageNet [10] in the future.

Results are shown in Fig. 6 and Table 8. All of the tested models converge well. DCGAN [4], which is the first to introduce convolutional neural networks into generative models, struggles more on convergence than the newly proposed GAN variants. DRAGAN [25] and WGAN-GP [26] get the top two scores on both datasets. Both BEGAN [6] and WGAN [5] focus more on stable training, while the qualities of their generated images are not the best. WGAN-GP [26] improves WGAN [5] by using norm penalizing to replace weight clipping. It generates higher quality images compared to its baseline. DRAGAN [25] utilizes a gradient penalty scheme and mitigates the problem of mode collapse. It is worth noting that the recently proposed DRAGAN [25] and WGAN-GP [26] outperform other models by a relatively large margin. We can infer that the development of exploring better GAN architectures and training strategies is still highly active.

H Qualitative Visualization

In this section, we provide qualitative visualization of images with different scores under our evaluation framework. Images were generated by intermediate models during GAN training. Experiments were conducted on FASHION-MNIST [42]. Figs. 7, 8 and 9 show the results. The Class-Aware Frechet Distance (CAFD) metric in our proposed framework gives consistent results with human judgements.

(a)
(b)
Figure 7: Qualitative visualization of different scores on FASHION-MNIST [42].
(a)
(b)
(c)
(d)
Figure 8: Qualitative visualization of different scores on FASHION-MNIST [42].
(a)
(b)
(c)
(d)
Figure 9: Qualitative visualization of different scores on FASHION-MNIST [42].

Footnotes

  1. footnotetext: equal contribution
  2. https://github.com/B1ueber2y/CAFD
  3. We used the off-the-shelf tensorflow package https://github.com/hwalsuklee/tensorflow-generative-model-collections.

References

  1. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: NIPS. (2014) 2672–2680
  2. Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: ICCV. (2017) 2223–2232
  3. Ledig, C., Theis, L., Huszar, F., Caballero, J., Cunningham, A., Acosta, A., Aitken, A., Tejani, A., Totz, J., Wang, Z., Shi, W.: Photo-realistic single image super-resolution using a generative adversarial network. In: CVPR. (2017) 4681–4690
  4. Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolutional generative adversarial networks. CoRR abs/1511.06434 (2015)
  5. Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein generative adversarial networks. In: ICML. (2017) 214–223
  6. Berthelot, D., Schumm, T., Metz, L.: BEGAN: boundary equilibrium generative adversarial networks. CoRR abs/1703.10717 (2017)
  7. Mao, X., Li, Q., Xie, H., Lau, R.Y., Wang, Z., Paul Smolley, S.: Least squares generative adversarial networks. In: ICCV. (2017) 2794–2802
  8. Theis, L., van den Oord, A., Bethge, M.: A note on the evaluation of generative models. In: ICLR. (Apr 2016)
  9. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans trained by a two time-scale update rule converge to a local nash equilibrium. In: NIPS. (2017) 6629–6640
  10. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115(3) (2015) 211–252
  11. Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X., Chen, X.: Improved techniques for training gans. In: NIPS. (2016) 2234–2242
  12. Huang, G., Yuan, Y., Xu, Q., Guo, C., Sun, Y., Wu, F., Weinberger, K.: An empirical study on evaluation metrics of generative adversarial networks. https://openreview.net/forum?id=Sy1f0e-R- (2018)
  13. Im, D.J., Ma, A.H., Taylor, G.W., Branson, K.: Quantitatively evaluating GANs with divergences proposed for training. In: ICLR. (2018)
  14. Lucic, M., Kurach, K., Michalski, M., Gelly, S., Bousquet, O.: Are GANs Created Equal? A Large-Scale Study. ArXiv e-prints (November 2017)
  15. Borji, A.: Pros and Cons of GAN Evaluation Measures. ArXiv e-prints (February 2018)
  16. Zhu, J.Y., Krähenbühl, P., Shechtman, E., Efros, A.A.: Generative visual manipulation on the natural image manifold. In: ECCV. (2016) 597–613
  17. Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: CVPR. (2017) 1125–1134
  18. Gurumurthy, S., Kiran Sarvadevabhatla, R., Venkatesh Babu, R.: Deligan : Generative adversarial networks for diverse and limited data. In: CVPR. (2017) 166–174
  19. Huang, X., Li, Y., Poursaeed, O., Hopcroft, J., Belongie, S.: Stacked generative adversarial networks. In: CVPR. (2017) 5077–5086
  20. Arora, S., Ge, R., Liang, Y., Ma, T., Zhang, Y.: Generalization and equilibrium in generative adversarial nets (gans). In: ICML. (2017) 224–232
  21. Hoang, Q., Nguyen, T.D., Le, T., Phung, D.: MGAN: Training generative adversarial nets with multiple generators. In: ICLR. (2018)
  22. Che, T., Li, Y., Jacob, A.P., Bengio, Y., Li, W.: Mode regularized generative adversarial networks. In: ICLR. (2017)
  23. Dziugaite, G.K., Roy, D.M., Ghahramani, Z.: Training generative neural networks via maximum mean discrepancy optimization. arXiv preprint arXiv:1505.03906 (2015)
  24. Zhao, J., Mathieu, M., LeCun, Y.: Energy-based generative adversarial network. In: ICLR. (2017)
  25. Kodali, N., Abernethy, J.D., Hays, J., Kira, Z.: How to train your DRAGAN. CoRR abs/1705.07215 (2017)
  26. Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., Courville, A.C.: Improved training of wasserstein gans. In: NIPS. (2017) 5767–5777
  27. Wu, Y., Burda, Y., Salakhutdinov, R., Grosse, R.: On the quantitative analysis of decoder-based generative models. In: ICLR. (2017)
  28. Lopez-Paz, D., Oquab, M.: Revisiting classifier two-sample tests. In: ICLR. (2017)
  29. Li, Y., Swersky, K., Zemel, R.: Generative moment matching networks. In: ICML. (2015) 1718–1727
  30. Barratt, S., Sharma, R.: A Note on the Inception Score. ArXiv e-prints (January 2018)
  31. Rosca, M., Lakshminarayanan, B., Warde-Farley, D., Mohamed, S.: Variational Approaches for Auto-Encoding Generative Adversarial Networks. ArXiv e-prints (June 2017)
  32. Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images. (2009)
  33. Liu, Z., Luo, P., Wang, X., Tang, X.: Deep learning face attributes in the wild. In: ICCV. (2015) 3730–3738
  34. Yu, F., Zhang, Y., Song, S., Seff, A., Xiao, J.: Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365 (2015)
  35. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: CVPR. (2015) 1–9
  36. Jonghoon Jin, A.D., Culurciello, E.: Robust convolutional neural networks under adversarial noise. In: ICLR Workshop. (2016)
  37. Scholz, F.W., Stephens, M.A.: K-sample anderson-darling tests. Journal of the American Statistical Association 82(399) (1987) 918–924
  38. Mardia, K.V.: Mardia’s test of multinormality. Encyclopedia of statistical sciences (1985)
  39. Korkmaz, S., Goksuluk, D., Zararsiz, G.: Mvn: an r package for assessing multivariate normality. The R Journal 6(2) (2014) 151–162
  40. Maaten, L.v.d., Hinton, G.: Visualizing data using t-sne. Journal of machine learning research 9(Nov) (2008) 2579–2605
  41. Goodfellow, I.J., Shlens, J., Szegedy, C.: Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572 (2014)
  42. Xiao, H., Rasul, K., Vollgraf, R.: Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms (2017)
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
130609
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description