On the Evaluation of Conditional GANs

# On the Evaluation of Conditional GANs

## Abstract

Conditional Generative Adversarial Networks (cGANs) are finding increasingly widespread use in many application domains. Despite outstanding progress, quantitative evaluation of such models often involves multiple distinct metrics to assess different desirable properties, such as image quality, conditional consistency, and intra-conditioning diversity. In this setting, model benchmarking becomes a challenge, as each metric may indicate a different “best” model. In this paper, we propose the Fréchet Joint Distance (FJD), which is defined as the Fréchet distance between joint distributions of images and conditioning, allowing it to implicitly capture the aforementioned properties in a single metric. We conduct proof-of-concept experiments on a controllable synthetic dataset, which consistently highlight the benefits of FJD when compared to currently established metrics. Moreover, we use the newly introduced metric to compare existing cGAN-based models for a variety of conditioning modalities (e.g. class labels, object masks, bounding boxes, images, and text captions). We show that FJD can be used as a promising single metric for cGAN benchmarking and model selection. Code can be found at https://github.com/facebookresearch/fjd.

\iclrfinalcopy

## 1 Introduction

The use of generative models is growing across many domains (van den Oord et al., 2016a; Vondrick et al., 2016; Serban et al., 2017; Karras et al., 2018; Brock et al., 2019). Among the most promising approaches, Variational Auto-Encoders (VAEs) (Kingma and Welling, 2014), auto-regressive models (van den Oord et al., 2016b, c), and Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) have been driving significant progress, with the latter at the forefront of a wide-range of applications (Mirza and Osindero, 2014; Reed et al., 2016; Zhang et al., 2018; Vondrick et al., 2016; Almahairi et al., 2018; Subramanian et al., 2018; Salvador et al., 2019). In particular, significant research has emerged from practical applications, which require generation to be based on existing context. For example, tasks such as image inpainting, super-resolution, or text-to-image synthesis have been successfully addressed within the framework of conditional generation, with conditional GANs (cGANs) among the most competitive approaches. Despite these outstanding advances, quantitative evaluation of GANs remains a challenge (Theis et al., 2016; Borji, 2018).

In the last few years, a significant number of evaluation metrics for GANs have been introduced in the literature (Salimans et al., 2016; Heusel et al., 2017; Bińkowski et al., 2018; Shmelkov et al., 2018; Zhou et al., 2019; Kynkäänniemi et al., 2019; Ravuri and Vinyals, 2019). Although there is no clear consensus on which quantitative metric is most appropriate to benchmark GAN-based models, Inception Score (IS) (Salimans et al., 2016) and Fréchet Inception Distance (FID) (Heusel et al., 2017) have been extensively used. However, both IS and FID were introduced in the context of unconditional image generation and, hence, focus on capturing certain desirable properties such as visual quality and sample diversity, which do not fully encapsulate all the different phenomena that arise during conditional image generation.

In conditional generation, we care about visual quality, conditional consistency – i.e., verifying that the generation respects its conditioning, and intra-conditioning diversity – i.e., sample diversity per conditioning. Although visual quality is captured by both metrics, IS is agnostic to intra-conditioning diversity and FID only captures it indirectly.1 Moreover, neither of them can capture conditional consistency. In order to overcome these shortcomings, researchers have resorted to reporting conditional consistency and diversity metrics in conjunction with FID (Zhao et al., 2019; Park et al., 2019).

Consistency metrics often use some form of concept detector to ensure that the requested conditioning appears in the generated image as expected. Although intuitive to use, these metrics require pre-trained models that cover the same target concepts in the same format as the conditioning (i.e., classifiers for image-level class conditioning, semantic segmentation for mask conditioning, etc.), which may or may not be available off-the-shelf. Moreover, using different metrics to evaluate different desirable properties may hinder the process of model selection, as there may not be a single model that surpasses the rest in all measures. In fact, it has recently been demonstrated that there is a natural trade-off between image quality and sample diversity (Yang et al., 2019), which calls into question how we might select the correct balance of these properties.

In this paper we introduce a new metric called Fréchet Joint Distance (FJD), which is able to implicitly assess image quality, conditional consistency, and intra-conditioning diversity. FJD computes the Fréchet distance on an embedding of the joint image-conditioning distribution, and introduces only small computational overhead over FID compared to alternative methods. We evaluate the properties of FJD on a variant of the synthetic dSprite dataset (Matthey et al., 2017) and verify that it successfully captures the desired properties. We provide an analysis on the behavior of both FID and FJD under different types of conditioning such as class labels, bounding boxes, and object masks, and evaluate a variety of existing cGAN models for real-world datasets with the newly introduced metric. Our experiments show that (1) FJD captures the three highlighted properties of conditional generation; (2) it can be applied to any kind of conditioning (e.g., class, bounding box, mask, image, text, etc.); and (3) when applied to existing cGAN-based models, FJD demonstrates its potential to be used as a promising unified metric for hyperparameter selection and cGAN benchmarking. To our knowledge, there are no existing metrics for conditional generation that capture all of these key properties.

## 2 Related Work

Conditional GANs have witnessed outstanding progress in recent years. Training stability has been improved through the introduction of techniques such as progressive growing, Karras et al. (2018), spectral normalization (Miyato et al., 2018) and the two time-scale update rule (Heusel et al., 2017). Architecturally, conditional generation has been improved through the use of auxiliary classifiers (Odena et al., 2017) and the introduction of projection-based conditioning for the discriminator (Miyato and Koyama, 2018). Image quality has also benefited from the incorporation of self-attention (Zhang et al., 2018), as well as increases in model capacity and batch size (Brock et al., 2019).

All of this progress has led to impressive results, paving the road towards the challenging task of generating more complex scenes. To this end, a flurry of works have tackled different forms of conditional image generation, including class-based (Mirza and Osindero, 2014; Heusel et al., 2017; Miyato et al., 2018; Odena et al., 2017; Miyato and Koyama, 2018; Brock et al., 2019), image-based  (Isola et al., 2017; Zhu et al., 2017a; Wang et al., 2018; Zhu et al., 2017b; Almahairi et al., 2018; Huang et al., 2018; Mao et al., 2019), mask- and bounding box-based (Hong et al., 2018; Hinz et al., 2019; Park et al., 2019; Zhao et al., 2019), as well as text- (Reed et al., 2016; Zhang et al., 2017, 2018; Xu et al., 2018; Hong et al., 2018) and dialogue-based conditionings (Sharma et al., 2018; El-Nouby et al., 2019). This intensified research has lead to the development of a variety of metrics to assess the three factors of conditional image generation process quality, namely: visual quality, conditional consistency, and intra-conditioning diversity.

Visual quality. A number of GAN evaluation metrics have emerged in the literature to assess visual quality of generated images in the case of unconditional image generation. Most of these metrics either focus on the separability between generated images and real images (Lehmann and Romano, 2005; Radford et al., 2016; Yang et al., 2017; Isola et al., 2017; Zhou et al., 2019), compute the distance between distributions (Gretton et al., 2012; Heusel et al., 2017; Arjovsky et al., 2017), assess sample quality and diversity from conditional or marginal distributions (Salimans et al., 2016; Gurumurthy et al., 2017; Zhou et al., 2018), measure the similarity between generated and real images (Wang et al., 2004; Xiang and Li, 2017; Snell et al., 2017; Juefei-Xu et al., 2017) or are log-likelihood based (Theis et al., 2016)2. Among these, the most accepted automated visual quality metrics are Inception Score (IS) (Salimans et al., 2016) and Fréchet Inception Distance (FID) (Heusel et al., 2017).

Conditional consistency. To assess the consistency of the generated images with respect to model conditioning, researchers have reverted to available, pre-trained feed-forward models. The structure of these models depends on the modality of the conditioning (e.g. segmentation models are used for mask conditioning or image captioning models are applied to evaluate text conditioning). Moreover, the metric used to evaluate the forward model on the generated distribution depends on the conditioning modality and includes: accuracy in the case of class-conditioned generation, Intersection over Union when using bounding box- and mask-conditionings, BLEU (Papineni et al., 2002), METEOR (Banerjee and Lavie, 2005) or CIDEr (Vedantam et al., 2015) in the case of text-based conditionings, and Structural Similarity (SSIM) or peak signal-to-noise ratio (PSNR) for image-conditioning.

Intra-conditioning diversity. The most common metric for evaluating sample diversity is Learned Perceptual Image Patch Similarity (LPIPS) (Zhang et al., 2018a), which measures the distance between samples in a learned feature space. Alternatively, (Miyato and Koyama, 2018) proposed Intra-FID, which calculates a FID score separately for each conditioning and reports the average score over all conditionings. This method should in principle capture the desirable properties of image quality, conditional consistency, and intra-class diversity. However, it scales poorly with the number of unique conditions, as the computationally intensive FID calculation must be repeated for each case, and because FID behaves poorly when the sample size is small (Bińkowski et al., 2018). Furthermore, in cases where the conditioning cannot be broken down into a set of discrete classes (e.g., pixel-based conditioning), Intra-FID is intractable. As a result, it has not been applied beyond class-conditioning.

## 3 Review of Fréchet Inception Distance (FID)

FID aims to compare the statistics of generated samples to samples from a real dataset. Given two multivariate Gaussian distributions and , Fréchet Distance (FD) is defined as:

 d2((μ,Σ),(^μ,^Σ))=||μ−^μ||22+Tr(Σ+^Σ−2(Σ^Σ)1/2). (1)

When evaluating a generative model, represents the data (reference) distribution, obtained by fitting a Gaussian to samples from a reference dataset, and represents the learned (generated) distribution, a result of fitting to samples from a generative model.

In FID, both the real images and model samples are embedded in a learned feature space using a pre-trained Inception v3 model (Szegedy et al., 2016). Thus, the Gaussian distributions are defined in the embedded space. More precisely, given a dataset of images , a set of model samples , and an Inception embedding function , we estimate the Gaussian parameters , , and as:

 μ=1NN∑i=0f(x(i)),Σ=1N−1N∑i=0(f(x(i))−μ)(f(x(i))−μ)T, (2)
 (3)

## 4 Fréchet Joint Distance (FJD)

In conditional image generation, a dataset is composed of image-condition pairs , where the conditioning can take variable forms, such as image-level classes, segmentation masks, or text. The goal of conditional image generation is to produce realistic looking, diverse images that are consistent with the conditioning . Thus, a set of model samples with corresponding conditioning can be defined as: .

As discussed in Section 3, the Fréchet distance (FD) compares any two Gaussians defined over arbitrary spaces. In FJD, we propose to compute the FD between two Gaussians defined over the joint image-conditioning embedding space.

More precisely, given an image embedding function , a conditioning embedding function , a conditioning embedding scaling factor , and a merging function that combines the image embedding with the conditioning embedding into a joint one, we can estimate the respective Gaussian parameters , , and as:

 (4)
 Σ=1N−1N∑i=0(g(f(x(i)),αh(y(i)))−μ)(g(f(x(i)),αh(y(i)))−μ)T, (5)
 ^Σ=1M−1M∑i=0(g(f(^x(i)),αh(^y(i)))−^μ)(g(f(^x(i)),αh(^y(i)))−^μ)T. (6)

Note that by computing the FD over the joint image-conditioning distribution, we are able to simultaneously assess image quality, conditional consistency, and intra-conditioning diversity, all of which are important factors in evaluating the quality of conditional image generation models.

To ensure reproducibility, when reporting FJD scores it is important to include details such as which conditioning embedding function was used, which dataset is used for the reference distribution, and the value. We report these values for all of our experiments in Appendix B.

### 4.1 Conditioning embedding function: h

The purpose of the embedding function is to reduce the dimensionality and extract a useful feature representation of the conditioning. As such, the choice of will vary depending on the modality of conditioning. In most cases, an off-the-shelf, pretrained embedding can be used for the purposes of extracting a useful representation. In the absence of preexisting, pretrained conditioning embedding functions, a new one should be learned. For example, for bounding box and mask conditionings the embedding function could be learned with an autoencoder.  3 For suggested assignments of conditioning modalities to embedding functions please refer to Table 1.

### 4.2 Conditioning embedding scaling factor: α

In order to control the relative contribution of the image component and the conditioning component to the final FJD value, we scale the conditioning embedding by a constant . In essence, indicates how much we care about the conditioning component compared to the image component. When , the conditioning component is ignored and FJD is equivalent to FID. As the value of increases, the perceived importance of the conditioning component is also increased and reflected accordingly in the resulting measure. To equally weight the image component and the conditioning component, we recommend setting to be equal to the ratio between the average norm of the image embedding and the conditioning embedding. This weighting ensures that FJD retains consistent behaviour across conditioning embeddings, even with varying dimensionality or magnitude. We note that should be calculated on data from the reference distribution (real data distribution), and then applied to all conditioning embeddings thereafter. See Appendix F for an example of the effect of the hyperparameter.

### 4.3 Merging function: g

The purpose of the merging function is to combine the image embedding and conditioning embedding into a single joint embedding. We compared several candidate merging functions and found concatenation of the image embedding and conditioning embedding vectors to be most effective, both in terms of simplicity and performance. As such, concatenation is used as the merging function in all following experiments.

## 5 Evaluation of the Properties of Fréchet Joint Distance

In this section, we demonstrate that FJD captures the three desiderata of conditional image generation, namely image quality, conditional consistency and intra-conditioning diversity.

### 5.1 Dataset

dSprite-textures. The dSprite dataset (Matthey et al., 2017) is a synthetic dataset where each image depicts a simple 2D shape on a black background. Each image can be fully described by a set of factors, including shape, scale, rotation, position, and position. We augment the dSprite dataset to create dSprite-textures by adding three texture patterns for each sample. Additionally, we include class labels indicating shape, as well as bounding boxes and mask labels for each sample (see Figure 1). In total, the dataset contains 2,211,840 unique images. This synthetic dataset allows us to exactly control our sample distribution and, thereby, simulate a generator with image-conditioning inconsistencies or reduced sample diversity. To embed the conditioning for calculating FJD in the following experiments, we use one-hot encoding for the class labels, and autoencoder representations for the bounding box and mask labels.4.

### 5.2 Image Quality

In this subsection, we aim to test the sensitivity of FJD to image quality perturbations. To do so, we draw k random samples from the dSprite-textures dataset to form a reference dataset. The generated dataset is simulated by duplicating the reference dataset and adding Gaussian noise drawn from to the images, where and pixel values are normalized (and clipped after noise addition) to the range . The addition of noise mimics a generative model that produces low quality images. We repeat this experiment for all three conditioning types in dSprite-textures: class, bounding box, and mask.

Results are shown in Figure 2, where we plot both FID and FJD as a function of the added Gaussian noise ( is indicated on the -axis as Noise Magnitude). We find that, in all cases, FJD tracks FID very closely, indicating that it successfully captures image quality. Interestingly, we note that FJD increases slightly compared to FID as image quality decreases, likely due to a decrease in perceived conditional consistency. Additional image quality experiments on the large scale COCO-Stuff dataset can be found in Appendix C.

### 5.3 Conditional Consistency

In this subsection, we aim to highlight the sensitivity of FJD to conditional consistency. In particular, we target specific types of inconsistencies, such as incorrect scale, orientation, or position. We draw a set of k samples from the dSprite-textures dataset and duplicate it to represent the reference dataset and the generated dataset, each with identical image and conditioning marginal distributions. For of the generated dataset samples we swap conditionings of pairs of samples that are identical in all but one of the attributes (scale, orientation, position or position). For example, if one generated sample has attribute position 4 and a second generated sample has attribute position 7, swapping their conditionings leads to generated samples that are offset by 3 pixels w.r.t. their ground truth position. Swapping conditionings in this manner allows us to control for specific attributes’ conditional consistency, while keeping the image and conditioning marginal distributions unchanged. As a result, all changes in FJD can be attributed solely to conditional inconsistencies.

Figure 3 depicts the results of this experiment for four different types of alterations: scale, orientation, and and positions. We observe that the FID between image distributions (solid blue line) remains constant even as the degree of conditional inconsistency increases. For class conditioning (dotted orange line), FJD also remains constant, as changes to scale, orientation, and position are independent of the object class. Bounding box and mask conditionings, as they contain spatial information, produce variations in FJD that are proportional to the offset. Interestingly, for the orientation offsets, FJD with mask conditioning fluctuates rather than increasing monotonically. This behaviour is due to the orientation masks partially re-aligning with the ground truth around and . Each of these cases emphasize the effective sensitivity of FJD with respect to conditional consistency. Additional conditional consistency experiments with text conditioning can be found in Appendix D.

### 5.4 Intra-conditioning Diversity

In this subsection, we aim to test the sensitivity of FJD to intra-conditioning diversity5, by alternating the per-conditioning image texture variability. More precisely, we vary the texture based on four different image attributes: shape that is captured in all tested conditionings, as well as scale, orientation and position that are captured by bounding box and mask conditionings only. To create attribute-texture assignments, we stratify attributes based on their values. For example, one possible shape-based stratification of a dataset with three shapes might be: [squares, ellipses, hearts]. To quantify the dataset intra-conditioning diversity, we introduce a diversity score. A diversity score of 1 means that the per-attribute texture distribution is uniform across stratas, while a diversity score of 0 means that each strata is assigned to a single texture. Middling diversity scores indicate that the textural distribution is skewed towards one texture type in each strata. We create our reference dataset by randomly drawing k samples. The generated distribution is created by duplicating the reference distribution and adjusting the per-attribute texture variability to achieve the desired diversity score.

The results of these experiments are shown in Figure 4, which plots the increase in FID and FJD, for different types of conditioning, as the diversity of textures within each subset decreases. For all tested scenarios, we observe that FJD is sensitive to intra-conditioning diversity changes. Moreover, not surprisingly, since a change in the joint distribution of attributes and textures also implies a change to the image marginal distribution, we observe that FID increases with reduced diversity. This experiment suggests that FID is able to capture intra-conditioning diversity changes when the image conditional distribution is also affected. However, if the image marginal distribution were to stay constant, FID would be blind to intra-conditioning diversity changes (as is shown in Section 5.3).

## 6 Evaluation of existing conditional generation models

In this section, we seek to demonstrate the application of FJD to evaluate models with several different conditioning modalities, in contrast to FID and standard conditional consistency and diversity metrics. We focus on testing class-conditioned, image-conditioned, and text-conditioned image generation tasks, which have been the focus of numerous works6. Multi-label, bounding box, and mask conditioning are also explored in Appendix I. We note that FJD and FID yield similar rankings of models in this setting, which is to be expected since most models use similar conditioning mechanisms. Rankings are therefore dominated by image quality, rather than conditional consistency. We refer the reader to Appendix F and H for examples of cases where FJD ranks models differently than FID.

Class-conditioned cGANs. Table 2 compares three state-of-the-art class-conditioned generative models trained on ImageNet at resolution. Specifically, we evaluate SN-GAN (Miyato et al., 2018) trained with and without a projection discriminator (Miyato and Koyama, 2018), and BigGAN (Brock et al., 2019). Accuracy is used to evaluate conditional consistency, and is computed as the Inception v3 accuracy of each model’s generated samples, using their conditioning as classification ground truth. Class labels from the validation set are used as conditioning to generate k samples for each model, and the training set is used as the reference distribution. One-hot encoding is used to embed the class conditioning for the purposes of calculating FJD.

We find that FJD follows the same trend as FID for class-conditioned models, preserving their ranking and highlighting the FJD’s ability to capture image quality. Additionally, we note that the difference between FJD and FID correlates with each model’s classification accuracy, with smaller gaps appearing to indicate better conditional consistency. Diversity scores, however, rank models in the opposite order compared to all other metrics.

This behaviour evokes the trade-off between realism and diversity highlighted by Yang et al. (2019). Ideally, we would like a model that produces diverse outputs, but this property is not as attractive if it also results in a decrease in image quality. At what point should diversity be prioritized over image quality, and vice versa? FJD is a suitable metric for answering this question if the goal is to find a model that best matches the target conditional data generating distribution. We refer the reader to Appendix F for examples of cases where models with greater diversity are favoured over models with better image quality.

Image-conditioned cGANs. Table 3 compares four state-of-the-art image translation models: Pix2pix (Isola et al., 2017), BicycleGAN (Zhu et al., 2017b), MSGAN (Mao et al., 2019), and MUNIT (Huang et al., 2018). We evaluate on four different image-to-image datasets: Facades (Tyleček and Šára, 2013), Maps (Isola et al., 2017), Edges2Shoes and Edges2Handbag (Zhu et al., 2016). To assess conditional consistency we utilize LPIPS to measure the average distance between generated images and their corresponding ground truth images. Conditioning from the validation sets are used to generate images, while the training sets are used as reference distributions. An Inceptionv3 model is used to embed the image conditioning for the FJD calculation. Due to the small size of the validation sets, we report scores averaged over 5 evaluations of each model.

In this setting we encounter some ambiguity with regards to model selection, as for all datasets, each metric ranks the models differently. BicycleGAN appears to have the best image quality, Pix2pix produces images that are most visually similar to the ground truth, and MSGAN and MUNIT achieve the best sample diversity scores. This scenario demonstrates the benefits of using a single unified metric for model selection, for which there is only a single best model.

Text-conditioned cGANs. Table 4 shows FJD and FID scores for three state-of-the-art text-conditioned models trained on the Caltech-UCSD Birds 200 dataset (CUB-200) (Welinder et al., 2010) at resolution: HDGan (Zhang et al., 2018b), StackGAN++ (Zhang et al., 2018), and AttnGAN (Xu et al., 2018). Conditional consistency is evaluated using visual-semantic similarity, as proposed by Zhang et al. (2018b). Conditioning from the test set captions is used to generate k images, and the same test set is also used as the reference distribution. We use pre-computed Char-CNN-RNN sentence embeddings as the conditioning embedding for FJD, since they are commonly used with CUB-200 and are readily available.

In this case we find that AttnGAN dominates in terms of conditional consistency compared to HDGan and StackGAN++, while all models are comparable in terms of diversity. AttnGAN is ranked best overall by FJD. In cases where the biggest differentiator between the models is image quality, FID and FJD will provide a consistent ranking as we see here. In cases where the trade-off is more subtle we believe practitioners will opt for a metric that measurably captures intra-conditioning diversity.

## 7 Conclusions

In this paper we introduce Fréchet Joint Distance (FJD), which is able to assess image quality, conditional consistency, and intra-conditioning diversity within a single metric. We compare FJD to FID on the synthetic dSprite-textures dataset, validating its ability to capture the three properties of interest across different types of conditioning, and highlighting its potential to be adopted as unified cGAN benchmarking metric. We also demonstrate how FJD can be used to address the potentially ambiguous trade-off between image quality and sample diversity when performing model selection. Looking forward, FJD could serve as valuable metric to ground future research, as it has the potential to help elucidate the most promising contributions within the scope of conditional generation.

#### Acknowledgments

The authors would like to thank Alaa El-Nouby, Nicolas Ballas, Lluis Castrejon, Mohamed Ishmael Belghazi, Nissan Pow, Mido Assran, Anton Bakhtin, and Vinakayk Tantia for insightful and entertaining discussions.

#### Changelog

v1 Initial Arxiv release.
v2 Use dedicated embedding for each conditioning type. Add evaluation of text conditioned models. Add measure of diversity with LPIPS. Minor wording changes.
v3 Add link to code release. Add example of using FJD for hyperparameter tuning to appendix. Minor wording changes.

## Appendix A Illustration of FID and FJD on two dimensional Gaussian data

In this section, we illustrate the claim made in Section 1 that FID cannot capture intra-conditioning diversity when the joint distribution of two variables changes but the marginal distribution of one of them is not altered.

Consider two multivariate Gaussian distributions, and , where

 Σ1=[4222]Σ2=[2.1222].

Figure 5 (left) shows samples drawn from each of these distributions, labeled as Dist1 and Dist2, respectively. While the joint distributions of and are different from each other, the marginal distributions and are the same ( and ). Figure 5 (center) shows the histograms of the two marginal distributions computed from samples.

If we let take the role of the embedding of the conditioning variables (e.g., position) and take the role of the embedding of the generated variables (i.e., images), then computing FID in this example would correspond to computing the FD between and , which is zero. On the other hand, computing FJD would correspond to the FD between and , which equals . But note that Dist1 and Dist2 have different degrees of intra-conditioning diversity, as illustrated by Figure 5 (right), where two histograms of are displayed, showing marked differences to each other (similar plots can be constructed for other values of ). Therefore, this example illustrates a situation in which FID is unable to capture changes in intra-conditioning diversity, while FJD is able to do so.

## Appendix B Experimental Settings for Calculating FJD

Important details pertaining to the computation of the FID and FJD metrics for different experiments included in this paper are reported in Table 5. For each dataset we report which conditioning modality was used, as well as the conditioning embedding function. Information about which split and image resolution are used for the reference and generated distributions is also included, as well as how many samples were generated per conditioning. Values for reported here are calculated according to the balancing mechanism recommended in Section 4.2. Datasets splits marked by “-” indicate that the distribution is a randomly sampled subset of the full dataset.

## Appendix C Image Quality Evaluation on COCO-Stuff Dataset

We repeat the experiment initially conducted in Section 5.2 on a real world dataset to see how well FJD tracks image quality. Specifically, we use the COCO-Stuff dataset (Caesar et al., 2018), which provides class labels, bounding box annotations, and segmentation masks. We follow the same experimental procedure as outlined in Section 5.2: Gaussian noise is drawn from and add to the images, where and pixel values are normalized (and clipped after noise addition) to the range . The original dataset of clean images is used as the reference distribution, while noisy images are used to simulate a generated distribution with poor image quality. For the purposes of calculating FJD, we use N-hot encoding to embed the labels of the classes present in each image, and autoencoder representations for the bounding box and mask labels. As shown in Figure 6, FID and FJD both track image quality well, increasing as more noise is added to the generated image distribution. In this case we observe a more drastic increase of FJD over FID as image quality decreases, which is likely due to the decrease in perceived conditional consistency as objects within the image begin to become unrecognizable due to the addition of noise.

## Appendix D Conditional Consistency Evaluation with Text Conditioning

In order to test the effectiveness of FJD at detecting conditional inconsistencies in the text domain, we use the Caltech-UCSD Birds 200 dataset (Welinder et al., 2010). This dataset is a common benchmark for text conditioned image generation models, containing 200 fine-grained bird categories, 11,788 images, and 10 descriptive captions per images. Also included in the dataset are vectors of detailed binary annotations describing the attributes of the bird in each image. Each annotation indicates the presence or absence of specific features, such as has\_bill\_shape::curved or has\_wing\_color::blue.

Our goal in this experiment is to swap captions between images, and in this fashion introduce inconsistencies between images and their paired captions, while preserving the marginal distributions of images and labels. We compare attribute vectors belonging to each image using the Hamming distance to get an indication for how well the captions belonging to one image might describe another. Small Hamming distances indicate a good match between image and caption, while at larger values the captions appear to describe a very different bird than what is pictured (as demonstrated in Figure 7).

To test FJD we create two datsets: one which contains the original image-captions pairs from CUB-200 to act as the reference distribution, and another in which captions have been swapped to act as a generated distribution that has poor conditional consistency. Char-CNN-RNN embeddings are used to encode the captions for the purposes of calculating FJD. In Figure 8 we observe that as the average Hamming distance across captions increases (i.e., the captions become worse at describing their associated images), FJD also increases. FID, which is unable to detect these inconsistencies, remains constant throughout.

## Appendix E List of Sources of Pre-trained Model

Table 6 includes the hyperlinks to all of the pretrained conditional generation models used in our experiments in Section 6.

## Appendix F Effect of α Parameter

The parameter in the FJD equation acts as a weighting factor indicating the importance of the image component versus the conditional component. When , then FJD is equal to FID, since we only care about the image component. As the value of increases, the magnitude of the conditional component’s contribution to the value of FJD increases as well. In our experiments, we attempt to find a neutral value for that will balance the contribution from the conditional component and the image component. This balancing is done by finding the value of that would result in equal magnitude between the image and conditioning embeddings (as measured by the average L2 norm of the embedding vectors).

Instead of reporting FJD at a single , an alternative approach is to calculate and plot FJD for a range of values, as shown in Figure 9. Plotting versus FJD allows us to observe any change in rank of models as the importance weighting on the conditional component is increased. Here we use the truncation trick to evaluate BigGAN (Brock et al., 2019) at several different truncation values . The truncation trick is a technique wherein the noise vector used to condition a GAN is scaled by in order to trade sample diversity for image quality and conditional consistency, without needing to retrain the model (as shown in Table 7).

We find that in several cases, the ranking of models changes when comparing them at (equivalent to FID), versus comparing them using FJD at higher values. Models with low truncation values initially achieve good performance when is also low. However, as increases, these models rapidly drop in rank due to lack of sample diversity, and instead models with higher values are favoured. This is most obvious when comparing and (blue and yellow lines in Figure 9) respectively.

## Appendix G Autoencoder Architecture

To create embeddings for the bounding box and mask conditionings evaluated in this paper we utilize a variant of the Regularized AutoEncoder with Spectral Normalization (RAE-SN) introduced by Ghosh et al. (2019) and enhance it with residual connections (Tables 9 and 9). For better reconstruction quality, we substitute the strided convolution and transposed convolution for average pooling and nearest neighbour upsampling, respectively. Spectral normalization (Miyato et al., 2018) is applied to all linear and convolution layers in the decoder, and an penalty is applied to the latent representation during training. Hyperparameters such as the weighting factor on the penalty and the number of dimensions in the latent space are selected based on which combination produces the best reconstructions on a held-out validation set.

In Tables 11 and 11 we depict the architecture for an autoencoder with input resolution, but this can be scaled up or down by adding or removing residual blocks as required. represents a channel multiplier which is used to control the capacity of the model. represents the number of latent dimensions in the latent representation. indicates the number of classes in the bounding box or mask representation.

## Appendix H FJD for Model Selection and Hyperparameter Tuning

In order to demonstrate the utility of FJD for the purposes of model selection and hyperparameter tuning, we consider the loss function of the generator from an auxiliary classifier GAN (ACGAN) (Odena et al., 2017), as shown in Equation 7 to 9. Here indicates the data source, and indicates the class label.

 LS =E[logP(S=real|Xreal]+E[logP(S=fake|Xfake)] (7) LC =E[logP(C=c|Xreal)]+E[logP(C=c|Xfake)] (8) LG =λLC−LS (9)

The generator loss is maximized during training, and consists of two components: an adversarial component , which encourages generated samples to look like real samples, and a classification component , which encourages samples to look more like their target class. In this experiment we add a weighting parameter , which weights the importance of the conditional component of the generator loss. The original formulation of ACGAN is equivalent to always setting , but it is unknown whether this is the most suitable setting as it is never formally tested. To this end, we train models on the MNIST dataset and perform a sweep over the parameter in the range , training a single model for each value tested. Each model is evaluated using FID, FJD, and classification accuracy to indicate conditional consistency. For FID and FJD we use the training set as the reference distribution, and generate samples for the generated distribution. Classification accuracy is measured using a pretrained LeNet classifier (LeCun et al., 1998), where the conditioning label is used as the groundtruth.

Scores from best performing models as indicated by FID, FJD, and classification accuracy are shown in Table 12. Sample sheets are provided in Figure 10, where each column is conditioned on a different digit from 0 to 9. We find that FID is optimized when (Figure 9(a)). This produces a model with good image quality, but almost no conditional consistency. Accuracy is optimized when (Figure 9(c)), yielding a model with good conditional consistency, but limited image quality. Finally, FJD is optimized when (Figure 9(b)), producing a model that demonstrates a balance between image quality and conditional consistency. These results demonstrate the importance of considering both image quality and conditional consistency simultaneously when performing hyperparameter tuning.

## Appendix I Training and Evaluating with Multi-label, Bounding Box, and Mask Conditioning on COCO-Stuff

To demonstrate FJD applied to multi-label, bounding box, and mask conditioning on a real world dataset, we train a GAN on the COCO-Stuff dataset (Caesar et al., 2018). To this end, we train three generative models, one for each conditioning type. Following (Johnson et al., 2018), we select only images containing between 3 and 8 objects, and also ignore any objects that occupy less than 2% of the total image area. Two image resolutions are considered: and . We adopt a BigGAN-style model (Brock et al., 2019), but modify the design such that a single fixed architecture can be trained with any of the three conditioning types. See Section I.1 for architectural details. We train each model 5 times, with different random seeds, and report mean and standard deviation of both FID and FJD in Table 13. N-hot encoding is used as the embedding function for the multi-label conditioning, while autoencoder representations are used to calculate FJD for bounding box and mask conditioning.

In most cases we find that FID values are very close between conditioning types. A similar trend is observed in FJD at the resolution. For models trained at resolution however, we notice a more drastic change in FJD between conditioning types. Mask conditioning achieves the lowest FJD score, followed by multi-label conditioning and bounding box conditioning. This could indicate that the mask conditioning models are more conditionally consistent (or diverse) compared to other conditioning types.

### i.1 COCO-Stuff GAN Architecture

In order to modify BigGAN Brock et al. (2019) to work with multiple types of conditioning we make two major changes. The first change occurs in the generator, where we replace the conditional batch normalization layers with SPADE (Park et al., 2019). This substitution allows the generator to receive spatial conditioning such as bounding boxes or masks. In the case of class conditioning with a spatially tiled class vector, SPADE behaves similarly to conditional batch normalization. The second change we make is to the discriminator. The original BigGAN implementation utilizes a single projection layer (Miyato and Koyama, 2018) in order to provide class-conditional information to the discriminator. To extend this functionality to bounding box and mask conditioning, we add additional projection layers after each ResBlock in the discriminator. The input to each projection layer is a downsampled version of the conditioning that has been resized using nearest neighbour interpolation to match the spatial resolution of each layer. In this way we provide conditioning information at a range of resolutions, allowing the discriminator to use whichever is most useful for the type of conditioning it has received. Aside from these specified changes, and using smaller batch sizes, models are trained with the same hyperparameters and training scheme as specified in (Brock et al., 2019).

### i.2 Samples of Generated Images

In this section, we present some random samples of conditional generation for the models covered in Section I. In particular, Figures 1113 show class, bounding box, and mask conditioning samples, respectively. Each row displays a depiction of conditioning, followed by 4 different samples, and finally the real image corresponding to the conditioning. As shown in Figure 11, conditioning on classes leads to variable samples w.r.t. object positions, scales and textures. As we increase the conditioning strength, we reduce the freedom of the generation and hence, in Figure 12, we observe how the variability starts appearing in more subtle regions. Similarly, in Figure 13, taking different samples per conditioning only changes the textures. Although the degrees of variability decrease as the conditioning strength increases, we obtain sharper, better looking images.

### Footnotes

1. FID compares image distributions and, as such, should be able to roughly capture the intra-conditioning diversity. Since it cares about the image marginal distribution exclusively, it fails to capture intra-conditioning diversity when changes only affect the image-conditioning joint distribution. See Appendix A.
2. We refer the reader to (Borji, 2018) for a detailed overview and insightful discussion of existing metrics.
3. In the initial stages of this project, we also explored methods to bypass this additional training step by projecting a visual representation of bounding box or mask conditioning into an Inceptionv3 embedding space. However, the Inceptionv3 embedding may not properly capture object positions as it is trained to classify, discarding precise spatial information. Therefore, we consider autoencoders (AE) to be better suited to our setup since they are trained to recover both object appearance and spatial information from the embedded representation.
4. Architectural details for autoencoders used in this paper can be found in Appendix G.
5. Note that for real datasets, intra-conditioning diversity is most often reduced as the strength of conditioning increases (e.g., mask conditionings usually present a single image instantiation, presenting no diversity).
6. A list of pre-trained models used in these evaluations can be found in Appendix E.

### References

1. Augmented CycleGAN: learning many-to-many mappings from unpaired data. In Proceedings of the 35th International Conference on Machine Learning, J. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, Stockholmsmässan, Stockholm Sweden, pp. 195–204. External Links: Link Cited by: §1, §2.
2. Wasserstein generative adversarial networks. In Proceedings of the 34th International Conference on Machine Learning, D. Precup and Y. W. Teh (Eds.), Proceedings of Machine Learning Research, Vol. 70, International Convention Centre, Sydney, Australia, pp. 214–223. External Links: Link Cited by: §2.
3. METEOR: an automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp. 65–72. Cited by: §2.
4. Demystifying MMD GANs. In International Conference on Learning Representations, External Links: Link Cited by: §1, §2.
5. Pros and cons of GAN evaluation measures. CoRR abs/1802.03446. Cited by: §1, footnote 2.
6. Large scale GAN training for high fidelity natural image synthesis. In International Conference on Learning Representations, External Links: Link Cited by: Appendix F, §I.1, Appendix I, §1, §2, §2, §6.
7. COCO-stuff: thing and stuff classes in context. In CVPR, pp. 1209–1218. Cited by: Appendix C, Appendix I.
8. Tell, draw, and repeat: generating and modifying images based on continual linguistic instruction. CoRR abs/1811.09845. Cited by: §2.
9. From variational to deterministic autoencoders. CoRR abs/1903.12436. External Links: Link, 1903.12436 Cited by: Appendix G, Table 1.
10. Generative adversarial nets. In Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence and K. Q. Weinberger (Eds.), pp. 2672–2680. External Links: Link Cited by: §1.
11. A kernel two-sample test. J. Mach. Learn. Res. 13 (1), pp. 723–773. External Links: ISSN 1532-4435, Link Cited by: §2.
12. DeLiGAN: generative adversarial networks for diverse and limited data.. In Computer Vision and Pattern Recognition, pp. 4941–4949. External Links: ISBN 978-1-5386-0457-1, Link Cited by: §2.
13. GANs trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan and R. Garnett (Eds.), pp. 6626–6637. External Links: Link Cited by: §1, §2, §2, §2.
14. Generating multiple objects at spatially distinct locations. In International Conference on Learning Representations, External Links: Link Cited by: §2.
15. Inferring semantic layout for hierarchical text-to-image synthesis. In Computer Vision and Pattern Recognition, pp. 7986–7994. Cited by: §2.
16. Multimodal unsupervised image-to-image translation. In ECCV, Cited by: §2, §6.
17. Image-to-image translation with conditional adversarial networks. Computer Vision and Patter Recognition (CVPR). Cited by: §2, §2, §6.
18. Image generation from scene graphs. In CVPR, pp. 1219–1228. Cited by: Appendix I.
19. Gang of gans: generative adversarial networks with maximum margin ranking. CoRR abs/1704.04865. External Links: Link, 1704.04865 Cited by: §2.
20. Progressive growing of GANs for improved quality, stability, and variation. In International Conference on Learning Representations, External Links: Link Cited by: §1, §2.
21. Auto-encoding variational bayes. In ICLR, Cited by: §1.
22. Improved precision and recall metric for assessing generative models. arXiv preprint arXiv:1904.06991. Cited by: §1.
23. Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: Appendix H.
24. Testing statistical hypotheses. Third edition, Springer Texts in Statistics, Springer. External Links: ISBN 0-387-98864-5, MathReview Cited by: §2.
25. Mode seeking generative adversarial networks for diverse image synthesis. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2, §6.
26. DSprites: disentanglement testing sprites dataset. Note: https://github.com/deepmind/dsprites-dataset/ Cited by: §1, §5.1.
27. Conditional generative adversarial nets. CoRR abs/1411.1784. Cited by: §1, §2.
28. Spectral normalization for generative adversarial networks. In International Conference on Learning Representations, External Links: Link Cited by: Appendix G, §2, §2, §6.
29. CGANs with projection discriminator. In International Conference on Learning Representations, External Links: Link Cited by: §I.1, §2, §2, §2, §6.
30. Conditional image synthesis with auxiliary classifier GANs. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, pp. 2642–2651. External Links: Link Cited by: Appendix H, §2, §2.
31. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pp. 311–318. Cited by: §2.
32. Semantic image synthesis with spatially-adaptive normalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §I.1, §1, §2.
33. Unsupervised representation learning with deep convolutional generative adversarial networks. In International Conference on Learning Representations, Cited by: §2.
34. Classification accuracy score for conditional generative models. arXiv preprint arXiv:1905.10887. Cited by: §1.
35. Generative adversarial text to image synthesis. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML’16, pp. 1060–1069. External Links: Link Cited by: §1, §2.
36. Sentence-bert: sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084. Cited by: Table 1.
37. Improved techniques for training gans. In Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon and R. Garnett (Eds.), pp. 2234–2242. External Links: Link Cited by: §1, §2.
38. Inverse cooking: recipe generation from food images. In Computer Vision and Patter Recognition (CVPR), Cited by: §1.
39. A hierarchical latent variable encoder-decoder model for generating dialogues. In AAAI, pp. 3295–3301. Cited by: §1.
40. ChatPainter: improving text to image generation using dialogue. CoRR abs/1802.08216. Cited by: §2.
41. How good is my gan?. CoRR abs/1807.09499. External Links: Link, 1807.09499 Cited by: §1.
42. Learning to generate images with perceptual similarity metrics. In 2017 IEEE International Conference on Image Processing, ICIP 2017, Beijing, China, September 17-20, 2017, pp. 4277–4281. External Links: Cited by: §2.
43. Towards text generation with adversarially learned neural outlines. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi and R. Garnett (Eds.), pp. 7551–7563. External Links: Link Cited by: §1.
44. Rethinking the inception architecture for computer vision. In CVPR, pp. 2818–2826. Cited by: §3, Table 1.
45. A note on the evaluation of generative models. In International Conference on Learning Representations, Cited by: §1, §2.
46. Spatial pattern templates for recognition of objects with regular structure. In Proc. GCPR, Saarbrucken, Germany. Cited by: §6.
47. WaveNet: a generative model for raw audio. In Arxiv, External Links: Link Cited by: §1.
48. Conditional image generation with pixelcnn decoders. In Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon and R. Garnett (Eds.), pp. 4790–4798. External Links: Link Cited by: §1.
49. Pixel recurrent neural networks. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML’16, pp. 1747–1756. External Links: Link Cited by: §1.
50. Cider: consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4566–4575. Cited by: §2.
51. Generating videos with scene dynamics. In Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon and R. Garnett (Eds.), pp. 613–621. External Links: Link Cited by: §1.
52. High-resolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.
53. Image quality assessment: from error visibility to structural similarity. IEEE TRANSACTIONS ON IMAGE PROCESSING 13 (4), pp. 600–612. Cited by: §2.
54. Caltech-UCSD Birds 200. Technical report Technical Report CNS-TR-2010-001, California Institute of Technology. Cited by: Appendix D, §6.
55. On the effects of batch and weight normalization in generative adversarial networks. arXiv preprint arXiv:1704.03971. Cited by: §2.
56. Attngan: fine-grained text to image generation with attentional generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1316–1324. Cited by: §2, §6.
57. Diversity-sensitive conditional generative adversarial networks. arXiv preprint arXiv:1901.09024. Cited by: §1, §6.
58. LR-GAN: layered recursive generative adversarial networks for image generation. In International Conference on Learning Representations, Cited by: §2.
59. StackGAN++: realistic image synthesis with stacked generative adversarial networks. IEEE Transactions on Pattern Analysis and Machine Intelligence . Cited by: §1, §2, §2, §6.
60. StackGAN: text to photo-realistic image synthesis with stacked generative adversarial networks. In ICCV, Cited by: §2.
61. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, Cited by: §2.
62. Photographic text-to-image synthesis with a hierarchically-nested adversarial network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6199–6208. Cited by: §6.
63. Image generation from layout. In Computer Vision and Pattern Recognition, Cited by: §1, §2.
64. Hype: human eye perceptual evaluation of generative models. arXiv preprint arXiv:1904.01121. Cited by: §1, §2.
65. Activation maximization generative adversarial nets. In International Conference on Learning Representations, External Links: Link Cited by: §2.
66. Generative visual manipulation on the natural image manifold. In Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part V, pp. 597–613. External Links: Cited by: §6.
67. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Computer Vision (ICCV), 2017 IEEE International Conference on, Cited by: §2.
68. Toward multimodal image-to-image translation. In Advances in Neural Information Processing Systems, Cited by: §2, §6.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters