Unsupervised Discovery of Interpretable Directions in the GAN Latent Space
The latent spaces of typical GAN models often have semantically meaningful directions. Moving in these directions corresponds to human-interpretable image transformations, such as zooming or recoloring, enabling a more controllable generation process. However, the discovery of such directions is currently performed in a supervised manner, requiring human labels, pretrained models, or some form of self-supervision. These requirements can severely limit a range of directions existing approaches can discover.
In this paper, we introduce an unsupervised method to identify interpretable directions in the latent space of a pretrained GAN model. By a simple model-agnostic procedure, we find directions corresponding to sensible semantic manipulations without any form of (self-)supervision. Furthermore, we reveal several non-trivial findings, which would be difficult to obtain by existing methods, e.g., a direction corresponding to background removal. As an immediate practical benefit of our work, we show how to exploit this finding to achieve a new state-of-the-art for the problem of saliency detection. The implementation of our method is available online
Nowadays, generative adversarial networks (GANs) Goodfellow et al. (2014) have become a leading paradigm of generative modeling in the computer vision domain. The state-of-the-art GANsBrock et al. (2019); Karras et al. (2019) are currently able to produce good-looking high-resolution images often indistinguishable from real ones. The exceptional generation quality paves the road to ubiquitous usage of GANs in applications, e.g., image editing Isola et al. (2017); Zhu et al. (2017), super-resolution Ledig et al. (2017), video generation Wang et al. (2018) and many others.
However, in most practical applications, GAN models are typically used as black-box instruments without complete understanding of the underlying generation process. While several recent papers Bau et al. (2019); Voynov and Babenko (2019); Yang et al. (2019); Karras et al. (2019); Jahanian et al. (2020); Plumerault et al. (2020) address the interpretability of GANs, this research area is still in its preliminary stage.
An active line of study on GANs interpretability investigates the structure of their latent spaces. Namely, several works Jahanian et al. (2020); Plumerault et al. (2020); Goetschalckx et al. (2019); Shen et al. (2019) aim to identify semantically meaningful directions, i.e., corresponding to human-interpretable image transformations. At the moment, prior works have provided enough evidence that a wide range of such directions exists. Some of them induce domain-agnostic transformations, like zooming or translation Jahanian et al. (2020); Plumerault et al. (2020), while others correspond to domain-specific transformations, e.g., adding smile or glasses on face images Radford et al. (2015).
While the discovery of interpretable directions has already been addressed by prior works, all these works require some form of supervision. For instance, Shen et al. (2019); Goetschalckx et al. (2019); Karras et al. (2019) require explicit human labeling or pretrained supervised models, which can be expensive or even impossible to obtain. Recent works Jahanian et al. (2020); Plumerault et al. (2020) employ self-supervised approaches, but are limited only to directions, corresponding to simple transformations achievable by automatic data augmentation. Moreover, all the existing works identify only the directions we “anticipate to discover” and cannot reveal unexpected ones.
In this paper, we propose a completely unsupervised approach to discover the interpretable directions in the latent space of a pretrained generator. In a nutshell, the approach seeks a set of directions corresponding to diverse image transformations, i.e., it is easy to distinguish one transformation from another. Intuitively, under such formulation, the learning process aims to find the directions corresponding to the independent factors of variation in the generated images. For several generators, we observe that many of the obtained directions are human-interpretable, see Figure 1.
As another significant contribution, our approach discovers new practically important directions, which would be difficult to obtain with existing techniques. For instance, we discover the direction corresponding to the background removal, see Figure 1. In the experimental section, we exploit it to generate high-quality synthetic data for saliency detection and achieve a new state-of-the-art for this problem in a weakly-supervised scenario. We expect that exploitation of other directions can also benefit other computer vision tasks in the unsupervised and weakly-supervised niches.
As our main contributions we highlight the following:
We propose the first unsupervised approach for the discovery of semantically meaningful directions in the GAN latent space. The approach is model-agnostic and does not require costly generator re-training.
For several common generators, we managed to identify non-trivial and practically important directions. The existing methods from prior works are not able to identify them without expensive supervision.
We provide an example of immediate practical benefit from our work. Namely, we show how to exploit the background removal direction to achieve a new state-of-the-art for weakly-supervised saliency detection.
2 Related work
In this section, we describe the relevant research areas and explain the scientific context of our study.
Generative adversarial networks Goodfellow et al. (2014) currently dominate the generative modeling field. In essence, GANs consist of two networks – a generator and a discriminator, which are trained jointly in an adversarial manner. The role of the generator is to map samples from the latent space distributed according to a standard Gaussian distribution to the image space. The discriminator aims to distinguish the generated images from the real ones. More complete understanding of the latent space structure is an important research problem as it would make the generation process more controllable.
Interpretable directions in the latent space. Since the appearance of earlier GAN models, it is known that the GAN latent space often possesses semantically meaningful vector space arithmetic, e.g., there are directions corresponding to adding smiles or glasses for face image data Radford et al. (2015). Since exploitation of these directions would make image editing more straightforward, the discovery of such directions currently receives much research attention. A line of recent works Goetschalckx et al. (2019); Shen et al. (2019); Karras et al. (2019) employs explicit human-provided supervision to identify interpretable directions in the latent space. For instance, Shen et al. (2019); Karras et al. (2019) use the classifiers pretrained on the CelebA dataset Liu et al. (2015) to predict certain face attributes. These classifiers are then used to produce pseudo-labels for the generated images and their latent codes. Based on these pseudo-labels, the separating hyperplane is constructed in the latent space, and a normal to this hyperplane becomes a direction that captures the corresponding attribute. Another work Plumerault et al. (2020) solves the optimization problem in the latent space that maximizes the score of the pretrained model, predicting image memorability. Thus, the result of the optimization is a direction corresponding to the increase of memorability. The crucial weakness of supervised approaches above is their need for human labels or pretrained models, which can be expensive to obtain. Two recent works Jahanian et al. (2020); Plumerault et al. (2020) employ self-supervised approaches and seek the vectors in the latent space that correspond to simple image augmentations such as zooming or translation. While these approaches do not require supervision, they can be used to find only the directions capturing simple transformations that can be obtained automatically.
Overall, all existing approaches are able to discover only directions, which researchers expect to identify. In contrast, our unsupervised approach often identifies surprising directions, corresponding to non-trivial image manipulations.
Before a formal description of our method, we explain its underlying motivation by a simple example on Figure 3. Figure 3 (top) shows the transformations of an original image (in a red frame) obtained by moving in two random directions of the latent space for the Spectral Norm GAN model Miyato et al. (2018) trained on the MNIST dataset LeCun (1989). As one can see, moving in a random direction typically affects several factors of variations at once, and different directions “interfere” with each other. This makes it difficult to interpret these directions or to use them for semantic manipulations in image editing.
The observation above provides the main intuition behind our method. Namely, we aim to learn a set of directions inducing “orthogonal” image transformations that are easy to distinguish from each other. We achieve this via jointly learning a set of directions and a model to distinguish the corresponding image transformations. The high quality of this model implies that directions do not interfere; hence, hopefully, affect only a single factor of variation and are easy-to-interpret.
The learning protocol is schematically presented on Figure 2. Our goal is to discover the interpretable directions in the latent space of a pretrained GAN generator , which maps samples from the latent space to the image space. is a non-trainable component of our method, and its parameters do not change during learning. Two trainable components of our method are:
A matrix , where equals to the dimensionality of the latent space of . A number of columns determines the number of directions our method aims to discover. It is a hyperparameter of our method, and we discuss its choice in the next section. In essence, the columns of correspond to the directions we aim to identify.
A reconstructor , which obtains an image pair of the form , where the first image is generated from a latent code , while the second one is generated from a shifted code . Here denotes an axis-aligned unit vector and is a scalar. In other words, the second image is a transformation of the first one, corresponding to moving by in a direction, defined by the -th column of in the latent space. The reconstructor’s goal is to reproduce the shift in the latent space that induces a given image transformation. In more details, produces two outputs , where is a prediction of a direction index , and is a prediction of a shift magnitude . More formally, the reconstructor performs a mapping .
Optimization objective. Learning is performed via minimizing the following loss function:
For the classification term we use the cross-entropy function, and for the regression term we use the mean absolute error. In all our experiments we use a weight coefficient .
As and are optimized jointly, the minimization process seeks to obtain such columns of that the corresponding image transformations are easier to distinguish from each other, to make the classification problem for reconstructor simpler. In the experimental section below, we demonstrate that these “disentangled” directions often appear to be human-interpretable.
The role of the regression term is to force shifts along discovered directions to have the continuous effect, thereby preventing “abrupt” transformations, e.g., mapping all the images to some fixed image. See Figure 4 for a latent direction example that maps all the images to a fixed one.
3.3 Practical details.
Here we describe the practical details of the pipeline and explain our design choices.
Reconstructor architecture. For reconstructor models we use the LeNet backbone LeCun et al. (1998) for the MNIST and AnimeFaces and the ResNet- model He et al. (2016) for Imagenet and CelebA-HQ. In all experiments, a number of input channels is set to six (two for MNIST), since we concatenate the input image pair along channels dimension. We also use two separate “heads”, predicting a direction index and a shift magnitude, respectively.
Distribution of training samples. The latent codes are always sampled from the normal distribution as in the original . The direction index is sampled from a uniform distribution . A shift magnitude is sampled from the uniform distribution . We also found a minor advantage in forcing to be separated from as too small shifts almost do not affect the generated image. Thus, in practice after sampling we take equal to . We did not observe any difference from using other distributions for .
Choice of . The number of directions is set to be equal to the latent space dimensionality for Spectral Norm GAN Miyato et al. (2018) with Anime Faces dataset, BigGAN Brock et al. (2019) and ProgGAN Karras et al. (2018) (which are , and respectively). For Spectral Norm GAN with MNIST dataset, we use , since its latent space is -dimensional, and it is too difficult for our model to obtain so many different interpretable directions for simple digit images.
Choice of . We experimented with three options for :
is a general linear operator;
is a linear operator with all matrix columns having a unit length;
is a linear operator with orthonormal matrix columns.
The first option appeared to be impractical as during the optimization, we frequently observed the columns of to have very high -norms. The reason is that for a constant latent shift of a high norm, most of the generated samples with appears to be almost the same for all . Thus the classification term in the loss (1) pushes to have columns of a high norm to simplify the classification task.
In all of the reported experiments, we use either with columns of length one (the second option) either with orthonormal columns (the third option). To guarantee that has unit-norm columns, we divide each column by its length. For the orthonormal case we parametrize with a skew-symmetric matrix (that is ) and define as the first columns of the exponent of (see e.g. Fulton and Harris (2013))
In experiments, we observe that both two options perform well and discover similar sets of interpretable directions. In general, using matrix with unit-norm columns is more expressive and is able to find more directions. However, on some datasets, the option with orthonormal columns managed to discover more interesting directions, and we discuss the details in the experimental section.
In this section, we thoroughly evaluate our approach on several datasets in terms of both quantitative and qualitative results. In all experiments, we do not exploit any form of external supervision and operate in a completely unsupervised manner.
Datasets and generator models. We experiment with four common datasets and generator architectures:
Optimization. In all the experiments, we use the Adam optimizer to learn both the matrix and the reconstructor . We always train the models with a constant learning rate . We perform gradient steps for ProgGAN and steps for others as the first has a significantly higher latent space dimension. We use a batch size of for Spectral Norm GAN on the MNIST, and Anime Faces datasets, a batch size of for BigGAN, and a batch size of for ProgGAN. All the experiments were performed on the NVIDIA Tesla v100 card.
Evaluation metrics. Since it is challenging to measure interpretability directly, we propose two proxy evaluation measures described below.
Reconstructor Classification Accuracy (RCA). As described in Section 3, the reconstructor aims to predict what direction in the latent space produces a given image transformation. In essence, the reconstructor’s classification “head” solves a multi-class classification problem. Therefore, high RCA values imply that directions are easy to distinguish from each other, i.e., corresponding image transformations do not “interfere” and influence different factors of variations. While it does not mean interpretability directly, in practice, transformations affecting a few factors of variation are easier to interpret. RCA allows to compare the directions obtained with our method with random directions or with directions corresponding to coordinate axes. To obtain RCA values for random or standard coordinate directions, we set to be equal random or identity matrix and do not optimize it during learning.
Direction Variation Naturalness (DVN). DVN measures how “natural” is a variation of images obtained by moving in a particular direction in the latent space. Intuitively, a natural factor of variation should appear in both real and generated images. Furthermore, if one splits the images based on the large/small values of this factor, the splitting should operate similarly for real and generated data. We formalize this intuition as follows. Let us have a direction in the latent space. Then we can construct a pseudo-labeled dataset for binary classification with . Given this dataset, we train a binary classification model to fit . After that, can induce pseudolabels for the dataset of real images , which results in pseudo-labeled dataset . We expect that if the variation induced by is natural, the corresponding variation factor will be the same for both generated and real data. Thus, we re-train the model from scratch on and compute its accuracy on . The obtained accuracy value is referred to as DVN. In the experiments below, we report the DVN averaged over all directions. Since the directions with higher DVN values are typically more interpretable, we additionally report the average DVN over the top directions ().
While DVN is a proxy for the interpretability of an individual direction, high RCA values indicate that discovered directions are substantially different, so both metrics are important. Therefore, we report RCA and DVN for directions discovered by our method for all datasets. We compare to directions corresponding to coordinate axes and random orthonormal directions in Table 1. Along with quantitative comparison, we also provide the qualitative results for each dataset below.
Qualitative examples of transformations induced by directions obtained with our method are shown on Figure 6. The variations along learned directions are easy to interpret and transform all samples in the same manner for any .
Evolution of directions. On Figure 7 we illustrate how the image variation along a given direction evolves during the optimization process. Namely, we take five snapshots of the matrix : from different optimization steps. Hence is the identity transformation and is the final matrix of directions. Here we fix a direction index and latent . The -th row on Figure 7 are the images . As one can see, in the course of optimization the direction stops to affect digit type and “focuses” on thickness.
4.2 Anime Faces
On this dataset, we observed an advantage of orthonormal compared to with unit-norm columns. We conjecture that the requirement of orthonormality can serve as a regularization, enforcing diversity of directions. However, we do not advocate the usage of orthonormal for all data since it did not result in practical benefits for MNIST/CelebA. On the Figure 8, we provide examples discovered by our approach.
Since the latent space dimensionality for ProgGAN equals and is remarkably higher compared to other models, we observed that the reconstructor fails to achieve reasonable RCA values with and we set in this experiment. See Figure 9 for examples of discovered directions for ProgGAN. These directions are likely to be useful for face image editing and are challenging to obtain without strong supervision.
Several examples of directions discovered by our method are presented on Figure 10. On this dataset, our method reveals several interesting directions, which can be of significant practical importance. For instance, we discover directions, corresponding to background blur and background removal, which can serve as a valuable source of training data for various computer vision tasks, as we show in the following section. For BigGAN we also use orthonormal since it results into a more diversified set of directions.
5 Weakly-supervised saliency detection
In this section we provide a simple example of practical usage of directions discovered by our method. Namely, we describe a straightforward way to exploit the background removal direction from the BigGAN latent space to achieve a new state-of-the-art for a problem of weakly supervised saliency detection. In a nutshell, this direction can be used to generate high-quality synthetic data for this task. Below we always explicitly specify an Imagenet class passed to the BigGAN generator, i.e., .
Figure 10 shows that is responsible for the background opacity variation. After moving in this direction, the pixels of a foreground object remain unchanged, while the background pixels become white. Thus, for a given BigGAN sample , one can label the white pixels from the corresponding shifted image as background, see Figure 11. Namely, to produce labeling, we compare an average intensity over three color channels for the image to a threshold :
Assuming that intensity values lie in the range , we set .
Given such synthetic masks, it is possible to train a model that achieves high quality on real data. Let us have an image dataset . Then one can train a binary segmentation model on the samples with classes that frequently appear in images from . While the images of can be unlabeled, we perform the following trick. We take an off-the-shelf pretrained Imagenet classifier (namely, ResNet-18) . For each sample we consider five most probable classes from the prediction . Thus, for each of 1000 ILSVRC classes, we count a number of times it appears in the top-5 prediction over . Then we define a subset of classes as the top most frequent classes. Finally, we form a pseudo-labeled segmentation dataset with the samples with . We exclude samples with mask area below and above of the whole image area. Then we train a segmentation model on these samples and apply it to the real data .
Note that the only supervision needed for the saliency detection method described above is image-level ILSVRC class labels. Our method does not require any pixel-level or dataset-specific supervision.
5.1 Experiments on the ECSSD dataset
We evaluate the described method on the ECSSD dataset Yan et al. (2013), which is a standard benchmark for weakly-supervised saliency detection. The dataset has separate train and test subsets, and we obtain the subset of classes from the train subset and evaluate on the test subset. For the segmentation model , we take a simple U-net architecture Ronneberger et al. (2015). We train on the pseudo-labeled dataset with Adam optimizer and the per-pixel cross-entropy loss. We perform 15000 steps with the initial rate of and decrease it by every 4000 steps and a batch size equal to 128. During inference, we rescale an input image to have a size 128 along its shorter side.
We measure the model performance in terms of the mean average error (MAE), which is an established metric for weakly-supervised saliency detection. For an image and a groundtruth mask , MAE is defined as:
where and are the image sizes. Our method based on BigGAN achieves MAE equal to , which is a new state-of-the-art on ECSSD across the methods using the same amount of supervision Wang et al. (2019) (i.e., image-level class labels from the ILSVRC dataset). To the best of our knowledge, the previous state-of-the-art, achieved by the method with the same source of weak supervision, achieves only MAE equal to . Figure 12 demonstrates several examples of saliency detection, provided by our method.
In this paper we have addressed the discovery of interpretable directions in the GAN latent space, which is an important step to an understanding of generative models required for researchers and practitioners. Unlike existing techniques, we have proposed a completely unsupervised method, which can be universally applied to any pretrained generator. On several standard datasets, our method reveals interpretable directions that have never been observed before or require expensive supervision to be identified. Finally, we have shown that one of the revealed directions can be used to generate high-quality synthetic data for the challenging problem of weakly supervised saliency detection. We expect that other interpretable directions can also be used to improve the performance of machine learning in existing computer vision tasks.
- In fact the map yields the operators with determinant 1. This is not a limitation as we take signed shift multiplicator
- GAN dissection: visualizing and understanding generative adversarial networks. In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: §1.
- Large scale GAN training for high fidelity natural image synthesis. In International Conference on Learning Representations, Cited by: §1, §3.3, item 4.
- ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, Cited by: item 4.
- Representation theory: a first course. Vol. 129, Springer Science & Business Media. Cited by: §3.3.
- Ganalyze: toward visual definitions of cognitive image properties. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5744–5753. Cited by: §1, §1, §2.
- Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §1, §2.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §3.3.
- Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1125–1134. Cited by: §1.
- On the”steerability” of generative adversarial networks. Proceedings of the International Conference on Learning Representations (ICLR). Cited by: §1, §1, §1, §2.
- Towards the high-quality anime characters generation with generative adversarial networks. In Proceedings of the Machine Learning for Creativity and Design Workshop at NIPS, Cited by: item 2.
- Progressive growing of gans for improved quality, stability, and variation. Proceedings of the International Conference on Learning Representations (ICLR). Cited by: §3.3, item 3.
- A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4401–4410. Cited by: §1, §1, §1, §2.
- Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §3.3.
- The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/. Cited by: §3.1, item 1.
- Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4681–4690. Cited by: §1.
- Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), Cited by: §2, item 3.
- Spectral normalization for generative adversarial networks. In International Conference on Learning Representations, Cited by: §3.1, §3.3, item 1, item 2.
- Controlling generative models with continuos factors of variations. Proceedings of the International Conference on Learning Representations (ICLR). Cited by: §1, §1, §1, §2.
- Unsupervised representation learning with deep convolutional generative adversarial networks. Proceedings of the International Conference on Learning Representations (ICLR). Cited by: §1, §2.
- U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §5.1.
- Interpreting the latent space of gans for semantic face editing. arXiv preprint arXiv:1907.10786. Cited by: §1, §1, §2.
- RPGAN: gans interpretability via random routing. arXiv preprint arXiv:1912.10920. Cited by: §1.
- Video-to-video synthesis. In Advances in Neural Information Processing Systems, pp. 1144–1156. Cited by: §1.
- Salient object detection in the deep learning era: an in-depth survey. arXiv preprint arXiv:1904.09146. Cited by: §5.1.
- Hierarchical saliency detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1155–1162. Cited by: §5.1.
- Semantic hierarchy emerges in deep generative representations for scene synthesis. arXiv preprint arXiv:1911.09267. Cited by: §1.
- Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pp. 2223–2232. Cited by: §1.