Variation Network: Learning High-level Attributes for Controlled Input Manipulation
This paper presents the Variation Network (VarNet), a generative model providing means to manipulate the high-level attributes of a given input. The originality of our approach is that VarNet is not only capable of handling pre-defined attributes but can also learn the relevant attributes of the dataset by itself. These two settings can also be easily considered at the same time, which makes this model applicable to a wide variety of tasks. Further, VarNet has a sound information-theoretic interpretation which grants us with interpretable means to control how these high-level attributes are learned. We demonstrate experimentally that this model is capable of performing interesting input manipulation and that the learned attributes are relevant and meaningful.
We focus on the problem of learning to generate variations of a given input in an intended way. Concretely, this means that given some input element , which can be considered as a template, we want to generate some transformed versions of by only modifying its high-level attributes. The objective is that the link between the original and its transformed version is preserved while the difference between their attributes is patent. Such a mechanism can be of great use in many domains such as image edition since it allows to edit images using more abstract controls and can be of crucial importance for creative usages since it allows to generate new content in a controlled and meaningful way.
In a majority of the recent proposed methods tackling this problem such as [29, 18, 27], the attributes of interest that we want to control are assumed to be given (often given as a discrete variable). If these methods are indeed successful in generating meaningful transformations of any input, we can nonetheless identify two shortcomings which can restrict their use: 1) Labeled data is not always available or can be costly to obtain; 2) Attributes which can be hard to specify in an absolute way cannot be considered.
The novelty of our approach resides in the fact that it grants the possibility, under the same framework, to control generations by modifying user-specified attributes but also to learn some meaningful high-level attributes at the same time, which can be used for control.
This problem may seem to be an ill-posed one on many aspects: Firstly, in the case of specified attributes, there is no ground truth for variations since there is no with two different attributes. Secondly, it can be hard to determine if a learned attribute is relevant. However, we provide empirical evidence that our general approach is capable of learning such relevant attributes and that they can be used for generating meaningful variations.
This paper introduces the Variation Network (VarNet), a probabilistic neural network which provides means to manipulate an input by changing its high-level attributes, which can be either learned or provided. Our model has a sound probabilistic interpretation which makes the variations obtained by changing the attributes statistically meaningful. This architecture is general and provides a wide range of design choices.
Our contributions are the following:
An easy-to-use framework: any encoder-decoder architecture can be easily framed into our framework in order to provide it control over generations, even in the case of unlabeled data,
Information-theoretic interpretation of our model providing a way to control the behavior of the learned attributes.
The plan of this paper is the following: Sect. 2 presents the VarNet architecture together with its training algorithm. For better clarity, we introduce separately all the components featured in our model and postpone the discussion about their interplay and the motivation behind our modeling choices in Sect. 3 We illustrate in Sect. 4 the possibilities offered by our proposed model and show that its faculty to generate variations in an intended way is of particular interest. Finally, Sect. 5 discusses about the related works. In particular, we show that VarNet provides an interesting solution to many constrained generation problems already considered in the literature while bearing interesting connections with the literature in fair representations [22, 27] and disentangled representations [12, 2, 21].
Our code is available at https://github.com/Ghadjeres/VarNet.
2 Proposed model
We now introduce our novel encoder-decoder architecture which we name Variation Network. Our architecture borrows principles from the traditional Variational AutoEncoder (VAE) architecture  and from the Wasserstein AutoEncoder (WAE) architecture [28, 26]. It uses an adversarially learned regularization [7, 18], introduces a decomposition of the latent space into two parts, a template and an attribute  and decomposes the attributes on an adaptive basis .
We detail in the following sections the different parts involved in our model. In Sect. 2.1, we focus on the encoder-decoder part of VarNet, in Sect. 2.2, we introduce the adversarially-learned regularization whose aim is to disentangle attributes from templates and in Section 2.3, we discuss about the special parametrization that we adopted for the space of attributes.
2.1 Encoder-decoder part
We consider a dataset of labeled elements , where stands for the input space and for the metadata space. Here, metadata information is optional as our architecture is also applicable in the totally unsupervised case. We suppose that the ’s follow the data distribution , that for each corresponds a unique label and write the distribution of the ’s together with their metadata information.
Similar to the VAE architectures, we suppose that our data depends on a couple of low-dimensional latent variable and through some decoder parametrized by a neural network. In this paper, we term the variables templates and the variable attributes. More details about the attribute space are given in Sect. 2.3.
We introduce a factorized prior over this latent space so that the joint probability distribution is expressed as . The objective is to maximize the log-likelihood of the data under the model. Since the posterior distribution is usually intractable, an approximate posterior distribution parametrized by a neural network is introduced (for simplicity, we let the approximate posterior depend on only, but considering or even is feasible). Concerning the computation of given , we introduce a deterministic function parametrized by a neural network, which can eventually rely on metadata information .
We then obtain the following mean reconstruction loss:
and regularize the latent space by adding the usual Kullback-Leibler () divergence term appearing in the VAE Evidence Lower Bound (ELBO) on :
2.2 Disentangling attributes from templates
Our decoder thus depends exclusively on and on features . However, there is no reason for a random attribute that , where , is a meaningful variation of the original . Indeed, all needed information for reconstructing could potentially be already contained in and changing could have no effect on the reconstructions.
To enforce this property, we propose to add an adversarially-learned cost on the latent variables and to force the encoder to discard information about the attribute of : Specifically, we train a discriminator neural network whose role is to evaluate the probability that there exists a pair such that and . In other words, the aim of the discriminator is to determine if the attributes and the template code originate from the same or if the features are sampled from the prior .
The encoder-decoder architecture presented in Sect. 2.1 is trained to fool the discriminator: this means that, for a given , it tries to produce a template code and features such that no information about the features could be recovered from .
In an optimal setting, i.e. when the discriminator is unable to match any with a particular feature , the space of template codes and the space of attributes are decorrelated and the aggregated distribution of the ’s (the pushforward measure of by ) matches the prior .
The discriminator is trained to maximize
while the encoder-decoder architecture is trained to minimize
The final objective we minimize for the encoder-decoder architecture is a combination of a reconstruction loss Eq. (1), a Kullback-Leibler penalty on Eq. (2) and the adversarially-learned loss Eq. (4).
Our final architecture is shown in Fig. 1 and the proposed training procedure in shown in Alg.1. Estimators of Eq. (3), Eq. (1) and (4) are given by Eq. (5), Eq. (7) and (8) respectively. The the estimator of the KL term can be either sampled or computed in closed form.
2.3 Parametrization of the attribute space
We adopt a particular parametrization of our attribute function . In the following, we make a distinction between two different cases: the case of continuous free attributes and the case of fixed continuous or discrete attributes.
2.3.1 Free attributes
In order to handle free attributes, which denote attributes that are not specified a priori but learned we introduce attribute vectors of dimension together with an attention module . By denoting the coordinates of , we then write our attribute function as
This approach is similar to the style tokens approach presented in . The ’s are global and do not depend on a particular instance . By varying the values of the ’s between , we can then span a -dimensional hypercube in which stands for our attribute space . It is worth noting that the ’s are also learned and thus constitute an adaptive basis of the attribute space.
The prior over (note that this subspace also varies during training) is obtained by pushing forward fixed distribution over (often considered to be the uniform distribution) using Eq. (10):
2.3.2 Fixed attributes
We now suppose that the metadata variable contains attributes that we want to vary at generation time. For simplicity, we can suppose that this metadata information can be either continuous with values in (with a natural order on each dimension) or discrete with values in .
In the continuous case, we write our attribute function
while in the discrete case, we just consider
where is a -dimensional embedding of the symbol . It is important to note that even if the attributes are fixed, the ’s or the embeddings are learned during training.
These two equations define a prior via:
We now detail our objective (6) and notably explicit connections in Sect. 3.1 with  which gives an information-theoretic interpretation of a similar objective. In Sect. 3.2, we further discuss about the multiple possibilities that we have concerning the implementation of the attribute function and list, in Sect. 3.3, the different sampling schemes of VarNet.
3.1 Information-theoretic interpretation
Our objective bears similarity the objective Eq. (11) from  which is derived from the following optimization problem
where and denote the mutual information and the conditional mutual information respectively (using the joint distribution ).
The differences between our work and  are the following: First, our objective Eq. (6) is motivated from a generative point of view, while the motivation in  is to obtain fair representations ([22, 33, 24]). This allows us to consider attributes computed from both and , while in  we have and the “sensitive attributes” must be provided. A second difference is our formulation of the discriminator function:  considers training a discriminator to predict given . In our case, such a formulation would prevent us to impose a prior distribution over .
3.2 Flexibility in the choice of the attribute function
In this section, we focus on the parametrization of the attribute function . The formulation of Sect. 2.3 is in fact too restrictive and considered only particular attribute functions. It is in fact possible to mix different attribute functions by simply concatenating the resulting vectors. By doing so, we can then combine free and fixed attributes in a natural way but also consider different attention modules similarly to what is done in , but also consider different distributions over the attention vectors .
It is important to note that the free attributes presented in Sect. 2.3.1 can only capture global attributes, which are attributes that are relevant for all elements of the dataset . In the presence of discrete labels , it can be interesting to consider label-dependent free attributes, which are attributes specific to a subset of the dataset. In this case, the attribute function can be written as
where designates the attribute vector of the label . With all these possibilities at hand, it is possible to devise numerous applications in which the notions of template and attribute of an input may have diverse interpretations.
Our choice of using a discriminator over instead of, for instance, over the values of themselves allow to encompass within the same framework discrete and continuous fixed attributes. This also makes the combinations of such attribute functions natural.
3.3 Sampling schemes
We quickly review the different sampling schemes of VarNet. We believe that this wide range of usages makes VarNet a promising model for a wide range of applications. We can for instance:
generate random samples from the estimated dataset distribution:
sample with given attributes :
generate variations of an input with attributes :
generate random variations of an input :
We now illustrate the different sampling schemes presented in Sect. 3.3 on different image datasets. Section 4.1 is devoted to the implementation details and our particular modeling choices, Sect. 4.2 and 4.3 showcase some applications of VarNet ensuring that the learned attributes are meaningful and controllable. Finally, Sect. 4.4 discusses the impact of the modeling choices.
4.1 Implementation details
In all these experiments, we choose to use a ResNet-50  for the encoder network. We also use a ResNet-50 for the attribute function when the function depends on the input and not only on metadata information . For the decoder network, we use a conditional transposed ResNet-50 similar to the decoder used in . The batch normalization parameters  are conditioned by similarly to what is done in [8, 25].
Following , we consider a simple parametrization for the encoder and decoder networks: the encoder is a diagonal Gaussian distribution and the decoder is a diagonal Gaussian distribution with scalar covariance matrix , where is a learnable scalar parameter and where denotes the identity matrix.
The prior over is a unit centered Normal distribution. For the sampling of the values in the free attributes case, we considered to be a uniform distribution over . In the fixed attribute case, we simply obtain a random sample by shuffling the already computed batches of (lines 4 and 6 in Alg.1).
4.2 Influence of the attribute function
We now apply VarNet on the different image datasets mentioned above in order to illustrate the different sampling schemes presented in Sect. 4 and the influence of the choice of the attribute functions (Sect. 3.2). We primarily focus on sampling schemes Eq. (18) and Eq. (19) which help understanding the effect of the learned attributes.
4.2.1 Fixed discrete attributes
We start by considering attribute functions depending exclusively on discrete labels (Eq. (12)). Figure 2 shows the effect of changing the attributes of a given input using Eq. (18) on the MNIST and KMNIST datasets. We observe consistency between all variations. In these two cases, the template encompasses the handwriting and it is interesting noting that characters with two distinct way of writing appear in the generated variations.
4.2.2 Free continuous attributes
We now focus on the case where we let the network learn relevant attributes of the data. Figure 3 displays typical examples of variations obtained by considering free attributes of dimension on different datasets. These plots are obtained by varying the in Eq. (10) at a constant speed between 0 and 1.
In all cases, the learned attributes are relevant to the dataset and meaningful. We find interesting to note that there is no disentanglement between the dimensions of the attributes and that the learned attributes are highly dependent on the dataset at hand.
4.2.3 Mixing attributes
In this part, we give examples of attribute functions obtained by combining the ones mentioned in the preceding sections. In particular, we consider attribute functions created by combining
several discrete fixed attributes,
discrete fixed attributes and continuous free attributes,
continuous free attributes conditioned by discrete fixed attributes.
The objective is to showcase how the design of the attribute functions influences the learned attributes and the “meaning” of the templates .
Figure 4 considers the case where two attribute functions relying solely on binary labels are concatenated. We observe that the resulting model allow to vary independently every label during generation. Such a design choice can be useful since concatenating two attributes functions taking into account two different labels instead of choosing a unique attribute function taking into account a Cartesian product of labels tends to make generalization easier.
We now consider the case where the free attributes are combined with fixed attributes. There are two different ways to do so: one is to concatenate a attribute function on free attributes with one taking into account fixed discrete attributes, the other is to make the free attributes label-dependent as in Eq. (15).
Figure 4(a) displays one such example in the first case. This corresponds to an attribute function as in Eq. (15), but where the dependence of on is dropped. This results in the fact that the learned free attributes are global and possess the same “meaning” for all classes. In Fig. 4(b), on the contrary, the dependence of the ’s on is kept. This has the effect that each class possesses a “local” free attribute.
4.3 Influence of the hyperparameters
From the preceding examples of Sect. 4.2, we saw that the fixed label attributes have clearly been taken into account, but it can be hard to guess a priori which high-level attribute the free attribute function might capture. However, for a given architecture, a given dataset and fixed hyperparameters, we observed that the network tended to learn the same high-level features across multiple runs. In this section, we show that the information-theoretic interpretation from Sect. 3.1 gives us an additional way to control what is learned by the free attributes by modifying the hyperparameter and/or .
For some applications, variation spaces such as the one displayed in Fig. 5(a) are not desirable because they may tend to move too “far away” from the original input. As discussed in Sect. 3.1, it is possible to reduce how “spread” the spaces of variation are by modifying the parameter multiplying the KL term in the objective Eq. (6). An example of such a variation space is displayed in Fig. 5(b). If the interpretation of the high-level features learned by the free attributes seems identical in both cases, the variations spanned in 5(b) are all more similar from the original input.
4.4 Inductive biases
In order to check how crucial is the choice of the architecture for the encoder and the decoder, we choose in this part to train a VarNet on representations obtained from a pre-trained VAE, similarly to the TwoStageVAE from . The first pre-trained VAE has the same encoder as the one described in Sect. 4.1 and a transposed VarNet-50 decoder. For the encoder and decoder of the VarNet trained on these representations, we use simple three-layered MLPs. The attribute function network is, as in Sect. 4.1, a VarNet-50. The results in the case of a 2-dimensional free attribute are displayed in Fig. 7. One striking observation is the difference with Fig. 2(c) where changing the free attributes corresponded to changing the background color. Such difference is not surprising  and both cases could be useful: removing information about background color in could be of help for some downstream tasks where this piece of information is irrelevant. Here, in the case of 7, the free attributes tend to learn the “pose” and “skin color” high-level attributes. This seems to be consistent across different runs (e.g. Fig. 6(a) and 6(b)) and consistent on different inputs (e.g. Fig. 6(b) and 6(c)).
5 Related work
The Variation Network generalizes many existing models used for controlled input manipulation by providing a unified probabilistic framework for this task. We now review the related literature and discuss the connections with VarNet.
The problem of controlled input manipulation has been considered in the Fader networks paper , where the authors are able to modify in a continuous manner the attributes of an input image. Similar to us, this approach uses an encoder-decoder architecture together with an adversarial loss used to decouple templates and attributes. The major difference with VarNet is that this model has a deterministic encoder which limits the sampling possibilities as discussed in Sect. 4.3. Also, this approach can only deal with fixed attributes while VarNet is able to also learn meaningful free attributes. In fact, VAEs , WAEs [28, 26] and Fader networks can be seen as special cases of VarNet.
Recently, the Style Tokens paper  proposed a solution to learn relevant free attributes in the context of text-to-speech. The similarities with our approach is that the authors condition an encoder model on an adaptive basis of style tokens (what we called attribute space in this work). VarNet borrows this idea but cast it in a probabilistic framework, where a distribution over the attribute space is imposed and where the encoder is stochastic. Our approach also allows to take into account fixed attributes, which we saw can help shaping the free attributes.
Traditional ways to explore the latent space of VAEs is by doing linear (or spherical ) interpolations between two points. However, there are two major caveats in this approach: the requirement of always needing two points in order to explore the latent space is cumbersome and the interpolation scheme is arbitrary and bears no probabilistic interpretation. Concerning the first point, a common approach is to find, a posteriori, directions in the latent space that accounts for a particular change of the (fixed) attributes . These directions are then used to move in the latent space. Similarly,  proposes a model where these directions of interest are given a priori. Concerning the second point,  proposes to compute interpolation paths minimizing some energy functional which result in interpolation curves rather than interpolation straight lines. However, this interpolation scheme is computationally demanding since an optimization problem must be solved for each point of the interpolation path.
Another trend in controlled input manipulation is to make a posteriori analysis on a trained generative model [9, 1, 29, 3] using different means. One possible advantage of these methods compared to ours is that different attribute manipulations can be devised after the training of the generative model. But, these procedures are still costly and so provide any real-time applications where a user could provide on-the-fly the attributes they would like to modify. One of these approaches  consists in using the trained decoder to obtain a mapping and then performing gradient descent on an objective which accounts for the constraints or change of the attributes. Another related approach proposed in  consists in training a Generative Adversarial Network which learns to move in the vicinity of a given point in the latent space so that the decoded output enforces some constraints. The major difference of these two approaches with our work is that these movements are done in a unique latent space, while in our case we consider separate latent spaces. But more importantly, these approaches implicitly consider that the variation of interest lies in a neighborhood of the provided input. In  the authors introduce an additional latent space called interpretable lens used to interpret the latent space of a generative model. This space shares similarity with our VarNet trained on VAE representations. The authors also propose a joint optimization for their model, where the encoder-decoder architecture and the interpretable lens are learned jointly. The difference with our approach is that the authors optimize an “interpretability” loss which requires labels and still need to perform a posteriori analysis to find relevant directions in the latent space.
6 Conclusion and future work
We presented the Variation Network, a generative model able to vary attributes of a given input. The novelty is that these attributes can be fixed or learned and have a sound probabilistic interpretation. Many sampling schemes have been presented together with a detailed discussion and examples. We hope that the flexibility in the design of the attribute function and the simplicity, from an implementation point of view, in transforming existing encoder-decoder architectures (it suffices to provide the encoder and decoder networks) will be of interest in many applications.
We saw that our architecture is indeed capable of decoupling templates from learned attributes and that we have three ways of controlling the free attributes that are learned: by modifying the hyperparameter terms in the objective Eq. (6), by carefully devising the attribute functions or by working on different input representations with different encoder/decoder architectures.
For future work, we would like to extend our approach in two different ways: being able to deal with partially-given fixed attributes and handling discrete free attributes as in . We also want to investigate the use of stochastic attribute functions . Indeed, it appeared to us that using deterministic attribute functions was crucial and we would like to go deeper in the understanding of the interplay between all VarNet components.
-  (2018-10–15 Jul) Discovering interpretable representations for both deep generative and discriminative models. In Proceedings of the 35th International Conference on Machine Learning, J. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, Stockholmsmässan, Stockholm Sweden, pp. 50–59. External Links: Cited by: §2, §5.
-  (2018-04) Understanding disentangling in -VAE. ArXiv e-prints. External Links: Cited by: §1.
-  (2018-09) Deep Learning-Based Decoding for Constrained Sequence Codes. ArXiv e-prints. External Links: Cited by: §5.
-  (2016) Variational lossy autoencoder. CoRR abs/1611.02731. External Links: Cited by: §3.2.
-  (2018-12-03)(Website) External Links: Cited by: §4.1.
-  (2019) Diagnosing and enhancing VAE models. arXiv preprint arXiv:1903.05789. Cited by: §4.1, §4.4.
-  (2016-06) Adversarially Learned Inference. ArXiv e-prints. External Links: Cited by: §2.2, §2.
-  (2018) Feature-wise transformations. Distill 3 (7), pp. e11. Cited by: §4.1.
-  (2017) Latent constraints: learning to generate conditionally from unconditional generative models. CoRR abs/1711.05772. External Links: Cited by: §5.
-  (2017) GLSR-VAE: geodesic latent space regularization for variational autoencoder architectures. CoRR abs/1707.04588. External Links: Cited by: §5.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §4.1.
-  (2016) Beta-VAE: learning basic visual concepts with a constrained variational framework. Cited by: §1, §4.1.
-  (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §4.1.
-  (2019) Learning discrete and continuous factors of data via alternating disentanglement. In International Conference on Machine Learning, pp. 3091–3099. Cited by: §6.
-  (2018) RedNet: residual encoder-decoder network for indoor RGB-D semantic segmentation. CoRR abs/1806.01054. External Links: Cited by: §4.1.
-  (2013-12) Auto-encoding variational Bayes. ArXiv e-prints. External Links: Cited by: 1st item, §2, §5.
-  (2018) Feature-based metrics for exploring the latent space of generative models. External Links: Cited by: §5.
-  (2017) Fader networks: manipulating images by sliding attributes. CoRR abs/1706.00409. External Links: Cited by: 1st item, §1, §2, §5.
-  (1998) The mnist database of handwritten digits. http://yann.lecun.com/exdb/mnist/. Cited by: §4.1.
-  (2015-12) Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), Cited by: §4.1.
-  (2018) Challenging common assumptions in the unsupervised learning of disentangled representations. arXiv preprint arXiv:1811.12359. Cited by: §1, §4.4.
-  (2015-11) The Variational Fair Autoencoder. ArXiv e-prints. External Links: Cited by: §1, §3.1.
-  (2017) dSprites: disentanglement testing sprites dataset. Note: https://github.com/deepmind/dsprites-dataset/ Cited by: §4.1.
-  (2018) Invariant representations without adversarial training. In Advances in Neural Information Processing Systems, pp. 9084–9093. Cited by: §3.1.
-  (2018) Film: visual reasoning with a general conditioning layer. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §4.1.
-  (2018-02) On the Latent Space of Wasserstein Auto-Encoders. ArXiv e-prints. External Links: Cited by: 1st item, §2, §5.
-  (2018) Learning controllable fair represntations. CoRR abs/1812.04218. External Links: Cited by: 1st item, §1, §1, §3.1, §3.1, §3.1, §3.
-  (2017-11) Wasserstein Auto-Encoders. ArXiv e-prints. External Links: Cited by: §2.2, §2, §5.
-  (2016) Deep feature interpolation for image content changes. arXiv preprint arXiv:1611.05507. Cited by: §1, §5, §5.
-  (2018) Style tokens: unsupervised style modeling, control and transfer in end-to-end speech synthesis. CoRR abs/1803.09017. External Links: Cited by: §2.3.1, §2, §5.
-  (2016) Sampling generative networks: notes on a few effective techniques. arXiv preprint arXiv:1609.04468. Cited by: §5.
-  (2017-08-28)(Website) External Links: Cited by: §4.1.
-  (2017) Controllable invariance through adversarial feature learning. In Advances in Neural Information Processing Systems, pp. 585–596. Cited by: §3.1.
-  (2019) Lipschitz generative adversarial nets. CoRR abs/1902.05687. External Links: Cited by: §2.2.