Variation Network: Learning Highlevel Attributes for Controlled Input Manipulation
Abstract
This paper presents the Variation Network (VarNet), a generative model providing means to manipulate the highlevel attributes of a given input. The originality of our approach is that VarNet is not only capable of handling predefined attributes but can also learn the relevant attributes of the dataset by itself. These two settings can also be easily considered at the same time, which makes this model applicable to a wide variety of tasks. Further, VarNet has a sound informationtheoretic interpretation which grants us with interpretable means to control how these highlevel attributes are learned. We demonstrate experimentally that this model is capable of performing interesting input manipulation and that the learned attributes are relevant and meaningful.
1 Introduction
We focus on the problem of learning to generate variations of a given input in an intended way. Concretely, this means that given some input element , which can be considered as a template, we want to generate some transformed versions of by only modifying its highlevel attributes. The objective is that the link between the original and its transformed version is preserved while the difference between their attributes is patent. Such a mechanism can be of great use in many domains such as image edition since it allows to edit images using more abstract controls and can be of crucial importance for creative usages since it allows to generate new content in a controlled and meaningful way.
In a majority of the recent proposed methods tackling this problem such as [29, 18, 27], the attributes of interest that we want to control are assumed to be given (often given as a discrete variable). If these methods are indeed successful in generating meaningful transformations of any input, we can nonetheless identify two shortcomings which can restrict their use: 1) Labeled data is not always available or can be costly to obtain; 2) Attributes which can be hard to specify in an absolute way cannot be considered.
The novelty of our approach resides in the fact that it grants the possibility, under the same framework, to control generations by modifying userspecified attributes but also to learn some meaningful highlevel attributes at the same time, which can be used for control.
This problem may seem to be an illposed one on many aspects: Firstly, in the case of specified attributes, there is no ground truth for variations since there is no with two different attributes. Secondly, it can be hard to determine if a learned attribute is relevant. However, we provide empirical evidence that our general approach is capable of learning such relevant attributes and that they can be used for generating meaningful variations.
This paper introduces the Variation Network (VarNet), a probabilistic neural network which provides means to manipulate an input by changing its highlevel attributes, which can be either learned or provided. Our model has a sound probabilistic interpretation which makes the variations obtained by changing the attributes statistically meaningful. This architecture is general and provides a wide range of design choices.
Our contributions are the following:

An easytouse framework: any encoderdecoder architecture can be easily framed into our framework in order to provide it control over generations, even in the case of unlabeled data,

Informationtheoretic interpretation of our model providing a way to control the behavior of the learned attributes.
The plan of this paper is the following: Sect. 2 presents the VarNet architecture together with its training algorithm. For better clarity, we introduce separately all the components featured in our model and postpone the discussion about their interplay and the motivation behind our modeling choices in Sect. 3 We illustrate in Sect. 4 the possibilities offered by our proposed model and show that its faculty to generate variations in an intended way is of particular interest. Finally, Sect. 5 discusses about the related works. In particular, we show that VarNet provides an interesting solution to many constrained generation problems already considered in the literature while bearing interesting connections with the literature in fair representations [22, 27] and disentangled representations [12, 2, 21].
Our code is available at https://github.com/Ghadjeres/VarNet.
2 Proposed model
We now introduce our novel encoderdecoder architecture which we name Variation Network. Our architecture borrows principles from the traditional Variational AutoEncoder (VAE) architecture [16] and from the Wasserstein AutoEncoder (WAE) architecture [28, 26]. It uses an adversarially learned regularization [7, 18], introduces a decomposition of the latent space into two parts, a template and an attribute [1] and decomposes the attributes on an adaptive basis [30].
We detail in the following sections the different parts involved in our model. In Sect. 2.1, we focus on the encoderdecoder part of VarNet, in Sect. 2.2, we introduce the adversariallylearned regularization whose aim is to disentangle attributes from templates and in Section 2.3, we discuss about the special parametrization that we adopted for the space of attributes.
2.1 Encoderdecoder part
We consider a dataset of labeled elements , where stands for the input space and for the metadata space. Here, metadata information is optional as our architecture is also applicable in the totally unsupervised case. We suppose that the ’s follow the data distribution , that for each corresponds a unique label and write the distribution of the ’s together with their metadata information.
Similar to the VAE architectures, we suppose that our data depends on a couple of lowdimensional latent variable and through some decoder parametrized by a neural network. In this paper, we term the variables templates and the variable attributes. More details about the attribute space are given in Sect. 2.3.
We introduce a factorized prior over this latent space so that the joint probability distribution is expressed as . The objective is to maximize the loglikelihood of the data under the model. Since the posterior distribution is usually intractable, an approximate posterior distribution parametrized by a neural network is introduced (for simplicity, we let the approximate posterior depend on only, but considering or even is feasible). Concerning the computation of given , we introduce a deterministic function parametrized by a neural network, which can eventually rely on metadata information .
We then obtain the following mean reconstruction loss:
(1) 
and regularize the latent space by adding the usual KullbackLeibler () divergence term appearing in the VAE Evidence Lower Bound (ELBO) on :
(2) 
2.2 Disentangling attributes from templates
Our decoder thus depends exclusively on and on features . However, there is no reason for a random attribute that , where , is a meaningful variation of the original . Indeed, all needed information for reconstructing could potentially be already contained in and changing could have no effect on the reconstructions.
To enforce this property, we propose to add an adversariallylearned cost on the latent variables and to force the encoder to discard information about the attribute of : Specifically, we train a discriminator neural network whose role is to evaluate the probability that there exists a pair such that and . In other words, the aim of the discriminator is to determine if the attributes and the template code originate from the same or if the features are sampled from the prior .
The encoderdecoder architecture presented in Sect. 2.1 is trained to fool the discriminator: this means that, for a given , it tries to produce a template code and features such that no information about the features could be recovered from .
In an optimal setting, i.e. when the discriminator is unable to match any with a particular feature , the space of template codes and the space of attributes are decorrelated and the aggregated distribution of the ’s (the pushforward measure of by ) matches the prior .
The discriminator is trained to maximize
(3) 
while the encoderdecoder architecture is trained to minimize
(4) 
This is the same setting as in [7, 28] but other GAN training methods could be considered [34]. Note the presence of two terms in Eq. (4), since the “true” examples depend on training parameters.
The final objective we minimize for the encoderdecoder architecture is a combination of a reconstruction loss Eq. (1), a KullbackLeibler penalty on Eq. (2) and the adversariallylearned loss Eq. (4).
Our final architecture is shown in Fig. 1 and the proposed training procedure in shown in Alg.1. Estimators of Eq. (3), Eq. (1) and (4) are given by Eq. (5), Eq. (7) and (8) respectively. The the estimator of the KL term can be either sampled or computed in closed form.
(5) 
(6) 
(7) 
(8) 
2.3 Parametrization of the attribute space
We adopt a particular parametrization of our attribute function . In the following, we make a distinction between two different cases: the case of continuous free attributes and the case of fixed continuous or discrete attributes.
2.3.1 Free attributes
In order to handle free attributes, which denote attributes that are not specified a priori but learned we introduce attribute vectors of dimension together with an attention module . By denoting the coordinates of , we then write our attribute function as
(9) 
This approach is similar to the style tokens approach presented in [30]. The ’s are global and do not depend on a particular instance . By varying the values of the ’s between , we can then span a dimensional hypercube in which stands for our attribute space . It is worth noting that the ’s are also learned and thus constitute an adaptive basis of the attribute space.
The prior over (note that this subspace also varies during training) is obtained by pushing forward fixed distribution over (often considered to be the uniform distribution) using Eq. (10):
(10) 
2.3.2 Fixed attributes
We now suppose that the metadata variable contains attributes that we want to vary at generation time. For simplicity, we can suppose that this metadata information can be either continuous with values in (with a natural order on each dimension) or discrete with values in .
In the continuous case, we write our attribute function
(11) 
while in the discrete case, we just consider
(12) 
where is a dimensional embedding of the symbol . It is important to note that even if the attributes are fixed, the ’s or the embeddings are learned during training.
These two equations define a prior via:
(13) 
3 Comments
We now detail our objective (6) and notably explicit connections in Sect. 3.1 with [27] which gives an informationtheoretic interpretation of a similar objective. In Sect. 3.2, we further discuss about the multiple possibilities that we have concerning the implementation of the attribute function and list, in Sect. 3.3, the different sampling schemes of VarNet.
3.1 Informationtheoretic interpretation
Our objective bears similarity the objective Eq. (11) from [27] which is derived from the following optimization problem
(14) 
where and denote the mutual information and the conditional mutual information respectively (using the joint distribution ).
The differences between our work and [27] are the following: First, our objective Eq. (6) is motivated from a generative point of view, while the motivation in [27] is to obtain fair representations ([22, 33, 24]). This allows us to consider attributes computed from both and , while in [27] we have and the “sensitive attributes” must be provided. A second difference is our formulation of the discriminator function: [27] considers training a discriminator to predict given . In our case, such a formulation would prevent us to impose a prior distribution over .
3.2 Flexibility in the choice of the attribute function
In this section, we focus on the parametrization of the attribute function . The formulation of Sect. 2.3 is in fact too restrictive and considered only particular attribute functions. It is in fact possible to mix different attribute functions by simply concatenating the resulting vectors. By doing so, we can then combine free and fixed attributes in a natural way but also consider different attention modules similarly to what is done in [4], but also consider different distributions over the attention vectors .
It is important to note that the free attributes presented in Sect. 2.3.1 can only capture global attributes, which are attributes that are relevant for all elements of the dataset . In the presence of discrete labels , it can be interesting to consider labeldependent free attributes, which are attributes specific to a subset of the dataset. In this case, the attribute function can be written as
(15) 
where designates the attribute vector of the label . With all these possibilities at hand, it is possible to devise numerous applications in which the notions of template and attribute of an input may have diverse interpretations.
Our choice of using a discriminator over instead of, for instance, over the values of themselves allow to encompass within the same framework discrete and continuous fixed attributes. This also makes the combinations of such attribute functions natural.
3.3 Sampling schemes
We quickly review the different sampling schemes of VarNet. We believe that this wide range of usages makes VarNet a promising model for a wide range of applications. We can for instance:

generate random samples from the estimated dataset distribution:
(16) 
sample with given attributes :
(17) 
generate variations of an input with attributes :
(18) 
generate random variations of an input :
(19)
4 Experiments
We now illustrate the different sampling schemes presented in Sect. 3.3 on different image datasets. Section 4.1 is devoted to the implementation details and our particular modeling choices, Sect. 4.2 and 4.3 showcase some applications of VarNet ensuring that the learned attributes are meaningful and controllable. Finally, Sect. 4.4 discusses the impact of the modeling choices.
4.1 Implementation details
In all these experiments, we choose to use a ResNet50 [11] for the encoder network. We also use a ResNet50 for the attribute function when the function depends on the input and not only on metadata information . For the decoder network, we use a conditional transposed ResNet50 similar to the decoder used in [15]. The batch normalization parameters [13] are conditioned by similarly to what is done in [8, 25].
Following [6], we consider a simple parametrization for the encoder and decoder networks: the encoder is a diagonal Gaussian distribution and the decoder is a diagonal Gaussian distribution with scalar covariance matrix , where is a learnable scalar parameter and where denotes the identity matrix.
The prior over is a unit centered Normal distribution. For the sampling of the values in the free attributes case, we considered to be a uniform distribution over . In the fixed attribute case, we simply obtain a random sample by shuffling the already computed batches of (lines 4 and 6 in Alg.1).
4.2 Influence of the attribute function
We now apply VarNet on the different image datasets mentioned above in order to illustrate the different sampling schemes presented in Sect. 4 and the influence of the choice of the attribute functions (Sect. 3.2). We primarily focus on sampling schemes Eq. (18) and Eq. (19) which help understanding the effect of the learned attributes.
4.2.1 Fixed discrete attributes
We start by considering attribute functions depending exclusively on discrete labels (Eq. (12)). Figure 2 shows the effect of changing the attributes of a given input using Eq. (18) on the MNIST and KMNIST datasets. We observe consistency between all variations. In these two cases, the template encompasses the handwriting and it is interesting noting that characters with two distinct way of writing appear in the generated variations.
4.2.2 Free continuous attributes
We now focus on the case where we let the network learn relevant attributes of the data. Figure 3 displays typical examples of variations obtained by considering free attributes of dimension on different datasets. These plots are obtained by varying the in Eq. (10) at a constant speed between 0 and 1.
In all cases, the learned attributes are relevant to the dataset and meaningful. We find interesting to note that there is no disentanglement between the dimensions of the attributes and that the learned attributes are highly dependent on the dataset at hand.
4.2.3 Mixing attributes
In this part, we give examples of attribute functions obtained by combining the ones mentioned in the preceding sections. In particular, we consider attribute functions created by combining

several discrete fixed attributes,

discrete fixed attributes and continuous free attributes,

continuous free attributes conditioned by discrete fixed attributes.
The objective is to showcase how the design of the attribute functions influences the learned attributes and the “meaning” of the templates .
Figure 4 considers the case where two attribute functions relying solely on binary labels are concatenated. We observe that the resulting model allow to vary independently every label during generation. Such a design choice can be useful since concatenating two attributes functions taking into account two different labels instead of choosing a unique attribute function taking into account a Cartesian product of labels tends to make generalization easier.
We now consider the case where the free attributes are combined with fixed attributes. There are two different ways to do so: one is to concatenate a attribute function on free attributes with one taking into account fixed discrete attributes, the other is to make the free attributes labeldependent as in Eq. (15).
Figure 4(a) displays one such example in the first case. This corresponds to an attribute function as in Eq. (15), but where the dependence of on is dropped. This results in the fact that the learned free attributes are global and possess the same “meaning” for all classes. In Fig. 4(b), on the contrary, the dependence of the ’s on is kept. This has the effect that each class possesses a “local” free attribute.
4.3 Influence of the hyperparameters
From the preceding examples of Sect. 4.2, we saw that the fixed label attributes have clearly been taken into account, but it can be hard to guess a priori which highlevel attribute the free attribute function might capture. However, for a given architecture, a given dataset and fixed hyperparameters, we observed that the network tended to learn the same highlevel features across multiple runs. In this section, we show that the informationtheoretic interpretation from Sect. 3.1 gives us an additional way to control what is learned by the free attributes by modifying the hyperparameter and/or .
For some applications, variation spaces such as the one displayed in Fig. 5(a) are not desirable because they may tend to move too “far away” from the original input. As discussed in Sect. 3.1, it is possible to reduce how “spread” the spaces of variation are by modifying the parameter multiplying the KL term in the objective Eq. (6). An example of such a variation space is displayed in Fig. 5(b). If the interpretation of the highlevel features learned by the free attributes seems identical in both cases, the variations spanned in 5(b) are all more similar from the original input.
4.4 Inductive biases
In order to check how crucial is the choice of the architecture for the encoder and the decoder, we choose in this part to train a VarNet on representations obtained from a pretrained VAE, similarly to the TwoStageVAE from [6]. The first pretrained VAE has the same encoder as the one described in Sect. 4.1 and a transposed VarNet50 decoder. For the encoder and decoder of the VarNet trained on these representations, we use simple threelayered MLPs. The attribute function network is, as in Sect. 4.1, a VarNet50. The results in the case of a 2dimensional free attribute are displayed in Fig. 7. One striking observation is the difference with Fig. 2(c) where changing the free attributes corresponded to changing the background color. Such difference is not surprising [21] and both cases could be useful: removing information about background color in could be of help for some downstream tasks where this piece of information is irrelevant. Here, in the case of 7, the free attributes tend to learn the “pose” and “skin color” highlevel attributes. This seems to be consistent across different runs (e.g. Fig. 6(a) and 6(b)) and consistent on different inputs (e.g. Fig. 6(b) and 6(c)).
5 Related work
The Variation Network generalizes many existing models used for controlled input manipulation by providing a unified probabilistic framework for this task. We now review the related literature and discuss the connections with VarNet.
The problem of controlled input manipulation has been considered in the Fader networks paper [18], where the authors are able to modify in a continuous manner the attributes of an input image. Similar to us, this approach uses an encoderdecoder architecture together with an adversarial loss used to decouple templates and attributes. The major difference with VarNet is that this model has a deterministic encoder which limits the sampling possibilities as discussed in Sect. 4.3. Also, this approach can only deal with fixed attributes while VarNet is able to also learn meaningful free attributes. In fact, VAEs [16], WAEs [28, 26] and Fader networks can be seen as special cases of VarNet.
Recently, the Style Tokens paper [30] proposed a solution to learn relevant free attributes in the context of texttospeech. The similarities with our approach is that the authors condition an encoder model on an adaptive basis of style tokens (what we called attribute space in this work). VarNet borrows this idea but cast it in a probabilistic framework, where a distribution over the attribute space is imposed and where the encoder is stochastic. Our approach also allows to take into account fixed attributes, which we saw can help shaping the free attributes.
Traditional ways to explore the latent space of VAEs is by doing linear (or spherical [31]) interpolations between two points. However, there are two major caveats in this approach: the requirement of always needing two points in order to explore the latent space is cumbersome and the interpolation scheme is arbitrary and bears no probabilistic interpretation. Concerning the first point, a common approach is to find, a posteriori, directions in the latent space that accounts for a particular change of the (fixed) attributes [29]. These directions are then used to move in the latent space. Similarly, [10] proposes a model where these directions of interest are given a priori. Concerning the second point, [17] proposes to compute interpolation paths minimizing some energy functional which result in interpolation curves rather than interpolation straight lines. However, this interpolation scheme is computationally demanding since an optimization problem must be solved for each point of the interpolation path.
Another trend in controlled input manipulation is to make a posteriori analysis on a trained generative model [9, 1, 29, 3] using different means. One possible advantage of these methods compared to ours is that different attribute manipulations can be devised after the training of the generative model. But, these procedures are still costly and so provide any realtime applications where a user could provide onthefly the attributes they would like to modify. One of these approaches [3] consists in using the trained decoder to obtain a mapping and then performing gradient descent on an objective which accounts for the constraints or change of the attributes. Another related approach proposed in [9] consists in training a Generative Adversarial Network which learns to move in the vicinity of a given point in the latent space so that the decoded output enforces some constraints. The major difference of these two approaches with our work is that these movements are done in a unique latent space, while in our case we consider separate latent spaces. But more importantly, these approaches implicitly consider that the variation of interest lies in a neighborhood of the provided input. In [1] the authors introduce an additional latent space called interpretable lens used to interpret the latent space of a generative model. This space shares similarity with our VarNet trained on VAE representations. The authors also propose a joint optimization for their model, where the encoderdecoder architecture and the interpretable lens are learned jointly. The difference with our approach is that the authors optimize an “interpretability” loss which requires labels and still need to perform a posteriori analysis to find relevant directions in the latent space.
6 Conclusion and future work
We presented the Variation Network, a generative model able to vary attributes of a given input. The novelty is that these attributes can be fixed or learned and have a sound probabilistic interpretation. Many sampling schemes have been presented together with a detailed discussion and examples. We hope that the flexibility in the design of the attribute function and the simplicity, from an implementation point of view, in transforming existing encoderdecoder architectures (it suffices to provide the encoder and decoder networks) will be of interest in many applications.
We saw that our architecture is indeed capable of decoupling templates from learned attributes and that we have three ways of controlling the free attributes that are learned: by modifying the hyperparameter terms in the objective Eq. (6), by carefully devising the attribute functions or by working on different input representations with different encoder/decoder architectures.
For future work, we would like to extend our approach in two different ways: being able to deal with partiallygiven fixed attributes and handling discrete free attributes as in [14]. We also want to investigate the use of stochastic attribute functions . Indeed, it appeared to us that using deterministic attribute functions was crucial and we would like to go deeper in the understanding of the interplay between all VarNet components.
References
 [1] (201810–15 Jul) Discovering interpretable representations for both deep generative and discriminative models. In Proceedings of the 35th International Conference on Machine Learning, J. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, Stockholmsmässan, Stockholm Sweden, pp. 50–59. External Links: Link Cited by: §2, §5.
 [2] (201804) Understanding disentangling in VAE. ArXiv eprints. External Links: 1804.03599 Cited by: §1.
 [3] (201809) Deep LearningBased Decoding for Constrained Sequence Codes. ArXiv eprints. External Links: 1809.01859 Cited by: §5.
 [4] (2016) Variational lossy autoencoder. CoRR abs/1611.02731. External Links: Link, 1611.02731 Cited by: §3.2.
 [5] (20181203)(Website) External Links: cs.CV/1812.01718 Cited by: §4.1.
 [6] (2019) Diagnosing and enhancing VAE models. arXiv preprint arXiv:1903.05789. Cited by: §4.1, §4.4.
 [7] (201606) Adversarially Learned Inference. ArXiv eprints. External Links: 1606.00704 Cited by: §2.2, §2.
 [8] (2018) Featurewise transformations. Distill 3 (7), pp. e11. Cited by: §4.1.
 [9] (2017) Latent constraints: learning to generate conditionally from unconditional generative models. CoRR abs/1711.05772. External Links: Link, 1711.05772 Cited by: §5.
 [10] (2017) GLSRVAE: geodesic latent space regularization for variational autoencoder architectures. CoRR abs/1707.04588. External Links: Link, 1707.04588 Cited by: §5.
 [11] (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §4.1.
 [12] (2016) BetaVAE: learning basic visual concepts with a constrained variational framework. Cited by: §1, §4.1.
 [13] (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §4.1.
 [14] (2019) Learning discrete and continuous factors of data via alternating disentanglement. In International Conference on Machine Learning, pp. 3091–3099. Cited by: §6.
 [15] (2018) RedNet: residual encoderdecoder network for indoor RGBD semantic segmentation. CoRR abs/1806.01054. External Links: Link, 1806.01054 Cited by: §4.1.
 [16] (201312) Autoencoding variational Bayes. ArXiv eprints. External Links: 1312.6114 Cited by: 1st item, §2, §5.
 [17] (2018) Featurebased metrics for exploring the latent space of generative models. External Links: Link Cited by: §5.
 [18] (2017) Fader networks: manipulating images by sliding attributes. CoRR abs/1706.00409. External Links: Link, 1706.00409 Cited by: 1st item, §1, §2, §5.
 [19] (1998) The mnist database of handwritten digits. http://yann.lecun.com/exdb/mnist/. Cited by: §4.1.
 [20] (201512) Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), Cited by: §4.1.
 [21] (2018) Challenging common assumptions in the unsupervised learning of disentangled representations. arXiv preprint arXiv:1811.12359. Cited by: §1, §4.4.
 [22] (201511) The Variational Fair Autoencoder. ArXiv eprints. External Links: 1511.00830 Cited by: §1, §3.1.
 [23] (2017) dSprites: disentanglement testing sprites dataset. Note: https://github.com/deepmind/dspritesdataset/ Cited by: §4.1.
 [24] (2018) Invariant representations without adversarial training. In Advances in Neural Information Processing Systems, pp. 9084–9093. Cited by: §3.1.
 [25] (2018) Film: visual reasoning with a general conditioning layer. In ThirtySecond AAAI Conference on Artificial Intelligence, Cited by: §4.1.
 [26] (201802) On the Latent Space of Wasserstein AutoEncoders. ArXiv eprints. External Links: 1802.03761 Cited by: 1st item, §2, §5.
 [27] (2018) Learning controllable fair represntations. CoRR abs/1812.04218. External Links: Link, 1812.04218 Cited by: 1st item, §1, §1, §3.1, §3.1, §3.1, §3.
 [28] (201711) Wasserstein AutoEncoders. ArXiv eprints. External Links: 1711.01558 Cited by: §2.2, §2, §5.
 [29] (2016) Deep feature interpolation for image content changes. arXiv preprint arXiv:1611.05507. Cited by: §1, §5, §5.
 [30] (2018) Style tokens: unsupervised style modeling, control and transfer in endtoend speech synthesis. CoRR abs/1803.09017. External Links: Link, 1803.09017 Cited by: §2.3.1, §2, §5.
 [31] (2016) Sampling generative networks: notes on a few effective techniques. arXiv preprint arXiv:1609.04468. Cited by: §5.
 [32] (20170828)(Website) External Links: cs.LG/1708.07747 Cited by: §4.1.
 [33] (2017) Controllable invariance through adversarial feature learning. In Advances in Neural Information Processing Systems, pp. 585–596. Cited by: §3.1.
 [34] (2019) Lipschitz generative adversarial nets. CoRR abs/1902.05687. External Links: Link, 1902.05687 Cited by: §2.2.