An Adversarial NeuroTensorial Approach For Learning Disentangled Representations
Abstract
Several factors contribute to the appearance of an object in a visual scene, including pose, illumination, and deformation, among others. Each factor accounts for a source of variability in the data, while the multiplicative interactions of these factors emulate the entangled variability, giving rise to the rich structure of visual object appearance. Disentangling such unobserved factors from visual data is a challenging task, especially when the data have been captured in uncontrolled recording conditions (also referred to as “inthewild”) and label information is not available.
In this paper, we propose the first unsupervised deep learning method (with pseudosupervision) for disentangling multiple latent factors of variation in face images captured inthewild. To this end, we propose a deep latent variable model, where the multiplicative interactions of multiple latent factors of variation are explicitly modelled by means of multilinear (tensor) structure. We demonstrate that the proposed approach indeed learns disentangled representations of facial expressions and pose, which can be used in various applications, including face editing, as well as 3D face reconstruction and classification of facial expression, identity and pose.
1 Introduction
The appearance of visual objects is significantly affected by multiple factors of variability such as, for example, pose, illumination, identity, and expression in case of faces. Each factor accounts for a source of variability in the data, while their complex interactions give rise to the observed entangled variability. Discovering the modes of variation, or in other words disentangling the latent factors of variations in visual data, is a very important problem in the intersection of statistics, machine learning, and computer vision.
Factor analysis [12] and the closely related Principal Component Analysis (PCA) [16] are probably the most popular statistical methods that find a single mode of variation explaining the data. Nevertheless, visual appearance (e.g., facial appearance) is affected by several modes of variations. Hence, methods such as PCA are not able to identify such multiple factors of variation. For example, when PCA is applied to facial images, the first principal component captures both pose and expressions variations.
An early approach for learning different modes of variation in the data is TensorFaces [36]. In particular, TensorFaces is a strictly supervised method as it not only requires the facial data to be labelled (e.g., in terms of expression, identity, illumination etc.) but the data tensor must also contain all samples in all different variations. This is the primary reason that the use of such tensor decompositions is still limited to databases that have been captured in a strictly controlled environment, such as the Weizmann face database [36].
Recent unsupervised tensor decompositions methods [32, 37] automatically discover the modes of variation in unlabelled data. In particular, the most recent one [37] assumes that the original visual data have been produced by a hidden multilinear structure and the aim of the unsupervised tensor decomposition is to discover both the underlying multilinear structure, as well as the corresponding weights (coefficients) that best explain the data. Special instances of the unsupervised tensor decomposition are the ShapefromShading (SfS) decompositions in [17, 31] and the multilinear decompositions for 3D face description in [37]. In [37], it is shown that the method indeed can be used to learn representations where many modes of variation have been disentangled (e.g., identity, expression and illumination etc.). Nevertheless, the method in [37] is not able to find pose variations and bypasses this problem by applying it to faces which have been frontalised by applying a warping function (e.g., piecewise affine warping [25]).
Another promising line of research for discovering latent representations is unsupervised Deep Neural Networks (DNNs). Unsupervised DNNs architectures include the AutoEncoders (AE) [1], as well as the Generative Adversarial Networks (GANs) [13] or adversarial versions of AE, e.g., the Adversarial AutoEncoders (AAE) [23]. Even though GANs, as well as AAEs, provide very elegant frameworks for discovering powerful lowdimensional embeddings without having to align the faces, due to the complexity of the networks, unavoidably all modes of variation are multiplexed in the latentrepresentation. Only with the use of labels it is possible to model/learn the manifold over the latent representation, usually as a postprocessing step [30].
In this paper, we show that it is possible to learn a disentangled representation of the human face captured in arbitrary recording conditions in an unsupervised manner^{1}^{1}1Our methodology uses the information produced by an automatic 3D face fitting procedure [4] but it does not make use of any labels in the training set. by imposing a multilinear structure on the latent representation of an AAE [30]. To the best of our knowledge, this is the first time that unsupervised tensor decompositions have been combined with DNNs for learning disentangled representations. We demonstrate the power of the proposed approach by showing expression/pose transfer using only the latent variable that is related to expression/pose. We also demonstrate that the disentangled lowdimensional embeddings are useful for many other applications, such as facial expression, pose, and identity recognition and clustering. An example of the proposed approach is given in Fig. 1. In particular, the left pair of images have been decomposed, using the encoder of the proposed neural network , into many different latent representations including latent representations for pose, illumination, identity and expression. Since our framework has learned a disentangled representation we can easily transfer the expression by only changing the latent variable related to expression and passing the latent vector into the decoder of our neural network . Similarly, we can transfer the pose merely by changing the latent variable related to pose.
2 Related Work
Learning disentangled representations that explain multiple factors of variation in the data as disjoint latent dimensions is desirable in several machine learning, computer vision, and graphics tasks.
Indeed, bilinear factor analysis models [33] have been employed for disentangling two factors of variation (e.g., head pose and facial identity) in the data. Identity, expression, pose, and illumination variations are disentangled in [36] by applying Tucker decomposition (also known as multilinear Singular Value Decomposition (SVD) [10]) into a carefully constructed tensor through label information. Interestingly, the modes of variation in well aligned images can be recovered via a multilinear matrix factorization [37] without any supervision. However, inference in [37] might be illposed.
More recently, both supervised and unsupervised deep learning methods have been developed for disentangled representations learning. Transforming autoencoders [15] is among the earliest methods for disentangling latent factors by means of autoencoder capsules. In [11] hidden factors of variation are disentangled via inference in a variant of the restricted Boltzmann machine. Disentangled representations of input images are obtained by the hidden layers of deep networks in [8] and through a higherorder Boltzmann machine in [27]. The Deep Convolutional Inverse Graphics Network [20] learns a representation that is disentangled with respect to transformations such as outofplane rotations and lighting variations. Methods in [6, 24, 5, 34, 35] extract disentangled and interpretable visual representations by employing adversarial training. The method in [30] disentangles the latent representations of illumination, surface normals, and albedo of face images using an image rendering pipeline. Trained with pseudosupervision, [30] undertakes multiple image editing tasks by manipulating the relevant latent representations. Nonetheless, this editing approach still requires expression labelling, as well as sufficient sampling of a specific expression.
Here, the proposed network is able to edit the expression of a face image given another single inthewild face image of arbitrary expression. Furthermore, we are able to edit the pose of a face in the image which is not possible in [30].
3 Proposed Method
In this section, we will introduce the main multilinear models used to describe three different image modalities, namely texture, 3D shape and 3D surface normals. To this end, we assume that for each different modality there is a different core tensor but all modalities share the same latent representation of weights regarding identity and expression. During training all the core tensors inside the network are randomly initialised and learnt endtoend. In the following, we assume that we have a set of facial images (e.g., in the training batch) and their corresponding 3D facial shape, as well as their normals per pixel (the 3D shape and normals have been produced by fitting a 3D model on the 2D image, e.g., [4]).
3.1 Facial Texture
The main assumption here follows from [37]. That is, the rich structure of visual data is a result of multiplicative interactions of hidden (latent) factors and hence the underlying multilinear structure, as well as the corresponding weights (coefficients) that best explain the data can be recovered using the unsupervised tensor decomposition [37]. Indeed, following [37], disentangled representations can be learnt (e.g., identity, expression, and illumination, etc.) from frontalised facial images. The frontalisation process is performed by applying a piecewise affine transform using the sparse shape recovered by a face alignment process. Inevitably, this process suffers from warping artifacts. Therefore, rather than applying any warping process, we perform the multilinear decomposition only on near frontal faces, which can be automatically detected during the 3D face fitting stage. In particular, assuming a near frontal facial image rasterised in a vector , given a core tensor ^{2}^{2}2 Tensors notation: Tensors (i.e., multidimensional arrays) are and denoted by calligraphic letters, e.g., . The mode matricisation of a tensor maps to a matrix . The mode vector product of a tensor with a vector , denoted by . The Kronecker product is denoted by and the KhatriRao (i.e., columnwise Kronecker product) product is denoted by . More details on tensors and multilinear operators can be found in [18]., this can be decomposed as
(1) 
where and are the weights that correspond to illumination, expression and identity respectively. The equivalent form in case that we have a number of images in the batch stacked in the columns of a matrix is
(2) 
where is a mode1 matricisation of tensor and , and are the corresponding matrices that gather the weights of the decomposition for all images in the batch. That is, stacks the latent variables of expressions of the images, stacks the latent variables of identity and stacks the latent variables of illumination.
3.2 3D Facial Shape
It is quite common to use a bilinear model for disentangling identity and expression in 3D facial shape [3]. Hence, for 3D shape we assume that there is a different core tensor and each 3D facial shape can be decomposed as:
(3) 
where and are exactly the same weights as in the texture decomposition (2). The tensor decomposition for the images in the batch is therefore written as as
(4) 
where is a mode1 matricization of tensor .
3.3 Facial Normals
The tensor decomposition we opted to use for facial normals was exactly the same as the texture, hence we can use the same core tensor and weights. The difference is that since facial normals do not depend on illumination parameters (assuming a Lambertian illumination model), we just need to replace the illumination weights with a constant^{3}^{3}3This is also the way that normals are computed in [37] up to a scaling factor. Thus, the decomposition for normals can be written as
(5) 
where is a matrix of ones.
3.4 3D Facial Pose
Finally, we define another latent variable regarding 3D pose. This latent variable represents a 3D rotation. We denote by an image at index . The indexing is denoted in the following by the superscript. The corresponding can be reshaped into a rotation matrix . As proposed in [39], we apply this rotation to the feature of the image created by 2way synthesis (explained in Section 3.5). This feature vector is the th column of the feature matrix resulting from the 2way synthesis . We denote this feature vector corresponding to a single image as . Next is reshaped into a matrix and leftmultiplied by . After another round of vectorisation, the resulting feature becomes the input of the decoders for normal and albedo. This transformation from feature vector to the rotated feature is called rotation.
3.5 Network Architecture
We incorporate the structure imposed by Equations (2), (4) and (5) into an autoencoder network, see Figure 2. For some matrices , we refer to the operation as 2way synthesis and as 3way synthesis. The multiplication of a feature matrix by or , mode1 matricisations of tensors and , is referred to as projection and can be represented by an unbiased fullyconnected layer.
Our network follows the architecture of [30]. The encoder receives an input image and the convolutional encoder stack first encodes it into , an intermediate latent variable vector of size . is then transformed into latent codes for background , mask , illumination , pose , identity and expression via fullyconnected layers.
(6) 
The decoder takes in the latent codes as input. and ( vectors) are directly passed into convolutional decoder stacks to estimate background and face mask respectively. The remaining latent variables follow 3 streams:

[leftmargin=*]

( vector) and ( vector) are joined by 2way synthesis and projection to estimate facial shape .

The result of 2way synthesis of and is rotated using . The rotated feature is passed into 2 different convolutional decoder stacks: one for normal estimation and another for albedo. Using the estimated normal map, albedo, illumination component , mask and background, we render a reconstructed image .

, and are combined by a 3way synthesis and projection to estimate frontal normal map and a frontal reconstruction of the image.
Streams 1 and 3 drive the disentangling of expression and identity components, while stream 2 focuses on the reconstruction of the image by adding the pose components.
(7) 
Our input images are aligned and cropped facial images from the CelebA database [21] of size , so . , , and . More details on the network such as the convolutional encoder stacks and decoder stacks can be found in the supplementary material.
3.6 Training
We use inthewild face images for training. Hence, we only have access to the image itself () while ground truth labelling for pose, illumination, normal, albedo, expression, identity or 3D shape is unavailable. The main loss function is the reconstruction loss of the image :
(8) 
where is the reconstructed image, is the reconstruction loss, and are regularisation weights, represents the adversarial loss and the verification loss. We use the pretrained verification network [40] to find face embeddings of our images and . As both images are supposed to represent the same person, we minimise the cosine distance between the embeddings: . Simultaneously, a discriminative network is trained to distinguish between the generated and real images [13]. We incorporate the discriminative information by following the autoencoder loss distribution matching approach of [2]. The discriminative network is itself an autoencoder trying to reconstruct the input image so the adversarial loss is . is trained to minimise .
As fully unsupervised training often results in semantically meaningless latent representations, Shu et al. [30] proposed to train with “pseudo ground truth” values for normals, lighting and 3D facial shape. We adopt here this technique and introduce further “pseudo ground truth” values for pose , expression and identity . , and are obtained by fitting coarse face geometry to every image in the training set using a 3D Morphable Model [4]. We incorporated the constraints used in [30] for illumination, normals and albedo. Hence, the following new objectives are introduced:
(9) 
where is a 3D camera rotation matrix.
(10) 
where fc() is a fullyconnected layer and is a “pseudo ground truth” vector representing 3DMM expression components of the image .
(11) 
where fc() is a fullyconnected layer and is a “pseudo ground truth” vector representing 3DMM identity components of the image .
Multilinear Losses
Directly applying the above losses as constraints to the latent variables does not result in a welldisentangled representation. To achieve a better performance, we impose a tensor structure on the image using the following losses:
(12) 
where is the 3D facial shape of the fitted model.
(13) 
where is a semifrontal face image. During training, is only applied on nearfrontal face images filtered using .
(14) 
where is a near frontal normal map. During training, the loss is only applied on near frontal normal maps.
The model is trained endtoend by applying gradient descent to batches of images, where Equations (12), (13) and (14) are written in the following general form:
(15) 
where is the number of modes of variations, is a data matrix, is the mode1 matricisation of a tensor and are the latent variables matrices.
The partial derivative of (15) with respect to the latent variable are computed as follows: Let be the vectorised , be the vectorised ,
and , then (15) is equivalent with:
(16)  
Consequently the partial derivative of (15) with respect to is obtained by matricising the partial derivative of (16) with respect to , which is easy to compute analytically. The derivation of this can be found in the supplemental material. To efficiently compute the above mentioned operations, Tensorly [19] has been employed.
4 Proof of Concept Experiments
We develop a lighter version of our proposed network, a proofofconcept network (visualised in Figure 3), to show that our network is able to learn and disentangle pose, expression and identity.
In order to showcase the ability of the network, we leverage our newly proposed 4DFAB database [7], where subjects were invited to attend four sessions at different times in a span of five years. In each experiment session, the subject was asked to articulate 6 different facial expressions (anger, disgust, fear, happiness, sadness, surprise), and we manually select the most expressive mesh (i.e. the apex frame) for this experiment. In total, 1795 facial meshes from 364 recording sessions (with 170 unique identities) are used. We keep 148 identities for training and leave 22 identities for testing. Note that there are no overlapping of identities between both sets. Within the training set, we synthetically augment each facial mesh by generating new facial meshes with 20 randomly selected expressions. Our training set contains in total 35900 meshes. The test set contains 387 meshes. For each mesh, we have the ground truth facial texture as well as expression and identity components of the 3DMM model.
4.1 Disentangling Expression and Identity
We create frontal images of the facial meshes. Hence there is no illumination or pose variation in this training dataset. We train a lighter version of our network by removing the illumination and pose streams, a proofofconcept network, visualised in Figure 3, on this synthetic dataset.
4.1.1 Expression Editing
We show the disentanglement between expression and identity by transferring the expression of one person to another.
For this experiment, we work with unseen data (a holdout set consisting of 22 unseen identities) and no labels. We first encode both input images and :
(17)  
where is our encoder and and are the latent representations of expression and identity respectively.
Assuming we want to emulate the expression of , we decode on:
(18) 
where is our decoder. The resulting becomes our edited image where has the expression of . Figure 4 shows how the network is able to separate expression and identity. The edited images clearly maintain the identity while expression changes.
4.1.2 3D Reconstruction and Facial Texture
The latent variables and that our network learns are extremely meaningful. Not only can they be used to reconstruct the image in 2D, but also they can be mapped into the expression () and identity () components of a 3DMM model. This mapping is learnt inside the network. By replacing the expression and identity components of a mean face shape with and , we are able to reconstruct the 3D mesh of a face given a single input image. We compare these reconstructed meshes against the ground truth 3DMM used to create the input image in Figure 5.
At the same time, the network is able to learn a mapping from to facial texture. Therefore, we can predict the facial texture given a single input image. We compare the reconstructed facial texture with the ground truth facial texture in Figure 6.
4.2 Disentangling Pose, Expression and Identity
Our synthetic training set contains in total 35900 meshes. For each mesh, we have the ground truth facial texture as well as expression and identity components of the 3DMM, from which we create a corresponding image with one of 7 given poses. As there is no illumination variation in this training set, we train a proofofconcept network by removing the illumination stream, visualised in Figure 3b, on this synthetic dataset.
4.2.1 Pose Editing
We show the disentanglement between pose, expression and identity by transferring the pose of one person to another. Figure 7 shows how the network is able to separate pose from expression and identity. This experiment highlights the ability of our proposed network to learn large pose variations even from profile to frontal faces.
5 Experiments inthewild
We train our network on inthewild data and perform several experiments on unseen data to show that our network is indeed able to disentangle illumination, pose, expression and identity.
We edit expression or pose by swapping the latent expression/pose component learnt by the encoder (Eq. (6)) with the latent expression/pose component predicted from another image. We feed the decoder (Eq. (7)) with the modified latent component to retrieve our edited image.
5.1 Expression and Pose Editing inthewild
Given two inthewild images of faces, we are able to transfer the expression or pose of one person to another. Transferring the expression from two different facial images without fitting a 3D model is a very challenging problem. Generally, it is considered in the context of the same person under an elaborate blending framework [41] or by transferring certain classes of expressions [29].
For this experiment, we work with completely unseen data (a holdout set of CelebA) and no labels. We first encode both input images and :
(19)  
where is our encoder and , , are the latent representations of expression, identity and pose respectively.
Assuming we want to take on the expression or pose of , we then decode on:
(20)  
where is our decoder.
The resulting then becomes our result image where has the expression of . is the edited image where changed to the pose of .
As there is currently no prior work for this expression editing experiment without fitting an AAM [9] or 3DMM, we used the image synthesised by the 3DMM fitted models as a baseline, which indeed performs quite well. Compared with our method, other very closely related works [37, 30] are not able to disentangle illumination, pose, expression and identity. In particular, [30] disentangles illumination of an image while [37] disentangles illumination, expression and identity from “frontalised” images. Hence they are not able to disentangle pose. None of these methods can be applied to the expression/pose editing experiments on a dataset that contains pose variations such as CelebA. If [37] is applied directly on our test images, it would not be able to perform expression editing well, as shown by Figure 9.
For the 3DMM baseline, we fit a shape model to both images and extract the expression components of the model. We then generate a new face shape using the expression components of one face and the identity components of another face in the same 3DMM setting. This technique has much higher overhead than our proposed method as it requires timeconsuming 3DMM fitting of the images. Our expression editing results and the baseline results are shown in Figure 8. Though the baseline is very strong, it does not change the texture of the face which can produce unnatural looking faces shown with original expression. Also, the baseline method can not fill up the inner mouth area. Our editing results show more natural looking faces.
For pose editing, the background is unknown once the pose has changed, thus, for this experiment, we mainly focus on the face region. Figure 10 shows our pose editing results. For the baseline method, we fit a 3DMM to both images and estimate the rotation matrix. We then synthesise with the rotation of . This technique has high overhead as it requires expensive 3DMM fitting of the images.




5.2 Illumination Editing
We transfer illumination by estimating the normals , albedo and illumination components of the source () and target () images. Then we use and to compute the transferred shading and multiply the new shading by to create the relighted image result . In Figure 11 we show the performance of our method and compare against [30] on illumination transfer. We observe that our method outperforms [30] as we obtain more realistic looking results. We include further comparison images with [30] in the supplemental material.
5.3 3D Reconstruction
The latent variables and that our network learns are extremely meaningful. Not only can they be used to reconstruct the image in 2D, they can be mapped into the expression () and identity () components of a 3DMM. This mapping is learnt inside the network. By replacing the expression and identity components of a mean face shape with and , we are able to reconstruct the 3D mesh of a face given a single inthewild 2D image. We compare these reconstructed meshes against the fitted 3DMM to the input image.
The results of the experiment are visualised in Figure 12. We observe that the reconstruction is very close to the ground truth. Both techniques though do not capture well the identity of the person in the input image due to a known weakness in 3DMM.
5.4 Normal Estimation
Method  MeanStd against [38]  <35  <40 

[37]  33.37° 3.29°  75.3%  96.3% 
[30]  30.09° 4.66°  84.6%  98.1% 
Proposed  28.67° 5.79°  89.1%  96.3% 
We evaluate our method on the surface normal estimation task on the Photoface [42] dataset which has information about illumination. Assuming the normals found using calibrated Photometric Stereo [38] as “ground truth”, we calculate the angular error between our estimated normals and the “ground truth”. Figure 13 and Table 1 quantitatively evaluates our proposed method against prior works [37, 30] in the normal estimation task. We observe that our proposed method performs on par or outperforms previous methods.
5.5 Quantitative Evaluation of the Latent Space
We want to test whether our latent space corresponds well to the variation that it is supposed to learn. For our quantitative experiment, we used MultiPIE [14] as our test dataset. This dataset contains labelled variations in identity, expressions and pose. Disentanglement of variations in MultiPIE is particularly challenging as its images are captured under laboratory conditions which is quite different from that of our training images. As a matter of fact, the expressions contained in MultiPIE do not correspond to the 7 basic expressions and can be easily confused.
We encoded 10368 images of the MultiPIE dataset with 54 identities, 6 expressions and 7 poses and trained a linear SVM classifier using 90% of the identity labels and the latent variables . We then test on the remaining 10% to check whether they are discriminative for identity classification. We use 10fold crossvalidation to evaluate the accuracy of the learnt classifier. We repeat this experiment for expression with and pose with respectively. Our results in Table 2 show that our latent representation is indeed discriminative. This experiment showcases the discriminative power of our latent representation on a previously unseen dataset. In order to quantitatively compare with [37], we run another experiment on only frontal images of the dataset with 54 identities, 6 expressions and 16 illuminations. The results in Table 3 shows how our proposed model outperforms [37] in these classification tasks. Our latent representation has stronger discriminative power than the one learnt by [37].
Accuracy  83.85%  86.07%  95.73% 
Identity  

[37]  
Accuracy  99.33%  19.18 % 
Expression  

[37]  
Accuracy  78.92%  35.49 
Illumination  

[37]  
Accuracy  64.11%  48.85% 
We visualise, using tSNE [22], the latent and encoded from MultiPIE according to their expression and pose label and compare against the latent representation learnt by an inhouse largescale adversarial autoencoder of similar architecture trained with 2 million faces [23]. Figures 14 and 15 show that even though our encoder has not seen any images of MultiPIE, it manages to create informative latent representations that cluster well expression and pose (contrary to the representation learned by the tested autoencoder).
6 Conclusion
We proposed the first, to the best of our knowledge, attempt to jointly disentangle modes of variation that correspond to expression, identity, illumination and pose using no explicit labels regarding these attributes. More specifically, we proposed the first, as far as we know, approach that combines a powerful Deep Convolutional Neural Network (DCNN) architecture with unsupervised tensor decompositions. We demonstrate the power of our methodology in expression and pose transfer, as well as discovering powerful features for pose and expression classification.
References
 [1] Y. Bengio, A. Courville, and P. Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828, 2013.
 [2] D. Berthelot, T. Schumm, and L. Metz. Began: Boundary equilibrium generative adversarial networks. arXiv preprint arXiv:1703.10717, 2017.
 [3] T. Bolkart and S. Wuhrer. A robust multilinear model learning framework for 3d faces. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4911–4919, 2016.
 [4] J. Booth, E. Antonakos, S. Ploumpis, G. Trigeorgis, Y. Panagakis, and S. Zafeiriou. 3d face morphable models” inthewild”. arXiv preprint arXiv:1701.05360, 2017.
 [5] C. X. D. T. Chaoyue Wang, Chaohui Wang. Tag disentangled generative adversarial network for object image rerendering. In Proceedings of the TwentySixth International Joint Conference on Artificial Intelligence, IJCAI17, pages 2901–2907, 2017.
 [6] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2172–2180, 2016.
 [7] S. Cheng, I. Kotsia, M. Pantic, and S. Zafeiriou. 4dfab: A large scale 4d facial expression database for biometric applications. In arXiv:1712.01443, 2017.
 [8] B. Cheung, J. A. Livezey, A. K. Bansal, and B. A. Olshausen. Discovering hidden factors of variation in deep networks. arXiv preprint arXiv:1412.6583, 2014.
 [9] T. F. Cootes, G. J. Edwards, and C. J. Taylor. Active appearance models. IEEE Transactions on pattern analysis and machine intelligence, 23(6):681–685, 2001.
 [10] L. De Lathauwer, B. De Moor, and J. Vandewalle. A multilinear singular value decomposition. SIAM journal on Matrix Analysis and Applications, 21(4):1253–1278, 2000.
 [11] G. Desjardins, A. Courville, and Y. Bengio. Disentangling factors of variation via generative entangling. arXiv preprint arXiv:1210.5474, 2012.
 [12] L. R. Fabrigar and D. T. Wegener. Exploratory factor analysis. Oxford University Press, 2011.
 [13] I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
 [14] R. Gross, I. Matthews, J. Cohn, T. Kanade, and S. Baker. MultiPIE. Image and Vision Computing, 28(5):807–813, 2010.
 [15] G. E. Hinton, A. Krizhevsky, and S. D. Wang. Transforming autoencoders. In International Conference on Artificial Neural Networks, pages 44–51. Springer, 2011.
 [16] H. Hotelling. Analysis of a complex of statistical variables into principal components. Journal of educational psychology, 24(6):417, 1933.
 [17] I. KemelmacherShlizerman. Internet based morphable model. In Proceedings of the IEEE International Conference on Computer Vision, pages 3256–3263, 2013.
 [18] T. G. Kolda and B. W. Bader. Tensor Decompositions and Applications. SIAM Review, 51(3):455–500, 2008.
 [19] J. Kossaifi, Y. Panagakis, and M. Pantic. Tensorly: Tensor learning in python. ArXiv eprint.
 [20] T. D. Kulkarni, W. F. Whitney, P. Kohli, and J. Tenenbaum. Deep convolutional inverse graphics network. In Advances in Neural Information Processing Systems, pages 2539–2547, 2015.
 [21] Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), 2015.
 [22] L. v. d. Maaten and G. Hinton. Visualizing data using tsne. Journal of Machine Learning Research, 9(Nov):2579–2605, 2008.
 [23] A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, and B. Frey. Adversarial autoencoders. arXiv preprint arXiv:1511.05644, 2015.
 [24] M. F. Mathieu, J. J. Zhao, J. Zhao, A. Ramesh, P. Sprechmann, and Y. LeCun. Disentangling factors of variation in deep representation using adversarial training. In Advances in Neural Information Processing Systems, pages 5040–5048, 2016.
 [25] I. Matthews and S. Baker. Active appearance models revisited. International journal of computer vision, 60(2):135–164, 2004.
 [26] H. Neudecker. Some theorems on matrix differentiation with special reference to kronecker matrix products. Journal of the American Statistical Association, 64(327):953–963, 1969.
 [27] S. Reed, K. Sohn, Y. Zhang, and H. Lee. Learning to disentangle factors of variation with manifold interaction. In E. P. Xing and T. Jebara, editors, Proceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, pages 1431–1439, Bejing, China, 22–24 Jun 2014. PMLR.
 [28] F. Roemer. Advanced algebraic concepts for efficient multichannel signal processing. PhD thesis, Universitätsbibliothek Ilmenau, 2012.
 [29] C. Sagonas, Y. Panagakis, A. Leidinger, S. Zafeiriou, et al. Robust joint and individual variance explained. In Proceedings of IEEE InternationalConference on Computer Vision & Pattern Recognition (CVPR), 2017.
 [30] Z. Shu, E. Yumer, S. Hadap, K. Sunkavalli, E. Shechtman, and D. Samaras. Neural face editing with intrinsic image disentangling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (CVPR), 2017.
 [31] P. Snape, Y. Panagakis, and S. Zafeiriou. Automatic construction of robust spherical harmonic subspaces. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 91–100, 2015.
 [32] Y. Tang, R. Salakhutdinov, and G. Hinton. Tensor analyzers. In International Conference on Machine Learning, pages 163–171, 2013.
 [33] J. B. Tenenbaum and W. T. Freeman. Separating style and content with bilinear models. Neural Comput., 12(6):1247–1283, June 2000.
 [34] A. Tewari, M. Zollöfer, H. Kim, P. Garrido, F. Bernard, P. Perez, and T. Christian. MoFA: Modelbased Deep Convolutional Face Autoencoder for Unsupervised Monocular Reconstruction. In The IEEE International Conference on Computer Vision (ICCV), 2017.
 [35] L. Tran, X. Yin, and X. Liu. Disentangled representation learning gan for poseinvariant face recognition. In CVPR, volume 4, page 7, 2017.
 [36] M. A. O. Vasilescu and D. Terzopoulos. Multilinear analysis of image ensembles: Tensorfaces. In European Conference on Computer Vision, pages 447–460. Springer, 2002.
 [37] M. Wang, Y. Panagakis, P. Snape, S. Zafeiriou, et al. Learning the multilinear structure of visual data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4592–4600, 2017.
 [38] R. J. Woodham. Photometric method for determining surface orientation from multiple images, 1980.
 [39] D. E. Worrall, S. J. Garbin, D. Turmukhambetov, and G. J. Brostow. Interpretable transformations with encoderdecoder networks. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
 [40] X. Wu, R. He, and Z. Sun. A lightened CNN for deep face representation. arXiv preprint arXiv:1511.02683, 2015.
 [41] F. Yang, J. Wang, E. Shechtman, L. Bourdev, and D. Metaxas. Expression flow for 3daware face component transfer. In ACM Transactions on Graphics (TOG), volume 30, page 60. ACM, 2011.
 [42] S. Zafeiriou, G. A. Atkinson, M. F. Hansen, W. A. P. Smith, V. Argyriou, M. Petrou, M. L. Smith, and L. N. Smith. Face recognition and verification using photometric stereo: The photoface database and a comprehensive evaluation. IEEE Transactions on Information Forensics and Security, 8(1):121–135, 2013.
Appendix A Network Details
The convolutional encoder stack (Fig. 2) is composed of three convolutions with , and filter sets. Each convolution is followed by maxpooling and a thresholding nonlinearity. We pad the filter responses so that the final output of the convolutional stack is a set of filter responses with size for an input image . The pooling indices of the maxpooling are preserved for the unpooling layers in the decoder stack.
The decoder stacks for the mask and background are strictly symmetric to the encoder stack and have skip connections to the input encoder stack at the corresponding unpooling layers. These skip connections between the encoder and the decoder allow for the details of the background to be preserved.
The other decoder stacks use upsampling and are also strictly symmetric to the encoder stack.














































































































Appendix B Derivation Details
The model is trained endtoend by applying gradient descent to batches of images, where (12), (13) and (14) are written in the following general form:
(15) 
where is a data matrix, is the mode1 matricisation of a tensor and are the latent variables matrices.
The partial derivative of (15) with respect to the latent variable are computed as follows: Let be a vectorisation of , then (15) is equivalent with:
(21)  
as both the Frobenius norm and the norm are the sum of all elements squared.
(22)  
as the property holds [26].
Using [28] and let and the following holds:
(23)  
Using [28] and let :
(24)  
Let be a vectorisation of , this becomes:
(16)  
Appendix C More expression and pose transfer images









































































































Appendix D Interpolation Results
We interpolate / of the input image on the righthand side to the / of the target image on the lefthand side. The interpolation is linear and at 0.1 interval. For the interpolation we do not modify the background so the background remains that of image .
For expression interpolation, we expect the identity and pose to stay the same as the input image and only the expression to change gradually from the expression of the input image to the expression of the target image . Figure 18 shows the expression interpolation. We can clearly see the change in expression while pose and identity remain constant.
For identity interpolation, we expect the expression and pose to stay the same as the input image and only the identity to change gradually from the identity of the input image to the identity of the target image . Figure 19 shows the identity interpolation. We can clearly observe the change in identity while other variations remain limited.
Appendix E Expression Transfer from Video
We conducted another challenging experiment to test the potential of our method. Can we transfer facial expressions from an “inthewild” video to a given template image (also “inthewild” image)? For this experiment, we split the input video into frames and extract the expression component of each frame. Then we replace the expression component of the template image with the of the video frames and decode them. The decoded images form a new video sequence where the person in the template image has taken on the expression of the input video at each frame. The result can be seen here: https://youtu.be/tUTRSrY_ON8. The original video is shown on the left side while the template image is shown on the right side. The result of the expression transfer is the 2nd video from the left. We compare against a baseline (3rd video from the left) where the template image has been warped to the landmarks of the input video. We can clearly see that our method is able to disentangle expression from pose and the change is only at the expression level. The baseline though is only able to transform expression and pose together. Our result video also displays expressions that are more natural to the person in the template image. To conclude, we are able to animate a template face using the disentangled facial expression components of a video sequence.
Appendix F Relighting
Appendix G Further Expression Editing Comparison
Figure 21 shows further expression editing comparison results with [37]. The method proposed in [37] does not disentangle pose and hence requires a “frontalisation” of the face to work optimally. Our proposed method on the other hand is able to edit expressions directly on aligned images. To visualise the difference, we run [37] directly on our aligned test images to compare with our proposed method. As expected the results returned by [37] does not perform well given this setup. So given the same input (aligned images from CelebA) our proposed method is able to edit expression directly whereas [37] requires further “frontalisation” transformations to obtain good results. This is due to [37] not being able to disentangle pose.