An Adversarial Neuro-Tensorial Approach For Learning Disentangled Representations

An Adversarial Neuro-Tensorial Approach For Learning Disentangled Representations

Mengjiao Wang   Zhixin Shu   Shiyang Cheng   Yannis Panagakis   Dimitris Samaras   Stefanos Zafeiriou
Imperial College London   Stony Brook University
{m.wang15,shiyang.cheng11,i.panagakis,s.zafeiriou}@imperial.ac.uk {zhshu,samaras}@cs.stonybrook.edu
Abstract

Several factors contribute to the appearance of an object in a visual scene, including pose, illumination, and deformation, among others. Each factor accounts for a source of variability in the data, while the multiplicative interactions of these factors emulate the entangled variability, giving rise to the rich structure of visual object appearance. Disentangling such unobserved factors from visual data is a challenging task, especially when the data have been captured in uncontrolled recording conditions (also referred to as “in-the-wild”) and label information is not available.

In this paper, we propose the first unsupervised deep learning method (with pseudo-supervision) for disentangling multiple latent factors of variation in face images captured in-the-wild. To this end, we propose a deep latent variable model, where the multiplicative interactions of multiple latent factors of variation are explicitly modelled by means of multilinear (tensor) structure. We demonstrate that the proposed approach indeed learns disentangled representations of facial expressions and pose, which can be used in various applications, including face editing, as well as 3D face reconstruction and classification of facial expression, identity and pose.

1 Introduction

(a) Expression Editing
(b) Pose Editing
Figure 1: Given a single in-the-wild image, our network learns disentangled representations for pose, illumination, expression and identity. Using these representations, we are able to manipulate the image and edit the pose or expression.

The appearance of visual objects is significantly affected by multiple factors of variability such as, for example, pose, illumination, identity, and expression in case of faces. Each factor accounts for a source of variability in the data, while their complex interactions give rise to the observed entangled variability. Discovering the modes of variation, or in other words disentangling the latent factors of variations in visual data, is a very important problem in the intersection of statistics, machine learning, and computer vision.

Factor analysis [12] and the closely related Principal Component Analysis (PCA) [16] are probably the most popular statistical methods that find a single mode of variation explaining the data. Nevertheless, visual appearance (e.g., facial appearance) is affected by several modes of variations. Hence, methods such as PCA are not able to identify such multiple factors of variation. For example, when PCA is applied to facial images, the first principal component captures both pose and expressions variations.

An early approach for learning different modes of variation in the data is TensorFaces [36]. In particular, TensorFaces is a strictly supervised method as it not only requires the facial data to be labelled (e.g., in terms of expression, identity, illumination etc.) but the data tensor must also contain all samples in all different variations. This is the primary reason that the use of such tensor decompositions is still limited to databases that have been captured in a strictly controlled environment, such as the Weizmann face database [36].

Recent unsupervised tensor decompositions methods [32, 37] automatically discover the modes of variation in unlabelled data. In particular, the most recent one [37] assumes that the original visual data have been produced by a hidden multilinear structure and the aim of the unsupervised tensor decomposition is to discover both the underlying multilinear structure, as well as the corresponding weights (coefficients) that best explain the data. Special instances of the unsupervised tensor decomposition are the Shape-from-Shading (SfS) decompositions in [17, 31] and the multilinear decompositions for 3D face description in [37]. In [37], it is shown that the method indeed can be used to learn representations where many modes of variation have been disentangled (e.g., identity, expression and illumination etc.). Nevertheless, the method in [37] is not able to find pose variations and bypasses this problem by applying it to faces which have been frontalised by applying a warping function (e.g., piece-wise affine warping [25]).

Another promising line of research for discovering latent representations is unsupervised Deep Neural Networks (DNNs). Unsupervised DNNs architectures include the Auto-Encoders (AE) [1], as well as the Generative Adversarial Networks (GANs) [13] or adversarial versions of AE, e.g., the Adversarial Auto-Encoders (AAE) [23]. Even though GANs, as well as AAEs, provide very elegant frameworks for discovering powerful low-dimensional embeddings without having to align the faces, due to the complexity of the networks, unavoidably all modes of variation are multiplexed in the latent-representation. Only with the use of labels it is possible to model/learn the manifold over the latent representation, usually as a post-processing step [30].

In this paper, we show that it is possible to learn a disentangled representation of the human face captured in arbitrary recording conditions in an unsupervised manner111Our methodology uses the information produced by an automatic 3D face fitting procedure [4] but it does not make use of any labels in the training set. by imposing a multilinear structure on the latent representation of an AAE [30]. To the best of our knowledge, this is the first time that unsupervised tensor decompositions have been combined with DNNs for learning disentangled representations. We demonstrate the power of the proposed approach by showing expression/pose transfer using only the latent variable that is related to expression/pose. We also demonstrate that the disentangled low-dimensional embeddings are useful for many other applications, such as facial expression, pose, and identity recognition and clustering. An example of the proposed approach is given in Fig. 1. In particular, the left pair of images have been decomposed, using the encoder of the proposed neural network , into many different latent representations including latent representations for pose, illumination, identity and expression. Since our framework has learned a disentangled representation we can easily transfer the expression by only changing the latent variable related to expression and passing the latent vector into the decoder of our neural network . Similarly, we can transfer the pose merely by changing the latent variable related to pose.

2 Related Work

Figure 2: Our network is an end-to-end trained auto-encoder. The encoder extracts latent variables corresponding to illumination, pose, expression and identity from the input image . These latent variables are then fed into the decoder to reconstruct the image. We impose a multilinear structure and enforce the disentangling of variations. The grey triangles represent the losses: adversarial loss , and losses.

Learning disentangled representations that explain multiple factors of variation in the data as disjoint latent dimensions is desirable in several machine learning, computer vision, and graphics tasks.

Indeed, bilinear factor analysis models [33] have been employed for disentangling two factors of variation (e.g., head pose and facial identity) in the data. Identity, expression, pose, and illumination variations are disentangled in [36] by applying Tucker decomposition (also known as multilinear Singular Value Decomposition (SVD) [10]) into a carefully constructed tensor through label information. Interestingly, the modes of variation in well aligned images can be recovered via a multilinear matrix factorization  [37] without any supervision. However, inference in [37] might be ill-posed.

More recently, both supervised and unsupervised deep learning methods have been developed for disentangled representations learning. Transforming auto-encoders [15] is among the earliest methods for disentangling latent factors by means of auto-encoder capsules. In [11] hidden factors of variation are disentangled via inference in a variant of the restricted Boltzmann machine. Disentangled representations of input images are obtained by the hidden layers of deep networks in [8] and through a higher-order Boltzmann machine in [27]. The Deep Convolutional Inverse Graphics Network [20] learns a representation that is disentangled with respect to transformations such as out-of-plane rotations and lighting variations. Methods in [6, 24, 5, 34, 35] extract disentangled and interpretable visual representations by employing adversarial training. The method in [30] disentangles the latent representations of illumination, surface normals, and albedo of face images using an image rendering pipeline. Trained with pseudo-supervision, [30] undertakes multiple image editing tasks by manipulating the relevant latent representations. Nonetheless, this editing approach still requires expression labelling, as well as sufficient sampling of a specific expression.

Here, the proposed network is able to edit the expression of a face image given another single in-the-wild face image of arbitrary expression. Furthermore, we are able to edit the pose of a face in the image which is not possible in [30].

3 Proposed Method

In this section, we will introduce the main multilinear models used to describe three different image modalities, namely texture, 3D shape and 3D surface normals. To this end, we assume that for each different modality there is a different core tensor but all modalities share the same latent representation of weights regarding identity and expression. During training all the core tensors inside the network are randomly initialised and learnt end-to-end. In the following, we assume that we have a set of facial images (e.g., in the training batch) and their corresponding 3D facial shape, as well as their normals per pixel (the 3D shape and normals have been produced by fitting a 3D model on the 2D image, e.g., [4]).

3.1 Facial Texture

The main assumption here follows from  [37]. That is, the rich structure of visual data is a result of multiplicative interactions of hidden (latent) factors and hence the underlying multilinear structure, as well as the corresponding weights (coefficients) that best explain the data can be recovered using the unsupervised tensor decomposition [37]. Indeed, following [37], disentangled representations can be learnt (e.g., identity, expression, and illumination, etc.) from frontalised facial images. The frontalisation process is performed by applying a piecewise affine transform using the sparse shape recovered by a face alignment process. Inevitably, this process suffers from warping artifacts. Therefore, rather than applying any warping process, we perform the multilinear decomposition only on near frontal faces, which can be automatically detected during the 3D face fitting stage. In particular, assuming a near frontal facial image rasterised in a vector , given a core tensor 222 Tensors notation: Tensors (i.e., multidimensional arrays) are and denoted by calligraphic letters, e.g., . The mode- matricisation of a tensor maps to a matrix . The mode- vector product of a tensor with a vector , denoted by . The Kronecker product is denoted by and the Khatri-Rao (i.e., column-wise Kronecker product) product is denoted by . More details on tensors and multilinear operators can be found in [18]., this can be decomposed as

(1)

where and are the weights that correspond to illumination, expression and identity respectively. The equivalent form in case that we have a number of images in the batch stacked in the columns of a matrix is

(2)

where is a mode-1 matricisation of tensor and , and are the corresponding matrices that gather the weights of the decomposition for all images in the batch. That is, stacks the latent variables of expressions of the images, stacks the latent variables of identity and stacks the latent variables of illumination.

3.2 3D Facial Shape

It is quite common to use a bilinear model for disentangling identity and expression in 3D facial shape [3]. Hence, for 3D shape we assume that there is a different core tensor and each 3D facial shape can be decomposed as:

(3)

where and are exactly the same weights as in the texture decomposition (2). The tensor decomposition for the images in the batch is therefore written as as

(4)

where is a mode-1 matricization of tensor .

3.3 Facial Normals

The tensor decomposition we opted to use for facial normals was exactly the same as the texture, hence we can use the same core tensor and weights. The difference is that since facial normals do not depend on illumination parameters (assuming a Lambertian illumination model), we just need to replace the illumination weights with a constant333This is also the way that normals are computed in [37] up to a scaling factor. Thus, the decomposition for normals can be written as

(5)

where is a matrix of ones.

3.4 3D Facial Pose

Finally, we define another latent variable regarding 3D pose. This latent variable represents a 3D rotation. We denote by an image at index . The indexing is denoted in the following by the superscript. The corresponding can be reshaped into a rotation matrix . As proposed in [39], we apply this rotation to the feature of the image created by 2-way synthesis (explained in Section 3.5). This feature vector is the -th column of the feature matrix resulting from the 2-way synthesis . We denote this feature vector corresponding to a single image as . Next is reshaped into a matrix and left-multiplied by . After another round of vectorisation, the resulting feature becomes the input of the decoders for normal and albedo. This transformation from feature vector to the rotated feature is called rotation.

3.5 Network Architecture

We incorporate the structure imposed by Equations (2), (4) and (5) into an auto-encoder network, see Figure 2. For some matrices , we refer to the operation as 2-way synthesis and as 3-way synthesis. The multiplication of a feature matrix by or , mode-1 matricisations of tensors and , is referred to as projection and can be represented by an unbiased fully-connected layer.

Our network follows the architecture of [30]. The encoder receives an input image and the convolutional encoder stack first encodes it into , an intermediate latent variable vector of size . is then transformed into latent codes for background , mask , illumination , pose , identity and expression via fully-connected layers.

(6)

The decoder takes in the latent codes as input. and ( vectors) are directly passed into convolutional decoder stacks to estimate background and face mask respectively. The remaining latent variables follow 3 streams:

  1. [leftmargin=*]

  2. ( vector) and ( vector) are joined by 2-way synthesis and projection to estimate facial shape .

  3. The result of 2-way synthesis of and is rotated using . The rotated feature is passed into 2 different convolutional decoder stacks: one for normal estimation and another for albedo. Using the estimated normal map, albedo, illumination component , mask and background, we render a reconstructed image .

  4. , and are combined by a 3-way synthesis and projection to estimate frontal normal map and a frontal reconstruction of the image.

Streams 1 and 3 drive the disentangling of expression and identity components, while stream 2 focuses on the reconstruction of the image by adding the pose components.

(7)

Our input images are aligned and cropped facial images from the CelebA database [21] of size , so . , , and . More details on the network such as the convolutional encoder stacks and decoder stacks can be found in the supplementary material.

Figure 3: Our proof-of-concept network is an end-to-end trained auto-encoder. The encoder extracts latent variables corresponding to expression and identity from the input image . These latent variables are then fed into the decoder to reconstruct the image. A separate stream also reconstructs facial texture from . We impose a multilinear structure and enforce the disentanglement of variations. In the extended version b) the encoder also extracts a latent variable corresponding to pose. The decoder takes in this information and reconstructs an image containing pose variations.

3.6 Training

We use in-the-wild face images for training. Hence, we only have access to the image itself () while ground truth labelling for pose, illumination, normal, albedo, expression, identity or 3D shape is unavailable. The main loss function is the reconstruction loss of the image :

(8)

where is the reconstructed image, is the reconstruction loss, and are regularisation weights, represents the adversarial loss and the verification loss. We use the pre-trained verification network  [40] to find face embeddings of our images and . As both images are supposed to represent the same person, we minimise the cosine distance between the embeddings: . Simultaneously, a discriminative network is trained to distinguish between the generated and real images [13]. We incorporate the discriminative information by following the auto-encoder loss distribution matching approach of [2]. The discriminative network is itself an auto-encoder trying to reconstruct the input image so the adversarial loss is . is trained to minimise .

As fully unsupervised training often results in semantically meaningless latent representations, Shu et al. [30] proposed to train with “pseudo ground truth” values for normals, lighting and 3D facial shape. We adopt here this technique and introduce further “pseudo ground truth” values for pose , expression and identity . , and are obtained by fitting coarse face geometry to every image in the training set using a 3D Morphable Model [4]. We incorporated the constraints used in [30] for illumination, normals and albedo. Hence, the following new objectives are introduced:

(9)

where is a 3D camera rotation matrix.

(10)

where fc() is a fully-connected layer and is a “pseudo ground truth” vector representing 3DMM expression components of the image .

(11)

where fc() is a fully-connected layer and is a “pseudo ground truth” vector representing 3DMM identity components of the image .

(a) Original Image
(b) Expression
(c) Our Recon
(d) Our Exp Edit
(e) Ground Truth
(f) Original Image
(g) Expression
(h) Our Recon
(i) Our Exp Edit
(j) Ground Truth
(k)
(l)
(m)
(n)
(o)
(p)
(q)
(r)
(s)
(t)
Figure 4: Our network is able to transfer the expression from one face to another by disentangling the expression components of the images. The ground truth has been computed using the ground truth texture with synthetic identity and expression components.
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)
(k)
(l)
(m)
(n)
(o)
(p)
(q)
(r)
(s)
(t)
(u)
Figure 5: Given a single image, we infer meaningful expression and identity components to reconstruct a 3D mesh of the face. We compare the reconstruction (last row) against the ground truth ( row).
(a)
(b)
(c)
Figure 6: Given a single image, we infer the facial texture. We compare the reconstructed facial texture (last row) against the ground truth texture ( row).
(a) Original Image
(b) Pose
(c) Our Recon
(d) Our Pose Edit
(e) Ground Truth
(f) Original Image
(g) Pose
(h) Our Recon
(i) Our Pose Edit
(j) Ground Truth
(k)
(l)
(m)
(n)
(o)
(p)
(q)
(r)
(s)
(t)
Figure 7: Our network is able to transfer the pose from one face to another by disentangling the pose, expression and identity components of the images. The ground truth has been computed using the ground truth texture with synthetic pose, identity and expression components.

Multilinear Losses

Directly applying the above losses as constraints to the latent variables does not result in a well-disentangled representation. To achieve a better performance, we impose a tensor structure on the image using the following losses:

(12)

where is the 3D facial shape of the fitted model.

(13)

where is a semi-frontal face image. During training, is only applied on near-frontal face images filtered using .

(14)

where is a near frontal normal map. During training, the loss is only applied on near frontal normal maps.

The model is trained end-to-end by applying gradient descent to batches of images, where Equations (12), (13) and (14) are written in the following general form:

(15)

where is the number of modes of variations, is a data matrix, is the mode-1 matricisation of a tensor and are the latent variables matrices.

The partial derivative of (15) with respect to the latent variable are computed as follows: Let be the vectorised , be the vectorised ,

and , then (15) is equivalent with:

(16)

Consequently the partial derivative of (15) with respect to is obtained by matricising the partial derivative of (16) with respect to , which is easy to compute analytically. The derivation of this can be found in the supplemental material. To efficiently compute the above mentioned operations, Tensorly [19] has been employed.

(a) Original Image
(b) Expression
(c) Our Recon
(d) Our Exp Edit
(e) Baseline
(f) Original Image
(g) Expression
(h) Our Recon
(i) Our Exp Edit
(j) Baseline
(k)
(l)
(m)
(n)
(o)
(p)
(q)
(r)
(s)
(t)
(u)
(v)
(w)
(x)
(y)
(z)
(aa)
(ab)
(ac)
(ad)
(ae)
(af)
(ag)
(ah)
(ai)
(aj)
(ak)
(al)
(am)
(an)
(ao)
(ap)
(aq)
(ar)
(as)
(at)
(au)
(av)
(aw)
(ax)
(ay)
(az)
(ba)
(bb)
(bc)
(bd)
(be)
(bf)
(bg)
(bh)
(bi)
(bj)
(bk)
(bl)
(bm)
(bn)
(bo)
(bp)
(bq)
(br)
Figure 8: Our network is able to transfer the expression from one face to another by disentangling the expression components of the images. We compare our expression editing results with a baseline where a 3DMM has been fit to both input images.
(a) Original Image
(b) Expression
(c) Our Recon
(d) Our Exp Edit
(e) B & W
(f) [37]
(g) Original Image
(h) Expression
(i) Our Recon
(j) Our Exp Edit
(k) B & W
(l) [37]
(m)
(n)
(o)
(p)
(q)
(r)
(s)
(t)
(u)
(v)
(w)
(x)
(y)
(z)
(aa)
(ab)
(ac)
(ad)
(ae)
(af)
(ag)
(ah)
(ai)
(aj)
(ak)
(al)
(am)
(an)
(ao)
(ap)
(aq)
(ar)
(as)
(at)
(au)
(av)
Figure 9: We compare our expression editing results with [37]. As [37] is not able to disentangle pose, editing expressions from images of different poses returns noisy results.
(a) Original Image
(b) Pose
(c) Our Recon
(d) Our Pose Edit
(e) Baseline
(f) Original Image
(g) Pose
(h) Our Recon
(i) Our Pose Edit
(j) Baseline
(k)
(l)
(m)
(n)
(o)
(p)
(q)
(r)
(s)
(t)
(u)
(v)
(w)
(x)
(y)
(z)
(aa)
(ab)
(ac)
(ad)
Figure 10: Our network is able to transfer the pose of one face to another by disentangling the pose components of the images. We compare our pose editing results with a baseline where a 3DMM has been fit to both input images.
(a) Source
(b)
(c) [30]
(d) Target
(e) Reconstruction
(f)
(g)
(h) Result
(i)
(j)
(k)
(l)
(m)
(n)
(o)
(p)
(q)
(r)
(s)
(t)
(u)
(v)
(w)
Figure 11: Using the illumination and normals estimated by our network, we are able to relight target faces using illumination from the source image. The source and target shading are displayed to visualise against the new transferred shading . We compare against [30].
(a) Input
(b) Reconstruction
(c) Ground Truth
(d) Input
(e) Reconstruction
(f) Ground Truth
(g)
(h)
(i)
(j)
(k)
(l)
(m)
(n)
(o)
(p)
(q)
(r)
(s)
(t)
(u)
(v)
(w)
(x)
Figure 12: Given a single image, we infer meaningful expression and identity components to reconstruct a 3D mesh of the face. We compare the reconstruction against the ground truth provided by 3DMM fitting.

4 Proof of Concept Experiments

We develop a lighter version of our proposed network, a proof-of-concept network (visualised in Figure 3), to show that our network is able to learn and disentangle pose, expression and identity.

In order to showcase the ability of the network, we leverage our newly proposed 4DFAB database [7], where subjects were invited to attend four sessions at different times in a span of five years. In each experiment session, the subject was asked to articulate 6 different facial expressions (anger, disgust, fear, happiness, sadness, surprise), and we manually select the most expressive mesh (i.e. the apex frame) for this experiment. In total, 1795 facial meshes from 364 recording sessions (with 170 unique identities) are used. We keep 148 identities for training and leave 22 identities for testing. Note that there are no overlapping of identities between both sets. Within the training set, we synthetically augment each facial mesh by generating new facial meshes with 20 randomly selected expressions. Our training set contains in total 35900 meshes. The test set contains 387 meshes. For each mesh, we have the ground truth facial texture as well as expression and identity components of the 3DMM model.

4.1 Disentangling Expression and Identity

We create frontal images of the facial meshes. Hence there is no illumination or pose variation in this training dataset. We train a lighter version of our network by removing the illumination and pose streams, a proof-of-concept network, visualised in Figure 3, on this synthetic dataset.

4.1.1 Expression Editing

We show the disentanglement between expression and identity by transferring the expression of one person to another.

For this experiment, we work with unseen data (a hold-out set consisting of 22 unseen identities) and no labels. We first encode both input images and :

(17)

where is our encoder and and are the latent representations of expression and identity respectively.

Assuming we want to emulate the expression of , we decode on:

(18)

where is our decoder. The resulting becomes our edited image where has the expression of . Figure 4 shows how the network is able to separate expression and identity. The edited images clearly maintain the identity while expression changes.

4.1.2 3D Reconstruction and Facial Texture

The latent variables and that our network learns are extremely meaningful. Not only can they be used to reconstruct the image in 2D, but also they can be mapped into the expression () and identity () components of a 3DMM model. This mapping is learnt inside the network. By replacing the expression and identity components of a mean face shape with and , we are able to reconstruct the 3D mesh of a face given a single input image. We compare these reconstructed meshes against the ground truth 3DMM used to create the input image in Figure 5.

At the same time, the network is able to learn a mapping from to facial texture. Therefore, we can predict the facial texture given a single input image. We compare the reconstructed facial texture with the ground truth facial texture in Figure 6.

4.2 Disentangling Pose, Expression and Identity

Our synthetic training set contains in total 35900 meshes. For each mesh, we have the ground truth facial texture as well as expression and identity components of the 3DMM, from which we create a corresponding image with one of 7 given poses. As there is no illumination variation in this training set, we train a proof-of-concept network by removing the illumination stream, visualised in Figure 3b, on this synthetic dataset.

4.2.1 Pose Editing

We show the disentanglement between pose, expression and identity by transferring the pose of one person to another. Figure 7 shows how the network is able to separate pose from expression and identity. This experiment highlights the ability of our proposed network to learn large pose variations even from profile to frontal faces.

5 Experiments in-the-wild

We train our network on in-the-wild data and perform several experiments on unseen data to show that our network is indeed able to disentangle illumination, pose, expression and identity.

We edit expression or pose by swapping the latent expression/pose component learnt by the encoder (Eq. (6)) with the latent expression/pose component predicted from another image. We feed the decoder (Eq. (7)) with the modified latent component to retrieve our edited image.

5.1 Expression and Pose Editing in-the-wild

Given two in-the-wild images of faces, we are able to transfer the expression or pose of one person to another. Transferring the expression from two different facial images without fitting a 3D model is a very challenging problem. Generally, it is considered in the context of the same person under an elaborate blending framework [41] or by transferring certain classes of expressions [29].

For this experiment, we work with completely unseen data (a hold-out set of CelebA) and no labels. We first encode both input images and :

(19)

where is our encoder and , , are the latent representations of expression, identity and pose respectively.

Assuming we want to take on the expression or pose of , we then decode on:

(20)

where is our decoder.

The resulting then becomes our result image where has the expression of . is the edited image where changed to the pose of .

As there is currently no prior work for this expression editing experiment without fitting an AAM [9] or 3DMM, we used the image synthesised by the 3DMM fitted models as a baseline, which indeed performs quite well. Compared with our method, other very closely related works [37, 30] are not able to disentangle illumination, pose, expression and identity. In particular, [30] disentangles illumination of an image while [37] disentangles illumination, expression and identity from “frontalised” images. Hence they are not able to disentangle pose. None of these methods can be applied to the expression/pose editing experiments on a dataset that contains pose variations such as CelebA. If [37] is applied directly on our test images, it would not be able to perform expression editing well, as shown by Figure 9.

For the 3DMM baseline, we fit a shape model to both images and extract the expression components of the model. We then generate a new face shape using the expression components of one face and the identity components of another face in the same 3DMM setting. This technique has much higher overhead than our proposed method as it requires time-consuming 3DMM fitting of the images. Our expression editing results and the baseline results are shown in Figure 8. Though the baseline is very strong, it does not change the texture of the face which can produce unnatural looking faces shown with original expression. Also, the baseline method can not fill up the inner mouth area. Our editing results show more natural looking faces.

For pose editing, the background is unknown once the pose has changed, thus, for this experiment, we mainly focus on the face region. Figure 10 shows our pose editing results. For the baseline method, we fit a 3DMM to both images and estimate the rotation matrix. We then synthesise with the rotation of . This technique has high overhead as it requires expensive 3DMM fitting of the images.

(a)
Figure 13: Comparison of the estimated normals obtained using the proposed model vs the ones obtained by [37] and [30].
(a)
(b)
Figure 14: Visualisation of our and baseline using t-SNE. Our latent clusters better with regards to expression than the latent space of an auto-encoder.
(a)
(b)
Figure 15: Visualisation of our and baseline using t-SNE. It is evident that the proposed disentangled clusters better with regards to pose than the latent space of an auto-encoder.

5.2 Illumination Editing

We transfer illumination by estimating the normals , albedo and illumination components of the source () and target () images. Then we use and to compute the transferred shading and multiply the new shading by to create the relighted image result . In Figure 11 we show the performance of our method and compare against [30] on illumination transfer. We observe that our method outperforms [30] as we obtain more realistic looking results. We include further comparison images with  [30] in the supplemental material.

5.3 3D Reconstruction

The latent variables and that our network learns are extremely meaningful. Not only can they be used to reconstruct the image in 2D, they can be mapped into the expression () and identity () components of a 3DMM. This mapping is learnt inside the network. By replacing the expression and identity components of a mean face shape with and , we are able to reconstruct the 3D mesh of a face given a single in-the-wild 2D image. We compare these reconstructed meshes against the fitted 3DMM to the input image.

The results of the experiment are visualised in Figure 12. We observe that the reconstruction is very close to the ground truth. Both techniques though do not capture well the identity of the person in the input image due to a known weakness in 3DMM.

5.4 Normal Estimation

Method MeanStd against [38] <35 <40
[37] 33.37°   3.29° 75.3% 96.3%
[30] 30.09°   4.66° 84.6% 98.1%
Proposed 28.67°   5.79° 89.1% 96.3%
Table 1: Angular error for the various surface normal estimation methods on the Photoface [42] dataset

We evaluate our method on the surface normal estimation task on the Photoface [42] dataset which has information about illumination. Assuming the normals found using calibrated Photometric Stereo [38] as “ground truth”, we calculate the angular error between our estimated normals and the “ground truth”. Figure 13 and Table 1 quantitatively evaluates our proposed method against prior works [37, 30] in the normal estimation task. We observe that our proposed method performs on par or outperforms previous methods.

5.5 Quantitative Evaluation of the Latent Space

We want to test whether our latent space corresponds well to the variation that it is supposed to learn. For our quantitative experiment, we used Multi-PIE [14] as our test dataset. This dataset contains labelled variations in identity, expressions and pose. Disentanglement of variations in Multi-PIE is particularly challenging as its images are captured under laboratory conditions which is quite different from that of our training images. As a matter of fact, the expressions contained in Multi-PIE do not correspond to the 7 basic expressions and can be easily confused.

We encoded 10368 images of the Multi-PIE dataset with 54 identities, 6 expressions and 7 poses and trained a linear SVM classifier using 90% of the identity labels and the latent variables . We then test on the remaining 10% to check whether they are discriminative for identity classification. We use 10-fold cross-validation to evaluate the accuracy of the learnt classifier. We repeat this experiment for expression with and pose with respectively. Our results in Table 2 show that our latent representation is indeed discriminative. This experiment showcases the discriminative power of our latent representation on a previously unseen dataset. In order to quantitatively compare with [37], we run another experiment on only frontal images of the dataset with 54 identities, 6 expressions and 16 illuminations. The results in Table 3 shows how our proposed model outperforms [37] in these classification tasks. Our latent representation has stronger discriminative power than the one learnt by [37].

Accuracy 83.85% 86.07% 95.73%
Table 2: Classification accuracy results: we try to classify 54 identities using , 6 expressions using and 7 poses using .
Identity
[37]
Accuracy 99.33% 19.18 %
Expression
[37]
Accuracy 78.92% 35.49
Illumination
[37]
Accuracy 64.11% 48.85%
Table 3: Classification accuracy results in comparison with [37]: As [37] works on frontal images, we only consider frontal images in this experiment. We try to classify 54 identities using vs. , 6 expressions using vs. and 16 illumination using vs. .

We visualise, using t-SNE [22], the latent and encoded from Multi-PIE according to their expression and pose label and compare against the latent representation learnt by an in-house large-scale adversarial auto-encoder of similar architecture trained with 2 million faces [23]. Figures 14 and 15 show that even though our encoder has not seen any images of Multi-PIE, it manages to create informative latent representations that cluster well expression and pose (contrary to the representation learned by the tested auto-encoder).

6 Conclusion

We proposed the first, to the best of our knowledge, attempt to jointly disentangle modes of variation that correspond to expression, identity, illumination and pose using no explicit labels regarding these attributes. More specifically, we proposed the first, as far as we know, approach that combines a powerful Deep Convolutional Neural Network (DCNN) architecture with unsupervised tensor decompositions. We demonstrate the power of our methodology in expression and pose transfer, as well as discovering powerful features for pose and expression classification.

References

  • [1] Y. Bengio, A. Courville, and P. Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828, 2013.
  • [2] D. Berthelot, T. Schumm, and L. Metz. Began: Boundary equilibrium generative adversarial networks. arXiv preprint arXiv:1703.10717, 2017.
  • [3] T. Bolkart and S. Wuhrer. A robust multilinear model learning framework for 3d faces. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4911–4919, 2016.
  • [4] J. Booth, E. Antonakos, S. Ploumpis, G. Trigeorgis, Y. Panagakis, and S. Zafeiriou. 3d face morphable models” in-the-wild”. arXiv preprint arXiv:1701.05360, 2017.
  • [5] C. X. D. T. Chaoyue Wang, Chaohui Wang. Tag disentangled generative adversarial network for object image re-rendering. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-17, pages 2901–2907, 2017.
  • [6] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2172–2180, 2016.
  • [7] S. Cheng, I. Kotsia, M. Pantic, and S. Zafeiriou. 4dfab: A large scale 4d facial expression database for biometric applications. In arXiv:1712.01443, 2017.
  • [8] B. Cheung, J. A. Livezey, A. K. Bansal, and B. A. Olshausen. Discovering hidden factors of variation in deep networks. arXiv preprint arXiv:1412.6583, 2014.
  • [9] T. F. Cootes, G. J. Edwards, and C. J. Taylor. Active appearance models. IEEE Transactions on pattern analysis and machine intelligence, 23(6):681–685, 2001.
  • [10] L. De Lathauwer, B. De Moor, and J. Vandewalle. A multilinear singular value decomposition. SIAM journal on Matrix Analysis and Applications, 21(4):1253–1278, 2000.
  • [11] G. Desjardins, A. Courville, and Y. Bengio. Disentangling factors of variation via generative entangling. arXiv preprint arXiv:1210.5474, 2012.
  • [12] L. R. Fabrigar and D. T. Wegener. Exploratory factor analysis. Oxford University Press, 2011.
  • [13] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
  • [14] R. Gross, I. Matthews, J. Cohn, T. Kanade, and S. Baker. Multi-PIE. Image and Vision Computing, 28(5):807–813, 2010.
  • [15] G. E. Hinton, A. Krizhevsky, and S. D. Wang. Transforming auto-encoders. In International Conference on Artificial Neural Networks, pages 44–51. Springer, 2011.
  • [16] H. Hotelling. Analysis of a complex of statistical variables into principal components. Journal of educational psychology, 24(6):417, 1933.
  • [17] I. Kemelmacher-Shlizerman. Internet based morphable model. In Proceedings of the IEEE International Conference on Computer Vision, pages 3256–3263, 2013.
  • [18] T. G. Kolda and B. W. Bader. Tensor Decompositions and Applications. SIAM Review, 51(3):455–500, 2008.
  • [19] J. Kossaifi, Y. Panagakis, and M. Pantic. Tensorly: Tensor learning in python. ArXiv e-print.
  • [20] T. D. Kulkarni, W. F. Whitney, P. Kohli, and J. Tenenbaum. Deep convolutional inverse graphics network. In Advances in Neural Information Processing Systems, pages 2539–2547, 2015.
  • [21] Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), 2015.
  • [22] L. v. d. Maaten and G. Hinton. Visualizing data using t-sne. Journal of Machine Learning Research, 9(Nov):2579–2605, 2008.
  • [23] A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, and B. Frey. Adversarial autoencoders. arXiv preprint arXiv:1511.05644, 2015.
  • [24] M. F. Mathieu, J. J. Zhao, J. Zhao, A. Ramesh, P. Sprechmann, and Y. LeCun. Disentangling factors of variation in deep representation using adversarial training. In Advances in Neural Information Processing Systems, pages 5040–5048, 2016.
  • [25] I. Matthews and S. Baker. Active appearance models revisited. International journal of computer vision, 60(2):135–164, 2004.
  • [26] H. Neudecker. Some theorems on matrix differentiation with special reference to kronecker matrix products. Journal of the American Statistical Association, 64(327):953–963, 1969.
  • [27] S. Reed, K. Sohn, Y. Zhang, and H. Lee. Learning to disentangle factors of variation with manifold interaction. In E. P. Xing and T. Jebara, editors, Proceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, pages 1431–1439, Bejing, China, 22–24 Jun 2014. PMLR.
  • [28] F. Roemer. Advanced algebraic concepts for efficient multi-channel signal processing. PhD thesis, Universitätsbibliothek Ilmenau, 2012.
  • [29] C. Sagonas, Y. Panagakis, A. Leidinger, S. Zafeiriou, et al. Robust joint and individual variance explained. In Proceedings of IEEE InternationalConference on Computer Vision & Pattern Recognition (CVPR), 2017.
  • [30] Z. Shu, E. Yumer, S. Hadap, K. Sunkavalli, E. Shechtman, and D. Samaras. Neural face editing with intrinsic image disentangling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (CVPR), 2017.
  • [31] P. Snape, Y. Panagakis, and S. Zafeiriou. Automatic construction of robust spherical harmonic subspaces. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 91–100, 2015.
  • [32] Y. Tang, R. Salakhutdinov, and G. Hinton. Tensor analyzers. In International Conference on Machine Learning, pages 163–171, 2013.
  • [33] J. B. Tenenbaum and W. T. Freeman. Separating style and content with bilinear models. Neural Comput., 12(6):1247–1283, June 2000.
  • [34] A. Tewari, M. Zollöfer, H. Kim, P. Garrido, F. Bernard, P. Perez, and T. Christian. MoFA: Model-based Deep Convolutional Face Autoencoder for Unsupervised Monocular Reconstruction. In The IEEE International Conference on Computer Vision (ICCV), 2017.
  • [35] L. Tran, X. Yin, and X. Liu. Disentangled representation learning gan for pose-invariant face recognition. In CVPR, volume 4, page 7, 2017.
  • [36] M. A. O. Vasilescu and D. Terzopoulos. Multilinear analysis of image ensembles: Tensorfaces. In European Conference on Computer Vision, pages 447–460. Springer, 2002.
  • [37] M. Wang, Y. Panagakis, P. Snape, S. Zafeiriou, et al. Learning the multilinear structure of visual data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4592–4600, 2017.
  • [38] R. J. Woodham. Photometric method for determining surface orientation from multiple images, 1980.
  • [39] D. E. Worrall, S. J. Garbin, D. Turmukhambetov, and G. J. Brostow. Interpretable transformations with encoder-decoder networks. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
  • [40] X. Wu, R. He, and Z. Sun. A lightened CNN for deep face representation. arXiv preprint arXiv:1511.02683, 2015.
  • [41] F. Yang, J. Wang, E. Shechtman, L. Bourdev, and D. Metaxas. Expression flow for 3d-aware face component transfer. In ACM Transactions on Graphics (TOG), volume 30, page 60. ACM, 2011.
  • [42] S. Zafeiriou, G. A. Atkinson, M. F. Hansen, W. A. P. Smith, V. Argyriou, M. Petrou, M. L. Smith, and L. N. Smith. Face recognition and verification using photometric stereo: The photoface database and a comprehensive evaluation. IEEE Transactions on Information Forensics and Security, 8(1):121–135, 2013.

Appendix A Network Details

The convolutional encoder stack (Fig. 2) is composed of three convolutions with , and filter sets. Each convolution is followed by max-pooling and a thresholding nonlinearity. We pad the filter responses so that the final output of the convolutional stack is a set of filter responses with size for an input image . The pooling indices of the max-pooling are preserved for the unpooling layers in the decoder stack.

The decoder stacks for the mask and background are strictly symmetric to the encoder stack and have skip connections to the input encoder stack at the corresponding unpooling layers. These skip connections between the encoder and the decoder allow for the details of the background to be preserved.

The other decoder stacks use upsampling and are also strictly symmetric to the encoder stack.

(a) Original Image
(b) Expression
(c) Recon
(d) Our Exp Edit
(e) Baseline
(f) Original Image
(g) Expression
(h) Recon
(i) Our Exp Edit
(j) Baseline
(k)
(l)
(m)
(n)
(o)
(p)
(q)
(r)
(s)
(t)
(u)
(v)
(w)
(x)
(y)
(z)
(aa)
(ab)
(ac)
(ad)
(ae)
(af)
(ag)
(ah)
(ai)
(aj)
(ak)
(al)
(am)
(an)
(ao)
(ap)
(aq)
(ar)
(as)
(at)
(au)
(av)
(aw)
(ax)
(ay)
(az)
(ba)
(bb)
(bc)
(bd)
(be)
(bf)
(bg)
(bh)
(bi)
(bj)
(bk)
(bl)
(bm)
(bn)
(bo)
(bp)
(bq)
(br)
(bs)
(bt)
(bu)
(bv)
(bw)
(bx)
(by)
(bz)
(ca)
(cb)
Figure 16: Expression Editing
(a) Original Image
(b) Pose
(c) Recon
(d) Our Pose Edit
(e) Baseline
(f) Original Image
(g) Pose
(h) Recon
(i) Our Pose Edit
(j) Baseline
(k)
(l)
(m)
(n)
(o)
(p)
(q)
(r)
(s)
(t)
(u)
(v)
(w)
(x)
(y)
(z)
(aa)
(ab)
(ac)
(ad)
Figure 17: Pose Editing

Appendix B Derivation Details

The model is trained end-to-end by applying gradient descent to batches of images, where (12), (13) and (14) are written in the following general form:

(15)

where is a data matrix, is the mode-1 matricisation of a tensor and are the latent variables matrices.

The partial derivative of (15) with respect to the latent variable are computed as follows: Let be a vectorisation of , then (15) is equivalent with:

(21)

as both the Frobenius norm and the norm are the sum of all elements squared.

(22)

as the property holds [26].

Using [28] and let and the following holds:

(23)

Using [28] and let :

(24)

Let be a vectorisation of , this becomes:

(16)

We then compute the partial derivative of (16) with respect to :

(25)

where .

The partial derivative of (15) with respect to is obtained by matricising (25).

Appendix C More expression and pose transfer images

Figures 16 and  17 show additional expression and pose editing results.

(a)
(b)
(c)
(d)
Figure 18: Expression Interpolation
(a)
(b)
(c)
(d)
Figure 19: Identity Interpolation
(a) Source
(b)
(c) [30]
(d) Target
(e)
(f)
(g) Result
(h) Target
(i)
(j)
(k) Result
(l)
(m)
(n)
(o)
(p)
(q)
(r)
(s)
(t)
(u)
(v)
(w)
(x)
(y)
(z)
(aa)
(ab)
(ac)
(ad)
(ae)
(af)
(ag)
(ah)
(ai)
(aj)
(ak)
(al)
(am)
(an)
(ao)
(ap)
(aq)
(ar)
(as)
(at)
(au)
(av)
(aw)
(ax)
(ay)
(az)
(ba)
(bb)
(bc)
(bd)
(be)
(bf)
(bg)
(bh)
(bi)
(bj)
(bk)
(bl)
(bm)
(bn)
(bo)
Figure 20: We relight target faces using illumination from the source image. We compare against results presented in [30].
(a) Original Image
(b) Expression
(c) Our Recon
(d) Our Exp Edit
(e) B & W
(f) [37]
(g)
(h)
(i)
(j)
(k)
(l)
(m)
(n)
(o)
(p)
(q)
(r)
(s)
(t)
(u)
(v)
(w)
(x)
(y)
(z)
(aa)
(ab)
(ac)
(ad)
Figure 21: We compare our expression editing results with [37]. As [37] requires frontalisation of the face, applying it on our aligned input data does not achieve good results.

Appendix D Interpolation Results

We interpolate / of the input image on the right-hand side to the / of the target image on the left-hand side. The interpolation is linear and at 0.1 interval. For the interpolation we do not modify the background so the background remains that of image .

For expression interpolation, we expect the identity and pose to stay the same as the input image and only the expression to change gradually from the expression of the input image to the expression of the target image . Figure 18 shows the expression interpolation. We can clearly see the change in expression while pose and identity remain constant.

For identity interpolation, we expect the expression and pose to stay the same as the input image and only the identity to change gradually from the identity of the input image to the identity of the target image . Figure 19 shows the identity interpolation. We can clearly observe the change in identity while other variations remain limited.

Appendix E Expression Transfer from Video

We conducted another challenging experiment to test the potential of our method. Can we transfer facial expressions from an “in-the-wild” video to a given template image (also “in-the-wild” image)? For this experiment, we split the input video into frames and extract the expression component of each frame. Then we replace the expression component of the template image with the of the video frames and decode them. The decoded images form a new video sequence where the person in the template image has taken on the expression of the input video at each frame. The result can be seen here: https://youtu.be/tUTRSrY_ON8. The original video is shown on the left side while the template image is shown on the right side. The result of the expression transfer is the 2nd video from the left. We compare against a baseline (3rd video from the left) where the template image has been warped to the landmarks of the input video. We can clearly see that our method is able to disentangle expression from pose and the change is only at the expression level. The baseline though is only able to transform expression and pose together. Our result video also displays expressions that are more natural to the person in the template image. To conclude, we are able to animate a template face using the disentangled facial expression components of a video sequence.

Appendix F Relighting

Figure 20 shows more relighting comparison results with [30]. Here we compare directly with images provided by [30] in their paper.

Appendix G Further Expression Editing Comparison

Figure 21 shows further expression editing comparison results with [37]. The method proposed in [37] does not disentangle pose and hence requires a “frontalisation” of the face to work optimally. Our proposed method on the other hand is able to edit expressions directly on aligned images. To visualise the difference, we run [37] directly on our aligned test images to compare with our proposed method. As expected the results returned by [37] does not perform well given this setup. So given the same input (aligned images from CelebA) our proposed method is able to edit expression directly whereas [37] requires further “frontalisation” transformations to obtain good results. This is due to [37] not being able to disentangle pose.

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
331956
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description