Geometry Guided Adversarial Facial Expression Synthesis

Geometry Guided Adversarial Facial Expression Synthesis


Facial expression synthesis has drawn much attention in the field of computer graphics and pattern recognition. It has been widely used in face animation and recognition. However, it is still challenging due to the high-level semantic presence of large and non-linear face geometry variations. This paper proposes a Geometry-Guided Generative Adversarial Network (G2-GAN) for photo-realistic and identity-preserving facial expression synthesis. We employ facial geometry (fiducial points) as a controllable condition to guide facial texture synthesis with specific expression. A pair of generative adversarial subnetworks are jointly trained towards opposite tasks: expression removal and expression synthesis. The paired networks form a mapping cycle between neutral expression and arbitrary expressions, which also facilitate other applications such as face transfer and expression invariant face recognition. Experimental results show that our method can generate compelling perceptual results on various facial expression synthesis databases. An expression invariant face recognition experiment is also performed to further show the advantages of our proposed method.


Facial expression synthesis is a classical graphics problem where the goal is to generate face images with specific expression for specified human subject. It has drawn much attention in the field of computer graphics, computer vision and pattern recognition. Synthesizing photo-realistic facial expression images has been of great value for both academic and industrial communities, and has been widely applied in facial animations, face editing, face data augmentation and face recognition. During the last two decades, many facial expression synthesis methods have been proposed, which can be roughly divided into two categories. The first category mainly resorts to computer graphics technique to directly warp input faces to target expressions [39] or re-use sample patches of existing images [23], while the other aims to build generative models to synthesize images with predefined attributes [30].

For the first category, a lot of research efforts have been devoted to finding correspondence between existing facial textures and target images. Earlier approaches usually generate new expressions by creating fully textured 3D facial models [26], warping face images via feature correspondence [31] and optical flow [35], or compositing face patches from an existing expression dataset [23]. Particularly, Yeh et al. [37] propose to learn the optical flow with a variational autoencoder. Although this kind of methods can usually produce realistic images with high resolution, their elaborated yet complex processes often result in expensive computation.

The representative methods in the second category are deep generative models that have recently obtained impressive results for image synthesis applications [40]. However, images generated by such methods sometimes lack fine details and tend to be blurry or of low-resolution. Targeted expressional attributes are usually encoded in a latent feature space, where certain directions are aligned with semantic properties. Therefore, these methods can provide better flexibility in semantical-level image generation, but it is hard to take fine-grain control of the synthesized images, e.g., widen the smile or narrow the eyes.

Figure 1: The proposed geometry-guided facial expression generation framework. Face geometry is fed into generators as the condition to guide the processes of expression synthesis and expression removal. In the bottom, we show some examples generated from the same real face image (the center one marked with red box).
Figure 1: The proposed geometry-guided facial expression generation framework. Face geometry is fed into generators as the condition to guide the processes of expression synthesis and expression removal. In the bottom, we show some examples generated from the same real face image (the center one marked with red box).

In this paper, a deep architecture (G2-GAN) is proposed to synthesize photo-realistic and identity-preserving facial images while keep operation-friendly. A human face is often assumed to contain geometry and texture information [21] in computer vision, and both geometry and texture attributes can be used to facilitate face recognition and expression classification [19]. Inspired by the face geometry information in active appearance models (AAM), we employ face geometry to control the expression synthesis process. Face geometry is defined via a set of feature points, and is transformed to an image (heat map) and fed to G2-GAN as a control condition. Figure 1 is the pipeline of our approach. We generate facial expression images conditioned on both the input face images and geometry attributes. Particularly, expression generating and removal are simultaneously considered in our method, constructing a full mapping cycle of expression editing. Then, expression transfer can be performed between arbitrary expressions and subjects. Extensive experiments on two facial expression databases demonstrate the superiority of the proposed facial expression synthesis framework.

The main contributions are summarized as follows,

  • We propose a novel geometry-guided GAN architecture for facial expression synthesis. It can generate photo-realistic images in different expressions from a single image, where target expressions can be easily controlled by various facial geometry inputs.

  • We employ a pair of GANs to simultaneously perform two opposite tasks: removing expression and synthesizing expression. By combining these two models, our method can be used in many applications such as facial expression transfer and cross-expression recognition.

  • We utilize an individual-specific shape model to operate facial geometry, which gives consideration to individual differences when perform expression synthesis. Based on this model, facial expression transfer and interpolation can be easily conducted.

  • Extensive experiments on two facial expression databases demonstrate that the proposed method can synthesize photo-realistic and identity-preserving expression images.

2Related Works

Facial expression synthesis (or editing) is an important task in face editing. In this section, we briefly review some recent advances in facial expression synthesis and its related generative adversarial networks (GAN).

2.1Expression synthesis

As mentioned above, existing expression synthesis methods can be categorized to two classes according to the way of manipulating pixels.

Methods in the first category address this problem either with 2D/3D image warping [1], flow mapping [36] or image reordering [17], most of which are morph-based or example-based. For instance, [1] estimates 3D shape from a neutral face, and synthesizes facial expression by 3D rendering. Bolkart et al. [2] propose a groupwise multilinear correspondence optimization to iteratively refine the correspondence between different 3D faces. In [8], an image-based warping strategy is introduced to perform automatic face reenactment, with the facial identity preserving being considered. Thies et al. [32] track expressions based on a statistical facial prior, and then achieve real-time facial reenactment by using deformation transfer in a low-dimensional expression space. Particularly, Olszewski et al. [24] employ a generative adversarial framework to refine 3D texture correspondences and infer details such as wrinkles and inner mouth region. Many works attempt to utilize the optical flow map to perform image warping. In [36], 3D faces of different expressions are constructed, and expression flow is computed by projecting the difference between 3D shapes back to 2D. Recently, neural networks based methods [7] have been presented to manipulate expression flow maps. It is difficult for those warping-based methods to recover unseen facial components, e.g., skin wrinkles and inner mouth area, or synthesize realistic images for new faces.

Example-based methods edit faces by re-using image patches or reordering image samples of a training set, which can synthesize expected expressions as well as generate unseen faces. [23] composites facial patches from a large dataset to synthesize face images with desired expressions. In [14], expression is mapped to a new face by matching images with similar pose and expression from a database of the target person. Li et al. [17] hallucinate face videos by retrieving frames via a carefully-designed expression similarity metric from an existing expression database. Yang et al. [35] reorder input face frames using Dynamic Time Warping, and then apply an additional expression warping to get more realistic results.

The other kinds of methods use generative models to deal with facial expression synthesis. In [30], a deep belief net is used to convert high-level descriptions of facial attributes into realistic face images. Reed et al. [28] propose a higher-order Boltzmann machine to model interaction among multiple groups of hidden units, and each unit group encodes distinct variation factors such as pose, morphology and expression in face images. In [5], a regularization term is embedded in an autoencoder to disentangle the variations between identity and expression in an unsupervised manner. Li et al. [18] build a convolutional neural network to generate facial images with the given source input images and reference attribute labels. Shu et al. [29] learn a face-specific disentangled representation of intrinsic face properties via GAN, and generate new faces by changing the latent representations. Recently, Ding et al. [6] propose ExprGAN which can synthesize facial expression with controllable intensity, and an expression controller network is proposed to learn expression code. ExprGAN is the most similar work to ours as far as we know. However, ExprGAN generates images conditioned on expression labels and intensity values, while we employ the face geometry as control condition which is not limited to certain expression styles.

2.2Generative adversarial networks

Our work is also related to the generative adversarial networks (GAN) [9], which provides a simple yet efficient way to train powerful model via the min-max two player game between generator and discriminator. Many modified architectures of GAN have been proposed to deal with different tasks. For example, CGAN[22] introduces a conditional version of GAN to guide image synthesis process via adding supervised information to both generator and discriminator. CycleGAN [40], DualGAN [38] and DiscoGAN [15] share the same idea of employing a cycle structure to handle the unpaired image-to-image translation problem. GAN and its variants have achieved great success in numerous image-generating-related tasks such as image synthesis [27], image super-resolution [16], image style transfer [40] and face synthesis [12]. Motivated by this, we develop our facial expression synthesis framework based on GAN, aiming at generating photo-realistic images with high-quality local details.


In this section, we present a novel framework for the facial expression synthesis problem based on generative adversarial networks. We first describe the geometry guided facial expression synthesis in detail, and then propose geometry manipulation methods for face transfer and expression interpolation.

3.1Geometry Guided Facial Expression Synthesis

The outstanding performance of GAN in fitting data distribution has significantly promoted many computer vision applications such as image style transfer [40]. Motivated by its remarkable success, we employ GAN to perform the facial expression synthesis.

Only limited expression styles are supported by existing deep learning-based facial expression synthesis methods, which are usually semantic properties such as smile and angry. Many works can transform a neutral face to a smile face, but can hardly control how strong the smile is. Even though one can construct an intensity-sensitive model by using training data with emotion intensity annotations, many expressions are still difficult to encode with the limited semantic properties. For example, it is hard to describe “a lopsided grin with one eye open” using normal semantic properties. To address this problem, we employ the face geometry to guide the generation.

As in AAM, face geometry is defined via a set of fiducial points [21]. Heatmap is used to encode the locations of these facial fiducial points, which has been widely used in human pose estimation [33] and face alignment [11]. The heatmap provides a per-pixel likelihood for fiducial point locations. Given the heatmaps of target facial expressions and frontal-looking faces without expression (in the following we term it as expressionless faces), new face images (expressioned faces) are synthesized accordingly.

As illustrated in Figure 1, a pair of generators and are introduced, in which is an expressionless face, is an expressioned face and is the heatmap corresponding to . Associated with these two generators, two discriminators and are involved, aiming to distinguish between real triplets and generated triplets correspondingly. and are images of expressionless and expressioned faces, or vise versa.

It is worth noting that plays different roles in these two face editing models, i.e., control measure in expression synthesis and auxiliary annotation in expression removal. In the expression synthesis process, is used to specify the target expression so that can transform neutral expression into desired expression. As for the expression removal process, is in charge of indicating the state of so as to facilitate the recovering of .

Adversarial Loss

. Generators and discriminators are trained alternatively towards adversarial goals, following the pioneering work of [9]. Since the proposed face editing models generate results conditioned on the input face images and heatmaps, we apply GAN in conditional setting as [22]. The adversarial losses for generator and discriminator are shown in Equation 1 and Equation 2 respectively.

Pixel Loss

. The generator is tasked to not only fool the discriminator, but also synthesize images similar to the target ground-truths as far as possible. The pixel-wise loss enforces the transformed face image to have a small distance with the ground-truth in the raw-pixel space. takes the form:

where we use L1 distance to encourage less blurring output. is one of the combination of and depending on the generators.

Cycle-Consistency Loss

. The generators and construct a full mapping cycle between neutral expression faces and expressioned faces. If we transform a face image from neutral expression to angry and then transform it back to neutral expression, the same face image should be obtained in the ideal situation. Therefore, we introduce an extra cycle consistency loss to guarantee the consistency between source images and the reconstructed images, e.g., vs. and vs. . is calculated as

where is the opposite generator to . In our case, if is used to transform neutral expression into expression specified by the face geometry heatmap , then is used to recover the neutral expression with the assistance of .

Identity Preserving Loss

. A fundamental principle of facial expression editing is that face identity should be preserved after expression synthesis as well as removal. Thus, an identity-preserving term is adopted in our framework to enforce identity consistency:

where is a feature extractor for face recognition. We employ the model-B of the Light CNN [34] as our feature extraction network, which includes 9 convolution layers, 4 max-pooling layers and one fully-connected layer. The Light CNN is pre-trained as a classier to distinguish between tens of thousands of identities, so it has ability to capture the most prominent feature for face identity discrimination. Therefore, we can leverage this loss to enforce preserving face identity through the face editing processes.

To sum up, the final full objective for generators is a weighted sum of all the losses defined above: to remove the modality gap between real and generated samples, to force pixel-wise correctness, to guarantee cycle consistency of the reconstructed image and source image, and to preserve identity characteristic through mapping process.

where , and are loss weight coefficients.

3.2Facial Geometry Manipulation

As mentioned above, geometric positions of a set of fiducial points are employed to guide facial expression editing in our framework. Face geometry is largely affected by facial expression, and is a useful cue for expression recognition [19]. Its usage provides a more intuitive yet efficient way for specifying target facial expression. This is because face geometry can not only visually represent the locations and shapes of facial organs, but also be adjusted continuously to obtain expressions with different intensities.

Human faces have unique physiological structure characteristics, resulting in strong correlation between the locations of fiducial points. Hence, the variance of facial geometry should be constrained to avoid unreasonable settings, e.g., eyebrows under the eyes, square-shapes eyes or nose. Taking the prior knowledge of faces’ distribution into account, a parametric shape model is built to serve as a geometry generator.

We adopt a method similar to [21] to learn a basic shape model from labelled training images. Firstly, faces are normalized to the same scale and rotated to horizontal according to the locations of two eyes. Then, Principal Component Analysis (PCA) is applied to get a basic shape model of the locations for fiducial points

where , , . The base shape is the mean shape of all the training images and columns of are the eigenvectors corresponding to the largest eigenvalues. Different facial geometries can be obtained by changing the value of shape parameters .

However, facial geometry is not only correlated with facial expression, but also related to face identity to a great extent. The facial geometry varies with different individuals even under the same expression. For example, the distance between eyes and the length of nose depend largely on face identity rather than expression. Considering these individual differences, we propose an individual-specific shape model based on Equation 7, which can be derived by replacing the mean shape with the neutral shape of different individuals. The individual-specific shape model is given by

where accounts for variation relate to identity, while accounts for changes caused by facial expression.

Facial Expression Transfer

. The proposed framework can be easily applied in facial expression transfer. Given two expressioned faces and with detected facial landmarks . The expression removal model is firstly employed to recover expressionless faces as

where denote the neutral expression faces of respectively. Therefore the neutral shapes can be acquired via facial landmark detection.

Then, the shape parameters are derived by solving the following least squares regression problem.

We change shape parameters so as to get transferred locations of fiducial points.

Heatmaps are transformed according to these transferred shapes, and concatenated with corresponding expressionless faces as inputs for expression synthesis. Finally, results of facial expression transfer can be obtained by using our expression synthesis model as Equation 12.

Facial Expression Synthesis and Interpolation

. As mentioned above, our method is able to synthesize different expressions from a single image. The simple requirement is to prepare a neutral expression face image and shape parameters for target expression. Benefitting from the proposed expression removal model, neutral expression face is not hard to access. The shape parameters for specific expression can be learnt via the basic shape model (see in Equation 7) from annotated training dataset. Once the values of shape parameters are associated with certain semantic properties, such as fear and surprise, we can use them to synthesize unseen facial expressions with desired semantic types. Besides, facial expression interpolation can be conducted by linearly adjusting the value of shape parameters.


In this section, we evaluate the proposed approach on two commonly used facial expression databases. The databases and testing protocols are introduced firstly. Then, the implementation details are presented. Finally, we provide experiments with qualitative and quantitative results for single-image editing, face transfer, expression interpolation and expression-invariant face recognition.

4.1Datasets and Protocols

The CK+ database

[20]. CK+ database includes 593 sequences from 123 subjects, in which seven kinds of emotions are labeled. The first frame is always neutral while the last frame has the peak expression. In each expression video sequence, the first frame is selected as the neutral expression, while last half frames are used as target expression. Training and testing subsets are divided based on identity, with 100 for training and 23 for testing. Locations of 68 fiducial points of each frame are provided, and we use them to create heatmaps for experiments. Because almost all of the videos in CK+ database are grayscale, grayscale images are used in our experiment.

The Oulu-CASIA NIR-VIS facial expression database [4]

. Videos of 80 subjects with six typical expressions and three different illumination conditions are captured in both NIR and VIS imaging systems in this database. Only images captured by a VIS camera within strong illumination condition are used in our facial expression editing experiments. Similar to the CK+ database, we take the first frame and images belong the last half of each sequence to make training pairs. The Oulu-CASIA database includes two parts captured among different ethnic groups at different time, where P001 to P050 are Finnish people and the rest P051 to P080 are Chinese people. We find that these two parts differ a lot in illuminations and face structures. Hence, we select training data over these two parts. Finally we get a training subset of 60 subjects that consists of 37 Finns and 13 Chinese, and a testing subset with 13 Finns and 7 Chinese accordingly. We use the 68 fiducial points detected by [3] to create heatmaps.

4.2Implementation Details

Image pre-processing

. All the face images are normalized by the similarity transformation using the locations of two eyes, and then cropped to size, of which sized sub images are selected by random cropping in training and center cropping in testing. In training stage, we also perform random flipping of the input images to encourage generalization performance. The heatmap is a multi-channel image with the same size as input face image, where value of each pixel is the likelihood for fiducial point location. 2D Gaussian convolution is applied on each channel to smooth the heatmap. All the pixel values are normalized into range of [0,1], including face images and heatmaps.

Network architecture

. We adapt our architecture from [13]. The generators take the architecture of U-Net, which is an encoder-decoder with skip connections between mirrored layers in the encoder and decoder stacks. For discriminator networks, the frequently-used PatchGAN model is employed.

We train different models for each dataset with a batch size of 5 and an onset learning rate of . In all our experiments, hyper-parameters are set empirically to balance the importance of different losses. The trade-off parameter for pixel loss is set to 10, for cycle consistency loss is set to 5. for identity-preserving loss is set to 0.1 in the beginning, and is gradually increased to 0.5 along with the training process.

4.3Experimental Results

Facial Expression Editing

Figure 2: Results of CK+ database for facial expression synthesis and removal. From top to bottom, input expressionless images (true I^N), input expressioned images (true I^E), expression removal results (fake I^N) and expression synthesis results (fake I^E).
Figure 2: Results of CK+ database for facial expression synthesis and removal. From top to bottom, input expressionless images (true ), input expressioned images (true ), expression removal results (fake ) and expression synthesis results (fake ).

For this experiment, given testing image triplets , we conduct expression synthesis on and expression removal on simultaneously. Some visual examples are shown in Figure 2 and Figure 3. The first two rows display original expressionless faces and original expressioned faces, and the next two rows are results of expression removal and expression synthesis respectively. We can see that the proposed G2-GAN is capable of generating compelling identity-preserving faces for desired expression in both testing datasets. Since the images in the CK+ database have higher resolution than those in the Oulu-CASIA database, results for the CK+ database contain better low-level image quality such as skin wrinkles. Noting that we can synthesize satisfactory mouth region with even teeth textures, without needing to involve extra manipulations such as recovering mouth area by retrieving similar frames from a pre-trained database.

In order to measure the correctness of transformed images, we adopt PSNR (peak signal to noise ratio, dB) and SSIM (structural similarity index) for quantitative metric, where PSNR is calculated on the luminance channel and SSIM is calculated on three channels of RGB respectively. Tab. ? reports quantitative results of the proposed approach under different settings. Both the cycle consistency loss and the identity preserving loss contribute to improve performances, and the best result is acquired by combining them together.

Figure 3: Results of Oulu-CASIA database for facial expression synthesis and removal. Images are arranged by the same order as Fig. .
Figure 3: Results of Oulu-CASIA database for facial expression synthesis and removal. Images are arranged by the same order as Fig. .
Quantitative results for expression synthesis and expression removal on CK+ and Oulu-CASIA databases.
2]*Dataset 2]*Configuration
2]*CK+ w/o , 0.726 22.655 0.756 23.903
w/o 0.728 22.828 0.754 24.061
w/o 0.724 22.516 0.765 24.335
G2-GAN 0.728 22.968 0.767 24.420
2]*Oulu-CASIA w/o , 0.902 25.202 0.908 26.206
w/o 0.903 25.270 0.914 26.337
w/o 0.904 25.519 0.916 26.677
G2-GAN 0.910 25.810 0.914 26.588

Facial Expression Transfer

In this part, we demonstrate our model’s ability to transfer the expression of different faces. The procedures for facial expression transfer are introduced in Section 3.2.

Figure 4 and Figure 5 show some example results. The facial expressions are transferred between two subjects in an identity consistent way. Besides, identity-irrelevant face attributes, e.g., eyeglasses and hairs, are perfectly preserved. Individual differences are considered in facial expression transfer, resulting in various local deformations for different subjects. For example, when different people keep the same expression of smile, more obvious changes can be discovered for people with larger mouths.

Figure 4: Results of CK+ database for facial expression transfer. There are three images for each subject in each example. From the left to right, the input images, results of expression removal, results of facial expression transfer.
Figure 4: Results of CK+ database for facial expression transfer. There are three images for each subject in each example. From the left to right, the input images, results of expression removal, results of facial expression transfer.
Figure 5: Results of Oulu-CASIA database for facial expression transfer. Images are arranged by the same order as Fig. .
Figure 5: Results of Oulu-CASIA database for facial expression transfer. Images are arranged by the same order as Fig. .

Facial Expression Interpolation

Interpolation for unseen expression is conducted in this experiment to demonstrate our model’s capability to synthesize expressions with different intensities. It is worth noting that there is no ground-truth in this experiment, and the locations of the fiducial points are obtained from a pre-trained shape dictionary as described in Section 3.2.

The generated images are shown in Figure 6 and Figure 7, in which each row contains a new type of expressions with different intensities. G2-GAN successfully transforms the input faces to new unseen expressions with fine details. Especially for the results on the CK+ database, the changes of facial textures caused by expression change are well captured such as glabellar winkles under expressions of anger and disgust, chin wrinkles when mouth shut and brows lifting when scared. This validates that the proposed G2-GAN’s adjustability in generating multiple face expressions, not limited in pre-determined categories. Besides, these results also demonstrate the operation-friendliness of our method, as we can easily synthesize expressions of desired intensities. An interesting phenomenon is that our model can distinguish the deformations of the mouth caused by happiness and surprise, and the teeth are only generated when synthesizing a smile expression.

Figure 6: Results of CK+ database for facial expression interpolation. Images in the left-most column are the source images, and the remainder are synthesized results. Each row shows a different expression with ascending intensity from left to right. Seven expression styles are shown corresponding to the annotated expression classes in CK+ database.
Figure 6: Results of CK+ database for facial expression interpolation. Images in the left-most column are the source images, and the remainder are synthesized results. Each row shows a different expression with ascending intensity from left to right. Seven expression styles are shown corresponding to the annotated expression classes in CK+ database.
Figure 7: Results of Oulu-CASIA database for facial expression interpolation. Images are arranged by the same order as Fig. . Six expression styles are corresponding to the annotated expression classes in Oulu-CASIA database.
Figure 7: Results of Oulu-CASIA database for facial expression interpolation. Images are arranged by the same order as Fig. . Six expression styles are corresponding to the annotated expression classes in Oulu-CASIA database.
Results for expression-invariant face recognition on CK+ and Oulu-CASIA databases. Images in the probe set are processed by our expression removal model firstly, and then fed to face recognition models. We conduct face verification on the transformed probe set and the original gallery set. Results of the ‘original’ configuration are obtained by directly testing on the non-transformed gallery set as well as probe set.
2]*Dataset 2]*Configuration
Rank-1 FAR=1% FAR=0.1% Rank-1 FAR=1% FAR=0.1%
2]*CK+ original 96.41 92.13 88.11 100.00 97.01 93.33
w/o , 96.15 93.33 84.94 98.63 96.83 87.77
w/o 96.15 94.27 87.25 100.00 97.60 94.87
w/o 96.41 92.90 84.86 99.23 97.43 89.65
G2-GAN 97.26 96.15 92.22 100.00 97.69 94.95
2]*Oulu-CASIA original 97.68 94.63 90.91 99.92 95.35 89.02
w/o , 96.95 94.99 90.46 99.52 95.95 90.10
w/o 97.56 95.80 92.90 99.92 97.60 91.67
w/o 96.59 95.35 90.02 99.84 97.04 89.78
G2-GAN 97.84 96.19 93.19 99.88 97.80 93.31

Expression-Invariant Face Recognition

In this subsection, we apply G2-GAN in expression-invariant face recognition. The expression removal model is employed as a normalization module in face recognition, which transforms faces into neutral expression. Face verification is taken in both the CK+ dataset and the Oulu-CASIA dataset. The gallery set is selected from the first frame of each video sequences, with only one image for each subject. The probe set is made up of all the rest images in testing set. Two released face recognition models are tested, including the VGG-FACE [25] and the Light CNN [34]. The Rank-1 identification rate, true accept rates at and (TAR@FAR=, TAR@FAR=) are taken as evaluation metrics. In order to validate the effectiveness of and , we report the results of removal each one of them respectively.

Results for the expression-invariant face recognition experiment are presented in Tab. ?. Benefiting from the powerful representation ability of deep learning methods, VGG-FACE and Light CNN obtain high performances on the original images. However, results can be further improved by introducing our expression removal module, especially for a lower FAR. Both the cycle consistency loss and the identity preserving loss facilitate to improve the recognition performance according to results of w/o , w/o , and the basic setting w/o . Besides, slight drops occur when we do not use comparing with the results of original images, suggesting the necessity of in face editing when the face identity is expected to be preserved.


This paper has developed a geometry-guided adversarial framework for facial expression synthesis. Facial geometry has been employed to guide photo-realistic face synthesis as well as to provide an operation friendly solution for specifying target expression. Besides, a pair of facial editing subnetworks are trained together towards two opposite tasks: expression removal and expression synthesis, forming a mapping cycle between expressionless and expressioned faces. By combining these two subnetworks, our method can be used in many face related applications including facial expression transfer and expression-invariant face recognition. Moreover, we have proposed an individual-specific shape model for operating the facial geometry, in which individual differences are considered. Extensive experimental results demonstrate the effectiveness of the proposed method for facial expression synthesis.


  1. Reanimating faces in images and video.
    V. Blanz, C. Basso, T. Poggio, and T. Vetter. In Computer Graphics Forum, pages 641–650, 2003.
  2. A groupwise multilinear correspondence optimization for 3d faces.
    T. Bolkart and S. Wuhrer. In ICCV, 2015.
  3. How far are we from solving the 2d & 3d face alignment problem? (and a dataset of 230,000 3d facial landmarks).
    A. Bulat and G. Tzimiropoulos. In ICCV, 2017.
  4. Learning mappings for face synthesis from near infrared to visual light images.
    J. Chen, D. Yi, J. Yang, G. Zhao, S. Z. Li, and M. Pietikainen. In CVPR, 2009.
  5. Discovering hidden factors of variation in deep networks.
    B. Cheung, J. A. Livezey, A. K. Bansal, and B. A. Olshausen. In ICLRW, 2015.
  6. Exprgan: Facial expression editing with controllable expression intensity.
    H. Ding, K. Sricharan, and R. Chellappa. In AAAI, 2018.
  7. Deepwarp: Photorealistic image resynthesis for gaze manipulation.
    Y. Ganin, D. Kononenko, D. Sungatullina, and V. Lempitsky. In ECCV, 2016.
  8. Automatic face reenactment.
    P. Garrido, L. Valgaerts, O. Rehmsen, T. Thormahlen, P. Perez, and C. Theobalt. In CVPR, 2014.
  9. Generative adversarial nets.
    I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. In NIPS, 2014.
  10. Multi-pie.
    R. Gross, I. Matthews, J. Cohn, T. Kanade, and S. Baker. In FG, pages 1–8, 2008.
  11. Densebox: Unifying landmark localization with end to end object detection.
    L. Huang, Y. Yang, Y. Deng, and Y. Yu. arXiv preprint arXiv:1509.04874, 2015.
  12. Beyond face rotation: Global and local perception gan for photorealistic and identity preserving frontal view synthesis.
    R. Huang, S. Zhang, T. Li, and R. He. In ICCV, 2017.
  13. Image-to-image translation with conditional adversarial networks.
    P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. In CVPR, 2017.
  14. Being john malkovich.
    I. Kemelmacher-Shlizerman, E. Shechtman, E. Shechtman, and S. M. Seitz. In ECCV, 2010.
  15. Learning to discover cross-domain relations with generative adversarial networks.
    T. Kim, M. Cha, H. Kim, J. Lee, and J. Kim. In ICML, 2017.
  16. Photo-realistic single image super-resolution using a generative adversarial network.
    C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, et al. In CVPR, 2017.
  17. A data-driven approach for facial expression synthesis in video.
    K. Li, F. Xu, J. Wang, Q. Dai, and Y. Liu. In CVPR, 2012.
  18. Deep identity-aware transfer of facial attributes.
    M. Li, W. Zuo, and D. Zhang. arXiv preprint arXiv:1610.05586, 2016.
  19. Expression-invariant face recognition with expression classification.
    X. Li, G. Mori, and H. Zhang. In CRV, 2006.
  20. The extended cohn-kanade dataset (ck+): A complete dataset for action unit and emotion-specified expression.
    P. Lucey, J. F. Cohn, T. Kanade, J. Saragih, Z. Ambadar, and I. Matthews. In CVPRW, 2010.
  21. Active appearance models revisited.
    I. Matthews and S. Baker. IJCV, 60(2):135–164, 2004.
  22. Conditional generative adversarial nets.
    M. Mirza and S. Osindero. arXiv preprint arXiv:1411.1784, 2014.
  23. Visio-lization: generating novel facial images.
    U. Mohammed, S. J. D. Prince, and J. Kautz. In ACM SIGGRAPH, 2009.
  24. Realistic dynamic facial textures from a single image using gans.
    K. Olszewski, Z. Li, C. Yang, Y. Zhou, R. Yu, Z. Huang, S. Xiang, S. Saito, P. Kohli, and H. Li. In ICCV, 2017.
  25. Deep face recognition.
    O. M. Parkhi, A. Vedaldi, and A. Zisserman. In BMVC, 2015.
  26. Synthesizing realistic facial expressions from photographs.
    F. Pighin, J. Hecker, D. Lischinski, R. Szeliski, and D. H. Salesin. In ACM SIGGRAPH Courses, 2006.
  27. Unsupervised representation learning with deep convolutional generative adversarial networks.
    A. Radford, L. Metz, and S. Chintala. arXiv preprint arXiv:1511.06434, 2015.
  28. Learning to disentangle factors of variation with manifold interaction.
    S. Reed, K. Sohn, Y. Zhang, and H. Lee. In ICML, 2014.
  29. Neural face editing with intrinsic image disentangling.
    Z. Shu, E. Yumer, S. Hadap, K. Sunkavalli, E. Shechtman, and D. Samaras. In CVPR, 2017.
  30. Generating facial expressions with deep belief nets.
    J. M. Susskind, G. E. Hinton, J. R. Movellan, and A. K. Anderson. In Affective Computing. 2008.
  31. Mapping and manipulating facial expression.
    B.-J. Theobald, I. Matthews, M. Mangini, J. R. Spies, T. R. Brick, J. F. Cohn, and S. M. Boker. Language and speech, 52(2-3):369–386, 2009.
  32. Face2face: Real-time face capture and reenactment of rgb videos.
    J. Thies, M. Zollhofer, M. Stamminger, C. Theobalt, and M. Nießner. In CVPR, 2016.
  33. Joint training of a convolutional network and a graphical model for human pose estimation.
    J. J. Tompson, A. Jain, Y. LeCun, and C. Bregler. In NIPS, 2014.
  34. A lightened cnn for deep face representation.
    X. Wu, R. He, and Z. Sun. arXiv preprint arXiv:1511.02683, 2015.
  35. Facial expression editing in video using a temporally-smooth factorization.
    F. Yang, L. Bourdev, E. Shechtman, J. Wang, and D. Metaxas. In CVPR, 2012.
  36. Expression flow for 3d-aware face component transfer.
    F. Yang, J. Wang, E. Shechtman, L. Bourdev, and D. Metaxas. In ACM Transactions on Graphics (TOG), 2011.
  37. Semantic facial expression editing using autoencoded flow.
    R. Yeh, Z. Liu, D. B. Goldman, and A. Agarwala. arXiv preprint arXiv:1611.09961, 2016.
  38. Dualgan: Unsupervised dual learning for image-to-image translation.
    Z. Yi, H. Zhang, P. T. Gong, et al. In ICCV, 2017.
  39. Geometry-driven photorealistic facial expression synthesis.
    Q. Zhang, Z. Liu, G. Quo, D. Terzopoulos, and H.-Y. Shum. TVCG, 12(1):48–60, 2006.
  40. Unpaired image-to-image translation using cycle-consistent adversarial networks.
    J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. In ICCV, 2017.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description