Generative Adversarial Networks in Human Emotion Synthesis:A Review
Synthesizing realistic data samples is of great value for both academic and industrial communities. Deep generative models have become an emerging topic in various research areas like computer vision and signal processing. Affective computing, a topic of a broad interest in computer vision society, has been no exception and has benefited from generative models. In fact, affective computing observed a rapid derivation of generative models during the last two decades. Applications of such models include but are not limited to emotion recognition and classification, unimodal emotion synthesis, and cross-modal emotion synthesis. As a result, we conducted a review of recent advances in human emotion synthesis by studying available databases, advantages, and disadvantages of the generative models along with the related training strategies considering two principal human communication modalities, namely audio and video. In this context, facial expression synthesis, speech emotion synthesis, and the audio-visual (cross-modal) emotion synthesis is reviewed extensively under different application scenarios. Gradually, we discuss open research problems to push the boundaries of this research area for future works.
Keywords:Deep Learning Generative Adversarial Networks Human Emotion Synthesis Speech Emotion Synthesis Facial Emotion Synthesis Cross-modal Emotion Synthesis
=\oddsidemargin+ \textwidth+ 1in + \oddsidemargin\paperheight=\topmargin+ \headheight+ \headsep+ \textheight+ 1in + \topmargin\usepackage[pass]geometry ∎ \floatsetup[longtable]LTcapwidth=table
Deep learning techniques are known best for their promising success in uncovering the underlying probability distributions over various data types in the field of artificial intelligence. Some of these data types are videos, images, audio samples, biological signals, and natural language corpora. The success of the deep discriminative models owes primarily to the back-propagation algorithm and piece-wise linear units (LeCun et al., 1998; Krizhevsky et al., 2012). In contrast, deep generative models (Goodfellow et al., 2014) have been less successful in addressing deep learning due to difficulties that arise by intractable approximation in the probabilistic computation of methods like maximum likelihood estimation.
Many reviews studied the rapidly expanding topic of generative models and specifically Generative Adversarial Networks (GAN) by investigating various points of view. From algorithms, theory, and applications (Gui et al., 2020; Wang et al., 2017) and recent advances and developments (Pan et al., 2019; Zamorski et al., 2019) to comparative studies(Hitawala, 2018), GAN taxonomies (Wang et al., 2019b), and its variants (Hong et al., 2019; Creswell et al., 2018; Huang et al., 2018; Kurach et al., 2018) are investigated by the researchers. Also, few review papers discussed the subject based on a specific application like medical imaging (Yi et al., 2019b), audio enhancement and synthesis (Torres-Reyes and Latifi, 2019), image synthesis (Wu et al., 2017), and text synthesis(Agnese et al., 2019). Howsoever, none of the existing surveys considered GAN in view of human emotion synthesis.
It is important to note that searching the phrase ”Generative Adversarial Network” on Web Of Science (WOS) and Scopus repositories report that 2538 and 4405 documents are published, respectively starting from 2014 up to present. Figure 1(a) and 1(b) show the statistical results obtained from these repositories by searching the aforementioned phrase. The large number of researches published on this topic within only 6 years inspired us to conduct a comprehensive review considering one of the significant applications of GAN models called human emotion synthesis.
Synthesizing realistic data samples is of great value for both academic and industrial communities. Affective computing, a topic of a broad interest in computer vision society benefits from human emotion synthesis and data augmentation. Throughout this paper, we concentrate on the recent advances in the field of GAN and their possible acquisition in the field of human emotion recognition which is known to be useful in other research areas like computer-aided diagnosis systems, security and identity verification, multimedia tagging systems, and human-computer and human-robot interactions. Humans communicate through various verbal and nonverbal channels to show their emotional state. All of the communication modalities are of high importance once interpreting the current emotional state of the user. In this paper, we focus on the GAN-related works of speech emotion synthesis, face emotion synthesis, and audio-visual (cross-modal) emotion synthesis because face and speech are known as pioneer communication channels among humans (Schirmer and Adolphs, 2017; Ekman et al., 1988; Zuckerman et al., 1981; Mehrabian and Ferris, 1967). Researchers developed many GAN-based models to address problems such as data augmentation, improvement of emotion recognition rate, and enhancement of synthesized samples through unimodal (Ding et al., 2018; Choi et al., 2018; Tulyakov et al., 2018; Kervadec et al., 2018),(Kim et al., 2018; Pascual et al., 2017),(Latif et al., 2017; Gideon et al., 2019),(Zhou and Wang, 2017),(Wang and Wan, 2018) and cross-modal analysis (Duarte et al., 2019; Karras et al., 2017a; Jamaludin et al., 2019; Chen et al., 2017).
A specific type of neural network called GAN models was introduced in 2014 by Goodfellow et al. (Goodfellow et al., 2014). This model is composed of a generative model pitting against an adversary model as a two-player minimax framework. The generative model captures data distribution. Then, given a sample, the adversary or the discriminator decides if the sample is drawn from the true data distribution (real) or the model distribution (fake). The competition continues until the generated samples are indistinguishable from the genuine ones.
This review deals with the GAN-based algorithms, theory, and applications in human emotion synthesis and recognition. The remainder of the paper is organized as follows: Section 2 provides a brief introduction to GANs and their variations. This is followed by a comprehensive review of related works on human emotion synthesis tasks using GANs in section 3. This section covers unimodal and cross-modal GAN-based methods developed using audio/visual modalities. Finally, section 4 summarizes the review, identifies potential applications, and discusses challenges.
In general, generative models can be categorized into explicit density models and implicit density models. While the former utilizes the true data distribution or its parameters to train the generative model, the latter generates sample instances without an explicit parameter assumption or direct estimation on real data distribution. Examples of explicit density modeling are maximum likelihood estimation and Markov Chain Method (Kingma and Welling, 2013; Rezende et al., 2014). GANs can be considered as implicit density modeling example (Goodfellow et al., 2014).
2.1 Generative Adversarial Networks (GAN)
Goodfellow et al. proposed Generative Adversarial Networks or vanilla GAN in 2014 (Goodfellow et al., 2014). The model works based on a two-player minimax game where one player seeks to maximize a value function and the other seeks to minimize it. The game ends at a saddle point when the first agent and the second agent reach a minimum and a maximum, respectively, concerning their strategies. This model draws samples directly from the desired distribution without explicitly modeling the underlying probability density function. The general framework of this model consists of two neural network components: a generative model capturing the data distribution and a discriminative model estimating the probability that a sample comes from the training samples or .
Let us designate the input sample for as where is a random noise vector sampled from a priori distribution . Let us denote a real sample as that is taken from the data distribution . Also, we show an output sample generated by as . Then, the idea is to get maximum visual similarity between the two samples. In fact, the generator learns a nonlinear mapping function parametrized by and formulated as: . The discriminator , gets both and to output a single scalar value stating the probability that whether an input sample is a real or a generated sample (Goodfellow et al., 2014). It is important to highlight that is the mapping function learned by and parametrized by . The final distribution formed by generated samples is and it is expected to approximate after learning. Figure 2(a) illustrates the general block diagram of the vanilla GAN model.
Having two distributions and on the same probability space , the KL divergence is as follows:
where both and are assumed to admit densities with respect to a common measure defined on . This happens when and are absolutely continuous, that is . The KL divergence is asymmetric, i.e and possibly infinite when there are points such that and for . A more convenient approach for GAN is the Jensen-Shannon (JS) divergence which may interpreted as a symmetrical version of KL divergence and it is defined as follows:
In other words, a minimax game between and continues to obtain a normalized and symmetrical score in terms of the value function as follows:
Here, the parameters of are adjusted by minimizing . In a similar way, adjusting the parameters for is performed by maximizing . Minimizing is known (Goodfellow et al., 2014) to be equivalent to minimizing the JS divergence between and as expressed in Eq. (2). The value function determines the payoff of the discriminator. Also, the generator takes the value as its own payoff. The generator and the discriminator, each attempts to maximize its own payoff (Goodfellow et al., 2016) during the learning process.The general framework of this model is shown in Figure 2(a).
2.2 Challenges of GAN Models
The training objective of GAN models is often referred to as saddle point optimization problem (Yadav et al., 2017) which is resolved by gradient-based methods. One challenge here is that and should be trained at a time so that they advance and converge together. Minimizing the generators’ objective is proven to be equivalent to minimizing JS divergence if the discriminator is trained to its optimal point before the next update of . This means minimizing the JS divergence does not guarantee finding the equilibrium point between and through the training process. This normally leads to a better performance of as opposed to . Consequently, at some point classifying real and fake samples becomes such an easy task that gradients of approach zero and it becomes ineffectual in the learning procedure of . Mode collapse is another well-known problem in training GAN models where produces a limited set of repetitive samples due to focusing on a few limited modes of the true data distribution, namely during learning and approximating distribution . We discuss these problems in more detail in section 4.1.
2.3 Variants by Architectures
The GAN model can be extended to a conditional GAN (CGAN) model if both the generator and discriminator are conditioned on some extra information (Mirza and Osindero, 2014). Figure 2(b) shows the block diagram of the CGAN model. The condition vector is fed into both the discriminator and the generator through an additional input layer. Here, the latent variable with prior density and condition vector with some value are passed through one perceptron layer to learn the joint hidden representation. Conditioning on changes the training criterion of Eq. (3) and leads to the following criterion:
where could be target class labels or auxiliary information from other modalities.
Another type of GAN models is Laplacian Generative Adversarial Network (LAPGAN) (Denton et al., 2015) that formed by combining CGAN models progressively within a Laplacian pyramid representation. LAPGAN includes a set of generative convolutional models, say . The synthesis procedure consists of two parts a sampling phase and a training phase. The sampling phase starts with generator that takes a noise sample and generates sample . The generated sample is upscaled before passing to the generator of next level as a conditioning variable. takes both upscaled version of and a noise sample to synthesize a difference sample called which is added to the upscaled version of . This process of upsampling and addition repeats across successive levels to yield a final full resolution sample. The Figure 3 illustrates the general block diagram of the LAPGAN model.
SGAN is a second example formed by top-down stacked GAN models (Huang et al., 2017b) to solve the low performance of GAN models in discriminative tasks with large variation in data distribution. Huang et al. (2017b) employ the hierarchical representation in a model trained discriminatively by stitching GAN models in a top-down framework and forcing the top-most level to take class labels and the bottom-most one to generate images. Alternatively, instead of stacking GANs on top of each other, Karras et al. (2017b) increased the depth of both the generator and the discriminator by adding new layers. All models are developed under conditional GAN (Denton et al., 2015; Huang et al., 2017b; Karras et al., 2017b).
Other models modify the input to the generator slightly. For instance, in SPADE (Park et al., 2019) a segmentation mask is fed indirectly to the generator through an adaptive normalization layer instead of utilizing the standard input noise vector . Also, StyleGAN (Wang et al., 2018a) injects , first to an intermediate latent space that helps to avoid entanglement of the input latent space to the probability density of the training data.
In 2015, Radford et al. (Radford et al., 2015) proposed Deep Convolutional Generative Adversarial Network (DCGAN) in which both the generator and the discriminator were formed by a class of architecturally constrained convolutional networks. In this model, fully convolutional downsampling/upsampling layers replaced the Fully connected layers of vanilla GAN along with other architectural restrictions like using batch-normalization layers and LeakyReLU activation function in all layers of the discriminator.
Another advancement in GAN models includes using the spectral normalization layer to adjust feature response criterion by normalizing the weights in the discriminator network (Miyato et al., 2018). Residual connections are another novel approach fetched into the GAN models by Gulrajani et al. (2017) and Miyato et al. (2018). While models like CGANs incorporate the conditional information vector simply by concatenation, others remodeled the usage of a conditional vector by a projection approach leading to significant improvement in the quality of the generated samples (Miyato and Koyama, 2018).
The aforementioned GAN models expanded based on Convolutional Neural Networks (CNN). Further, along this line, a whole new research line of GAN models developed based on recent deep learning models called CapsuleNets (CapsNets) (Sabour et al., 2017). Let be the output vector of the final layer of a CapsNet that represents the presence of a visual entity by classifying to one of the classes. Sabour et al. (2017) provide an updated objective function that benefits from CapsNet margin loss () and it could be expressed as follows:
where , , and are down-weighting factors set to 0.9, 0.1, and 0.5, respectively to stop initial learning from shrinking the lengths of the capsule outputs in the final layer. The length of each capsule in the final layer () can then be viewed as the probability of the image belonging to a particular class (). Also, denotes the target label.
CapsuleGAN (Jaiswal et al., 2018) is a GAN model proposed by Jaiswal et al. (2018) based on CapsNet.The authors use CapsNet in the discriminator as opposed to conventional CNNs. The final layer of this discriminator consists of a single capsule representing the probability of being a real or fake sample. They used the margin loss introduced in Eq. (5) instead of the binary cross-entropy loss for training. The training criterion of the CapsuleGAN is then formulated as follows:
Practically, the generator must be trained to minimize rather than minimizing . This helps eliminating the downweighting factor in when training the generator, which does not contain any capsules.
2.4 Variants by Discriminators
Stabilizing the training and avoiding mode collapse problem could be achieved by employing different loss functions for . An entropy-based loss is proposed by Springenberg (2015) called Categorical GAN (CatGAN) in which the objective of discriminator changed from real-fake classification to entropy-based class predictions. WGAN (Arjovsky et al., 2017) and an improved version of it called WGAN-GP (Gulrajani et al., 2017) are two GAN models with a loss function based on Wasserstein distance used in the discriminator. The Earth-Mover (EM) distance or Wasserstein-1 is expressed as follows:
where is the set of all joint distributions whose marginals are respectively, and . Here, describes how much âmass” needs to be transported from to in order to transform the distribution into . The EM distance is then the âcost” of the optimal transport plan.
Other alternative models that benefit from a different loss metric are GAN based on Category Information (CIGAN) (Niu et al., 2018), hinge loss(Miyato et al., 2018), least-square GAN (Mao et al., 2017), and f-divergence GAN (Nowozin et al., 2016). Research developments include replacing the encoder structure of the discriminator with an autoencoder structure. In fact, a new loss objective is defined for the discriminator which corresponds to the autoencoder loss distribution instead of data distribution. Examples of such GAN frameworks are Energy-based GAN (EBGAN) (Zhao et al., 2016) and Boundary Equilibrium GAN (BEGAN) Berthelot et al. (2017). Figure 4 illustrates the block diagram of GAN models developed by modification in the discriminator.
Another interesting GAN model proposed by Chen et al. (2016) is Information Maximizing Generative Adversarial Net (InfoGAN), which simply modifies the discriminator to output both the fake/real classification result and the semantic features of illustrated as in Figure 4(c). The discriminator performs real/fake prediction by maximizing the mutual information between the and conditional vector . Other models like CIGAN (Niu et al., 2018) and ACGAN (Odena et al., 2017) focused on improving the quality of the generated samples by employing the class labels during synthesis and then impelling to provide entropy loss information as well as class probabilities. The Figure 4(d) shows the structure of ACGAN.
2.5 Variants by Generators
The objective of generators is to transform noise input vector to a sample . In the standard vanilla GAN, this objective is achieved by successively improving the state of the generated sample. The procedure stops when the desired quality is captured. Variational AutoEncoder GAN network (VAEGAN) (Larsen et al., 2015) is arguably the most popular GAN model proposed by varying on the generator architecture. The VAEGAN computes the reconstruction loss in a pixel-wise approach. The decoder network of VAE outputs patterns resembling the true samples (see Figure 5(b)).
One challenge in designing GAN models is controlling the attributes of the generated data known as a mode of data. Using supplemental information leads to sample generation with control over the modification of the selected properties. The generator output then becomes . GANs lack the capability of interpreting the underlying latent space that encodes the input sample. ALI (Dumoulin et al., 2016) and BiGAN (Donahue et al., 2016) are proposed to resolve this problem by embedding an encoder network in the generator as shown in Figure 5(a). Here, the discriminator performs real/fake prediction by distinguishing between the tuples and . This can categorize the model as a discriminator variant as well.
Other researchers developed the generators to solve specific tasks. Isola et al. (2017) designed pix2pix as an image-to-image translation network to study relations between two visual domains and Milletari et al. (2016) proposed VNet with Dice loss for image segmentation. The disadvantage of such networks was the aligned training with paired samples. In 2017, Zhu et al. and Kim et al. found a solution to perform unpaired image-to-image translation using cycle consistency loss and cross-domain relations, respectively. Here, the idea was to join two generators together to perform translation between sets of unpaired samples. Below, Figures 6(a) and 6(b) show block diagrams of the CycleGAN and pix2pix, respectively.
CycleGAN(Zhu et al., 2017) and UNIT are successful examples derived from VAEGAN model. Figure 6(c) illustrates the layout for UNIT framework. It is important to highlight that considering the generators, the conditional input may vary from class labels (Mirza and Osindero, 2014) and text descriptions (Reed et al., 2016), (Xu et al., 2018) to object location and encoded audio features or cross-modal correlations.
3 Applications in view of Human Emotion
In this section, we discuss applications of GAN models in human emotion synthesis. We categorize related works into unimodal and cross-modal researches based on audio and video modalities to help the reader discover applications of interest without difficulty. Also, we explain each method in terms of the proposed algorithm and its advantages and disadvantages. Generally, applications of GAN for human emotion synthesis focus on two issues. The first one is data augmentation that helps obviating the need for the tedious job of collecting and labeling large scale databases and the second is improving the performance on emotion recognition.
3.1 Facial Expression Synthesis
Facial expression synthesis using conventional methods confronts several important problems. First, most methods require paired training data, and second, the generated faces are of low resolution. Moreover, the diversity of the generated faces is limited. The works reviewed in this section are taken from the computer-vision-related researches that focus on facial expression synthesis.
One of the foremost works on facial expression synthesis was the study by Susskind et al. (2008) that could embed constraints like ”raised eyebrows” on generated samples. The authors build their framework upon a Deep Belief Network (DBN) that starts with two hidden layers of 500 units. The output of the second hidden layer is concatenated with identity and a vector of the Facial Action Coding System (FACS) (Ekman and Friesen, 1978) to learn a joint model of them through a Restricted Boltzmann Machine (RBM) with 1000 logistic hidden units. The trained DBN model is then used to generate faces with different identities and facial Action Units (AU).
Later, with the advent of GAN models, DyadGAN is designed specifically for face generation and it can generate facial images of an interviewer conditioned on the facial expressions of their dyadic conversation partner. ExprGAN (Ding et al., 2018) is another model designed to solve the problems mentioned above. ExprGAN has the ability to control both the target class and the intensity of the generated expression from weak to strong without a need for training data with intensity values. This is achieved by using an expression controller module that encodes complex information like expression intensity to a real-valued vector and by introducing an identity preserving loss function.
Other proposed methods before ExprGAN had the ability to synthesize facial expressions either by manipulating facial components in the input image (Yang et al., 2011; Mohammed et al., 2009; Yeh et al., 2016) or by using the target expression as a piece of auxiliary information (Susskind et al., 2008; Reed et al., 2014; Cheung et al., 2014). In 2017, Shu et al. and Zhou and Shi proposed two GAN-based models. Shu et al. learns a disentangled representation of inherent facial attributes by manipulating facial appearance, and Zhou and Shi synthesizes facial appearance of unseen subjects using AUs and a conditional adversarial autoencoder.
|Huang and Khan||CGAN||DyadGAN||L||D||M||-||P, P, P, P|
|Ding et al.||CGAN||ExprGAN||L, L, L, L, L||D||M||84.72||P, P, P, P|
|Song et al.||CGAN||G2GAN||L, L, L, L, L||D||M||58.94||P, P, P, P|
|Choi et al.||CycleGAN||StarGAN||L, L, L||D||M||52.20||P, P, P|
|Vielzeuf et al.||StarGAN||-||L, L, L||D||M||3.4||P, P, P|
|Lai and Lai||VAEGAN||-||L, L, L, L, L||D||M||87.08||P, P, P|
|Caramihale et al.||CycleGAN||-||L, L||D||M||98.30||P, P|
|Zhu et al.||CycleGAN||-||L, L||D||M||94.71||P|
|Lindt et al.||VAEGAN||-||L, L, L, L||D||M||6.07||P, P, P|
|Lu et al.||CycleGAN||AttGGAN||L, L, L||D||M||0.92||P, P, P|
|Zhang et al.||CGAN||FaceID||L, L, L||D||M||97.01||P, P, P|
|Peng and Yin||CycleGAN||ApprGAN||L, L, L||D||M||0.95||P|
|Shen et al.||Pix2Pix||TVGAN||L, L||D||M||50.90||P, P, P|
|He et al.||VAEGAN IcGAN||AttGAN*||L, L, L||D||M||88.20||P, P, P, P|
|Wang et al.||GAN||CompGAN*||L, L, L, L, L||D||M||74.92||P, P, P|
|Borogtabar et al.||CGAN||ExprADA*||L, L, L, L||D||M||73.20||P, P, P, P, P|
|Wang et al.||ACGAN||UNet GAN||L, L, L||D||M||43.33||P, P|
|Lee et al.||CycleGAN||CollaGAN||L, L, L, L||D||M||-||P, P, P, P, P|
|Shen et al.||StackGAN||FaceFeat||L, L, L, L||D||M||97.62||P, P|
|Deng et al.||GAN||UVGAN||L, L, L||D||M||0.89||P, P, P|
|Li et al.||GAN||-||L, L||D||M||0.84||P, P|
|Cheng et al.||BEGAN ChebNet CoMA||MeshGAN*||L||D||M||1.43||P, P|
Table 1 compares the reviewed publication based on various metrics, databases, loss functions and purposes used by researchers. Following those models and through the many variations of facial expression synthesis proposed by researchers, the GAN-based model proposed by Song et al. (2018) was one of the interesting and premier ones, called G2GAN. G2GAN generates photo-realistic and identity-preserving images. Furthermore, it provides fine-grained control over the target expression and facial attributes of the generated images like widening the smile of the subject or narrowing the eyes. The idea here is to feed the face geometry into the generator as a condition vector which guides the expression synthesis procedure. The model benefits from a pair of GANs that while one removes the expression, the other synthesizes it. This leverages on the ability of unpaired training.
|D||Oulu-CASIA (Zhao et al.)||80||2,880||6|
|D||MULTI-PIE (Gross et al.)||337||755,370||6|
|D||CK+ (Lucey et al.)||123||593||6|
|D||BU-3DFE (Yin et al.)||100||2,500||6|
|D||RaFD (Langner et al.)||67||1,068||6|
|D||CelebA (Liu et al.)||10,177||202,599||45|
|D||EmotioNet (Quiroz et al.)||N/A||1,000,000||23|
|D||AffectNet (Mollahosseini et al.)||N/A||450,000||6|
|D||FER2013 (Goodfellow et al.)||N/A||35,887||6|
|D||SFEW (Dhall et al.)||N/A||1,766||6|
|D||JAFFE (Lyons et al.)||10||213||6|
|D||LFW (Huang et al.)||5,749||13,233||-|
|D||F2ED (Wang et al.)||119||219,719||54|
|D||MUG (Aifanti et al.)||52||204,242||6|
|D||Dyadic Dataset (Huang and Khan)||31||-||8|
|D||IRIS Dataset (Kong et al.)||29||4,228||3|
|D||MMI (Pantic et al.)||31||236||6|
|D||Driver emotion (Bozorgtabar et al.)||26||N/A||6|
|D||KDEF (Lundqvist et al.)||70||N/A||6|
|D||UVDB (Deng et al.)||5,793||77,302||-|
|D||3dMD (Cheng et al.)||12,000||N/A||-|
|D||4DFAB (Cheng et al.)||180||1,800,000||6|
D stands for Database, A: Audio, V:Visual, A-V:Audio-Visual
6: 6 basic expressions including angry, disgust, fear, happiness, sadness, and surprise
6: 6 classes including smile, surprised, squint, disgust, scream, and neutral
6: 6 + neutral and contempt
6: 6 + neutral
23: 6 basic expressions + 17 compound emotions
45: 5 landmark locations, 40 binary attributes
54: 54 emotion types (categories are not mentioned clearly in the source paper)
6 + neutral, also landmark point annotation is provided)
6: Joy, Anger, Surprise, Fear, Contempt, Disgust, Sadness and Neutral
3: surprised, laughing, angry
StarGAN (Choi et al., 2018) is the first approach with a scalable solution for multi-domain image-to-image translation using a unified GAN model (i.e only a single generator and discriminator). In this model, a domain is defined as a set of images sharing the same attribute and attributes are the facial features like hair color, gender, and age which can be modified based on the desired value. For example, one can set hair color to be blond or brown and set the gender to be male or female. Likewise, Attribute editing GAN (AttGAN) (He et al., 2019) provides a GAN framework that can edit any attribute among a set of attributes for face images by employing adversarial loss, reconstruction loss, and attribute classification constraints. Also, DIAT (Li et al., 2016), CycleGAN (Zhu et al., 2017) and IcGAN (Perarnau et al., 2016) could be compared as baseline models.
In 2018, G2GAN Song et al. is extended by Qiao et al.. The authors derived a model based on VAEGANs to synthesize facial expressions given a single image and several landmarks through some transferring stages. Different from ExprGAN their model does not require the target class label of the generated image. Also, unlike G2GAN, it does not require the neutral expression of a specific subject as an intermediate level in the facial expression transfer procedure. While G2GAN and its extension focus on geomterical features to guide the expression synthesis procedure, Pumarola et al. (2018) use facial AU as a one-hot vector to perform an unsupervised expression synthesis while smooth transition and unpaired samples are guaranteed.
Another VAEGAN-based model is the work of (Lai and Lai, 2018) where a novel optimization loss called symmetric loss is introduced. Symmetric loss helps preserving the symmetrical property of the face while translating from various head poses to frontal-view of the face. Similar to Lai and Lai is the FaceID-GAN (Shen et al., 2018a) where, in addition to the two-players of vanilla GANs and symmetry information, a classifier of face identity is employed as the third player that competes with the generator by distinguishing the identities of the real and synthesized faces.
|L||adversarial loss presented by the discriminator, (see section 2.4)|
|L||pixel-wise image reconstruction loss|
|L||identity preserving loss|
|L||loss of a regularizer|
|L||total variation regularizer loss|
|L||classification loss (expression)|
|L||feature matching loss|
|L||image reconstruction loss|
|L||Symmetry loss to preserve symmetrical property of the face|
|L||cross-entropy loss used to ensure correct pose of the face|
|L||bidirectional loss to avoid mode collapse|
|L||triplet loss to minimize the similarity between and|
|L||Structural Similarity Index Loss that measures the image quality|
|L||learn the shape feature by minimizing the weighted distance|
|L||loss for a recurrent temporal predictor to predict future samples|
|L||3D Morphable Model loss to ensure correct pose and expression|
|L||motion loss consisting of a VAE loss, a video reconstruction loss, and KL-divergence between the prior and posterior motion latent distribution|
|L||contents loss consisting of a reconstruction loss for the current frame and a KL-divergence between the prior and posterior content distribution|
Lai and Lai (2018) used GAN to perform emotion-preserving representations. In the proposed approach, the generator can transform the non-frontal facial images into frontal ones while the identity and the emotion expression are preserved. Moreover, a recent publication (Vielzeuf et al., 2019) relies on a two-step GAN framework. The first component maps images to a 3D vector space. This vector is issued from a neural network and it represents the corresponding emotion of the image. Then, a second component that is a standard image-to-image translator uses the 3D points obtained in the first step to generate different expressions. The proposed model provides fine-grained control over the synthesized discrete expressions through the continuous vector space representing the arousal, valence, and dominance space.
|M||Ground truth, costly, not scaleable|
|M||expression classification (accuracy)||down stream task|
|M||expression classification (error)||down stream task|
|M||identity classification (accuracy)||down stream task|
|M||Structural Similarity Index Measure||measures image quality degradation|
|M||Peak Signal to Noise Ratio (PSNR)||measures quality of representation|
|M||visual representation||down stream task|
|M||real/fake classification (accuracy)||down stream task|
|M||real/fake classification (error)||down stream task|
|M||attribute classification (accuracy)||down stream task|
|M||attribute classification (error)||down stream task|
|M||Average Content Distance (ACD)||content consistency of a generated video|
|M||Motion Control Score (MCS)||capability in motion generation control|
|M||Inception Score (IS)||measures quality and diversity of|
|M||texture similarity score||measuring texture similarity|
|M||identity classification (error)||down stream task|
It should be noted that a series of GAN models focus on 3D object/face generation. Examples of these models are Convolutional Mesh Autoencoder (CoMA) (Ranjan et al., 2018), MeshGAN(Cheng et al., 2019), UVGAN (Deng et al., 2018), and MeshVAE (Litany et al., 2018). Despite the successful performance of GANs in image synthesis, they still fall short when dealing with 3D objects and particularly human face synthesis. Here, we compare synthesized images of the aforementioned methods qualitatively in Figures 7 and 8. Images are taken from the corresponding papers. As the images show, most of the generated samples suffer from blurring problem.
In addition to GAN-based models that synthesize single images, there are models with the ability to generate an image sequence or a video/animation. Video GAN (VGAN) (Vondrick et al., 2016) and Temporal GAN (TGAN) (Saito et al., 2017) were the first two models in this research line. Although these models could learn a semantic representation of unlabeled videos, they produced a fixed-length video clip. As a result, MoCoGAN is proposed by Tulyakov et al. to solve the problem. MoCoGAN is composed of 4 sub-networks. These sub-networks are a recurrent neural network, an image generator, an image discriminator, and a video discriminator. The image generator generates a video clip by sequentially mapping a sequence of vectors to a sequence of images.
|Pumarola et al.||CycleGAN||GANimation||L, L, L, L, L||D||M||-||P, P, P|
|Tulyakov et al.||CGAN||MoCoGAN*||L, L||D||M||0.201||P, P, P|
|Qiao et al.||VAEGAN||-||L, L, L||D||M||0.69||P, P, P P|
|Geng et al.||CGAN||wg-GAN/ Elor et al.||L||D||M||62.00||P, P, P, P|
|Nakahira and Kawamoto||CGAN||DCVGAN||L||D||M||6.68||P|
|Yang et al.||-||PS/SCGAN*||L, L, L||D||M||1.92||P, P|
|Kim et al.||GAN||DVP/VDub||L||-||M||51.25||P, P, P|
|Bansal et al.||CycleGAN||RecycleGAN||L||-||M||76.00||P, P, P, P|
|Sun et al.||CGAN||2SVAN*||L, L,||D||M||5.48||P, P,|
|*/MoCoGAN||L, L||M||88.00||P, P|
- M: Metric, RS: Results, RM:Remarks
- *: shows the proposed method by authors, other mentioned methods are implemented by the authors for the sake of comparison
- the result reported for expression classification accuracy (M) belongs to the synthesized image datasets
- All papers provide visual representation of the synthesized images (M)
While MoCoGAN uses content and motion, Depth Conditional Video generation (DCVGAN) proposed by Nakahira and Kawamoto (2019) utilizes both the optical information and the 3D geometric information to generate accurate videos using the scene dynamics. DCVGAN solved the unnatural appearance of moving objects and assimilation of objects into the background in MoCoGAN. Other methods like Warp-guided GAN (Geng et al., 2018) generate real-time facial animations using a single photo. The method instantly fuses facial details like wrinkles and creases to achieve a high fidelity facial expression.
Recently, Yang et al. (2018) proposed a pose-guided method to synthesize human videos. This successful method relies on two concepts: first, a Pose Sequence Generative Adversarial Network (PSGAN) is proposed to learn various motion patterns by conditioning on action labels. Second, a Semantic Consistent Generative Adversarial Network (SCGAN) is employed to generate image sequences (video) given the pose sequence generated by the PSGAN. The effect of noisy or abnormal poses between the generated and ground-truth poses is reduced by the semantic consistency. We show this method as PS/SCGAN in Table 5. It is worth to mention that two of the recent and successful methods in video generation are MetaPix (Lee et al., 2019b) and MoCycleGAN (Chen et al., 2019) that used motion and temporal information for realistic video synthesis. However, these methods are not tested for facial expression generation. Table 5 lists the models developed for video or animation generation.
One of the main goals in synthesizing is augmenting the number of available samples. Zhu et al. (2018) used GAN models to improve the imbalanced class distribution by data augmentation through GAN models. The discriminator of the model is a CNN and the generator is based on CycleGAN. They report up to 10% increase in the classification accuracy (M) based on GAN-based data augmentation techniques.
The objective function or the optimization loss problem categorizes into two groups: synthesis loss and classification loss. Although the definitions provided by the authors are not always clear, we tried to list all different losses used by authors and we propose a symbolic name for each to provide harmony in the literature. The losses are used in a general point of view. That is, marking different papers by classification loss (L7) in Table 1, does not mean necessarily that the exact same loss function is used. In other words, it shows that the classification loss is contributed in some way. A comprehensive list of these functions is given in Table 3. Additionally, we compared some of the video synthesis models in Figure 9.
Evaluation metrics of the generative models are different from one research to another due to several reasons (Hitawala, 2018). First, the quality of the synthesized sample is a perceptual concept and, as a result, it cannot be accurately expressed. Usually, researchers provide the best-synthesized samples for visual comparison and thus problems like mode drop are not covered qualitatively. Second, employing human annotators to judge the visual quality can cover only a limited number of data samples. Specifically, in topics such as human emotion, experts are required for accurate annotation and having the least possible labeling error. Hence, approaches like Amazon Mechanical Turk are less reliable considering classification based on those labels. Third, general metrics like photo-metric error, geometric error, and inception score are not reported in all publications (Salimans et al., 2016). These problems cause the comparison among papers either unfair or impossible.
The Inception Score (IS) can be computed as follows:
where denotes the generated sample, y is the label predicted by an arbitrary classifier, and is the KL divergence to measure the distance between probability distributions as defined in Eq. (1). Based on this score, an ideal model produces samples that have close congruence to real data samples as much as possible. In fact, KL divergence is the de-facto standard for training and evaluating generative models.
Other widely used evaluative metrics are Structural Similarity Index Measure (SSIM) and Peak Signal to Noise Ratio (PSNR). SSIM is expressed as follows:
where , , and are luminance, contrast, and structure and they can be formulated as:
Here , , , and denote mean and standard deviations of pixel intensity in a local image patch where the patch is superimposed so that its center coincides with the center of the image. Typically, a patch is considered as a square neighborhood of pixels. Also, is the sample correlation coefficient between corresponding pixels in that patch. , , and are small constants values added for numerical stability.
PSNR or the peak signal-to-noise ratio assesses the quality between two monochrome images and . Let and be the generated image and the real image, respectively. Then, PSNR is:
where MAX is the maximum possible pixel value of the image and MSE stands for Mean Square Error. PSNR is measured in dB, generated images with a better quality result in higher PSNR.
In addition to the metrics that evaluate the generated image, Generative Adversarial Metric (GAM) proposed by Im et al. (2016) compares two GAN models by engaging them in a rivalry. In this metric, first GAN models and are trained. Then, model competes with model in a test phase by having trying to fool discriminator of and vice versa. In the end, two ratios are calculated using the discriminative scores of these models as follows:
where , , , and , are the generators and the discriminators of and , respectively. In Eq. (12), outputs the classification error rate. The test ratio or shows which model generalizes better because it discriminates based on . The sample ratio or shows which model fools the other more easily because discriminators classify based on the synthesized samples of the opponent. The sample ratio and the test ratio can be used to decide the winning model:
To measure the texture similarity, Peng and Yin (2019) simply calculated correlation coefficients between and that are the texture of the synthesized image and the texture of its corresponding ground truth, respectively. Let be the texture similarity score. Then, the mathematical representation is as follows:
where (i, j) specifies pixel coordinates in the texture images, and and are the mean value of and , respectively.
Other important metrics include FrÃ©chet Inception Distance (FID), Maximum Mean Discrepancy (MMD), the Wasserstein Critic, Tournament Win Rate and Skill Rating, and Geometry Score. FID works based on embedding the set of synthesized samples into a new feature space using a certain layer of a CNN architecture. Then, mean and covariance are estimated for both the synthesized and the real data distributions based on the assumption that the embedding layer is a continuous multivariate Gaussian distribution. Finally, FID or Wasserstein-2 distance between these Gaussians is then used to quantify the quality of generated samples :
Here, and represent the mean and covariance of generated and real data distributions, respectively. Lower FID score indicates a smaller distance between the two distributions. MMD focuses on the dissimilarity between the two probability distributions by taking samples from each distribution independently. The kernel MMD is expressed as follows:
where is some fixed characteristic kernel function like Gaussian kernel: that measures MMD dissimilarity between the generated and real data distributions. Also, and are randomly drawn samples from real data distribution, i.e . Similarly, and are randomly drawn from model distribution, i.e .
The Wasserstein Critic provides an approximation of the Wasserstein distance between the model distribution and the real data distribution. Let and be the real data and the model distributions, then:
where is a Lipschitz continuous function. In practice, the critic is a neural network with clipped weights and bounded derivatives (Borji, 2019). In practice, this is approximated by training to achieve high values for real samples and low values for generated ones:
where is a batch of testing samples, is a batch of generated samples, and is the independent critic. An alternative version of this score is known as Sliced Wasserstein Distance (SWD) that estimates the Wasserstein-1 distance (see Eq. (7)) between real and generated images. SWD computes the statistical similarity between local image patches extracted from Laplacian pyramid representations of the images (Karras et al., 2017b).
In the case of the metrics of video generation, evaluating content consistency based on Average Content Distance (ACD) is defined as calculating the average pairwise distance of the per-frame average feature vectors. In addition, the motion control score (MCS) is suggested for assessing the motion generation ability of the model. Here, a spatio-temporal CNN is first trained on a training dataset. Then, this model classifies the generated videos to verify whether the generated video contained the required motion (e.g action/expression).
Other metrics include but are not limited to identification classification, true/false acceptance rate (Song et al., 2018), expression classification accuracy/error (Ding et al., 2018), real/fake classification accuracy/error (Ding et al., 2018), attribute editing accuracy/error (He et al., 2019), and Fully Convolutional Networks. List of evaluative metrics used in the reviewed publications is given in Table 4. For a comprehensive list on evaluative metrics of GAN models, we invite the reader to study âPros and Cons of GAN Evaluation Measures” by Borji (2019).
Synthesizing models are proposed with different aims and purposes. Texture synthesis, super-resolution images, and image in-painting are some applications. Considering face synthesis, the most important goal is the data augmentation for improved recognition performance. A complete list of such purposes and the model properties are given in Table 6.
|purpose or characteristic|
|P||is tested for data augmentation|
|P||preserves the identity of the subject|
|P||controls the expression intensity|
|P||generates images with high quality (photo-realistic)|
|P||employs geometrical features as the conditional vector|
|P||employs facial action units as the conditional vector|
|P||employs a unified framework for multi-domain tasks|
|P||generates facial expressions by combining appearance and geometric features|
|P||provides a smooth transition of facial expression|
|P||supports training with unpaired samples|
|P||employs arousal-valence and dominance-like inputs|
|P||preserves the emotion of the subject|
|P||employs arbitrary head poses and applies face frontalization|
|P||modifies facial attributes based on a desired value|
|P||generates image sequences (video)|
|P||employs multiple inputs|
|P||modifies image attributes like illumination and background|
|P||is tested for data imputation|
|P||generates video animation using a single image|
|P||generates visible faces from thermal face image|
|P||employs temporal/motion information for video generation|
|P||supports unsupervised learning|
Despite the numerous publications on image and video synthesis, yet some problems are not solved thoroughly. For example, generating high-resolution samples is an open research problem. The output is usually blurry or impaired by checkered artifacts. Results obtained for video generation or synthesis of 3D samples are far from realistic examples. Also, it is important to highlight that the number of publications focused on expression classification is greater than that of those employing identity recognition.
3.2 Speech Emotion Synthesis
Research efforts focusing on synthesizing speech with emotion effect has continued for more than a decade now. One application of GAN models in speech synthesis is speech enhancement. A pioneer GAN-based model developed for raw speech generation and enhancement is called the Speech Enhancement GAN (SEGAN) proposed by Pascual et al. (2017). SEGAN provides a quick non-recursive framework that works End-to-End (E2E) with raw audio. Learning from different speakers and noise types and incorporating that information to a shared parameterizing system is another contribution of the proposed model. Similar to SEAGAN, Macartney and Weyde (2018) proposes a model for speech enhancement based on a CNN architecture called Wave-UNet. The Wave-UNet is used successfully for audio source separation in music and speech de-reverberation. Similar to section 3.1, we compare the results of the reviewed papers in Table 7. Additionally, Tables 8 to 11 represent databases, loss functions, assessment metrics and characteristics used in speech synthesis.
Sahu et al. (2018) followed a two-fold contribution. First, they train a simple GAN model to learn a high-dimensional feature vector through the distribution of a lower-dimensional representation. Second, cGAN is used to learn the distribution of the high-dimensional feature vectors by conditioning on the emotional label of the target class. Eventually, the generated feature vectors are used to assess the improvement of emotion recognition. They report that using synthesized samples generated by cGAN in the training set is helpful. Also it is concluded that using synthesized samples in the test set suggests the estimation of a lower-dimensional distribution is easier than a high-dimensional complex distribution. Employing the synthesized feature vectors from IEMOCAP database in a cross-corpus experiment on emotion classification of MSP-IMPROV database is reported to be successful.
Mic2Mic (Mathur et al., 2019) is another example of a GAN-based model for speech enhancement. This model addresses a challenging problem called microphone variability. The Mic2Mic model disentangles the variability problem from the downstream speech recognition task and it minimizes the need for training data. Another advantage is that it works with unlabeled and unpaired samples from various microphones. This model defines microphone variability as a data translation from one microphone to another for reducing domain shift between the train and the test data. This model is developed based on CycleGAN to assure that the audio sample (Mathur et al., 2019) from microphone A is translated to a corresponding sample from microphone B.
|Pascual et al.||CGAN||SEAGAN||L||D||M||2.16||P|
|Macartney and Weyde||WaveUNet||-||L||D||M||2.41||P|
|Mathur et al.||CycleGAN||Mic2Mic||L, L, L||D||M||89.00||P|
|Hsu et al.||WGAN CVAE GAN||VAWGAN||L||D||M||3.00||P|
|Latif et al.||GAN||-||L||D||M||47.87||P, P|
|Kameoka et al.||CVAE||CVAE-VC||L, L, L||D||M||92.00||P|
|Kameoka et al.||StarGAN CycleGAN||StarGAN- VC||L, L, L, L||D||P|
|Kaneko and Kameoka||CycleGAN||CycleGAN- VC||L, L, L||D||M||2.4||P|
|Kaneko et al.||CycleGAN||CycleGAN- VC2||L, L, L||D||M||3.1||P P|
|Tanaka et al.||CycleGAN SEAGAN||Wave- CycleGAN- VC||L, L||D||M||4.18||P|
|Tanaka et al.||CycleGAN||Wave- CycleGAN- VC2||L, L, L||D||M||4.29||P, P|
|Kameoka et al.||-||ConvS2S||L, L, L|