Deep Facial Expression Recognition: A Survey

Deep Facial Expression Recognition: A Survey

Shan Li and Weihong DengThe authors are with the Pattern Recognition and Intelligent System Laboratory, School of Information and Communication Engineering, Beijing University of Posts and Telecommunications, Beijing, 100876, China.
E-mail:{ls1995, whdeng}

With the transition of facial expression recognition (FER) from laboratory-controlled to challenging in-the-wild conditions and the recent success of deep learning techniques in various fields, deep neural networks have increasingly been leveraged to learn discriminative representations for automatic FER. Recent deep FER systems generally focus on two important issues: overfitting caused by a lack of sufficient training data and expression-unrelated variations, such as illumination, head pose and identity bias. In this paper, we provide a comprehensive survey on deep FER, including datasets and algorithms that provide insights into these intrinsic problems. First, we describe the standard pipeline of a deep FER system with the related background knowledge and suggestions of applicable implementations for each stage. We then introduce the available datasets that are widely used in the literature and provide accepted data selection and evaluation principles for these datasets. For the state of the art in deep FER, we review existing novel deep neural networks and related training strategies that are designed for FER based on both static images and dynamic image sequences, and discuss their advantages and limitations. Competitive performances on widely used benchmarks are also summarized in this section. We then extend our survey to additional related issues and application scenarios. Finally, we review the remaining challenges and corresponding opportunities in this field as well as future directions for the design of robust deep FER systems.

Facial Expressions Recognition, Affect, Deep Learning, Survey.

1 Introduction

Facial expression is one of the most powerful, natural and universal signals for human beings to convey their emotional states and intentions [1, 2]. Numerous studies have been conducted on automatic facial expression analysis (AFEA) because of its practical importance in sociable robotics, medical treatment, driver fatigue surveillance, and many other human-computer interaction systems. As early as the twentieth century, Ekman and Friesen [3, 4] defined six basic emotions based on cross-culture study. These prototypical facial expressions are anger, disgust, fear, happiness, sadness, and surprise. Contempt was subsequently added as one of the basic emotions [5]. Due to these pioneering investigations and the direct and intuitive definition of facial expressions, this categorical model is still the most popular perspective for AFEA.

FER systems can be divided into two main categories according to the feature representations: static image FER and dynamic sequence FER. In static-based methods, the feature representation is encoded with only spatial information from the current single image, whereas dynamic-based methods consider the temporal relation among contiguous frames in the input facial expression sequence. Based on these two vision-based methods, other modalities, such as audio and physiological channels, have also been used in multimodal systems to assist the recognition of expression.

The majority of the traditional methods have used handcrafted features or shallow learning (e.g., local binary patterns (LBP) [6], LBP on three orthogonal planes (LBP-TOP) [7], non-negative matrix factorization (NMF) [8] and sparse learning [9]) for FER. However, since 2013, emotion recognition competitions such as FER2013 [10] and Emotion Recognition in the Wild (EmotiW) [11, 12, 13, 14, 15] have collected relatively sufficient training data from challenging real-world scenarios, which implicitly promote the transition of FER from lab-controlled to in-the-wild settings. Conventional handcrafted features have been reported to be incapable of addressing the great diversity of factors unrelated to facial expressions. Recently, due to the dramatically increased chip processing abilities (e.g., GPU units) and well-designed network architecture, studies in various fields have begun to transfer to deep learning methods, which have achieved the state-of-the-art recognition accuracy and exceeded previous results by a large margin (e.g., [16, 17, 18, 19]). Likewise, deep learning techniques have increasingly been implemented to handle the challenging factors for emotion recognition in the wild.

Fig. 1: The evolution of facial expression recognition in terms of datasets and methods.

Exhaustive surveys on automatic expression analysis have been published in recent years [20, 21, 22, 23]. These surveys have established a set of standard algorithmic pipelines for FER. However, they focus on traditional methods, and deep learning has rarely been reviewed. Therefore, in this paper, we focus our research on deep learning technology for facial emotion recognition tasks based on both static images and videos (image sequences). We aim to give a newcomer to this filed an overview of the systematic framework and prime skills for deep FER.

Despite the powerful feature learning ability of deep learning, problems remain when applied to FER. First, deep neural networks require a large amount of training data to avoid overfitting. However, the existing facial expression databases are not sufficient to train the well-known neural network with deep architecture that achieved the most promising results in object recognition tasks. Additionally, high inter-subject variations exist due to different personal attributes, such as age, gender, ethnic backgrounds and level of expressiveness [24]. In addition to subject identity bias, variations in pose, illumination and occlusions are common in unconstrained facial expression scenarios. These factors are nonlinearly coupled with facial expressions and therefore strengthen the requirement of deep networks to address the large intra-class variability and to learn effective expression-specific representations.

In this paper, we introduce recent advances in research on solving the above problems for deep FER. We examine the state-of-the-art results that have not been reviewed in previous survey papers. The rest of this paper is organized as follows. Section 2 identifies three main steps required in a deep expression recognition system and describes the related background. Frequently used expression databases are introduced in Section 3. Section 4 provides a detailed review of novel neural network architectures and special network training tricks designed for FER based on static images and dynamic image sequences. We then cover additional related issues and other practical scenarios in Section 5. Section 6 discusses some of the challenges and opportunities in this field and identifies potential future directions.

2 Deep facial expression recognition

In this section, we describe the three main steps that are common in automatic deep FER, i.e., pre-processing, deep feature learning and deep feature classification. We briefly summarize the widely used algorithms for each step and recommend the existing state-of-the-art best practice implementations according to the referenced papers.

Fig. 2: The general pipeline of deep facial expression recognition systems.

2.1 Pre-processing

Variations that are irrelevant to facial expressions, such as different backgrounds, illuminations and head poses, are fairly common in unconstrained scenarios. Therefore, before training the deep neural network to learn meaningful features, pre-processing is required to align and normalize the visual semantic information conveyed by the face.

2.1.1 Face alignment

Face alignment is a traditional pre-precessing step in many face-related recognition tasks. We list some well-known approaches and publicly available implementations that are used in deep FER. For more details, readers can refer to [25], which provides specific principles of face detection, facial landmark localization and face registration.

Given a series of training data, the first step is to detect the face and then to remove background and non-face areas. The Viola-Jones (V&J) face detector [26] is a classic and widely employed method for face detection that is publicly available in many implementations (e.g., OpenCV and Matlab). This face detector is robust and computationally simple for detecting near-frontal faces. Given the detected face bounding box, raw images can be cropped to obtain the face area. Although face detection is the only indispensable procedure to enable feature learning, further face alignment can substantially enhance the FER performance [27]. Based on the coordinates of localized landmarks, faces can be registered into a predefined uniform template with an affine transformation. This step is crucial because it can reduce the variation in face scale and in-plane rotation. The most widely used implementation for facial alignment is IntraFace111 [28], which has been employed in many deep facial expression recognition systems (e.g., [29, 30, 27, 31]). The software applies a regression-based facial landmark localization method, i.e., the supervised descent method (SDM) [32], and provides 49 accurate facial landmark points, including the two eyes, the nose, the mouth, and the two eyebrows. Many other studies (e.g., [33, 34, 35]) have used Zhu and Ramanan’s mixtures of trees (MoT) structured models222 xzhu/face/ [36] for landmark location. Other effective open-source algorithms, such as discriminative response map fitting (DRMF)333 [37] and the Dlib C++ library444 [38] have also been adopted (e.g., [39, 40]). Recently, deep learning methods such as multitask cascaded convolutional networks (MTCNN)555 [41], DenseReg666 [42] and Tiny Faces777 peiyunh/tiny/ [43] have achieved superior performance in many face alignment benchmarks. These methods are robust and can maintain real-time performance for FER in challenging unconstrained environments.

In contrast to using a single model for face alignment, some methods combine multiple models for better landmark estimation when processing faces in challenging unconstrained environments. Yu et al. [44] concatenate three detectors (JDA detector [45], DCNN detector [46] and MoT detector[36]) to complement each other. Kim et al. [47] considered four pipelines with different inputs (original image and histogram equalized image) and different face detection models (V&J [26] and MoT[36]), and the landmark set with the highest confidence provided by the Intraface was selected.

2.1.2 Data augmentation

Deep neural networks require sufficient valid training data to ensure generalizability to a given recognition task. However, publicly available databases for FER do not have a sufficient quantity of images for training. Therefore, data augmentation is a vital step for deep FER. Data augmentation techniques can be divided into two groups: offline data augmentation and on-the-fly data augmentation.

Simard et al. [48] proposed the generation of synthetic samples for each original image to increase the database. Inspired by this technique, various data augmentation operations have been added offline for deep FER. The most frequently used methods include random perturbations and transforms, e.g., rotation, translation, horizontal flips, scaling and sheer. These operations generate additional unseen training samples and therefore make the network more robust to deviated and rotated faces. In [49], noise such as salt & pepper and speckle were added to augment the training data. In [50], changes in brightness and saturation were considered for data augmentation. In [51] and [52], a 2D Gaussian distribution with standard deviation was used to randomly add noise to the centers of the eyes. After correction for rotation, [52] also applied a skew procedure to distort the corners of the image to enlarge the training set. In [53], the authors applied five image appearance filters (disk, average, Gaussian, unsharp and motion filters) and six affine transform matrices that were formalized by adding slight geometric transformations to the identity matrix. In [44], a more comprehensive affine transform matrix was proposed to randomly generate images that varied in terms of rotation, skew and scale. Furthermore, a synthetic data generation system with 3D convolutional neural network (CNN) was created in [54] to confidentially create faces with different levels of saturation in expression and accurate movement in action units. Moreover, the generative adversarial network (GAN) [55] can also be applied to data augmentation by generating various expressive appearances.

In addition to offline data augmentation, on-the-fly data augmentation is often embedded in deep learning toolkits to alleviate overfitting. During the training step, the input samples are randomly cropped from the four corners and center of the image and then flipped horizontally, which can result in a dataset that is ten times larger than the original training data. Two common prediction methods are used for testing: only the center patch of the face is used for prediction (e.g., [56, 40]) or the prediction value is averaged over all ten crops (e.g., [57, 47]).

2.1.3 Face normalization

Variations in illumination and head poses can introduce large changes in images and hence impair the FER performance. Therefore, we introduce two typical face normalization methods to ameliorate these variations: illumination normalization and pose normalization (frontalization).

Illumination normalization: Illumination and contrast can vary in different images even from the same person with the same expression, especially in unconstrained environments, which can result in large intra-class variances. The INFace toolbox888 is the most commonly used implementation for illumination invariance. Several studies have shown that histogram equalization combined with illumination normalization techniques results in better face recognition performance than that achieved using illumination normalization techniques on their own [58]. Many studies in the literature of deep FER (e.g., [44, 59, 60, 49]) have employed histogram equalization to increase the global contrast of images for pre-processing. This method is effective when the brightness of the background and foreground are similar. To reduce illumination variation, the authors in [61] employed homomorphic filtering based normalization, which yields the most consistent results among all other techniques [62]. A method adapted from a bio-inspired technique described in [63], called contrastive equalization, was used in [51] for intensity normalization. Besides histogram equalization, three frequently used illumination normalization methods, namely, isotropic diffusion (IS)-based normalization, discrete cosine transform (DCT)-based normalization [64] and difference of Gaussian (DoG), were evaluated in [39]. The experimental results suggested that histogram equalization achieved the most reliable performance for all the network models. In [49], the authors compared three different methods: global contrast normalization (GCN), local normalization, and histogram equalization. GCN and histogram equalization achieved the best accuracy for the training and testing steps, respectively.

Pose normalization: Considerable pose variation is another common and intractable problem in unconstrained settings. Some studies have employed pose normalization techniques to yield frontal facial views for FER (e.g., [65, 66]), among which the most popular was proposed by Hassner et [67]. Specifically, after localizing facial landmarks, a 3D texture reference model generic to all faces is generated to efficiently estimate visible facial components. Then, the initial frontalized face is synthesized by back-projecting each input face image to the reference coordinate system. Alternatively, Sagonas et al. [68] proposed an effective statistical model to simultaneously localize landmarks and convert facial poses using only frontal faces. Very recently, a series of GAN-based deep models were proposed for frontal view synthesis (e.g., FF-GAN [69], TP-GAN [70]) and DR-GAN [71]).

2.2 Deep networks for feature learning

Deep learning has recently become a hot research topic and has achieved state-of-the-art performance for a variety of applications [72]. Deep learning attempts to capture high-level abstractions through hierarchical architectures of multiple nonlinear transformations and representations. In this section, we briefly introduce some deep learning techniques that have been applied for FER. The traditional architectures of these deep neural networks are shown in Fig. 2.

2.2.1 Convolutional neural network (CNN)

The convolutional neural network (CNN) is one of the most successful deep models for image analysis. The CNN has been extensively used in diverse computer vision applications, including FER. At the beginning of the 21st century, several studies in the FER literature [73, 74] found that the CNN is robust to face location changes and scale variations and behaves better than the multilayer perceptron (MLP) in the case of previously unseen face pose variations. [75] employed the CNN to address the problems of subject independence as well as translation, rotation, and scale invariance in the recognition of facial expressions.

A CNN consists of three types of heterogeneous layers: convolutional layers, pooling layers, and fully connected layers. The convolutional layer has a set of learnable filters to convolve through the whole input image and produce various specific types of activation feature maps. The convolution operation is associated with three main benefits: local connectivity, which learns correlations among neighboring pixels; weight sharing in the same feature map, which greatly reduces the number of the parameters to be learned; and shift-invariance to the location of the object. The pooling layer follows the convolutional layer and is used to reduce the spatial size of the feature maps and the computational cost of the network. Average pooling and max pooling are the two most commonly used nonlinear down-sampling strategies for translation invariance. The fully connected layer is usually included at the end of the network to ensure that all neurons in the layer are fully connected to activations in the previous layer and to enable the 2D feature maps to be converted into 1D feature maps for further feature representation and classification.

In addition to these three typical layers, various follow-up strategies are widely used in CNNs. Rectified linear unit (ReLu) [16], which increases the nonlinear properties of the network without suffering from the vanishing gradient problem, is the most common activation function. Its extension, parametric rectified linear unit (PReLU) [76], was proposed to improve model fitting and generalizability when given limited training data. Dropout, proposed by Hinton et al. [77, 78], is another universal strategy to prevent complex co-adaptations on the training data and to reduce overfitting by randomly omitting a portion of the feature detectors. Furthermore, batch normalization (BN) [79] was proposed to reduce internal covariate shift to regularize the model and to improve the convergence speed.

We list the configurations and characteristics of some well-known CNN models that have been applied for FER in Table I. Besides these networks, several well-known derived frameworks also exist. In [80, 35], region-based CNN (R-CNN) [81] was used to learn features for FER. In [82], Faster R-CNN [83] was used to identify facial expressions by generating high-quality region proposals. Moreover, Ji et al. proposed 3D CNN [84] to capture motion information encoded in multiple adjacent frames for action recognition via 3D convolutions. Tran et al. [85] proposed the well-designed C3D, which exploits 3D convolutions on large-scale supervised training datasets to learn spatio-temporal features. Many related studies (e.g., [86, 87]) have employed this network for FER involving image sequences.

AlexNet VGGNet GoogleNet ResNet
[16] [17] [18] [19]
Year 2012 2014 2014 2015
# of layers 5+3 13/16 + 3 21+1 151+1
Kernel size 11, 5, 3 3 7, 1, 3, 5 7, 1, 3, 5
  • number of convolutional layers + fully connected layers

  • size of the convolution kernel

TABLE I: Comparison of CNN models and their achievements. DA = Data augmentation; BN = Batch normalization.

2.2.2 Deep belief network (DBN)

The deep belief network (DBN) proposed by Hinton et al. [88] is a graphical model that learns to extract a deep hierarchical representation of the training data. The traditional DBN is built with a stack of restricted Boltzmann machines (RBMs) [89], which are two-layer generative stochastic models composed of a visible-unit layer and a hidden-unit layer. These two layers in an RBM must form a bipartite graph without lateral connections. In a DBN, the units in higher layers are trained to learn the conditional dependencies among the units in the adjacent lower layers, except the top two layers, which have undirected connections. The training of a DBN contains two phases: pre-training and fine-tuning [90]. First, an efficient layer-by-layer greedy learning strategy [91] is used to initialize the deep network in an unsupervised manner, which can prevent poor local optimal results to some extent without the requirement of a large amount of labeled data. During this procedure, contrastive divergence [92] is used to train RBMs in the DBN to estimate the approximation gradient of the log-likelihood. Then, the parameters of the network and the desired output are fine-tuned with a simple gradient descent under supervision.

2.2.3 Deep autoencoder (DAE)

The deep autoencoder (DAE) was first introduced in [93] to learn efficient codings for dimensionality reduction. In contrast to the previously mentioned networks, which are trained to predict target values, the DAE is optimized to reconstruct its inputs by minimizing the reconstruction error. Variations of the DAE exist, such as the denoising autoencoder [94], which recovers the original undistorted input from partially corrupted data; the sparse autoencoder network [95], which enforces sparsity on the learned feature representation; the contractive autoencoder [96], which adds an activity dependent regularization to induce locally invariant features; and the convolutional autoencoder [97], which uses convolutional (and optionally pooling) layers for the hidden layers in the network.

2.2.4 Recurrent neural network (RNN)

A recurrent neural network (RNN) is a connectionist model that captures temporal information and is more suitable for sequential data prediction with arbitrary lengths. In addition to training the deep neural network in a single feed-forward manner, RNNs include recurrent edges that span adjacent time steps and share the same parameters across all steps. The classic back propagation through time (BPTT) [98] is used to train the RNN. Long-short term memory (LSTM), introduced by Hochreiter & Schmidhuber [99], is a special form of the traditional RNN that is used to address the gradient vanishing and exploding problems that are common in training RNNs. The cell state in LSTM is regulated and controlled by three gates: an input gate that allows or blocks alteration of the cell state by the input signal, an output gate that enables or prevents the cell state to affect other neurons, and a forget gate that modulates the cell’s self-recurrent connection to accumulate or forget its previous state. By combining these three gates, LSTM can model long-term dependencies in a sequence and has been widely employed for video-based expression recognition tasks.

2.3 Facial expression classification

After learning the deep features, the final step of FER is to classify the given face into one of the basic emotion categories. In contrast to the traditional methods, where the feature extraction step and the feature classification step are independent, the deep neutral network can perform FER in an end-to-end manner. Specifically, a loss layer is added to the end of the network to regulate the back-propagation error; then, the prediction probability of each sample can be directly output by the network. Another alternative is to employ a deep neural network (particularly a CNN) as a feature extraction tool and then apply additional classifiers, such as support vector machine (SVM) or random forest, to the extracted image representations [100, 101].

3 Facial expression databases

Having sufficient labeled training data that include as many variations of the populations and environments as possible is important for the design of a deep expression recognition system. In this section, we discuss the publicly available databases that contain basic expressions and that are widely used in our reviewed papers for deep learning algorithms evaluation. We also introduce newly released databases that contain a large amount of affective images collected from the real world to benefit the training of deep neural networks. Table II provides an overview of these datasets, including the main reference, number of subjects, number of image or video samples, collection environment, expression distribution and additional information.

Database Samples Subjects Condit. Elicit. Expression distribution Access
CK+ [102] 593 image sequences 123 Lab P & S 6 basic expressions plus contempt and neutral
MMI [103, 104] 740 images and 2,900 videos 25 Lab P 6 basic expressions plus neutral
JAFFE [105] 213 images 10 Lab P 6 basic expressions plus neutral
TFD [106] 112,234 images N/A Lab P 6 basic expressions plus neutral
FER-2013 [10] 35,887 images N/A Web P & S 6 basic expressions plus neutral
AFEW 7.0 [15] 1,809 videos N/A Movie P & S 6 basic expressions plus neutral
SFEW 2.0 [13] 1,766 images N/A Movie P & S 6 basic expressions plus neutral
Multi-PIE [107] 755,370 images 337 Lab P Smile, surprised, squint, disgust, scream and neutral
BU-3DFE [108] 2,500 images 100 Lab P 6 basic expressions plus neutral
Oulu-CASIA [109] 2,880 image sequences 80 Lab P 6 basic expressions
RaFD [110] 1,608 images 67 Lab P 6 basic expressions plus contempt and neutral
KDEF [111] 4900 images 70 Lab P 6 basic expressions plus neutral
EmotioNet [112] 1,000,000 images N/A Web P & S 23 basic expressions or compound expressions
RAF-DB [113] 29672 images N/A Web P & S 6 basic expressions plus neutral and 12 compound expressions
AffectNet [114] 450,000 images (labeled) N/A Web P & S 6 basic expressions plus neutral
  • The number of samples and number of subjects for each dataset are taken from the reference paper and may be different from the actual data that are available for basic emotion recognition. See text for details.

TABLE II: An overview of the facial expression datasets. P = posed; S = spontaneous; Condit. = Collection condition; Elicit. = Elicitation method.

CK+ [102]: The Extended Cohn–Kanade (CK+) database is the most extensively used laboratory-controlled database for evaluating FER systems. CK+ contains 593 video sequences from 123 subjects. The sequences vary in duration from 10 to 60 frames and show a shift from a neutral facial expression to the peak expression. Among these videos, 327 sequences from 118 subjects are labeled with seven basic expression labels (anger, contempt, disgust, fear, happiness, sadness, and surprise) based on the Facial Action Coding System (FACS). Because CK+ does not provide specified training, validation and test sets, the algorithms evaluated on this database are not uniform. For static-based methods, the most common data selection method is to extract the last one to three frames with peak formation and the first frame (neutral face) of each sequence. Then, the subjects are divided into groups for person-independent -fold cross-validation experiments, where commonly selected values of are 5, 8 and 10.

MMI [103, 104]: The MMI database is laboratory-controlled and includes 326 sequences from 32 subjects. A total of 213 sequences are labeled with six basic expressions (without “contempt”), and 205 sequences are captured in frontal view. In contrast to CK+, sequences in MMI are onset-apex-offset labeled, i.e., the sequence begins with a neutral expression and reaches peak near the middle before returning to the neutral expression. Furthermore, MMI has more challenging conditions, i.e., there are large inter-personal variations because subjects perform the same expression non-uniformly and many of them wear accessories (e.g., glasses, mustache). For experiments, the most common method is to choose the first frame (neutral face) and the three peak frames in each frontal sequence to conduct person-independent 10-fold cross-validation.

JAFFE [105]: The Japanese Female Facial Expression (JAFFE) database is a laboratory-controlled image database that contains 213 samples of posed expressions from 10 Japanese females. Each person has 3~4 images with each of six basic facial expressions (anger, disgust, fear, happiness, sadness, and surprise) and one image with a neutral expression. The database is challenging because it contains few examples per subject/expression. Typically, all the images are used for the leave-one-subject-out experiment.

TFD [106]:The Toronto Face Database (TFD) is an amalgamation of several facial expression datasets. TFD contains 112,234 images, 4,178 of which are annotated with one of seven expression labels: anger, disgust, fear, happiness, sadness, surprise and neutral. The faces have already been detected and normalized to a size of 48*48 such that all the subjects’ eyes are the same distance apart and have the same vertical coordinates. Five official folds are provided in TFD; each fold contains a training, validation, and test set consisting of 70%, 10%, and 20% of the images, respectively.

FER2013 [10]: The FER2013 database was introduced during the ICML 2013 Challenges in Representation Learning. FER2013 is a large-scale and unconstrained database collected automatically by the Google image search API. All images in the dataset have been registered and resized to 48*48 pixels after rejecting wrongfully labeled frames and adjusting the cropped region. FER2013 contains 28,709 training images, 3,589 validation images and 3,589 test images with seven expression labels (anger, disgust, fear, happiness, sadness, surprise and neutral).

AFEW [115]: The Acted Facial Expressions in the Wild (AFEW) database was first established and introduced in [116] and [115] and has served as an evaluation platform for the annual Emotion Recognition In The Wild Challenge (EmotiW) since 2013. AFEW contains video clips collected from different movies with spontaneous expressions, various head poses, occlusions and illuminations. AFEW is a temporal and multimodal database that provides with vastly different environmental conditions in both audio and video. Samples in AFEW are labeled with seven expressions: anger, disgust, fear, happiness, sadness, surprise and neutral. The annotation of expressions have been continuously updated, and reality TV show data have been continuously added. The latest AFEW 7.0 in EmotiW 2017 [15] is divided into three data partitions in an independent manner in terms of subject and movie/TV source: Train (773 samples), Val (383 samples) and Test (653 samples), which ensures data in the three sets belong to mutually exclusive movies and actors.

SFEW [117]: The Static Facial Expressions in the Wild (SFEW) was created by selecting static frames from the AFEW database by computing key frames based on facial point clustering. The most commonly used version, SFEW 2.0, was the benchmarking data for the SReco sub-challenge in EmotiW 2015 [13]. SFEW 2.0 has been divided into three sets: Train (958 samples), Val (436 samples) and Test (372 samples). Each of the images is assigned to one of seven expression categories, i.e., anger, disgust, fear, neutral, happiness, sadness, and surprise. The expression labels of the training and validation sets are publicly available, whereas those of the testing set are held back by the challenge organizer.

Multi-PIE [107]: The CMU Multi-PIE database contains 755,370 images from 337 subjects under 15 viewpoints and 19 illumination conditions in up to four recording session. Each facial image is labeled with one of six expressions: disgust, neutral, scream, smile, squint and surprise. This dataset is typically used for multiview facial expression analysis.

BU-3DFE [108]: The Binghamton University 3D Facial Expression (BU-3DFE) database contains 606 facial expression sequences captured from 100 people. For each subject, six universal facial expressions (anger, disgust, fear, happiness, sadness and surprise) are elicited by various manners with multiple intensities. Similar to Multi-PIE, this dataset is typically used for multiview 3D facial expression analysis.

Oulu-CASIA [109]: The Oulu-CASIA database includes 2,880 image sequences collected from 80 subjects labeled with six basic emotion labels: anger, disgust, fear, happiness, sadness, and surprise. Each of the videos is captured with one of two imaging systems, i.e., near-infrared (NIR) or visible light (VIS), under three different illumination conditions. Similar to CK+, the first frame is neutral and the last frame has the peak expression. Typically, only the last three peak frames and the first frame (neutral face) from the 480 videos collected by the VIS System under normal indoor illumination are employed for 10-fold cross-validation experiments.

RaFD [110]: The Radboud Faces Database (RaFD) is laboratory-controlled and has a total of 1,608 images from 67 subjects with three different gaze directions, i.e., front, left and right. Each sample is labeled with one of eight expressions: anger, contempt, disgust, fear, happiness, sadness, surprise and neutral.

KDEF [111]: The laboratory-controlled Karolinska Directed Emotional Faces (KDEF) database was originally developed for use in psychological and medical research. KDEF consists of images from 70 actors with five different angles labeled with six basic facial expressions plus neutral.

In addition to these commonly used datasets for basic emotion recognition, several well-established and large-scale publicly available facial expression databases collected from the Internet that are suitable for training deep neural networks have emerged in the last two years.

EmotioNet [112]: EmotioNet is a large-scale database with one million facial expression images collected from the Internet. A total of 950,000 images were annotated by the automatic action unit (AU) detection model described in [112], and the remaining 25,000 images were manually annotated with 11 AUs. The second track of the EmotioNet Challenge [118] provides six basic expressions and ten compound expressions [119], and 2,478 images with expression labels are available.

RAF-DB [113]: The Real-world Affective Face Database (RAF-DB) is a real-world database that contains 29,672 highly diverse facial images downloaded from the Internet. With manually crowd-sourced annotation and reliable estimation, seven basic and eleven compound emotion labels are provided for the samples. Specifically, 15,339 images from the basic emotion set are divided into two groups (12,271 training samples and 3,068 testing samples) for evaluation.

AffectNet [114]: AffectNet contains more than one million images from the Internet that were obtained by querying different search engines using emotion-related tags. AffectNet is by far the largest database that provides facial expressions in two different emotion models (categorical model and dimensional model), of which 450,000 images have manually annotated labels for eight basic expressions.

4 The state of the art

In this section, we review the existing novel deep neural networks designed for FER and the related training strategies proposed to address expression-specific problems. We divide the works presented in the literature into two main groups depending on the type of data: deep FER networks for static images and deep FER networks for dynamic image sequences. We then provide an overview of the current deep FER systems with respect to the network architecture and performance. Because some of the evaluated datasets do not provide explicit data groups for training, validation and testing, and the relevant studies may conduct experiments under different experimental conditions with different data, we summarize the expression recognition performance along with information about the data selection and grouping methods.

Datasets Method
Data selection Data group
Performance1 (%)
CK+ Ouellet 14 [120] AlexNet the last frame LOSO SVM 7 classes : (94.4)
Li et al. 15 [61] DBM
6 classes: 96.8
7 classes : 91.7 (86.76)
Liu et al. 14 [121] cascaded network the last three frames and the first frame 8 folds AdaBoost 6 classes: 96.7
Liu et al. 13 [122] cascaded network 10 folds SVM 8 classes: 92.05 (87.67)
Liu et al. 15 [123] cascaded network 10 folds SVM 7 classes : 93.70
Khorrami et al. 15 [124] zero-bias CNN 10 folds
6 classes: 98.3
8 classes: 96.4
Ding et al. 17 [31] FaceNet2ExpNet 10 folds
6 classes: (98.6)
8 classes: (96.8)
Zeng et al. 18 [125] DSAE
the last four frames
and the first frame
7 classes : 95.79 (93.78)
8 classes: 89.84 (86.82)
Cai et al. 17 [126] Island loss the last three frames 10 folds 7 classes : 94.35 (90.66)
Meng et al. 17 [40] multitask network 8 folds 7 classes : 95.37 (95.51)
Liu et al. 17 [56]
clusters loss
8 folds 7 classes : 97.1(96.1)
Zhang et al. 18 [127] fine-tune 10 folds 6 classes: 98.9
JAFFE Liu et al. 14 [121] cascaded network 213 images LOSO AdaBoost 7 classes : 91.8
Hamester et al. 15 [128] network ensemble LOSO/10 folds
LOSO: (95.8 )
10 folds: 94.1 (91.6)
MMI Liu et al. 13 [122] cascaded network the middle three frames and the first frame 10 folds SVM 7 classes : 74.76 (71.73)
Liu et al. 15 [123] cascaded network 10 folds SVM 7 classes : 75.85
Mollahosseini et al. 16 [27] Inception images from each sequence 5 folds 6 classes: 77.9
Cai et al. 17 [126] Island loss the middle three frames 10 folds 6 classes: 70.67 (69.60)
Liu et al. 17 [56]
clusters loss
10 folds 6 classes: 78.53 (73.50)
Li et al. 17 [113] LP loss 5 folds SVM 6 classes: 78.46
TFD Rifai et al. 12 [129] cascaded network all images 5 official folds SVM Test: 85.06
Reed et al. 14 [130] multitask network
4,178 emotion labeled
3,874 identity labeled
SVM Test: 85.43
Devries et al. 14 [34] multitask network 4,178 labeled images
Validation: 87.80
Test: 85.13 (48.29)
Reed et al. 14 [131] label bootstrap Test: 86.8
Khorrami et al. 15 [124] zero-bias CNN Test: 89.8
Ding et al. 17 [31] FaceNet2ExpNet Test: 88.9 (87.7)
FER2013 Tang 13 [132] L2-SVM loss Training Set: 28,709 Validation Set: 3,589 Test Set: 3,589 Test: 71.2
Devries et al. 14 [34] multitask network Validation+Test: 67.21
Zhang et al. 15 [133] bridging layer Test: 75.10
Guo et al. 16 [134] triplet-based loss Test: 71.33
Kim et al. 16 [135] network ensemble Test: 73.73
pramerdorfer et al. 16 [136] network ensemble Test:75.2
Connie et al. 17 [137] network ensemble Test: 73.4
SFEW 2.0 levi et al. 15 [57] network ensemble
891 training, 431 validation,
and 372 test
Validation: 51.75
Test: 54.56
Ng et al. 15 [29] pre-traing and fine-tuning
921 training, ? validation,
and 372 test
Validation: 48.5 (39.63)
Test: 55.6 (42.69)
Li et al. 17 [113] DLP loss 921 training, 427 validation SVM Validation: 51.05
Ding et al. 17 [31] FaceNet2ExpNet 891 training, 425 validation Validation: 55.15 (46.6)
Pons et al. 18 [138] multitask network 958 training, 436 validation Validation: 45.9
Liu et al. 17 [56]
clusters loss
Validation: 54.19 (47.97)
Cai et al. 17 [126] Island loss 958training, 436 validation, and 372 test
Validation: 52.52 (43.41)
Test: 59.41 (48.29)
Meng et al. 17 [40] multitask network
Validation: 50.98 (42.57)
Test: 54.30 (44.77)
Kim et al. 15 [47] network ensemble
Validation: 53.9
Test: 61.6
Yu et al. 15 [44] network ensemble
Validation: 55.96 (47.31)
Test: 61.29 (51.27)
  • The value in parentheses is the mean accuracy, which is calculated with the confusion matrix given by the authors.

  • 7 Classes: Anger, Contempt, Disgust, Fear, Happiness, Sadness, and Surprise.

  • 7 Classes: Anger, Disgust, Fear, Happiness, Neutral, Sadness, and Surprise.

TABLE III: Performance summary of representative methods for static-based deep facial expression recognition on the most widely evaluated datasets. LOSO = leave-one-subject-out.

4.1 Deep FER networks for static images

A large volume of the existing studies conducted expression recognition tasks based on static images without considering temporal information due to the convenience of data processing and the availability of the relevant training and test material. We first introduce specific pre-training and fine-tuning skills for FER, then review the novel deep neural networks in this field. For each of the most frequently evaluated datasets, Table III shows the current state-of-the-art methods in the field that are explicitly conducted in a person-independent protocol (subjects in the training and testing sets are separated).

4.1.1 Pre-training and fine-tuning

As mentioned before, direct training of deep networks on relatively small facial expression datasets is prone to overfitting. To mitigate this problem, many studies used additional task-oriented data to pre-train their self-built networks from scratch or fine-tuned on well-known pre-trained models (e.g., AlexNet [16], VGG[17], VGG-face [139] and GoogleNet [18]). Kahou et al. [33, 140] indicated that the use of additional data can help to obtain models with high capacity without overfitting, thereby enhancing the FER performance.

To select appropriate auxiliary data, large-scale face recognition (FR) datasets (e.g., CASIA WebFace [141], Celebrity Face in the Wild (CFW) [142], FaceScrub dataset [143]) or relatively large FER datasets (FER2013 [10] and TFD [106]) are suitable. As suggested by Kaya et al.[144] in their preliminary experiments for FER, VGG-Face, which was trained for FR, overwhelmed ImageNet, which was developed for objected recognition. Another interesting result observed by Knyazev et al.[145] is that worse FR models that were pre-trained on much larger FR data can achieve better performance in the emotion recognition task after fine-tuning using the FER2013 dataset. Therefore, pre-training on larger FR data positively affects the emotion recognition accuracy, and further fine-tuning with related facial expression datasets can help improve the performance.

Fig. 3: Flowchart of the different fine-tuning combinations used in [29]. Here, “FER28” and “FER32” indicate different parts of the FER2013 datasets. “EmotiW” is the target dataset. The proposed two-stage fine-tuning strategy (Submission 3) exhibited the best performance.
Fig. 4: Two-stage training flowchart in [31]. In stage (a), the deeper face net is frozen and provides the feature-level regularization that pushes the convolutional features of the expression net to be close to the face net by using the proposed distribution function. Then, in stage (b), to further improve the discriminativeness of the learned features, randomly initialized fully connected layers are added and jointly trained with the whole expression net using the expression label information.

Instead of directly using the pre-trained or fine-tuned models to extract learned features on the target dataset, Ng et al. [29] proposed a multistage fine-tuning: after the first-stage fine-tuning using FER2013 on pre-trained models, a second-stage fine-tuning based on the training part of the target dataset (EmotiW) is employed to refine the models to adapt to a more specific dataset (i.e., the target dataset). This strategy achieved the best performance (see “Submission 3” in Fig. 3) among all the tested fine-tuning methods. In [31], Ding et al. found that face-dominated information remains in the fine-tuned FR network because of the large gap between the FR and FER datasets, which weakens the network’s ability to represent different expressions. The authors presented a novel training algorithm, called FaceNet2ExpNet, to eliminate this effect and further incorporate face domain knowledge learned from the FR net to regularize the training of the target FER net. The training is conducted in two stages (see Fig. 4 for details). Because the fine-tuned face net already achieves competitive performance on the expression dataset, it could serve as a good initialization for the expression net. Moreover, because fully connected layers generally capture more domain-specific semantics, the face net is used to guide the learning of the convolutional layers only, and the fully connected layers are trained from scratch with expression information.

4.1.2 Diverse network input

Traditional practices commonly use the whole aligned face of RGB images as the input of the network to learn features for FER. However, these raw data lack important information, such as homogenous or regular textures and invariance in terms of image scaling, rotation, occlusion and illumination, which may represent confounding factors for FER. Some methods have employed diverse outstanding handcrafted features and their extension as the network input to alleviate this problem.

Fig. 5: Image intensities (left) and LBP codes (middle). [57] proposed mapping these values to a 3D metric space (right) as the input of CNNs.

Levi et al.[57] proposed a novel mapped LBP that transformed image intensities to illumination-invariant 3D spaces (see Fig. 5). By setting a series of different radius parameters on the mapped LBP codes, the training data size can be efficiently enlarged and the CNNs can focus on information related to the FER task rather than illumination variations. Zhang et al. [146] employed scale-invariant feature transform (SIFT) [147]) features as robust local appearance descriptors to input the network units, which can be beneficial for conducting multiview FER tasks, in which the facial poses are diverse. Luo et al. [148] proposed a 3D Angle+Gradient+Edge (AGE) feature to combine the outline, texture and geometrical information of faces and experimentally found the AGE features produce better results than those of the raw data for FER. Mavani et al. [149] multiplied the raw image by its saliency map, which put intensities on parts of the image demanding visual attention, using a deep multilevel CNN [150]. And Wu et al. [151] applied the neighbor-center difference vector (NCDV) [152] to obtain features with more intrinsic information. Zeng et al. [125] extracted three different descriptors (LBP, histogram of oriented gradients (HOG) and gray value) from patches centered at 51 landmarks to compose a high-dimensional feature for the input of a deep sparse autoencoder (DSAE) after dimension reduction via principal component analysis (PCA). Chen et al. [153] removed noncritical parts from the whole image and extracted only three regions of interest (ROI), i.e., eyebrows, eyes and mouth, which are the facial regions most strongly related to expression, as the input of the DSAE.

4.1.3 Auxiliary blocks & layers

(a) Three different supervised blocks in [66]. SS_Block for shallow-layer supervision, IS_Block for intermediate-layer supervision, and DS_Block for deep-layer supervision.
(b) Island loss layer in [126]. The island loss calculated at the feature extraction layer and the softmax loss calculated at the decision layer are combined to supervise the CNN training.
(c) (N+M)-tuple clusters loss layer in [56]. During training, the identity-aware hard-negative mining and online positive mining schemes are used to decrease the inter-identity variation in the same expression class.
Fig. 6: Representative functional layers or blocks that are specifically designed for deep facial expression recognition.

Based on the foundation architecture of CNN, several studies have proposed the addition of well-designed auxiliary blocks or layers to enhance the expression-related representation capability of learned features.

Yao et al. [65] constructed a deep, yet computationally efficient, CNN architecture, named HoloNet, for FER with three improvements: (1) the concatenated rectified linear unit (CReLU) [154] replaced the popular ReLU to reduce redundant filters and to improve the non-saturated nonlinearity in the lower convolutional layers, (2) the powerful residual structure [19] was further combined with CReLU to achieve high accuracy from the considerably increased depth without reducing efficiency, and (3) an inception-residual block [155, 156] was uniquely designed for emotion recognition to learn multiscale features that can explicitly capture variations in emotion.

Hu et al. [66] embedded three types of supervised blocks in the early hidden layers of the mainstream CNN architecture for shallow, intermediate and deep supervision. These blocks were designed according to the layer-wise feature description ability of the original network. Then, the class-wise scoring activations of each block were aggregated in the scoring connection layer for second-level supervision (see Fig. 6(a) for details).

The traditional softmax loss layer in CNNs simply forces features of different classes to remain apart, but FER in real-world scenarios suffers from not only high inter-class similarity but also high intra-class variation. Therefore, several works have proposed novel loss layers for FER. Inspired by the center loss [157], which penalizes the distance between deep features and their corresponding class centers, two variations were proposed to assist the supervision of the softmax loss for more discriminative features for FER: (1) island loss [126] was formalized to further increase the pairwise distances between different class centers, and (2) locality-preserving loss (LP loss) [113] was formalized to pull the locally neighboring features of the same class together so that the intra-class local clusters of each class are compact. Based on the triplet loss [158], which requires one positive example to be closer to the anchor than one negative example with a fixed gap, two variations were proposed to replace or assist the supervision of the softmax loss: (1) exponential triplet-based loss [134] was formalized to give difficult samples more weight when updating the network, and (2) (N+M)-tuples cluster loss [56] was formalized to alleviate the difficulty of anchor selection and threshold validation in the triplet loss for identity-invariant FER (see Fig. 6(c) for details).

4.1.4 Network ensemble

Previous research suggested that assemblies of multiple networks can outperform an individual network [159]. Two key factors should be considered when implementing network ensembles: (1) sufficient diversity of the networks to ensure complementarity, and (2) an appropriate ensemble method that can effectively aggregate the committee networks.

In terms of the first factor, different training data sets and various network architectures or parameters are considered to generate diverse committees. Connie et al.[137] merged SIFT features with CNN features learned from raw images to improve the FER performance. Kim et al. [135] used several preprocessing methods, such as deformation and normalization, to obtain different training data from both aligned and non-aligned faces. In [47, 160], different sizes of filters and different numbers of neurons in the fully connected layer were used to construct various CNNs. In [47], the authors applied multiple random seeds for weight initialization on each CNN. Hamester et al. [128] proposed a novel multi-channel CNN in which one channel is trained using the traditional CNN in a supervised manner and the other is trained separately using a convolutional autoencoder (CAE) in an unsupervised manner.

definition used in (example)
Majority Voting determine the class with the most votes using the predicted label yielded from each individual [47, 161, 135]
Simple Average determine the class with the highest mean score using the posterior class probabilities yielded from each individual with the same weight [47, 161, 135]
Weighted Average determine the class with the highest weighted mean score using the posterior class probabilities yielded from each individual with different weights [33, 57, 136, 144]
TABLE IV: Three primary ensemble methods on the decision level.

For the second factor, each member of the committee networks can be assembled at two different levels: the feature level and the decision level. For feature-level ensembles, the most commonly adopted strategy is to concatenate features learned from different networks to create a final feature vector that represents the input image. In [162], three subnets that differ in the number of convolutional layers are concatenated into a fully connected layer. Similarly, Bargal et al. [60] concatenated features learned from different networks to obtain a single feature vector to describe the input image (see Fig.7(a)). Three widely-used rules are applied for decision-level ensembles [163, 164]: majority voting, simple average and weighted average. A summary of these three methods is provided in Table IV. Because the weighted average rule considers the importance and confidence of each individual, many methods have been proposed to find an optimal set of weights for decision-level ensembles. Kahou et al. [33] proposed a random search method to formulate the re-weighting of per-model and per-emotion predictions by weighting the model predictions for each emotion type. Yu et al. [44] proposed two different optimization frameworks by using the log-likelihood loss and hinge loss to adaptively assign different weights to each network. Kim et al. [47] proposed an exponentially weighted average based on the validation accuracy to emphasize qualified individuals and then constructed a hierarchical architecture of committees by implementing majority voting or simple average in the higher levels (see Fig.7(b)). Pons et al. [160] used a CNN to learn weights for each individual model with all emotions.

(a) Feature-level ensemble in [60]. Three different features (fc5 of VGG13 + fc7 of VGG16 + pool of Resnet) after normalization are concatenated to create a single feature vector (FV) that describes the input frame.
(b) Decision-level ensemble in [47]. A 3-level hierarchical committee architecture with hybrid decision-level fusions was proposed to obtain sufficient decision diversity.
Fig. 7: Representative network ensemble systems at the feature level and decision level.

4.1.5 Multitask networks

Many existing networks for FER focus on a single task and learn features that are sensitive to expressions without considering interactions among other latent factors. However, in the real world, FER is intertwined with various factors, such as head pose, illumination, and subject identity (facial morphology). To solve this problem, multitask leaning is introduced to transfer knowledge from other relevant tasks and to disentangle nuisance factors.

Fig. 8: Representative multitask network for FER. In the proposed MSCNN [165], a pair of images is sent into the MSCNN during training. The expression recognition task with cross-entropy loss, which learns features with large between-expression variation, and the face verification task with contrastive loss, which reduces the variation in within-expression features, are combined to train the MSCNN.

Reed et al. [130] constructed a higher-order Boltzmann machine (disBM) to learn manifold coordinates for the relevant factors of expressions and further proposed training strategies for efficient disentangling so that the expression-related hidden units are relatively invariant to face morphology. Devries et al. [34] jointly explored two synergistic tasks, i.e., FER and facial landmark localization, and demonstrated that learning features to predict facial geometry while recognize expressions can improve FER performance. Pons et al. [138] focused on multitask training of emotion recognition and facial AUs [166, 167]. Because not all the data in the target dataset are labeled with respect to both tasks and because AU recognition is a multilabel problem, the selective joint multitask (SJMT) approach, which defined a novel dataset-wise selective loss function to address multitask, multilabel and multidomain problems, was proposed. Meng et al. [40] proposed a novel identity-aware CNN (IACNN) that contains two identical sub-CNNs. One stream uses expression-sensitive contrastive loss to learn expression-discriminative features, and the other stream uses identity-sensitive contrastive loss to learn identity-related features for identity-invariant expression recognition. Similarly, in [165], a multisignal CNN (MSCNN), which was trained under the supervision of both expression recognition and face verification tasks corresponding to different loss functions, was proposed to force the model to focus on expression information (see Fig. 8 for details).

4.1.6 Cascaded networks

Fig. 9: Representative cascaded network for FER. The proposed AU-aware deep network (AUDN) [122] is composed of three sequential modules: in the first module, a 2-layer CNN is trained to generate an over-complete representation encoding all expression-specific appearance variations over all possible locations; in the second module, an AU-aware receptive field layer is designed to search subsets of the over-complete representation; in the last module, a multilayer RBM is exploited to learn hierarchical features.

In a cascaded network, various modules that handle different tasks are combined sequentially to design a deeper network, where the outputs of the former modules are utilized by the latter modules. Related studies have proposed novel combinations of different structures to learn a hierarchy of features through which factors of variation that are unrelated with expressions can be gradually filtered out.

In [168], DBNs were trained to first detect faces and then to hierarchically detect expression-related areas. These parsed face components were then classified by a stacked autoencoder for expression recognition. In [169], multiple color channels were first coded by CNN and then sequenced by the t-SNE [170] based on their discriminability. Finally, the sequenced spatial features were input to the LSTM to learn expression information along the color channels. In [129], a multiscale contractive convolutional network (CCNET) was proposed to obtain local-translation-invariant (LTI) representations. Then, contractive discriminant analysis (CDA) was conducted to hierarchically separate the emotion-related factors from subject identity and pose. In [122, 123], over-complete representations were first learned and then filtered by an AU-aware feature selection scheme. Then, a multilayer RBM was exploited to learn higher-level features for FER (see Fig. 9 for details). Instead of simply concatenating different networks, Liu et al. [121] presented a novel boosted DBN (BDBN) that iteratively performed feature representation, feature selection and classifier construction in a unified loop framework. The discriminative ability for FER was increased by alternately iterating bottom-up unsupervised feature learning (BU-UFL) and boosted top-down supervised feature strengthen (BTD-SFS) until convergence.

4.1.7 Discussion

The existing well-constructed deep FER systems focus on two key issues: the lack of plentiful diverse data and expression-unrelated variation, such as illumination, head pose and identity. Pre-training and fine-tuning have become mainstream in deep FER to solve the problem of insufficient training data and overfitting. A practical technique that proved to be particularly useful is pre-training and fine-tuning the network in multiple stages using auxiliary data from large-scale objection or face recognition datasets to small-scale FER datasets, i.e., from large to small and from general to specific. In addition to the raw image data directly input into the deep network, diverse pre-designed features are recommended to strengthen the network’s robustness to common distractions (e.g., illumination, head pose and occlusion) and to force the network to focus more on facial areas with expressive information. Moreover, the use of heterogeneous input is an alternative way to enlarge the data size. Instead of the popular network architecture, various structures and loss layers are specifically designed to supervise the network learning of more powerful features with discriminate inter-class separability and intra-class compactness.

Training a deep and wide network with a large number of hidden layers and flexible filters is acknowledged as an effective way to learn deep high-level features focused on the given task. However, this process is vulnerable to FER when limited training data are available. Integrating multiple relatively small networks in parallel or in series is a natural research direction to overcome this problem. Network ensemble integrates diverse networks at the feature or decision level to combine their advantages. Furthermore, multitask networks jointly train multiple networks with consideration of interactions between the target FER task and other secondary tasks, such as facial landmark localization, facial AU recognition and face verification. Alternatively, cascaded networks sequentially train multiple networks in a hierarchical approach, in which case the discriminative ability of the learned features are continuously strengthened. In general, these integration methods can alleviate the overfitting problem, and in the meanwhile, progressively disentangling the nuisance factors that are irrelevant to facial expression.

4.2 Deep FER networks for dynamic image sequences

Although most of the previous models focus on static images, facial expression analysis can benefit from the temporal correlations of consecutive frames in a sequence that contains subtle appearance changes. We first introduce the existing frame aggregation techniques that strategically combine deep features learned from static-based FER networks. Considering that in a videostream, people usually display the same expression with diffirent intensities, we further review methods that use images in different expression intensity states for intensity-invariant FER. Finally, we introduce deep FER networks that consider spatio-temporal motion patterns in video frames and learned features derived from the temporal structure. For each of the most frequently evaluated datasets, Table V shows the current state-of-the-art methods in the field conducted in the person-independent protocol.

Datasets Methods
Training data Selection1 Testing data selection1 Data group Performance2 (%)
CK+ Zhao et al. 16 [171] PPDN from the 7th to the last3 the last frame 10 folds 6 classes: 99.3
Yu et al. 17 [50] DCPN from the 7th to the last3 the peak expression 10 folds 6 classes: 99.6
Sun et al. 17 [172] network ensemble
S: emotional
T: neutralemotional
the same as the training data4 10 folds 6 classes: 97.28
Jung et al. 15[30] DTAGN fixed number of frames 10 folds 7 classes: 97.25 (95.22)
Zhang et al. 17 [165] network ensemble
S: the last frame
T: all frames
10 folds 7 classes: 98.50 (97.78)
MMI Kim et al. 17 [173] CNN-LSTM 5 intensities frames the same as the training data4 LOSO 6 classes: 78.61 (78.00)
Hasani et al. 17 [174]
+ landmark
ten frames 5 folds 6 classes: 77.50 (74.50)
Hasani et al. 17 [174] CNN-CRF static frames 5 folds 6 classes: 78.68
Zhang et al. 17 [165] network ensemble
S: the middle frame
T: all frames
10 folds 6 classes: 81.18 (79.30)
Sun et al. 17 [172] network ensemble
S: emotional
T: neutralemotional
10 folds 6 classes: 91.46
Oulu- CASIA Zhao et al. 16 [171] PPDN from the 7th to the last3 the last frame 10 folds 6 classes: 84.59
Yu et al. 17 [50] DCPN from the 7th to the last3 the peak expression 10 folds 6 classes: 86.23
Jung et al. 15 [30] DTAGN fixed number of frames the same as the training data4 10 folds 6 classes: 81.46 (81.49)
Zhang et al. 17 [165] network ensemble
S: the last frame
T: all frames
10 folds 6 classes: 86.25 (86.25)
AFEW* 5.0 Ebrahimi et al. [59] CNN-IRNN Training: 723; Validation: 383; Test: 539 Validation: 39.6 (36.56)
Ebrahimi et al. [59] fusion Test: 52.88
AFEW* 6.0 Yan et al. [175] VGG-BRNN 40 frames 3 folds 7 classes: 44.46
Yan et al. [175] trajectory 30 frames 3 folds 7 classes: 37.37
Fan et al. [86] VGG-LSTM 16 features for LSTM Validation: 45.43 (38.96)
Fan et al. [86] C3D several windows of 16 consecutive frames Validation: 39.69 (38.55)
Yan et al. [175] fusion Training: 773; Validation: 383; Test: 593 Test: 56.66 (40.81)
Fan et al. 16 [86] fusion Training: 774; Validation: 383; Test: 593 Test: 59.02 (44.94)
AFEW* 7.0 Ouyang et al. [176] VGG-LSTM 16 frames Validation: 47.4
Ouyang et al. [176] ResNet-LSTM 16 frames Validation: 46.4
Ouyang et al. [176] C3D 16 frames Validation: 35.2
Ouyang et al. [176] fusion Training: 773; Validation: 373; Test: 653 Test: 57.2
Vielzeuf et al. [177] C3D-LSTM detected face frames Validation: 43.2
Vielzeuf et al. [177] VGG-LSTM several windows of 16 consecutive frames Validation: 48.6
Vielzeuf et al. [177] fusion Training: 773; Validation: 383; Test: 653 Test: 58.81 (43.23)
Kim et al. [173] fusion Test: 57.12 (43.85)
Wang et al. [178] fusion Test: 58.5 (44.71)
  • Training data selection for each sequence; Testing data selection for each sequence.

  • The value in parentheses is the mean accuracy calculated from the confusion matrix given by authors.

  • A pair of images (peak and non-peak expression) is chosen for training each time.

  • The network can directly output the prediction label for the input sequence.

  • We have included the result of a single spatio-temporal network and also the best result after fusion with both video and audio modalities.

  • 7 Classes in CK+: Anger, Contempt, Disgust, Fear, Happiness, Sadness, and Surprise.

  • 7 Classes in AFEW: Anger, Disgust, Fear, Happiness, Neutral, Sadness, and Surprise.

TABLE V: Performance summary of representative methods for dynamic-based deep facial expression recognition on the most widely evaluated datasets. S = Spatial network; T = Temporal network; LOSO = leave-one-subject-out.

4.2.1 Frame aggregation

Because the frames in a given video clip may vary in intensity of the corresponding expression, directly measuring per-frame error in the target dataset does not yield satisfactory performance. Various methods have been introduced to further aggregate the network output for frames in each sequence to substantially improve the FER performance. We divide these methods into two groups: decision-level frame aggregation and feature-level frame aggregation.

For decision-level frame aggregation, Kahou et al. [33, 179] concatenated the -class probability vectors of all frames in a sequence to form a fixed-length () feature vector for subsequent classification. Because sequences may have different numbers of frames, two aggregation approaches were proposed: frame averaging and frame expansion (see Fig. 10 for details). The authors also proposed a statistical approach that dose not require a fixed number of frames; the average, max, average of square, average of winner-take-all, and average of maximum suppression vectors were used to summarize the per-frame probabilities in each sequence for a fixed-length representation.

(a) Frame averaging
(b) Frame expansion
Fig. 10: Frame aggregation in [33]. The flowchart is top-down. (a) For sequences with more than 10 frames, we averaged the probability vectors of 10 independent groups of frames taken uniformly along time. (b) For sequences with less than 10 frames, we expanded by repeating frames uniformly to obtain 10 total frames.

For feature-level frame aggregation, [180] and [181] extracted a set of image features for a given sequence and applied three models: eigenvector (linear subspace), covariance matrix and multi-dimensional Gaussian distribution. Bargal et al. [60] proposed a statistical (STAT) encoding module that computed and concatenated the mean, variance, minimum, and maximum of the feature dimensions over all frames (shown in the posterior part of Fig. 7(a)). To improve this STAT encoding, Knyazev et al. [145] first performed an ablation study to remove the max features, then compute and average the features of transformations per frame after frame-based augmentation, and finally added spectral features by computing and averaging the 1-dimensional Fourier transform (fft) for each neuron. Kaya et al. [144] further used three polynomial functions: curvature, slope and offset. Moreover, Gaussian mixture models (GMM) with diagonal covariances were used for Fisher vector encoding of low-level descriptors. In addition, Xu et al. [182] proposed an image transfer encoding (ITE) method that treated a video as a bag of frames and used a bag-of-words scheme to encode each video, where the cluster centers were computed from the auxiliary image dataset.

4.2.2 Expression Intensity network

Fig. 11: The proposed PPDN in [171]. During training, PPDN is trained by jointly optimizing the L2-norm loss and the cross-entropy losses of two expression images. During testing, the PPDN takes one still image as input for probability prediction.

Most methods focus on recognizing the peak high-intensity expression and ignore the subtle lower-intensity expressions. In this section, we introduced several deep networks that take training samples with certain intensity as input to exploit the intrinsic correlations among expressions from the same subject in a sequence that vary in intensity.

Zhao et al. [171] proposed a peak-piloted deep network (PPDN) for intensity-invariant expression recognition. Specifically, the PPDN takes a pair of peak and non-peak expression images of the same type and from the same subject as input and utilizes the L2-norm loss to minimize the distance between both images. Moreover, a back-propagation method called peak gradient suppression (PGS) was proposed to drive the learned feature of the non-peak expression towards that of peak expression while avoiding the inverse by ignoring the gradient information of the peak expression in the L2-norm minimization. Based on PPDN, Yu et al. [50] proposed a deeper cascaded peak-piloted network (DCPN) that used a deeper and larger architecture to enhance the discriminative ability of the learned features and employed an integration training method called cascade fine-tuning to avoid overfitting. To assist the subsequent temporal learning, Kim et al. [173] utilized five intensity states (i.e., onset, onset to apex transition, apex, apex to offset transition and offset) from a sequence and adopted five loss functions to regulate the learning of the spatial feature representation. Specifically, two loss functions were devised to focus on increasing the expression class separability, namely, minimizing expression classification error and minimizing intra-class expression variation. To model subsequent temporal changes, two additional loss functions were designed to increase the expression intensity separability, namely, minimizing intensity classification error and minimizing intra-intensity variation. Hence, each expression class contains distinct clusters of different intensities. To further preserve the intensity continuity, the fifth loss function was devised to encode intermediate intensity. Instead of directly using samples with different expression intensities provided by the dataset, Kim et al. [183] proposed a convolutional encoder-decoder network to generate the reference (expressiveness) face for each sample and then combined a contrastive metric loss and a reconstruction loss to jointly filter out information that is irrelevant or of negligible use for the discriminative purposes.

4.2.3 Deep spatio-temporal FER network

Although the aforementioned frame aggregation can integrate the learned features of frames to produce a single feature vector representing the entire video sequence, the crucial temporal dependency is not explicitly exploited. By contrast, the spatio-temporal FER network takes a range of frames in a temporal window as a single input without prior knowledge of the expression intensity and utilizes both textural information and temporal dependencies in the image sequence for more subtle expression recognition.

RNN and C3D: RNN can robustly derive information from sequences by exploiting the fact that feature vectors for successive data are connected semantically and are therefore interdependent. The improved version, LSTM, is flexible to handle varying-length sequential data with lower computation cost. Graves et al. [184] employed both bidirectional LSTM [185] and unidirectional LSTM to the extracted deformable 3D wire frame Candide-3 face model [186] for facial expression classification. Compared with RNN, CNN is more suitable for computer vision applications; hence, its derivative C3D [85], which uses 3D convolutional kernels with shared weights along the time axis instead of the traditional 2D kernels, has been widely used for dynamic-based FER (e.g., [187, 86, 176, 54]) to capture the spatio-temporal features. Liu et al. [188] incorporated 3D CNN with the DPM-inspired [189] deformable facial action part constraints to simultaneously encode dynamic motion and discriminative part-based representations (see Fig. 12 for details). Jung et al. [30] proposed a deep temporal appearance network (DTAN) that employed 3D filters without weight sharing along the time axis; hence, each filter can vary in importance over time. Vielzeuf et al. [177] proposed a weighted C3D, where several windows of consecutive frames were extracted from each sequence and weighted based on their prediction scores. Then, each video, considered as a bag of windows with a single label, was trained on C3D by multiple instance learning (MIL) [190] for the final prediction. Instead of directly using C3D for classification, Nguyen et al. [87] employed C3D for spatio-temporal feature extraction and then cascaded with DBN for training and prediction, including a layer-wise pre-training and a fine-tune stage. In [191], C3D was also used as a feature extractor, followed by a NetVLAD layer [192] to aggregate the temporal information of the motion features by learning cluster centers.

Fig. 12: The proposed 3DCNN-DAP [188]. The input -frame sequence is convolved with 3D filters; then, part filters corresponding to manually defined facial parts are used to convolve feature maps for the facial action part detection maps of expression classes.

Facial landmark trajectory: Related psychological studies have shown that expressions are invoked by dynamic motions of certain facial parts (e.g., eyes, nose and mouth) that contain the most descriptive information for representing expressions. To obtain more accurate facial actions for FER, facial landmark trajectory models have been proposed to capture the dynamic variations of facial components from consecutive frames. Jung et al. [30] proposed a deep temporal geometry network (DTGN) that first alternately concatenated the x-coordinates and y-coordinates of the facial landmark points from each frame after normalization and then concatenated these normalized points over time for a one-dimensional trajectory signal of each sequence. Inspired by this method, Yan et al. [175] extracted an image-like map by stretching all the normalized trajectory features of the frames in a sequence together as the input of a CNN. Kim et al. [193] designed a 2D landmark feature by computing the L2-norm distance between each pair of landmarks and then subtracted it in adjacent frames. Instead of directly using the trajectory features as the input layer of a CNN, Hasani et al. [174] incorporated the trajectory features by replacing the shortcut in the residual unit of the original 3D Inception-ResNet with element-wise multiplication of facial landmarks and the input tensor of the residual unit. Zhang et al. [165] proposed a part-based hierarchical bidirectional recurrent neural network (PHRNN) to capture the temporal information from consecutive frames. Specifically, facial landmarks were divided into four parts based on facial physical structure and were then separately fed into BRNNs [194], where local features were concatenated along the feature extraction cascade and the global high-level features were formed in the upper layers (see Fig. 13 for details).

Cascaded networks: By combining the powerful perceptual vision representations learned from CNNs with the strength of LSTM for variable-length inputs and outputs, Donahue et al. [195] proposed a both spatially and temporally deep model, called the long-term recurrent convolutional network (LRCN), which cascades the outputs of CNNs with LSTMs for various vision tasks involving time-varying inputs and outputs. Similar to this hybrid network, many cascaded networks have been proposed for FER (e.g., [86, 173, 196]). Baccouche et al. [197] proposed a convolutional sparse autoencoder for sparse and shift-invariant features with unsupervised learning; then, an LSTM classifier was trained for temporal evolution of the learned features. Ouyang et al. [176] employed a more flexible network called ResNet-LSTM, which allows nodes in lower CNN layers to directly contact with LSTMs to capture spatio-temporal information. Hasani et al. [174] constructed 3D Inception-ResNet (3DIR) layers for feature maps and then cascaded with an LSTM unit. Vielzeuf et al. [177] employed the proposed weighted C3D for feature extraction and further chose LSTM to better capture the temporal dependencies for varying-length inputs. In addition to concatenating LSTM with the fully connected layer of CNN, a hypercolumn-based system proposed in [198] extracted the last convolutional layer features as the input of the LSTM for longer range dependencies without losing global coherence. Instead of LSTM, Ebrahimi et al. [59] used IRNNs [199] to provide a simpler mechanism for addressing the vanishing and exploding gradient problems. Moreover, Yan et al. [175] employed bidirectional RNN (BRNN) [194] to learn the temporal relations in both the original and reversed directions. Hasani et al. [200] replaced LSTM with the conditional random fields (CRFs) model [201], which assigned the most probable sequence of labels given the whole input and trained the model on the whole training set in several iterations.

Network ensemble:

Fig. 13: The spatio-temporal network proposed in [165]. The temporal network PHRNN for landmark trajectory and the spatial network MSCNN for identity-invariant features are trained separately. Then, the predicted probabilities from the two networks are fused together for spatio-temporal FER.
Fig. 14: The joint fine-tuning method for DTAGN proposed in [30]. To integrate DTGA and DTAN, we freeze the weight values in the gray boxes and retrain the top layer in the green boxes. The logit values of the green boxes are used by Softmax3 to supervise the integrated network. During training, we combine three softmax loss functions, and for prediction, we use only Softmax3.

A two-stream CNN for action recognition in videos, which trained one stream of the CNN on the multi-frame dense optical flow for temporal information and the other stream of the CNN on still images for appearance features and then fused the outputs of two streams, was introduced by Simonyan et al. [202]. Several network ensemble models inspired by this architecture have been proposed for FER. Sun et al. [172] proposed a multi-channel network that extracted the spatial information from emotion-expressing faces and temporal information (optical flow) from the changes between emotioanl and neutral faces, and investigated three feature fusion strategies: score average fusion, SVM-based fusion and neural-network-based fusion. Zhang et al. [165] fused the temporal network PHRNN (discussed in “Landmark trajectory” part) and the spatial network MSCNN (discussed in section 4.1.5) to extract the partial-whole, geometry-appearance, and static-dynamic information for FER (see Fig.13). Instead of network fusion, Jung et al. [30] proposed a joint fine-tuning method that jointly trained the DTAN (discussed in the “RNN and C3D” section), the DTGN( discussed in the “Landmark trajectory” section) and the integrated network (see Fig. 14 for details).

4.2.4 Discussion

In the real world, people display facial expressions in a dynamic process, e.g., from subtle to obvious, and it has become a trend to conduct FER on sequence/video data. Frame aggregation is widely employed to statistically combine the learned feature or prediction probability of each frame for a sequence-level result. Instead of directly using peak expression images for training, the expression intensity network considers images with subtle non-peak expression and further exploits the dynamic correlations between peak and non-peak expressions to improve FER performance.

Despite the advantages of the above-mentioned methods, frame aggregation handles frames in a sequence without consideration of temporal information, and expression intensity networks require prior knowledge of expression intensity. Deep spatio-temporal networks are designed to encode temporal dependencies in consecutive frames and have been shown to benefit from learning spatial features in conjunction with temporal features. RNN and its variations (e.g., LSTM, IRNN and BRNN) and C3D are foundational networks for learning spatio-temporal features. Alternatively, facial landmark trajectory methods extract shape features based on the physical structures of facial morphological variations and the dynamic evolutionary properties of facial expression, and then apply deep networks on these features to capture dynamic facial component variations in consecutive frames. Cascaded networks were proposed to first extract discriminative representations for facial expression images and then input these features to sequential networks (e.g., LSTM) to reinforce the temporal information encoding. Network ensemble is utilized to train multiple networks for both spatial and temporal representations and then to fuse the network outputs in the final stage.

Table III and Table V demonstrate the powerful capability and popularity of deep spatio-temporal networks. For instance, comparison results on widely evaluated benchmarks (e.g., CK+ and MMI) illustrate that training networks based on sequence data and analyzing temporal dependency between frames can further improve the performance. In the EmotiW challenge 2015, only one system employed deep spatio-networks for FER, whereas 5 of 7 reviewed systems in the EmotiW challenge 2017 relied on such networks.

5 Additional Related Issues

In addition to the most popular basic expression classification task reviewed above, we further introduce a few related issues that depend on deep neural networks and prototypical expression-related knowledge.

Occlusion and non-frontal head pose, which may change the visual appearance of the original facial expression, are two major obstacles for automatic FER, especially in real-world scenarios. Ranzato et al. [203, 204] proposed a deep generative model that used mPoT [205] as the first layer of DBNs to model pixel-level representations and then trained DBNs to fit an appropriate distribution to its inputs. Thus, the occluded pixels in images could be filled in by reconstructing the top layer representation using the sequence of conditional distributions. Cheng et al. [206] employed multilayer RBMs with a pre-training and fine-tuning process on Gabor features to compress features from the occluded facial parts. Xu et al. [207] concatenated high-level learned features transferred from two CNNs with the same structure but pre-trained on different data: the original MSRA-CFW database and the MSRA-CFW database with additive occluded samples. For head-pose invariant FER, Zhang et al. [146] introduced a projection layer into the CNN that learned discriminative facial features by weighting different facial landmark points within 2D SIFT feature matrices without requiring facial pose estimation.

Although RGB data are the current standard in deep FER, these data are vulnerable to ambient lighting conditions and also lack depth information in different facial parts. He et al. [208] employed thermal infrared images, which capture emitted heat patterns and are not sensitive to illumination variations, for FER. Specifically, a DBM model that consists of a Gaussian-binary RBM and a binary RBM was trained by layerwise pre-training and joint training and was then fine-tuned on thermal infrared images to learn thermal features for FER. Wu et al. [209] proposed a three-stream 3D CNN to fuse local and global spatio-temporal features on illumination-invariant near-infrared (NIR) images for FER. Depth images or videos record the intensity of facial pixels based on distance from a depth camera, which contain critical information of facial geometric relations and are also tolerant to illumination variation. [210] employed CNN on unregistered facial depth images to recognize facial expressions based on depth information. Uddin et al. [211, 212] extracted a series of salient features from depth videos and further combined them with deep networks (i.e., CNN and RBN) for FER. Oyedotun et al. [213] employed CNN to jointly learn facial expression features from both RGB and depth map latent modalities. Li et al. [214] employed 3D face modal, which is not only robust to large lighting and pose variations but also can capture subtle facial deformations, to describe 3D facial expressions. Specifically, six types of 2D facial attribute maps were first extracted from the textured 3D face scans and were then jointly fed into the feature extraction and feature fusion subnets of the proposed deep fusion CNN (DF-CNN) to learn a highly concentrated facial representation for FER.

Realistic facial expression synthesis, which can generate various facial expressions for interactive interfaces, is a hot topic. Susskind et al. [215] demonstrated that DBN has the capacity to capture the large range of variation in expressive appearance and can be trained on large but sparsely labeled datasets. In light of this work,[216, 203, 204] employed DBN with unsupervised learning to construct facial expression synthesis systems. Kaneko et al. [140] proposed a multitask deep network with state recognition and key-point localization to adaptively generate visual feedback to improve facial expression recognition. With the recent success of the deep generative models, such as variational autoencoder (VAE), adversarial autoencoder (AAE), and generative adversarial network (GAN), a series of facial expression synthesis systems have been developed based on these models (e.g., [217], [218], [219], [220], [221] and [222]). Facial expression synthesis can also be applied to data augmentation without manually collecting and labeling huge datasets. Masi et al. [223] employed CNN to synthesize new face images by increasing face-specific appearance variation, such as expressions within the 3D textured face model.

In addition to utilizing CNN for FER, several works (e.g., [124, 224, 225]) employed visualization techniques [226] on the learned CNN features to qualitatively analyze how the CNN contributes to the appearance-based learning process of FER and to qualitatively decipher which portions of the face yield the most discriminative information. The deconvolutional results all indicated that the activations of some particular filters on the learned features have strong correlations with the face regions that correspond to facial AUs.

Several novel issues have been approached on the basis of the prototypical expression categories: dominant and complementary emotion recognition challenge [227] and the Real versus Fake expressed emotions challenge [228]. Furthermore, deep learning techniques have been thoroughly applied by the participants of these two challenges (e.g., [229, 230, 231, 232]). Additional related real-world applications, such as the Real-time FER App for smartphones [233, 234], Eyemotion (FER using eye-tracking cameras) [235], privacy-preserving mobile analytics [236], and Unfelt emotions [237], have also been developed.

6 Challenges and Opportunities

As the FER literature shifts its main focus to the challenging in-the-wild environmental conditions, many researchers have committed to employing deep learning technologies to handle difficulties, such as illumination variation, occlusions, non-frontal head poses, identity bias and the recognition of low-intensity expressions. Given that FER is a data-driven task and that training a sufficiently deep network to capture subtle expression-related deformations requires a large amount of training data, the major challenge that deep FER systems face is the lack of training data in terms of both quality and quantity.

Because people of different age ranges, cultures and genders display and interpret facial expression in different ways, an ideal facial expression dataset is expected to include abundant sample images with precise face attribute labels, not just expression but other attributes such as age, gender and ethnicity, which would facilitate related research on cross-age range, cross-gender and cross-cultural FER using deep learning techniques, such as multitask deep networks and transfer learning. In addition, although occlusion and multipose problems have received relatively wide interest in the field of deep face recognition, the occlusion-robust and pose-invariant issues have receive less attention in deep FER. One of the main reasons is the lack of a large-scale facial expression dataset with occlusion type and head-pose annotations. On the other hand, accurately annotating a large volume of image data with the large variation and complexity of natural scenarios is an obvious impediment to the construction of expression datasets. A reasonable approach is to employ crowd-sourcing models [238, 113, 114] under the guidance of expert annotators. Additionally, a fully automatic labeling tool [112] refined by experts is alternative to provide approximate but efficient annotations. In both cases, a subsequent reliable estimation or labeling learning process is necessary to filter out noisy annotations. In particular, few comparatively large-scale datasets that consider real-world scenarios and contain a wide range of facial expressions have recently become publicly available, i.e., EmotioNet [112], RAF-DB [113] and AffectNet [114], and we anticipate that with advances in technology and the wide spread of the Internet, more complementary facial expression datasets will be constructed to promote the development of deep FER.

Another major issue that requires consideration is that while FER within the categorical model is widely acknowledged and researched, the definition of the prototypical expressions covers only a small portion of specific categories and cannot capture the full repertoire of expressive behaviors for realistic interactions. Two additional models were developed to describe a larger range of emotional landscape: the FACS model [166, 167], where various facial muscle AUs are combined to describe the visible appearance changes of facial expressions, and the dimensional model [239], where two continuous-valued variables, namely, valence and arousal, are proposed to continuously encode small changes in the intensity of emotions. Another novel definition, i.e., compound expression, was proposed by Du et al. [119], who argued that some facial expressions are actually combinations of more than one basic emotion. These works improve the characterization of facial expressions and, to some extent, can complement FER. For instance, as discussed above, the visualization results of CNNs have demonstrated a certain congruity between the learned representations and the facial areas defined by AUs. Thus, we can design filters of the deep neural networks to distribute different weights according to the importance degree of different facial muscle action parts.

Biases among different databases and the imbalanced distribution of expression classes are two additional problems to resolve in the field of deep FER. Researchers commonly evaluate their algorithms within a specific dataset and can achieve satisfactory performance. However, early cross-database experiments have indicated that discrepancies between databases exist due to the different collection environments and construction indicators; hence, algorithms evaluated via intra-database protocols lack generalizability on unseen test data, and the performance in cross-dataset settings is greatly deteriorated. Deep domain adaption and knowledge distillation are alternatives to address this bias. Another common problem in facial expression is class imbalance, which is a result of the practicalities of data acquisition: eliciting and annotating a smile is easy, however, capturing information for disgust, anger and other less common expressions can be very challenging. As shown in Table III and Table V, the performance assessed in terms of mean accuracy, which assigns equal weights to all classes, decreases when compared with the accuracy criterion, and this decline is especially evident in real-world datasets (e.g., SFEW 2.0 and AFEW). One solution is to balance the class distribution during the pre-processing stage using data augmentation and synthesis. Another alternative is to develop a cost-sensitive loss layer for deep networks during training.

Last but not the least, human expressive behaviors in realistic applications involve encoding from different perspectives, and the facial expression is only one modality. Although pure expression recognition based on visible face images can achieve promising results, incorporating with other models into a high-level framework can provide complementary information and further enhance the robustness. For example, participants in the EmotiW challenges and Audio Video Emotion Challenges (AVEC) [240, 241] considered the audio model to be the second most important element and employed various fusion techniques for multimodal FER. Additionally, the fusion of other modalities, such as depth information, thermal infrared images, 3D face models and physiological data, is becoming a promising research direction due to the large complementarity for facial expressions.


  • [1] C. Darwin and P. Prodger, The expression of the emotions in man and animals.   Oxford University Press, USA, 1998.
  • [2] Y.-I. Tian, T. Kanade, and J. F. Cohn, “Recognizing action units for facial expression analysis,” IEEE Transactions on pattern analysis and machine intelligence, vol. 23, no. 2, pp. 97–115, 2001.
  • [3] P. Ekman, “Pictures of facial affect,” Consulting Psychologists Press, 1976.
  • [4] ——, “Facial expression and emotion.” American psychologist, vol. 48, no. 4, p. 384, 1993.
  • [5] D. Matsumoto, “More evidence for the universality of a contempt expression,” Motivation and Emotion, vol. 16, no. 4, pp. 363–368, 1992.
  • [6] C. Shan, S. Gong, and P. W. McOwan, “Facial expression recognition based on local binary patterns: A comprehensive study,” Image and Vision Computing, vol. 27, no. 6, pp. 803–816, 2009.
  • [7] G. Zhao and M. Pietikainen, “Dynamic texture recognition using local binary patterns with an application to facial expressions,” IEEE transactions on pattern analysis and machine intelligence, vol. 29, no. 6, pp. 915–928, 2007.
  • [8] R. Zhi, M. Flierl, Q. Ruan, and W. B. Kleijn, “Graph-preserving sparse nonnegative matrix factorization with application to facial expression recognition,” IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 41, no. 1, pp. 38–52, 2011.
  • [9] L. Zhong, Q. Liu, P. Yang, B. Liu, J. Huang, and D. N. Metaxas, “Learning active facial patches for expression analysis,” in Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on.   IEEE, 2012, pp. 2562–2569.
  • [10] I. J. Goodfellow, D. Erhan, P. L. Carrier, A. Courville, M. Mirza, B. Hamner, W. Cukierski, Y. Tang, D. Thaler, D.-H. Lee et al., “Challenges in representation learning: A report on three machine learning contests,” in International Conference on Neural Information Processing.   Springer, 2013, pp. 117–124.
  • [11] A. Dhall, R. Goecke, J. Joshi, M. Wagner, and T. Gedeon, “Emotion recognition in the wild challenge 2013,” in Proceedings of the 15th ACM on International conference on multimodal interaction.   ACM, 2013, pp. 509–516.
  • [12] A. Dhall, R. Goecke, J. Joshi, K. Sikka, and T. Gedeon, “Emotion recognition in the wild challenge 2014: Baseline, data and protocol,” in Proceedings of the 16th International Conference on Multimodal Interaction.   ACM, 2014, pp. 461–466.
  • [13] A. Dhall, O. Ramana Murthy, R. Goecke, J. Joshi, and T. Gedeon, “Video and image based emotion recognition challenges in the wild: Emotiw 2015,” in Proceedings of the 2015 ACM on International Conference on Multimodal Interaction.   ACM, 2015, pp. 423–426.
  • [14] A. Dhall, R. Goecke, J. Joshi, J. Hoey, and T. Gedeon, “Emotiw 2016: Video and group-level emotion recognition challenges,” in Proceedings of the 18th ACM International Conference on Multimodal Interaction.   ACM, 2016, pp. 427–432.
  • [15] A. Dhall, R. Goecke, S. Ghosh, J. Joshi, J. Hoey, and T. Gedeon, “From individual to group-level emotion recognition: Emotiw 5.0,” in Proceedings of the 19th ACM International Conference on Multimodal Interaction.   ACM, 2017, pp. 524–528.
  • [16] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
  • [17] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
  • [18] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich et al., “Going deeper with convolutions.”   Cvpr, 2015.
  • [19] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  • [20] M. Pantic and L. J. M. Rothkrantz, “Automatic analysis of facial expressions: The state of the art,” IEEE Transactions on pattern analysis and machine intelligence, vol. 22, no. 12, pp. 1424–1445, 2000.
  • [21] B. Fasel and J. Luettin, “Automatic facial expression analysis: a survey,” Pattern recognition, vol. 36, no. 1, pp. 259–275, 2003.
  • [22] Z. Zeng, M. Pantic, G. I. Roisman, and T. S. Huang, “A survey of affect recognition methods: Audio, visual, and spontaneous expressions,” IEEE transactions on pattern analysis and machine intelligence, vol. 31, no. 1, pp. 39–58, 2009.
  • [23] E. Sariyanidi, H. Gunes, and A. Cavallaro, “Automatic analysis of facial affect: A survey of registration, representation, and recognition,” IEEE transactions on pattern analysis and machine intelligence, vol. 37, no. 6, pp. 1113–1133, 2015.
  • [24] M. F. Valstar, M. Mehu, B. Jiang, M. Pantic, and K. Scherer, “Meta-analysis of the first facial expression recognition challenge,” IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 42, no. 4, pp. 966–979, 2012.
  • [25] B. Martinez, M. F. Valstar, B. Jiang, and M. Pantic, “Automatic analysis of facial actions: A survey,” IEEE Transactions on Affective Computing, 2017.
  • [26] P. Viola and M. Jones, “Rapid object detection using a boosted cascade of simple features,” in Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference on, vol. 1.   IEEE, 2001, pp. I–I.
  • [27] A. Mollahosseini, D. Chan, and M. H. Mahoor, “Going deeper in facial expression recognition using deep neural networks,” in Applications of Computer Vision (WACV), 2016 IEEE Winter Conference on.   IEEE, 2016, pp. 1–10.
  • [28] F. De la Torre, W.-S. Chu, X. Xiong, F. Vicente, X. Ding, and J. F. Cohn, “Intraface,” in IEEE International Conference on Automatic Face and Gesture Recognition (FG), 2015.
  • [29] H.-W. Ng, V. D. Nguyen, V. Vonikakis, and S. Winkler, “Deep learning for emotion recognition on small datasets using transfer learning,” in Proceedings of the 2015 ACM on international conference on multimodal interaction.   ACM, 2015, pp. 443–449.
  • [30] H. Jung, S. Lee, J. Yim, S. Park, and J. Kim, “Joint fine-tuning in deep neural networks for facial expression recognition,” in Computer Vision (ICCV), 2015 IEEE International Conference on.   IEEE, 2015, pp. 2983–2991.
  • [31] H. Ding, S. K. Zhou, and R. Chellappa, “Facenet2expnet: Regularizing a deep face recognition net for expression recognition,” in Automatic Face & Gesture Recognition (FG 2017), 2017 12th IEEE International Conference on.   IEEE, 2017, pp. 118–126.
  • [32] X. Xiong and F. De la Torre, “Supervised descent method and its applications to face alignment,” in Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on.   IEEE, 2013, pp. 532–539.
  • [33] S. E. Kahou, C. Pal, X. Bouthillier, P. Froumenty, Ç. Gülçehre, R. Memisevic, P. Vincent, A. Courville, Y. Bengio, R. C. Ferrari et al., “Combining modality specific deep neural networks for emotion recognition in video,” in Proceedings of the 15th ACM on International conference on multimodal interaction.   ACM, 2013, pp. 543–550.
  • [34] T. Devries, K. Biswaranjan, and G. W. Taylor, “Multi-task learning of facial landmarks and expression,” in Computer and Robot Vision (CRV), 2014 Canadian Conference on.   IEEE, 2014, pp. 98–103.
  • [35] B. Sun, L. Li, G. Zhou, and J. He, “Facial expression recognition in the wild based on multimodal texture features,” Journal of Electronic Imaging, vol. 25, no. 6, p. 061407, 2016.
  • [36] X. Zhu and D. Ramanan, “Face detection, pose estimation, and landmark localization in the wild,” in Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on.   IEEE, 2012, pp. 2879–2886.
  • [37] A. Asthana, S. Zafeiriou, S. Cheng, and M. Pantic, “Robust discriminative response map fitting with constrained local models,” in Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on.   IEEE, 2013, pp. 3444–3451.
  • [38] D. E. King, “Dlib-ml: A machine learning toolkit,” Journal of Machine Learning Research, vol. 10, no. Jul, pp. 1755–1758, 2009.
  • [39] M. Shin, M. Kim, and D.-S. Kwon, “Baseline cnn structure analysis for facial expression recognition,” in Robot and Human Interactive Communication (RO-MAN), 2016 25th IEEE International Symposium on.   IEEE, 2016, pp. 724–729.
  • [40] Z. Meng, P. Liu, J. Cai, S. Han, and Y. Tong, “Identity-aware convolutional neural network for facial expression recognition,” in Automatic Face & Gesture Recognition (FG 2017), 2017 12th IEEE International Conference on.   IEEE, 2017, pp. 558–565.
  • [41] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao, “Joint face detection and alignment using multitask cascaded convolutional networks,” IEEE Signal Processing Letters, vol. 23, no. 10, pp. 1499–1503, 2016.
  • [42] R. A. Güler, G. Trigeorgis, E. Antonakos, P. Snape, S. Zafeiriou, and I. Kokkinos, “Densereg: Fully convolutional dense shape regression in-the-wild,” in Proc. CVPR, vol. 2, no. 3, 2017.
  • [43] P. Hu and D. Ramanan, “Finding tiny faces,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).   IEEE, 2017, pp. 1522–1530.
  • [44] Z. Yu and C. Zhang, “Image based static facial expression recognition with multiple deep network learning,” in Proceedings of the 2015 ACM on International Conference on Multimodal Interaction.   ACM, 2015, pp. 435–442.
  • [45] D. Chen, S. Ren, Y. Wei, X. Cao, and J. Sun, “Joint cascade face detection and alignment,” in European Conference on Computer Vision.   Springer, 2014, pp. 109–122.
  • [46] C. Zhang and Z. Zhang, “Improving multiview face detection with multi-task deep convolutional neural networks,” in Applications of Computer Vision (WACV), 2014 IEEE Winter Conference on.   IEEE, 2014, pp. 1036–1041.
  • [47] B.-K. Kim, H. Lee, J. Roh, and S.-Y. Lee, “Hierarchical committee of deep cnns with exponentially-weighted decision fusion for static facial expression recognition,” in Proceedings of the 2015 ACM on International Conference on Multimodal Interaction.   ACM, 2015, pp. 427–434.
  • [48] P. Y. Simard, D. Steinkraus, J. C. Platt et al., “Best practices for convolutional neural networks applied to visual document analysis.” in ICDAR, vol. 3, 2003, pp. 958–962.
  • [49] D. A. Pitaloka, A. Wulandari, T. Basaruddin, and D. Y. Liliana, “Enhancing cnn with preprocessing stage in automatic emotion recognition,” Procedia Computer Science, vol. 116, pp. 523–529, 2017.
  • [50] Z. Yu, Q. Liu, and G. Liu, “Deeper cascaded peak-piloted network for weak expression recognition,” The Visual Computer, pp. 1–9, 2017.
  • [51] A. T. Lopes, E. de Aguiar, A. F. De Souza, and T. Oliveira-Santos, “Facial expression recognition with convolutional neural networks: coping with few data and the training sample order,” Pattern Recognition, vol. 61, pp. 610–628, 2017.
  • [52] M. V. Zavarez, R. F. Berriel, and T. Oliveira-Santos, “Cross-database facial expression recognition based on fine-tuned deep convolutional network,” in Graphics, Patterns and Images (SIBGRAPI), 2017 30th SIBGRAPI Conference on.   IEEE, 2017, pp. 405–412.
  • [53] W. Li, M. Li, Z. Su, and Z. Zhu, “A deep-learning approach to facial expression recognition with candid images,” in Machine Vision Applications (MVA), 2015 14th IAPR International Conference on.   IEEE, 2015, pp. 279–282.
  • [54] I. Abbasnejad, S. Sridharan, D. Nguyen, S. Denman, C. Fookes, and S. Lucey, “Using synthetic data to improve facial expression analysis with 3d convolutional networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1609–1618.
  • [55] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in neural information processing systems, 2014, pp. 2672–2680.
  • [56] X. Liu, B. Kumar, J. You, and P. Jia, “Adaptive deep metric learning for identity-aware facial expression recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshops (CVPRW), 2017, pp. 522–531.
  • [57] G. Levi and T. Hassner, “Emotion recognition in the wild via convolutional neural networks and mapped binary patterns,” in Proceedings of the 2015 ACM on international conference on multimodal interaction.   ACM, 2015, pp. 503–510.
  • [58] V. Štruc and N. Pavešić, Photometric normalization techniques for illumination invariance.   IGI-Global, 2011, pp. 279–300.
  • [59] S. Ebrahimi Kahou, V. Michalski, K. Konda, R. Memisevic, and C. Pal, “Recurrent neural networks for emotion recognition in video,” in Proceedings of the 2015 ACM on International Conference on Multimodal Interaction.   ACM, 2015, pp. 467–474.
  • [60] S. A. Bargal, E. Barsoum, C. C. Ferrer, and C. Zhang, “Emotion recognition in the wild from videos using images,” in Proceedings of the 18th ACM International Conference on Multimodal Interaction.   ACM, 2016, pp. 433–436.
  • [61] J. Li and E. Y. Lam, “Facial expression recognition using deep neural networks,” in Imaging Systems and Techniques (IST), 2015 IEEE International Conference on.   IEEE, 2015, pp. 1–6.
  • [62] J. Short, J. Kittler, and K. Messer, “A comparison of photometric normalisation algorithms for face verification,” in Automatic Face and Gesture Recognition, 2004. Proceedings. Sixth IEEE International Conference on.   IEEE, 2004, pp. 254–259.
  • [63] B. A. Wandell, Foundations of vision.   Sinauer Associates, 1995.
  • [64] W. Chen, M. J. Er, and S. Wu, “Illumination compensation and normalization for robust face recognition using discrete cosine transform in logarithm domain,” IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 36, no. 2, pp. 458–466, 2006.
  • [65] A. Yao, D. Cai, P. Hu, S. Wang, L. Sha, and Y. Chen, “Holonet: towards robust emotion recognition in the wild,” in Proceedings of the 18th ACM International Conference on Multimodal Interaction.   ACM, 2016, pp. 472–478.
  • [66] P. Hu, D. Cai, S. Wang, A. Yao, and Y. Chen, “Learning supervised scoring ensemble for emotion recognition in the wild,” in Proceedings of the 19th ACM International Conference on Multimodal Interaction.   ACM, 2017, pp. 553–560.
  • [67] T. Hassner, S. Harel, E. Paz, and R. Enbar, “Effective face frontalization in unconstrained images,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 4295–4304.
  • [68] C. Sagonas, Y. Panagakis, S. Zafeiriou, and M. Pantic, “Robust statistical face frontalization,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 3871–3879.
  • [69] X. Yin, X. Yu, K. Sohn, X. Liu, and M. Chandraker, “Towards large-pose face frontalization in the wild,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 3990–3999.
  • [70] R. Huang, S. Zhang, T. Li, and R. He, “Beyond face rotation: Global and local perception gan for photorealistic and identity preserving frontal view synthesis,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 2439–2448.
  • [71] L. Tran, X. Yin, and X. Liu, “Disentangled representation learning gan for pose-invariant face recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1415–1424.
  • [72] L. Deng, D. Yu et al., “Deep learning: methods and applications,” Foundations and Trends® in Signal Processing, vol. 7, no. 3–4, pp. 197–387, 2014.
  • [73] B. Fasel, “Robust face analysis using convolutional neural networks,” in Pattern Recognition, 2002. Proceedings. 16th International Conference on, vol. 2.   IEEE, 2002, pp. 40–43.
  • [74] ——, “Head-pose invariant facial expression recognition using convolutional neural networks,” in Proceedings of the 4th IEEE International Conference on Multimodal Interfaces.   IEEE Computer Society, 2002, p. 529.
  • [75] M. Matsugu, K. Mori, Y. Mitari, and Y. Kaneda, “Subject independent facial expression recognition with robust face detection using a convolutional neural network,” Neural Networks, vol. 16, no. 5-6, pp. 555–559, 2003.
  • [76] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1026–1034.
  • [77] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov, “Improving neural networks by preventing co-adaptation of feature detectors,” arXiv preprint arXiv:1207.0580, 2012.
  • [78] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” The Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.
  • [79] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015.
  • [80] B. Sun, L. Li, G. Zhou, X. Wu, J. He, L. Yu, D. Li, and Q. Wei, “Combining multimodal features within a fusion network for emotion recognition in the wild,” in Proceedings of the 2015 ACM on International Conference on Multimodal Interaction.   ACM, 2015, pp. 497–502.
  • [81] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 580–587.
  • [82] J. Li, D. Zhang, J. Zhang, J. Zhang, T. Li, Y. Xia, Q. Yan, and L. Xun, “Facial expression recognition with faster r-cnn,” Procedia Computer Science, vol. 107, pp. 135–140, 2017.
  • [83] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in Advances in neural information processing systems, 2015, pp. 91–99.
  • [84] S. Ji, W. Xu, M. Yang, and K. Yu, “3d convolutional neural networks for human action recognition,” IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 1, pp. 221–231, 2013.
  • [85] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3d convolutional networks,” in Computer Vision (ICCV), 2015 IEEE International Conference on.   IEEE, 2015, pp. 4489–4497.
  • [86] Y. Fan, X. Lu, D. Li, and Y. Liu, “Video-based emotion recognition using cnn-rnn and c3d hybrid networks,” in Proceedings of the 18th ACM International Conference on Multimodal Interaction.   ACM, 2016, pp. 445–450.
  • [87] D. Nguyen, K. Nguyen, S. Sridharan, A. Ghasemi, D. Dean, and C. Fookes, “Deep spatio-temporal features for multimodal emotion recognition,” in Applications of Computer Vision (WACV), 2017 IEEE Winter Conference on.   IEEE, 2017, pp. 1215–1223.
  • [88] G. E. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning algorithm for deep belief nets,” Neural computation, vol. 18, no. 7, pp. 1527–1554, 2006.
  • [89] G. E. Hinton and T. J. Sejnowski, “Learning and releaming in boltzmann machines,” Parallel distributed processing: Explorations in the microstructure of cognition, vol. 1, no. 282-317, p. 2, 1986.
  • [90] G. E. Hinton, “A practical guide to training restricted boltzmann machines,” in Neural networks: Tricks of the trade.   Springer, 2012, pp. 599–619.
  • [91] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle, “Greedy layer-wise training of deep networks,” in Advances in neural information processing systems, 2007, pp. 153–160.
  • [92] G. E. Hinton, “Training products of experts by minimizing contrastive divergence,” Neural computation, vol. 14, no. 8, pp. 1771–1800, 2002.
  • [93] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” science, vol. 313, no. 5786, pp. 504–507, 2006.
  • [94] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol, “Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion,” Journal of Machine Learning Research, vol. 11, no. Dec, pp. 3371–3408, 2010.
  • [95] Q. V. Le, “Building high-level features using large scale unsupervised learning,” in Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on.   IEEE, 2013, pp. 8595–8598.
  • [96] S. Rifai, P. Vincent, X. Muller, X. Glorot, and Y. Bengio, “Contractive auto-encoders: Explicit invariance during feature extraction,” in Proceedings of the 28th International Conference on International Conference on Machine Learning.   Omnipress, 2011, pp. 833–840.
  • [97] J. Masci, U. Meier, D. Cireşan, and J. Schmidhuber, “Stacked convolutional auto-encoders for hierarchical feature extraction,” in International Conference on Artificial Neural Networks.   Springer, 2011, pp. 52–59.
  • [98] P. J. Werbos, “Backpropagation through time: what it does and how to do it,” Proceedings of the IEEE, vol. 78, no. 10, pp. 1550–1560, 1990.
  • [99] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
  • [100] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell, “Decaf: A deep convolutional activation feature for generic visual recognition,” in International conference on machine learning, 2014, pp. 647–655.
  • [101] A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, “Cnn features off-the-shelf: an astounding baseline for recognition,” in Computer Vision and Pattern Recognition Workshops (CVPRW), 2014 IEEE Conference on.   IEEE, 2014, pp. 512–519.
  • [102] P. Lucey, J. F. Cohn, T. Kanade, J. Saragih, Z. Ambadar, and I. Matthews, “The extended cohn-kanade dataset (ck+): A complete dataset for action unit and emotion-specified expression,” in Computer Vision and Pattern Recognition Workshops (CVPRW), 2010 IEEE Computer Society Conference on.   IEEE, 2010, pp. 94–101.
  • [103] M. Pantic, M. Valstar, R. Rademaker, and L. Maat, “Web-based database for facial expression analysis,” in Multimedia and Expo, 2005. ICME 2005. IEEE International Conference on.   IEEE, 2005, pp. 5–pp.
  • [104] M. Valstar and M. Pantic, “Induced disgust, happiness and surprise: an addition to the mmi facial expression database,” in Proc. 3rd Intern. Workshop on EMOTION (satellite of LREC): Corpora for Research on Emotion and Affect, 2010, p. 65.
  • [105] M. J. Lyons, S. Akamatsu, M. Kamachi, J. Gyoba, and J. Budynek, “The japanese female facial expression (jaffe) database,” 1998.
  • [106] J. M. Susskind, A. K. Anderson, and G. E. Hinton, “The toronto face database,” Department of Computer Science, University of Toronto, Toronto, ON, Canada, Tech. Rep, vol. 3, 2010.
  • [107] R. Gross, I. Matthews, J. Cohn, T. Kanade, and S. Baker, “Multi-pie,” Image and Vision Computing, vol. 28, no. 5, pp. 807–813, 2010.
  • [108] L. Yin, X. Wei, Y. Sun, J. Wang, and M. J. Rosato, “A 3d facial expression database for facial behavior research,” in Automatic face and gesture recognition, 2006. FGR 2006. 7th international conference on.   IEEE, 2006, pp. 211–216.
  • [109] G. Zhao, X. Huang, M. Taini, S. Z. Li, and M. PietikäInen, “Facial expression recognition from near-infrared videos,” Image and Vision Computing, vol. 29, no. 9, pp. 607–619, 2011.
  • [110] O. Langner, R. Dotsch, G. Bijlstra, D. H. Wigboldus, S. T. Hawk, and A. van Knippenberg, “Presentation and validation of the radboud faces database,” Cognition and Emotion, vol. 24, no. 8, pp. 1377–1388, 2010.
  • [111] D. Lundqvist, A. Flykt, and A. Öhman, “The karolinska directed emotional faces (kdef),” CD ROM from Department of Clinical Neuroscience, Psychology section, Karolinska Institutet, no. 1998, 1998.
  • [112] C. F. Benitez-Quiroz, R. Srinivasan, and A. M. Martinez, “Emotionet: An accurate, real-time algorithm for the automatic annotation of a million facial expressions in the wild,” in Proceedings of IEEE International Conference on Computer Vision & Pattern Recognition (CVPR¡¯16), Las Vegas, NV, USA, 2016.
  • [113] S. Li, W. Deng, and J. Du, “Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).   IEEE, 2017, pp. 2584–2593.
  • [114] A. Mollahosseini, B. Hasani, and M. H. Mahoor, “Affectnet: A database for facial expression, valence, and arousal computing in the wild,” IEEE Transactions on Affective Computing, vol. PP, no. 99, pp. 1–1, 2017.
  • [115] A. Dhall et al., “Collecting large, richly annotated facial-expression databases from movies,” 2012.
  • [116] A. Dhall, R. Goecke, S. Lucey, and T. Gedeon, “Acted facial expressions in the wild database,” Australian National University, Canberra, Australia, Technical Report TR-CS-11, vol. 2, p. 1, 2011.
  • [117] ——, “Static facial expression analysis in tough conditions: Data, evaluation protocol and benchmark,” in Computer Vision Workshops (ICCV Workshops), 2011 IEEE International Conference on.   IEEE, 2011, pp. 2106–2112.
  • [118] C. F. Benitez-Quiroz, R. Srinivasan, Q. Feng, Y. Wang, and A. M. Martinez, “Emotionet challenge: Recognition of facial expressions of emotion in the wild,” arXiv preprint arXiv:1703.01210, 2017.
  • [119] S. Du, Y. Tao, and A. M. Martinez, “Compound facial expressions of emotion,” Proceedings of the National Academy of Sciences, vol. 111, no. 15, pp. E1454–E1462, 2014.
  • [120] S. Ouellet, “Real-time emotion recognition for gaming using deep convolutional network features,” arXiv preprint arXiv:1408.3750, 2014.
  • [121] P. Liu, S. Han, Z. Meng, and Y. Tong, “Facial expression recognition via a boosted deep belief network,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 1805–1812.
  • [122] M. Liu, S. Li, S. Shan, and X. Chen, “Au-aware deep networks for facial expression recognition,” in Automatic Face and Gesture Recognition (FG), 2013 10th IEEE International Conference and Workshops on.   IEEE, 2013, pp. 1–6.
  • [123] ——, “Au-inspired deep networks for facial expression feature learning,” Neurocomputing, vol. 159, pp. 126–136, 2015.
  • [124] P. Khorrami, T. Paine, and T. Huang, “Do deep neural networks learn facial action units when doing expression recognition?” in Proceedings of the IEEE International Conference on Computer Vision Workshops, 2015, pp. 19–27.
  • [125] N. Zeng, H. Zhang, B. Song, W. Liu, Y. Li, and A. M. Dobaie, “Facial expression recognition via learning deep sparse autoencoders,” Neurocomputing, vol. 273, pp. 643–649, 2018.
  • [126] J. Cai, Z. Meng, A. S. Khan, Z. Li, and Y. Tong, “Island loss for learning discriminative features in facial expression recognition,” arXiv preprint arXiv:1710.03144, 2017.
  • [127] Z. Zhang, P. Luo, C. L. Chen, and X. Tang, “From facial expression recognition to interpersonal relation prediction,” International Journal of Computer Vision, vol. 126, no. 5, pp. 1–20, 2018.
  • [128] D. Hamester, P. Barros, and S. Wermter, “Face expression recognition with a 2-channel convolutional neural network,” in Neural Networks (IJCNN), 2015 International Joint Conference on.   IEEE, 2015, pp. 1–8.
  • [129] S. Rifai, Y. Bengio, A. Courville, P. Vincent, and M. Mirza, “Disentangling factors of variation for facial expression recognition,” in European Conference on Computer Vision.   Springer, 2012, pp. 808–822.
  • [130] S. Reed, K. Sohn, Y. Zhang, and H. Lee, “Learning to disentangle factors of variation with manifold interaction,” in International Conference on Machine Learning, 2014, pp. 1431–1439.
  • [131] S. Reed, H. Lee, D. Anguelov, C. Szegedy, D. Erhan, and A. Rabinovich, “Training deep neural networks on noisy labels with bootstrapping,” arXiv preprint arXiv:1412.6596, 2014.
  • [132] Y. Tang, “Deep learning using linear support vector machines,” arXiv preprint arXiv:1306.0239, 2013.
  • [133] Z. Zhang, P. Luo, C. C. Loy, and X. Tang, “Learning social relation traits from face images,” in IEEE International Conference on Computer Vision, 2015, pp. 3631–3639.
  • [134] Y. Guo, D. Tao, J. Yu, H. Xiong, Y. Li, and D. Tao, “Deep neural networks with relativity learning for facial expression recognition,” in Multimedia & Expo Workshops (ICMEW), 2016 IEEE International Conference on.   IEEE, 2016, pp. 1–6.
  • [135] B.-K. Kim, S.-Y. Dong, J. Roh, G. Kim, and S.-Y. Lee, “Fusing aligned and non-aligned face information for automatic affect recognition in the wild: A deep learning approach,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2016, pp. 48–57.
  • [136] C. Pramerdorfer and M. Kampel, “Facial expression recognition using convolutional neural networks: State of the art,” arXiv preprint arXiv:1612.02903, 2016.
  • [137] T. Connie, M. Al-Shabi, W. P. Cheah, and M. Goh, “Facial expression recognition using a hybrid cnn–sift aggregator,” in International Workshop on Multi-disciplinary Trends in Artificial Intelligence.   Springer, 2017, pp. 139–149.
  • [138] G. Pons and D. Masip, “Multi-task, multi-label and multi-domain learning with residual convolutional networks for emotion recognition,” arXiv preprint arXiv:1802.06664, 2018.
  • [139] O. M. Parkhi, A. Vedaldi, A. Zisserman et al., “Deep face recognition.” in BMVC, vol. 1, no. 3, 2015, p. 6.
  • [140] T. Kaneko, K. Hiramatsu, and K. Kashino, “Adaptive visual feedback generation for facial expression improvement with multi-task deep neural networks,” in Proceedings of the 2016 ACM on Multimedia Conference.   ACM, 2016, pp. 327–331.
  • [141] D. Yi, Z. Lei, S. Liao, and S. Z. Li, “Learning face representation from scratch,” arXiv preprint arXiv:1411.7923, 2014.
  • [142] X. Zhang, L. Zhang, X.-J. Wang, and H.-Y. Shum, “Finding celebrities in billions of web images,” IEEE Transactions on Multimedia, vol. 14, no. 4, pp. 995–1007, 2012.
  • [143] H.-W. Ng and S. Winkler, “A data-driven approach to cleaning large face datasets,” in Image Processing (ICIP), 2014 IEEE International Conference on.   IEEE, 2014, pp. 343–347.
  • [144] H. Kaya, F. Gürpınar, and A. A. Salah, “Video-based emotion recognition in the wild using deep transfer learning and score fusion,” Image and Vision Computing, vol. 65, pp. 66–75, 2017.
  • [145] B. Knyazev, R. Shvetsov, N. Efremova, and A. Kuharenko, “Convolutional neural networks pretrained on large face recognition datasets for emotion classification from video,” arXiv preprint arXiv:1711.04598, 2017.
  • [146] T. Zhang, W. Zheng, Z. Cui, Y. Zong, J. Yan, and K. Yan, “A deep neural network-driven feature learning method for multi-view facial expression recognition,” IEEE Transactions on Multimedia, vol. 18, no. 12, pp. 2528–2536, 2016.
  • [147] D. G. Lowe, “Object recognition from local scale-invariant features,” in Computer vision, 1999. The proceedings of the seventh IEEE international conference on, vol. 2.   Ieee, 1999, pp. 1150–1157.
  • [148] Z. Luo, J. Chen, T. Takiguchi, and Y. Ariki, “Facial expression recognition with deep age,” in Multimedia & Expo Workshops (ICMEW), 2017 IEEE International Conference on.   IEEE, 2017, pp. 657–662.
  • [149] V. Mavani, S. Raman, and K. P. Miyapuram, “Facial expression recognition using visual saliency and deep learning,” arXiv preprint arXiv:1708.08016, 2017.
  • [150] M. Cornia, L. Baraldi, G. Serra, and R. Cucchiara, “A deep multi-level network for saliency prediction,” in Pattern Recognition (ICPR), 2016 23rd International Conference on.   IEEE, 2016, pp. 3488–3493.
  • [151] B.-F. Wu and C.-H. Lin, “Adaptive feature mapping for customizing deep learning based facial expression recognition model,” IEEE Access, 2018.
  • [152] J. Lu, V. E. Liong, and J. Zhou, “Cost-sensitive local binary feature learning for facial age estimation,” IEEE Transactions on Image Processing, vol. 24, no. 12, pp. 5356–5368, 2015.
  • [153] L. Chen, M. Zhou, W. Su, M. Wu, J. She, and K. Hirota, “Softmax regression based deep sparse autoencoder network for facial emotion recognition in human-robot interaction,” Information Sciences, vol. 428, pp. 49–61, 2018.
  • [154] W. Shang, K. Sohn, D. Almeida, and H. Lee, “Understanding and improving convolutional neural networks via concatenated rectified linear units,” in International Conference on Machine Learning, 2016, pp. 2217–2225.
  • [155] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2818–2826.
  • [156] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, “Inception-v4, inception-resnet and the impact of residual connections on learning.” in AAAI, vol. 4, 2017, p. 12.
  • [157] Y. Wen, K. Zhang, Z. Li, and Y. Qiao, “A discriminative feature learning approach for deep face recognition,” in European Conference on Computer Vision.   Springer, 2016, pp. 499–515.
  • [158] F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified embedding for face recognition and clustering,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 815–823.
  • [159] D. Ciregan, U. Meier, and J. Schmidhuber, “Multi-column deep neural networks for image classification,” in Computer vision and pattern recognition (CVPR), 2012 IEEE conference on.   IEEE, 2012, pp. 3642–3649.
  • [160] G. Pons and D. Masip, “Supervised committee of convolutional neural networks in automated facial expression analysis,” IEEE Transactions on Affective Computing, 2017.
  • [161] B.-K. Kim, J. Roh, S.-Y. Dong, and S.-Y. Lee, “Hierarchical committee of deep convolutional neural networks for robust facial expression recognition,” Journal on Multimodal User Interfaces, vol. 10, no. 2, pp. 173–189, 2016.
  • [162] K. Liu, M. Zhang, and Z. Pan, “Facial expression recognition with cnn ensemble,” in Cyberworlds (CW), 2016 International Conference on.   IEEE, 2016, pp. 163–166.
  • [163] J. Kittler, M. Hatef, R. P. Duin, and J. Matas, “On combining classifiers,” IEEE transactions on pattern analysis and machine intelligence, vol. 20, no. 3, pp. 226–239, 1998.
  • [164] R. Polikar, “Ensemble based systems in decision making,” IEEE Circuits and systems magazine, vol. 6, no. 3, pp. 21–45, 2006.
  • [165] K. Zhang, Y. Huang, Y. Du, and L. Wang, “Facial expression recognition based on deep evolutional spatial-temporal networks,” IEEE Transactions on Image Processing, vol. 26, no. 9, pp. 4193–4203, 2017.
  • [166] P. Ekman and E. L. Rosenberg, What the face reveals: Basic and applied studies of spontaneous expression using the Facial Action Coding System (FACS).   Oxford University Press, USA, 1997.
  • [167] P. Ekman, “Facial action coding system (facs),” A human face, 2002.
  • [168] Y. Lv, Z. Feng, and C. Xu, “Facial expression recognition via deep learning,” in Smart Computing (SMARTCOMP), 2014 International Conference on.   IEEE, 2014, pp. 303–308.
  • [169] J. Jang, D. H. Kim, H.-I. Kim, and Y. M. Ro, “Color channel-wise recurrent learning for facial expression recognition,” in Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on.   IEEE, 2017, pp. 1233–1237.
  • [170] L. v. d. Maaten and G. Hinton, “Visualizing data using t-sne,” Journal of machine learning research, vol. 9, no. Nov, pp. 2579–2605, 2008.
  • [171] X. Zhao, X. Liang, L. Liu, T. Li, Y. Han, N. Vasconcelos, and S. Yan, “Peak-piloted deep network for facial expression recognition,” in European conference on computer vision.   Springer, 2016, pp. 425–442.
  • [172] N. Sun, Q. Li, R. Huan, J. Liu, and G. Han, “Deep spatial-temporal feature fusion for facial expression recognition in static images,” Pattern Recognition Letters, 2017.
  • [173] D. H. Kim, W. Baddar, J. Jang, and Y. M. Ro, “Multi-objective based spatio-temporal feature representation learning robust to expression intensity variations for facial expression recognition,” IEEE Transactions on Affective Computing, 2017.
  • [174] B. Hasani and M. H. Mahoor, “Facial expression recognition using enhanced deep 3d convolutional neural networks,” in Computer Vision and Pattern Recognition Workshops (CVPRW), 2017 IEEE Conference on.   IEEE, 2017, pp. 2278–2288.
  • [175] J. Yan, W. Zheng, Z. Cui, C. Tang, T. Zhang, Y. Zong, and N. Sun, “Multi-clue fusion for emotion recognition in the wild,” in Proceedings of the 18th ACM International Conference on Multimodal Interaction.   ACM, 2016, pp. 458–463.
  • [176] X. Ouyang, S. Kawaai, E. G. H. Goh, S. Shen, W. Ding, H. Ming, and D.-Y. Huang, “Audio-visual emotion recognition using deep transfer learning and multiple temporal models,” in Proceedings of the 19th ACM International Conference on Multimodal Interaction.   ACM, 2017, pp. 577–582.
  • [177] V. Vielzeuf, S. Pateux, and F. Jurie, “Temporal multimodal fusion for video emotion classification in the wild,” in Proceedings of the 19th ACM International Conference on Multimodal Interaction.   ACM, 2017, pp. 569–576.
  • [178] S. Wang, W. Wang, J. Zhao, S. Chen, Q. Jin, S. Zhang, and Y. Qin, “Emotion recognition with multimodal features and temporal models,” in Proceedings of the 19th ACM International Conference on Multimodal Interaction.   ACM, 2017, pp. 598–602.
  • [179] S. E. Kahou, X. Bouthillier, P. Lamblin, C. Gulcehre, V. Michalski, K. Konda, S. Jean, P. Froumenty, Y. Dauphin, N. Boulanger-Lewandowski et al., “Emonets: Multimodal deep learning approaches for emotion recognition in video,” Journal on Multimodal User Interfaces, vol. 10, no. 2, pp. 99–111, 2016.
  • [180] M. Liu, R. Wang, S. Li, S. Shan, Z. Huang, and X. Chen, “Combining multiple kernel methods on riemannian manifold for emotion recognition in the wild,” in Proceedings of the 16th International Conference on Multimodal Interaction.   ACM, 2014, pp. 494–501.
  • [181] W. Ding, M. Xu, D. Huang, W. Lin, M. Dong, X. Yu, and H. Li, “Audio and face video emotion recognition in the wild using deep neural networks and small datasets,” in Proceedings of the 18th ACM International Conference on Multimodal Interaction.   ACM, 2016, pp. 506–513.
  • [182] B. Xu, Y. Fu, Y.-G. Jiang, B. Li, and L. Sigal, “Video emotion recognition with transferred deep feature encodings,” in Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval.   ACM, 2016, pp. 15–22.
  • [183] Y. Kim, B. Yoo, Y. Kwak, C. Choi, and J. Kim, “Deep generative-contrastive networks for facial expression recognition,” arXiv preprint arXiv:1703.07140, 2017.
  • [184] A. Graves, C. Mayer, M. Wimmer, J. Schmidhuber, and B. Radig, “Facial expression recognition with recurrent neural networks,” in Proceedings of the International Workshop on Cognition for Technical Systems, 2008.
  • [185] A. Graves and J. Schmidhuber, “Framewise phoneme classification with bidirectional lstm and other neural network architectures,” Neural Networks, vol. 18, no. 5-6, pp. 602–610, 2005.
  • [186] J. Ahlberg, “Candide-3-an updated parameterised face,” 2001.
  • [187] P. Barros and S. Wermter, “Developing crossmodal expression recognition based on a deep neural model,” Adaptive behavior, vol. 24, no. 5, pp. 373–396, 2016.
  • [188] M. Liu, S. Li, S. Shan, R. Wang, and X. Chen, “Deeply learning deformable facial action parts model for dynamic expression analysis,” in Asian conference on computer vision.   Springer, 2014, pp. 143–157.
  • [189] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan, “Object detection with discriminatively trained part-based models,” IEEE transactions on pattern analysis and machine intelligence, vol. 32, no. 9, pp. 1627–1645, 2010.
  • [190] S. Ali and M. Shah, “Human action recognition in videos using kinematic features and multiple instance learning,” IEEE transactions on pattern analysis and machine intelligence, vol. 32, no. 2, pp. 288–303, 2010.
  • [191] S. Pini, O. B. Ahmed, M. Cornia, L. Baraldi, R. Cucchiara, and B. Huet, “Modeling multimodal cues in a deep learning-based framework for emotion recognition in the wild,” in Proceedings of the 19th ACM International Conference on Multimodal Interaction.   ACM, 2017, pp. 536–543.
  • [192] R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic, “Netvlad: Cnn architecture for weakly supervised place recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 5297–5307.
  • [193] D. H. Kim, M. K. Lee, D. Y. Choi, and B. C. Song, “Multi-modal emotion recognition using semi-supervised learning and multiple neural networks in the wild,” in Proceedings of the 19th ACM International Conference on Multimodal Interaction.   ACM, 2017, pp. 529–535.
  • [194] M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural networks,” IEEE Transactions on Signal Processing, vol. 45, no. 11, pp. 2673–2681, 1997.
  • [195] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell, “Long-term recurrent convolutional networks for visual recognition and description,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 2625–2634.
  • [196] D. K. Jain, Z. Zhang, and K. Huang, “Multi angle optimal pattern-based deep learning for automatic facial expression recognition,” Pattern Recognition Letters, 2017.
  • [197] M. Baccouche, F. Mamalet, C. Wolf, C. Garcia, and A. Baskurt, “Spatio-temporal convolutional sparse auto-encoder for sequence classification.” in BMVC, 2012, pp. 1–12.
  • [198] S. Kankanamge, C. Fookes, and S. Sridharan, “Facial analysis in the wild with lstm networks,” in Image Processing (ICIP), 2017 IEEE International Conference on.   IEEE, 2017, pp. 1052–1056.
  • [199] Q. V. Le, N. Jaitly, and G. E. Hinton, “A simple way to initialize recurrent networks of rectified linear units,” arXiv preprint arXiv:1504.00941, 2015.
  • [200] B. Hasani and M. H. Mahoor, “Spatio-temporal facial expression recognition using convolutional neural networks and conditional random fields,” in Automatic Face & Gesture Recognition (FG 2017), 2017 12th IEEE International Conference on.   IEEE, 2017, pp. 790–795.
  • [201] J. Lafferty, A. McCallum, and F. C. Pereira, “Conditional random fields: Probabilistic models for segmenting and labeling sequence data,” 2001.
  • [202] K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” in Advances in neural information processing systems, 2014, pp. 568–576.
  • [203] J. Susskind, V. Mnih, G. Hinton et al., “On deep generative models with applications to recognition,” in Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on.   IEEE, 2011, pp. 2857–2864.
  • [204] V. Mnih, J. M. Susskind, G. E. Hinton et al., “Modeling natural images using gated mrfs,” IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 9, pp. 2206–2222, 2013.
  • [205] V. Mnih, G. E. Hinton et al., “Generating more realistic images using gated mrf’s,” in Advances in Neural Information Processing Systems, 2010, pp. 2002–2010.
  • [206] Y. Cheng, B. Jiang, and K. Jia, “A deep structure for facial expression recognition under partial occlusion,” in Intelligent Information Hiding and Multimedia Signal Processing (IIH-MSP), 2014 Tenth International Conference on.   IEEE, 2014, pp. 211–214.
  • [207] M. Xu, W. Cheng, Q. Zhao, L. Ma, and F. Xu, “Facial expression recognition based on transfer learning from deep convolutional networks,” in Natural Computation (ICNC), 2015 11th International Conference on.   IEEE, 2015, pp. 702–708.
  • [208] S. He, S. Wang, W. Lan, H. Fu, and Q. Ji, “Facial expression recognition using deep boltzmann machine from thermal infrared images,” in Affective Computing and Intelligent Interaction (ACII), 2013 Humaine Association Conference on.   IEEE, 2013, pp. 239–244.
  • [209] Z. Wu, T. Chen, Y. Chen, Z. Zhang, and G. Liu, “Nirexpnet: Three-stream 3d convolutional neural network for near infrared facial expression recognition,” Applied Sciences, vol. 7, no. 11, p. 1184, 2017.
  • [210] E. P. Ijjina and C. K. Mohan, “Facial expression recognition using kinect depth sensor and convolutional neural networks,” in Machine Learning and Applications (ICMLA), 2014 13th International Conference on.   IEEE, 2014, pp. 392–396.
  • [211] M. Z. Uddin, M. M. Hassan, A. Almogren, M. Zuair, G. Fortino, and J. Torresen, “A facial expression recognition system using robust face features from depth videos and deep learning,” Computers & Electrical Engineering, vol. 63, pp. 114–125, 2017.
  • [212] M. Z. Uddin, W. Khaksar, and J. Torresen, “Facial expression recognition using salient features and convolutional neural network,” IEEE Access, vol. 5, pp. 26 146–26 161, 2017.
  • [213] O. K. Oyedotun, G. Demisse, A. E. R. Shabayek, D. Aouada, and B. Ottersten, “Facial expression recognition via joint deep learning of rgb-depth map latent representations,” in 2017 IEEE International Conference on Computer Vision Workshop (ICCVW), 2017.
  • [214] H. Li, J. Sun, Z. Xu, and L. Chen, “Multimodal 2d+ 3d facial expression recognition with deep fusion convolutional neural network,” IEEE Transactions on Multimedia, vol. 19, no. 12, pp. 2816–2831, 2017.
  • [215] J. M. Susskind, G. E. Hinton, J. R. Movellan, and A. K. Anderson, “Generating facial expressions with deep belief nets,” in Affective Computing.   InTech, 2008.
  • [216] M. Sabzevari, S. Toosizadeh, S. R. Quchani, and V. Abrishami, “A fast and accurate facial expression synthesis system for color face images using face graph and deep belief network,” in Electronics and Information Engineering (ICEIE), 2010 International Conference On, vol. 2.   IEEE, 2010, pp. V2–354.
  • [217] R. Yeh, Z. Liu, D. B. Goldman, and A. Agarwala, “Semantic facial expression editing using autoencoded flow,” arXiv preprint arXiv:1611.09961, 2016.
  • [218] Y. Zhou and B. E. Shi, “Photorealistic facial expression synthesis by the conditional difference adversarial autoencoder,” arXiv preprint arXiv:1708.09126, 2017.
  • [219] L. Song, Z. Lu, R. He, Z. Sun, and T. Tan, “Geometry guided adversarial facial expression synthesis,” arXiv preprint arXiv:1712.03474, 2017.
  • [220] G. Gu, S. T. Kim, K. Kim, W. J. Baddar, and Y. M. Ro, “Differential generative adversarial networks: Synthesizing non-linear facial variations with limited number of training data,” arXiv preprint arXiv:1711.10267, 2017.
  • [221] H. Ding, K. Sricharan, and R. Chellappa, “Exprgan: Facial expression editing with controllable expression intensity,” arXiv preprint arXiv:1709.03842, 2017.
  • [222] F. Qiao, N. Yao, Z. Jiao, Z. Li, H. Chen, and H. Wang, “Geometry-contrastive generative adversarial network for facial expression synthesis,” arXiv preprint arXiv:1802.01822, 2018.
  • [223] I. Masi, A. T. Trần, T. Hassner, J. T. Leksut, and G. Medioni, “Do we really need to collect millions of faces for effective face recognition?” in European Conference on Computer Vision.   Springer, 2016, pp. 579–596.
  • [224] N. Mousavi, H. Siqueira, P. Barros, B. Fernandes, and S. Wermter, “Understanding how deep neural networks learn face expressions,” in Neural Networks (IJCNN), 2016 International Joint Conference on.   IEEE, 2016, pp. 227–234.
  • [225] R. Breuer and R. Kimmel, “A deep learning perspective on the origin of facial expressions,” arXiv preprint arXiv:1705.01842, 2017.
  • [226] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional networks,” in European conference on computer vision.   Springer, 2014, pp. 818–833.
  • [227] I. Lüsi, J. C. J. Junior, J. Gorbova, X. Baró, S. Escalera, H. Demirel, J. Allik, C. Ozcinar, and G. Anbarjafari, “Joint challenge on dominant and complementary emotion recognition using micro emotion features and head-pose estimation: Databases,” in Automatic Face & Gesture Recognition (FG 2017), 2017 12th IEEE International Conference on.   IEEE, 2017, pp. 809–813.
  • [228] J. Wan, S. Escalera, X. Baro, H. J. Escalante, I. Guyon, M. Madadi, J. Allik, J. Gorbova, and G. Anbarjafari, “Results and analysis of chalearn lap multi-modal isolated and continuous gesture recognition, and real versus fake expressed emotions challenges,” in ChaLearn LaP, Action, Gesture, and Emotion Recognition Workshop and Competitions: Large Scale Multimodal Gesture Recognition and Real versus Fake expressed emotions, ICCV, vol. 4, no. 6, 2017.
  • [229] Y.-G. Kim and X.-P. Huynh, “Discrimination between genuine versus fake emotion using long-short term memory with parametric bias and facial landmarks,” in Computer Vision Workshop (ICCVW), 2017 IEEE International Conference on.   IEEE, 2017, pp. 3065–3072.
  • [230] S. Ozkan and G. B. Akar, “Relaxed spatio-temporal deep feature aggregation for real-fake expression prediction,” arXiv preprint arXiv:1708.07335, 2017.
  • [231] L. Li, T. Baltrusaitis, B. Sun, and L.-P. Morency, “Combining sequential geometry and texture features for distinguishing genuine and deceptive emotions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 3147–3153.
  • [232] J. Guo, S. Zhou, J. Wu, J. Wan, X. Zhu, Z. Lei, and S. Z. Li, “Multi-modality network with visual and geometrical information for micro emotion recognition,” in Automatic Face & Gesture Recognition (FG 2017), 2017 12th IEEE International Conference on.   IEEE, 2017, pp. 814–819.
  • [233] I. Song, H.-J. Kim, and P. B. Jeon, “Deep learning for real-time robust facial expression recognition on a smartphone,” in Consumer Electronics (ICCE), 2014 IEEE International Conference on.   IEEE, 2014, pp. 564–567.
  • [234] S. Bazrafkan, T. Nedelcu, P. Filipczuk, and P. Corcoran, “Deep learning for facial expression recognition: A step closer to a smartphone that knows your moods,” in Consumer Electronics (ICCE), 2017 IEEE International Conference on.   IEEE, 2017, pp. 217–220.
  • [235] S. Hickson, N. Dufour, A. Sud, V. Kwatra, and I. Essa, “Eyemotion: Classifying facial expressions in vr using eye-tracking cameras,” arXiv preprint arXiv:1707.07204, 2017.
  • [236] S. A. Ossia, A. S. Shamsabadi, A. Taheri, H. R. Rabiee, N. Lane, and H. Haddadi, “A hybrid deep learning architecture for privacy-preserving mobile analytics,” arXiv preprint arXiv:1703.02952, 2017.
  • [237] K. Kulkarni, C. A. Corneanu, I. Ofodile, S. Escalera, X. Baro, S. Hyniewska, J. Allik, and G. Anbarjafari, “Automatic recognition of facial displays of unfelt emotions,” arXiv preprint arXiv:1707.04061, 2017.
  • [238] E. Barsoum, C. Zhang, C. C. Ferrer, and Z. Zhang, “Training deep networks for facial expression recognition with crowd-sourced label distribution,” in Proceedings of the 18th ACM International Conference on Multimodal Interaction.   ACM, 2016, pp. 279–283.
  • [239] J. A. Russell, “A circumplex model of affect.” Journal of personality and social psychology, vol. 39, no. 6, p. 1161, 1980.
  • [240] M. Valstar, J. Gratch, B. Schuller, F. Ringeval, D. Lalanne, M. Torres Torres, S. Scherer, G. Stratou, R. Cowie, and M. Pantic, “Avec 2016: Depression, mood, and emotion recognition workshop and challenge,” in Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge.   ACM, 2016, pp. 3–10.
  • [241] F. Ringeval, B. Schuller, M. Valstar, J. Gratch, R. Cowie, S. Scherer, S. Mozgai, N. Cummins, M. Schmitt, and M. Pantic, “Avec 2017: Real-life depression, and affect recognition workshop and challenge,” in Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge.   ACM, 2017, pp. 3–9.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description