Suppressing Uncertainties for Large-Scale Facial Expression Recognition

Suppressing Uncertainties for Large-Scale Facial Expression Recognition


Annotating a qualitative large-scale facial expression dataset is extremely difficult due to the uncertainties caused by ambiguous facial expressions, low-quality facial images, and the subjectiveness of annotators. These uncertainties lead to a key challenge of large-scale Facial Expression Recognition (FER) in deep learning era. To address this problem, this paper proposes a simple yet efficient Self-Cure Network (SCN) which suppresses the uncertainties efficiently and prevents deep networks from over-fitting uncertain facial images. Specifically, SCN suppresses the uncertainty from two different aspects: 1) a self-attention mechanism over mini-batch to weight each training sample with a ranking regularization, and 2) a careful relabeling mechanism to modify the labels of these samples in the lowest-ranked group. Experiments on synthetic FER datasets and our collected WebEmotion dataset validate the effectiveness of our method. Results on public benchmarks demonstrate that our SCN outperforms current state-of-the-art methods with 88.14% on RAF-DB, 60.23% on AffectNet, and 89.35% on FERPlus. The code will be available at


1 Introduction

Facial expression is one of the most natural, powerful and universal signals for human beings to convey their emotional states and intentions [7, 38]. Automatically recognizing facial expression is also important to help the computer understand human behavior and to interact with them. In the past decades, researchers have made significant progress on facial expression recognition (FER) with algorithms and large-scale datasets, where datasets can be collected in laboratory or in the wild, such as CK+ [29], MMI [39], Oulu-CASIA [47], SFEW/AFEW [10], FERPlus [4], AffectNet [32], EmotioNet [11], RAF-DB [22], etc.

Figure 1: Illustration of uncertainties on real-world facial images from RAF-DB. The right samples are extremely difficult for machines and even human which are better to be suppressed in training.

However, for the large-scale FER datasets collected from the Internet, it is extremely difficult to annotate with high quality due to the uncertainties caused by the subjectiveness of annotators as well as ambiguous in-the-wild facial images. As illustrated in Figure 1, the uncertainties increase from high-quality and evident facial expressions to low-quality and micro expressions. These uncertainties usually lead to inconsistent labels and incorrect labels, which are suspending the progress of large-scale Facial Expression Recognition (FER), especially for the one of data-driven deep learning based FER. Generally, training with uncertainties of FER may lead to the following problems. First, it may result in over-fitting on the uncertain samples which may be mislabeled. Second, it is harmful for a model to learn useful facial expression features. Third, a high ratio of incorrect labels even makes the model disconvergence in the early stage of optimization.

To address these issues, we propose a simple yet efficient method, termed as Self-Cure Network (SCN), to suppress the uncertainties for large-scale facial expression recognition. The SCN consists of three crucial modules: self-attention importance weighting, ranking regularization, and noise relabeling. Given a batch of images, a backbone CNN is first used to extract facial features. Then the self-attention importance weighting module learns a weight for each image to capture the sample importance for loss weighting. It is expected that uncertain facial images are assigned low importance weights. Further, the ranking regularization module ranks these weights in descending order, splits them into two groups (i.e. high importance weights and low importance weights), and regularizes the two groups by enforcing a margin between the average weights of the two groups. This regularization is implemented with a loss function, termed as Rank Regularization loss (RR-Loss). The ranking regularization module ensures that the first module learns meaningful weights to highlight certain samples (\egreliable annotations) and to suppress uncertain samples (\egambiguous annotations). The last module is a careful relabeling module that attempts to relabel these samples from the bottom group by comparing the maximum predicted probabilities to the probabilities of given labels. A sample is assigned to a pseudo label if the maximum prediction probability is higher than the one of given label with a margin threshold. In addition, since the main evidence of uncertainties is the incorrect/noisy annotation problem, we collect an extreme noisy FER dataset from the Internet, termed as WebEmotion, to investigate the effect of SCN with extreme uncertainties.

Overall, our contributions can be summarized as follows,

  • We innovatively pose the uncertainty problem in facial expression recognition, and propose a Self-Cure Network to reduce the impact of uncertainties.

  • We elaborately design a rank regularization to supervise the SCN to learn meaningful importance weights, which also provides a reference for the relabeling module.

  • We extensively validate our SCN on synthetic FER data and a new real-world uncertain emotion dataset (WebEmotion) collected from the Internet. Our SCN also achieves performance 88.14% on RAF-DB, 60.23% on AffectNet, and 89.35% on FERPlus, which set new records on them.

2 Related Work

2.1 Facial Expression Recognition

Generally, a FER system mainly consists of three stages, namely face detection, feature extraction, and expression recognition. In face detection stage, several face detectors like MTCNN [44] and Dlib [2]) are used to locate faces in complex scenes. The detected faces can be further aligned alternatively. For feature extraction, various methods are designed to capture facial geometry and appearance features caused by facial expressions. According to the feature type, they can be grouped into engineered features and learning-based features. For the engineered features, they can be further divided into texture-based local features, geometry-based global features, and hybrid features. The texture-based features mainly include SIFT [34], HOG [6], Histograms of LBP [35], Gabor wavelet coefficients [26], etc. The geometry-based global features are mainly based on the landmark points around noses, eyes, and mouths. Combining two or more of the engineered features refers to the hybrid feature extraction, which can further enrich the representation. For the learned features, Fasel [12] finds that a shallow CNN is robust to face poses and scales. Tang [37] and Kahou et al[21] utilize deep CNNs for feature extraction, and win the FER2013 and Emotiw2013 challenge, respectively. Liu et al[27] propose a Facial Action Units based CNN architecture for expression recognition. Recently, both Li et al. [23] and Wang et al. [42] have designed region-based attention networks for pose and occlusion aware FER, where the regions are either cropped from landmark points or fixed positions.

2.2 Learning with Uncertainties

Uncertainties in the FER task mainly come from ambiguous facial expressions, low-quality facial images, inconsistent annotations, and incorrect annotations (\ienoisy labels). Particularly, learning with noisy labels is extensively studied in the computer vision community while the other two aspects are rarely explored. In order to handle noisy labels, one intuitive idea is to leverage a small set of clean data that can be used to assess the quality of the labels during the training process [41, 25, 8], or to estimate the noise distribution [36], or to train the feature extractors [3]. Li et al. [25] propose a unified distillation framework using ‘side’ information from a small clean dataset and label relations in knowledge graph, to ‘hedge the risk’ of learning from noisy labels. Veit et al.[40] use a multi-task network that jointly learns to clean noisy annotations and to classify images. Azadi et al.[3] select reliable images by an auxiliary image regularization for deep CNNs with noisy labels. Other methods do not need a small clean dataset but they may assume extra constrains or distributions on the noisy samples [31], such as a specific loss for randomly flipped labels [33], regularizing the deep networks on corrupted labels by a MentorNet [20], and other approaches that model the noise with a softmax layer by connecting the latent correct labels to the noisy ones [13, 43]. For the FER task, Zeng et al[43] first consider the inconsistent annotation problem among different FER datasets, and propose to leverage these uncertainties to improve FER. In contrast, our work focuses on suppressing these uncertainties to learn better facial expression features.

3 Self-Cure Network

Figure 2: The pipeline of our Self-Cure Network. Face images are first fed into a backbone CNN for feature extraction. The self-attention importance weighting module learns sample weights from facial features for loss weighting. The rank regularization module takes as input the sample weights and constrain them with a ranking operation and a margin-based loss function. The relabeling module hunts reliable samples by comparing maximum predicted probabilities to the probabilities of given labels. Mislabeled samples are marked in red solid rectangles and ambiguous samples in green dash ones. It is worth noting that SCN mainly resorts to the re-weighting operation to suppress these uncertainties and only modifies some of the uncertain samples.

To learn robust facial expression features with uncertainties, we propose a simple yet efficient Self-Cure Network (SCN). In this section, we first provide an overview of the SCN, and then present its three modules. We finally present the detailed implementation of SCN.

3.1 Overview of Self-Cure Network

Our SCN is built upon traditional CNNs and consists of three crucial modules: i) self-attention importance weighting, ii) ranking regularization, and iii) relabeling, as shown in Figure 2.

Given a batch of face images with some uncertain samples, we first extract the deep features by a backbone network. The self-attention importance weighting module assigns an importance weight for each image using a fully-connected (FC) layer and the sigmoid function. These weights are multiplied by the logits for a sample re-weighting scheme. To explicitly reduce the importance of uncertain samples, a rank regularization module is further introduced to regularize the attention weights. In the rank regularization module, we first rank the learned attention weights and then split them into two groups, i.e. high and low importance groups. We then add a constraint between the mean weights of these groups by a margin-based loss, which is called rank regularization loss (RR-Loss). To further improve our SCN, the relabeling module is added to modify some of the uncertain samples in the low importance group. This relabeling operation aims to hunt more clean samples and then to enhance the final model. The whole SCN can be trained in an end-to-end manner and easily added into any CNN backbones.

3.2 Self-Attention Importance Weighting

We introduce the self-attention importance weighting module to capture the contributions of samples for training. It is expected that certain samples may have high importance weights while uncertain ones have low importance. Let denotes the facial features of images, the self-attention importance weighting module takes as input, and output an importance weight for each feature. Specifically, the self-attention importance weighting module is comprised of a linear fully-connected (FC) layer and a sigmoid activation function, which can be formulated as ,


where is the importance weight of the i-th sample, is the parameters of the FC layer used for attention, and is the sigmoid function. This module also provides reference for the other two modules.

Logit-Weighted Cross-Entropy Loss. With the attention weights, we have two simple choices to perform loss weighting inspired by [17]. The first choice is to multiply the weight of each sample by the sample loss. In our case, since the weights are optimized in an end-to-end manner and are learned from the CNN features, they are doomed to be zeros as this trival solution makes zero loss. MentorNet [20] and other self-paced learning methods [19, 30] solve this problem by alternating minimization, i.e. optimize one at a time while the other is held fixed. In this paper, we choose the logit-weighted one of [17] which is shown to be more efficient. For a multi-class Cross-Entropy loss, we call our weighted loss as Logit-Weighted Cross-Entropy loss (WCE-Loss), which is formulated as,


where is the j-th classifier. As suggested in [28], the has a positive correlation with the .

3.3 Rank Regularization

The self-attention weights in the above module can be arbitrary in (0, 1). To explicitly constrain the importance of uncertain samples, we elaborately design a rank regularization module to regularize the attention weights. In the rank regularization module, we first rank the learned attention weights in descending order and then split them into two groups with a ratio . The rank regularization ensures that the mean attention weight of high-importance group is higher than the one of low-importance group with a margin. Formally, we define a rank regularization loss (RR-Loss) for this purpose as follows,




where is a margin which can be a fixed hyper parameter or a learnable parameter, and are the mean values of the high importance group with samples and the low importance group with samples, respectively. In training, the total loss function is where is a trade-off ratio.

Category Happy Sad Surprise Fear Angry Disgust Contempt Neutral Total
# Videos 4,231 5,670 4,573 5,328 5,668 5,197 5,266 5,406 41,339
# Clips 27,854 29,667 27,418 29,822 31,483 20,764 6,454 26,687 200,149
Table 1: The statistics of our WebEmotion.

3.4 Relabeling

In the rank regularization module, each mini-batch is divided into two groups, i.e. the high-importance and the low-importance groups. We experimentally find that the uncertain samples usually have low importance weights, thus an intuitive idea is to design a strategy to relabel these samples. The main challenge to modify these annotations is to know which annotation is incorrect.

Specifically, our relabeling module only considers the samples in the low-importance group and is performed on the Softmax probabilities. For each sample, we compare the maximum predicted probability to the probability of given label. A sample is assigned to a new pseudo label if the maximum prediction probability is higher than the one of given label with a threshold. Formally, the relabeling module can be defined as,


where denotes the new label, is a threshold, is the maximum predicted probability, and is the predicted probability of the given label. and are the original given label and the index of the maximum prediction, respectively.

In our system, uncertain samples are expected to obtain low importance weights thus to degrade their negative impacts with re-weighting, and then fall into the low-importance group, and finally may be corrected as certain samples by relabeling. Those corrected samples may obtain high important weights in the next epoch. We expect the network can be cured by itself with either re-weighting or relabeling, which is the reason why we call our method as self-cured network.

3.5 Implementation

Pre-processing and facial features. In our SCN, face images are detected and aligned by MTCNN [45] and further resized to 224 224 pixels. The SCN is implemented with Pytorch toolbox and the backbone network is ResNet-18 [16]. By default, the ResNet-18 is pre-trained on the MS-Celeb-1M face recognition dataset and the facial features are extracted from its last pooling layer.

Training. We train our SCN in an end-to-end manner with 8 Nvidia Titan 2080ti GPU, and set the batch size as 1024. In each iteration, the training images are divided into two groups including 70% high importance samples and 30% low importance samples by default. The margin between the mean value of high and low importance groups can be either set at 0.15 by default or designed as a learnable parameter. Both strategies will be evaluated in the ensuing Experiments. The whole network is jointly optimized with RR-Loss and WCE-Loss. The ratio of the two losses is empirically set at 1:1, and its influence will be studied in the ensuing ablation study of Experiments. The leaning rate is initialized as 0.1 which is further divided by 10 after 15 epochs and 30 epochs, respectively. The training stops at 40 epochs. The relabeling module is included for optimization from the 10th epoch, where the relabeling margin is set at 0.2 by default.

4 Experiments

In this section, we first describe three public datasets and our WebEmotion dataset. We then demonstrate the robustness of our SCN under uncertainties of both synthetic and real-world noisy facial expression annotations. Further, we conduct ablation studies with qualitative and quantitative results to show the effectiveness of each module in SCN. Finally, we compare our SCN to the state-of-the-art methods on public datasets.

4.1 Datasets

RAF-DB [22] contains 30,000 facial images annotated with basic or compound expressions by 40 trained human coders. In our experiment, only images with six basic expressions (neutral, happiness, surprise, sadness, anger, disgust, fear) and neutral expression are used which leads to 12,271 images for training and 3,068 images for testing. The overall sample accuracy is used for measurement.

FERPlus [4] is extended from FER2013 as used in the ICML 2013 Challenges. It is a large-scale dataset collected by the Google search engine. It consists of 28,709 training images, 3,589 validation images and 3,589 test images, all of which are resized to 4848 pixels. Contempt is included which leads to 8 classes in this dataset. The overall sample accuracy is used for measurement

AffectNet [32] is by far the largest dataset that provides both categorical and Valence-Arousal annotations. It contains more than one million images from the Internet by querying expression-related keywords in three search engines, of which 450,000 images are manually annotated with eight expression labels as in FERPlus. It has imbalanced training and test sets as well as a balanced validation set. The mean class accuracy on the validation set is used for measurement.

The collected WebEmotion. Since the main evidence of uncertainties is the incorrect/noisy annotation problem, we collect an extreme noisy FER dataset from the Internet, termed as WebEmotion, to investigate the effect of SCN with extreme uncertainties. The WebEmotion is a video dataset (though we use it as image data by assigning labels to frames) downloaded from YouTube with a set of keywords including 40 emotion-related words, 45 countries from Asia, Europe, Africa, America, and 6 age-related words (i.e. baby, lady, woman, man, old man, old woman). It consists of the same 8 classes with FERPlus, where each class is connected to several emotion-related keywords, \egHappy is connected to the keywords happy, funny, ecstatic, smug, and kawaii. To obtain meaningful correlation between the keywords and the searched videos, only the top 20 crawled videos with less then 4 minutes are selected. This leads to around 41,000 videos which are further segmented into 200,000 video clips with a constraint that a face (detected by MTCNN) appears at least 5 seconds. For evaluation, we only use WebEmotion for pretraining since annotating is extremely difficult. Table 1 shows the statistics of WebEmotion. The meta videos and video clips will be public to the research community.

Figure 3: Visualization of the learned importance weights in our SCN, we show these weights on randomly selected images with original labels (1st row) and synthetic noisy labels before and after relabeling (2nd row and 3rd row).

4.2 Evaluation of SCN on Synthetic Uncertainties

The uncertainties of FER mainly come from ambiguous facial expressions, low-quality facial images, inconsistent annotations, and incorrect annotations (\ienoisy labels). Considering that only noisy labels can be analyzed quantitatively, we explore the robustness of SCN with three levels of label noises including the ratio of 10%, 20%, and 30% to RAF-DB, FERPLus, and AffectNet datasets. Specifically, we randomly choose 10%, 20%, and 30% of training data for each category and randomly change their labels to others. In Table 2, we use ResNet-18 as CNN backbone and compare our SCN to the baseline (traditional CNN training without considering label noises) with two training schemes: i) training from scratch and ii) fine-tuning with a pretrained model on Ms-Celeb-1M [15]. We also compare our SCN to two state-of-the-art noise-tolerant methods on RAF-DB, namely CurriculumNet [14] and MetaCleaner [46].

As shown in Table 2, our SCN consistently improves the baseline by a large margin. For scheme i) with noise ratio 30%, our SCN outperforms the baseline by 13.80% , 1.07%, and 1.91% on RAF-DB, FERPLus, and AffectNet, respectively. For scheme ii) with noise ratio 30%, our SCN still gain improvements of 2.20%, 2.47%, and 3.12% on these datasets though the performance of them are relatively high. For both schemes, the benefit from SCN becomes more obvious as the noise ratio increases up. CurriculumNet designs training curriculum by measuring data complexity using cluster density which can avoid training noisy-labeled data in early stages. MetaCleaner aggregates the features of several samples in each class into a weighted mean feature for classification which can also weaken the noisy-labeled samples. Both CurriculumNet and MetaCleaner improve the baseline largely but are still inferior to the SCN which is simpler. Another interesting find is that the improvement of SCN on RAF-DB is much higher than on other datasets. It may be explained by the following reasons. On the one hand, RAF-DB consists of compound facial expressions and is annotated by 40 people with crowdsourcing, which make the data annotations more inconsistent. Thus, our SCN may also gain improvement on the original RAF-DB without synthetic label noises. On the other hand, AffectNet and FERPlus are annotated by experts, thus less inconsistent labels are involved, leading to less improvement on RAF-DB.

Pretrain SCN Noise(%) RAF-DB AffectNet FERPlus
CurriculumNet [14] 10 68.5 - -
MetaCleaner [46] 10 68.45 - -
10 61.43 44.68 77.15
10 70.26 45.23 78.53
CurriculumNet [14] 20 61.23 - -
MetaCleaner [46] 20 61.35 - -
20 55.5 41.00 71.88
20 63.50 41.63 72.46
CurriculumNet [14] 30 57.52 - -
MetaCleaner [46] 30 58,89 - -
30 46.81 38.35 68.54
30 60.61 39.42 70.45
10 80.81 57.18 83.39
10 82.18 58.58 84.28
20 78.18 56.15 82.24
20 80.10 57.25 83.17
30 75.26 52.58 79.34
30 77.46 55.05 82.47
Table 2: The evaluation of SCN on synthetic noisy FER datasets. ‘Pretrain’ means we use a pretrained model from face recognition, otherwise we train from scratch.

Visualization of in SCN. To further investigate the effectiveness of our SCN under noisy annotations, we visualize the importance weight during the training phase of SCN on RAF-DB with noise ratio 10% . In Figure 3, the first row indicates the importance weights when SCN is trained with original labels. The images of the second row are annotated with synthetic corrupted labels, and we use SCN (without Relabel module) to train the synthetic noisy dataset. Indeed, the SCN regards those label-corrupted images as noises and automatically suppresses the weights of them. After sufficient training epochs, the relabeling module are added into SCN, and these noisy-labeled images are relabeled (of course many others may be not relabeled since we have relabeling constraint). After several other epochs, the importance weights of them become high (the 3rd row), which demonstrates that our SCN can ‘self-cure’ the corrupted labels. It is worth noting that the new labels from relabeling module may be inconsistent with “ground-truth” labels (see the 1st, 4th, and 6th columns) but they are also reasonable in visualization.

WebEmoition SCN RAF-DB AffectNet FERPlus
72.00 46.58 82.4
w/o SCN 78.97 56.43 84.20
w/o SCN 80.42 57.23 85.13
SCN 82.45 58.45 85.97
Table 3: The effect of SCN on WebEmotion for pretraining. The 2nd column indicates finetuning with or without SCN.

4.3 Exploring SCN on Real-World Uncertainties

Synthetic noisy data proves the effectiveness of the ‘self-curing’ ability of SCN. In this section, we apply our SCN to real-world FER datasets which can include all types of uncertainties.

SCN on WebEmotion for pretraining. Our collected WebEmotion dataset consists of massive noises since the searching keywords are regarded as labels. To better validate the effect of SCN on real-world noisy data, we apply our SCN to WebEmotion for pretraining and then finetune the model on target datasets. We show the comparison experiments in Table 3. From the 1st and the 2nd rows, we can see that pretraining on WebEmotion without SCN improves the baseline by 6.97%, 9.85%, and 1.80% on RAF-DB, FERPlus and AffectNet, respectively. Fine-tuning with SCN on target datasets obtains gains ranged from 1% to 2%. Pretraining on WebEmotion with SCN further boosts the performance from 80.42% to 82.45% on RAF-DB. This suggests that SCN learns robust features on WebEmotion which is better for further fine-tuning.

Figure 4: Ten examples of RAF-DB (w/o synthetic noisy labels) with low importance weights. Each column corresponds to a basic emotion. One can guess their labels and the ground-truth labels of RAD-DB are included in the text.
Pretrain SCN RAF-DB AffectNet FERPlus
72.00 46.58 82.4
78.31 47.28 83.42
CurriculumNet [14] 74.67 - -
MetaCleaner [46] 77.18 - -
84.20 58.5 86.80
87.03 60.23 88.01
Table 4: SCN on real-world FER datasets. The improvements of SCN suggests that these public datasets more or less suffer from uncertainties.

SCN on Original FER datasets. We further conduct experiments on original FER datasets to evaluate our SCN since these datasets inevitably suffer from uncertainties such as ambiguous facial expressions, low-quality facial images, etc. Results are shown in Table 4. When training from scratch, our proposed SCN improves the baseline consistently with gains of 6.31%, 0.7%, and 1.02% on RAD-DB, AffectNet, and FERPlus, respectively. MetaCleaner also boosts the baseline on RAF-DB but slightly worse than our SCN. With pretraining, we still obtain improvements of 2.83%, 1.73%, and 1.21% on these datasets. The improvement of SCN and MetaCleaner suggests that there indeed exists uncertainties in those datasets. To validate our speculation, we rank the importance weights of RAF-DB, and show some examples with low importance weights in Figure 4. The ground-truth labels from top-left to bottom-right are surprise, neutral, neutral, sad, surprise, surprise, neutral, surprise, neutral, surprise. We find that images with low quality and occlusion are difficult to annotate and are more likely to have low-importance weights in SCN.

Weight Rank Relabel RAF-DB RAF-DB (pretrain)
72.00 84.20
71.25 83.78
74.15 85.14
76.26 86.09
76.57 86.63
78.31 87.03
Table 5: Evaluation of the three modules in SCN.
Figure 5: Evaluation of the margin and , and the ratio on the RAF-DB dataset.
0.2 0.3 0.5 0.6 0.8
76.12% 76.35% 78.31% 76.57% 71.75%
Table 6: Evaluation of the ratio between RR-Loss and WCE-Loss.
Method Acc.
DLP-CNN [22] 84.22
IPA2LT [43] 86.77
gaCNN [24] 85.07
RAN [42] 86.90
Our SCN (ResNet18) 87.03
Our SCN (ResNet18) 88.14
(a) Comparison on RAF-DB.
Method mean Acc.
Upsample [32] 47.00
Weighted loss [32] 58.00
IPA2LT [43] (7 cls) 55.71
RAN [42] 52.97
RAN [42] 59.5
Our SCN(ResNet18) 60.23
(b) Comparison on AffectNet.
Method Acc.
PLD [5] 85.1
ResNet+VGG [18] 87.4
SeNet50 [1] 88.8
RAN [42] 88.55
RAN-VGG16 [42] 89.16
Our SCN (ResNet18/IR50) 88.01/89.35
(c) Comparison on FERPlus
Table 7: Comparison to the state-of-the-art results.These results are trained using label distributions. Oversampling is used since AffectNet is imbalanced. RAF-DB and AffectNet are jointly used for training. Note that IPA2LT tests with 7 classes on AffectNet.

4.4 Ablation Studies

Evaluation of the three modules in SCN. To evaluate the effect of each module of SCN, we design an ablation study to investigate WCE-Loss, RR-Loss and Relabel modules on RAF-DB. We show the experimental results in Table 5. Several observations can be concluded in the following. First, for both training schemes, a naive relabeling module (the 2nd row) added into the baseline (1st row) can degrade performance slightly. This may be explained by that many relabeling operations are wrong from the baseline model. It indirectly indicates that our elaborately-designed relabeling in the low-importance group with rank regularization is more effective. Second, when adding one module, we obtain the highest improvement by WCE-Loss which improves the baseline from 72% to 76.26% on RAF-DB. This suggests that the re-weighting is the most contributed module for our SCN. Third, the RR-Loss and the relabeling module can further boost WCE-Loss by 2.15% on RAF-DB.

Evaluation of the ratio . In Table 6, we evaluate the effect of different ratios between the RR-Loss and WCE-Loss. We find that setting equal weight for each loss achieves the best results. Increasing the weight of RR-Loss from 0.5 to 0.8 dramatically degrades performance which suggests that WCE-Loss is more important.

Evaluation of and . is a margin parameter to control the mean margin between the high- and low-importance groups. For fixed setting, we evaluate it from 0 to 0.30. Figure 5 (left) shows the results for both fixed and learned . The default = 0.15 obtains the best performance, which shows that the margin should be an appropriate value. We also design a learnable paradigm of , and initialize it to 0.15. The learnable converges to and the performances are 77.76% and 69.45% in original and noise RAF-DB datasets, respectively.

is a margin to determine when to relabel a sample. The default is 0.2. We evaluate from 0 to 0.5 on original RAF-DB, and show the results in Figure 5 (middle). means we relabel a sample if the max prediction probability is larger than the probability of the given label. Small leads to a lot of incorrect relabeling operations which may hurt performance significantly. Large leads to few relabeling operations which converges to no relabeling. We get the best performance in 0.2.

Evaluation of the . is the ratio of high importance samples in a minibatch. We study different ratios from 0.9 to 0.5 in both synthetic noisy and original RAF-DB dataset. The results are shown in Figure 5 (right). Our default ratio is 0.7 that achieves the best performance. Large degrades the ability of SCN since it considers few of the data is uncertain. Small leads to over-consideration of uncertainties which decreases the training loss unreasonably.

4.5 Comparison to the State of the Art

Table 7 compares our method to several state-of-the-art methods on RAF-DB, AffectNet, and FERPlus. IPA2LT [43] introduces the latent ground-truth idea for training with inconsistent annotations across different FER datasets. gaCNN [24] leverages a patch-based attention network and a global network. RAN[42] utilizes face regions and original face with a cascade attention network. gaCNN and RAN are time-consuming due to the cropped patches and regions. Our proposed SCN does not increase any cost in inference. Our SCN outperforms these recent state-of-the-art methods with 88.14%, 60.23%, and 89.35% (with IR50 [9]) on RAF-DB, AffectNet, and FERPlus, respectively.

5 Conclusion

This paper presents a self-cure network (SCN) to suppress the uncertainties of facial expression data thus to learn robust feature for FER. The SCN consists of three novel modules including self-attention importance weighting, ranking regularization, and relabeling. The first module learns a weight for each facial image with self-attention to capture the sample importance for training and is used for loss weighting. The ranking regularization ensures that the first module learns meaningful weights to highlight certain samples and suppress uncertain samples. The relabeling module attempts to identify mislabeled samples and modify their labels. Extensive experiments on three public datasets and our collected WebEmotion show that our SCN achieves state-of-the-art results and can handle both synthetic and real-world uncertainties effectively.


  1. S. Albanie, A. Nagrani, A. Vedaldi and A. Zisserman (2018) Emotion recognition in speech using cross-modal transfer in the wild. arXiv preprint arXiv:1808.05561. Cited by: 7(c).
  2. B. Amos, B. Ludwiczuk and M. Satyanarayanan (2016) OpenFace: a general-purpose face recognition library with mobile applications. Technical report CMU-CS-16-118, CMU School of Computer Science. Cited by: §2.1.
  3. S. Azadi, J. Feng, S. Jegelka and T. Darrell (2015) Auxiliary image regularization for deep cnns with noisy labels. arXiv preprint:1511.07069. Cited by: §2.2.
  4. E. Barsoum, C. Zhang, C. Canton Ferrer and Z. Zhang (2016) Training deep networks for facial expression recognition with crowd-sourced label distribution. In ACM ICMI, Cited by: §1, §4.1.
  5. E. Barsoum, C. Zhang, C. C. Ferrer and Z. Zhang (2016) Training deep networks for facial expression recognition with crowd-sourced label distribution. In ACM ICMI, pp. 279–283. Cited by: 7(c).
  6. N. Dalal and B. Triggs (2005) Histograms of oriented gradients for human detection. In CVPR, Cited by: §2.1.
  7. C. Darwin and P. Prodger (1998) The expression of the emotions in man and animals. Oxford University Press, USA. Cited by: §1.
  8. M. Dehghani, A. Severyn, S. Rothe and J. Kamps (2017) Avoiding your teacher’s mistakes: training neural networks with controlled weak supervision. arXiv preprint 1711.00313. Cited by: §2.2.
  9. J. Deng, J. Guo, N. Xue and S. Zafeiriou (2019) Arcface: additive angular margin loss for deep face recognition. In CVPR, pp. 4690–4699. Cited by: §4.5.
  10. A. Dhall, R. Goecke, S. Lucey and T. Gedeon (2011) Static facial expression analysis in tough conditions: data, evaluation protocol and benchmark. In ICCV, pp. 2106–2112. Cited by: §1.
  11. C. Fabian Benitez-Quiroz, R. Srinivasan and A. M. Martinez (2016) Emotionet: an accurate, real-time algorithm for the automatic annotation of a million facial expressions in the wild. In CVPR, pp. 5562–5570. Cited by: §1.
  12. B. Fasel (2002) Robust face analysis using convolutional neural networks. In ICPR, pp. 40–43. Cited by: §2.1.
  13. J. Goldberger and E. Ben-Reuven (2016) Training deep neural-networks using a noise adaptation layer. Cited by: §2.2.
  14. S. Guo, W. Huang, H. Zhang, C. Zhuang, D. Dong, M. R. Scott and D. Huang (2018-09) CurriculumNet: weakly supervised learning from large-scale web images. In ECCV, Cited by: §4.2, Table 2, Table 4.
  15. Y. Guo, L. Zhang, Y. Hu, X. He and J. Gao (2016) MS-celeb-1m: A dataset and benchmark for large-scale face recognition. CoRR abs/1607.08221. External Links: 1607.08221 Cited by: §4.2.
  16. K. He, X. Zhang, S. Ren and J. Sun (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: §3.5.
  17. W. Hu, Y. Huang, F. Zhang and R. Li (2019) Noise-tolerant paradigm for training face recognition cnns. In CVPR, pp. 11887–11896. Cited by: §3.2.
  18. C. Huang (2017) Combining convolutional neural networks for emotion recognition. In 2017 IEEE MIT Undergraduate Research Technology Conference (URTC), pp. 1–4. Cited by: 7(c).
  19. L. Jiang, D. Meng, S. Yu, Z. Lan, S. Shan and A. Hauptmann (2014) Self-paced learning with diversity. In NIPS, pp. 2078–2086. Cited by: §3.2.
  20. L. Jiang, Z. Zhou, T. Leung, L. Li and L. Fei-Fei (2017) Mentornet: learning data-driven curriculum for very deep neural networks on corrupted labels. arXiv preprint:1712.05055. Cited by: §2.2, §3.2.
  21. S. E. Kahou, C. Pal, X. Bouthillier, P. Froumenty, R. Memisevic, P. Vincent, A. Courville, Y. Bengio and R. C. Ferrari (2013) Combining modality specific deep neural networks for emotion recognition in video. In International Conference on Multimodal Interaction, pp. 543–550. Cited by: §2.1.
  22. S. Li, W. Deng and J. Du (2017) Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild. In CVPR, pp. 2852–2861. Cited by: §1, §4.1, 7(a).
  23. Y. Li, J. Zeng, S. Shan and X. Chen (2019-05) Occlusion aware facial expression recognition using cnn with attention mechanism. IEEE Transactions on Image Processing 28 (5), pp. 2439–2450. External Links: Document, ISSN 1057-7149 Cited by: §2.1.
  24. Y. Li, J. Zeng, S. Shan and X. Chen (2018) Occlusion aware facial expression recognition using cnn with attention mechanism. TIP 28 (5), pp. 2439–2450. Cited by: §4.5, 7(a).
  25. Y. Li, J. Yang, Y. Song, L. Cao, J. Luo and L. Li (2017) Learning from noisy labels with distillation. In ICCV, pp. 1910–1918. Cited by: §2.2.
  26. C. Liu and H. Wechsler (2002-04) Gabor feature based classification using the enhanced fisher linear discriminant model for face recognition. IEEE Transactions on Image Processing 11 (4), pp. 467–476. External Links: Document, ISSN 1057-7149 Cited by: §2.1.
  27. M. Liu, S. Li, S. Shan and X. Chen (2015) AU-inspired deep networks for facial expression feature learning. Neurocomputing 159 (C), pp. 126–136. Cited by: §2.1.
  28. W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj and L. Song (2017) Sphereface: deep hypersphere embedding for face recognition. In CVPR, pp. 212–220. Cited by: §3.2.
  29. P. Lucey, J. F. Cohn, T. Kanade, J. Saragih, Z. Ambadar and I. Matthews (2010) The extended cohn-kanade dataset (ck+): a complete dataset for action unit and emotion-specified expression. In CVPRW, pp. 94–101. Cited by: §1.
  30. F. Ma, D. Meng, Q. Xie, Z. Li and X. Dong (2017) Self-paced co-training. In ICML, pp. 2275–2284. Cited by: §3.2.
  31. V. Mnih and G. E. Hinton (2012) Learning to label aerial images from noisy data. In ICML, pp. 567–574. Cited by: §2.2.
  32. A. Mollahosseini, B. Hasani, M. H. Mahoor and M. H. Mahoor (2017) Affectnet: a database for facial expression, valence, and arousal computing in the wild. TAC 10 (1), pp. 18–31. Cited by: §1, §4.1, 7(b).
  33. N. Natarajan, I. S. Dhillon, P. K. Ravikumar and A. Tewari (2013) Learning with noisy labels. In NIPS, pp. 1196–1204. Cited by: §2.2.
  34. P. C. Ng and S. Henikoff (2003) SIFT: predicting amino acid changes that affect protein function. Nucleic Acids Research 31 (13), pp. 3812–3814. External Links: Document, Link, /oup/backfile/content_public/journal/nar/31/13/10.1093_nar_gkg509/2/gkg509.pdf Cited by: §2.1.
  35. C. Shan, S. Gong and P. W. McOwan (2009) Facial expression recognition based on local binary patterns: a comprehensive study. Image and Vision Computing 27 (6), pp. 803 – 816. External Links: ISSN 0262-8856, Document, Link Cited by: §2.1.
  36. S. Sukhbaatar and R. Fergus (2014) Learning from noisy labels with deep neural networks. arXiv preprint:1406.2080 2 (3), pp. 4. Cited by: §2.2.
  37. Y. Tang (2013) Deep learning using linear support vector machines. Computer Science. Cited by: §2.1.
  38. Y. Tian, T. Kanade and J. F. Cohn (2001) Recognizing action units for facial expression analysis. T-PAMI 23 (2), pp. 97–115. Cited by: §1.
  39. M. Valstar and M. Pantic (2010) Induced disgust, happiness and surprise: an addition to the mmi facial expression database. In Proc. 3rd Intern. Workshop on EMOTION (satellite of LREC): Corpora for Research on Emotion and Affect, pp. 65. Cited by: §1.
  40. A. Veit, N. Alldrin, G. Chechik, I. Krasin, A. Gupta and S. Belongie (2017-07) Learning from noisy large-scale datasets with minimal supervision. In CVPR, Cited by: §2.2.
  41. A. Veit, N. Alldrin, G. Chechik, I. Krasin, A. Gupta and S. Belongie (2017) Learning from noisy large-scale datasets with minimal supervision. In CVPR, pp. 839–847. Cited by: §2.2.
  42. K. Wang, X. Peng, J. Yang, D. Meng and Y. Qiao (2019) Region attention networks for pose and occlusion robust facial expression recognition. arXiv preprint:1905.04075. Cited by: §2.1, §4.5, 7(a), 7(b), 7(c).
  43. J. Zeng, S. Shan, X. Chen and X. Chen (2018) Facial expression recognition with inconsistently annotated datasets. In ECCV, pp. 222–237. Cited by: §2.2, §4.5, 7(a), 7(b).
  44. K. Zhang, Z. Zhang, Z. Li and Y. Qiao (2016-10) Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters 23 (10), pp. 1499–1503. External Links: Document, ISSN 1070-9908 Cited by: §2.1.
  45. K. Zhang, Z. Zhang, Z. Li and Y. Qiao (2016) Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letter 23 (10), pp. 1499–1503. Cited by: §3.5.
  46. W. Zhang, Y. Wang and Y. Qiao (2019-06) MetaCleaner: learning to hallucinate clean representations for noisy-labeled visual recognition. In CVPR, Cited by: §4.2, Table 2, Table 4.
  47. G. Zhao, X. Huang, M. Taini, S. Z. Li and M. PietikäInen (2011) Facial expression recognition from near-infrared videos. Image and Vision Computing 29 (9), pp. 607–619. Cited by: §1.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description