Suppressing Uncertainties for Large-Scale Facial Expression Recognition
Annotating a qualitative large-scale facial expression dataset is extremely difficult due to the uncertainties caused by ambiguous facial expressions, low-quality facial images, and the subjectiveness of annotators. These uncertainties lead to a key challenge of large-scale Facial Expression Recognition (FER) in deep learning era. To address this problem, this paper proposes a simple yet efficient Self-Cure Network (SCN) which suppresses the uncertainties efficiently and prevents deep networks from over-fitting uncertain facial images. Specifically, SCN suppresses the uncertainty from two different aspects: 1) a self-attention mechanism over mini-batch to weight each training sample with a ranking regularization, and 2) a careful relabeling mechanism to modify the labels of these samples in the lowest-ranked group. Experiments on synthetic FER datasets and our collected WebEmotion dataset validate the effectiveness of our method. Results on public benchmarks demonstrate that our SCN outperforms current state-of-the-art methods with 88.14% on RAF-DB, 60.23% on AffectNet, and 89.35% on FERPlus. The code will be available at https://github.com/kaiwang960112/Self-Cure-Network.
Facial expression is one of the most natural, powerful and universal signals for human beings to convey their emotional states and intentions [7, 38]. Automatically recognizing facial expression is also important to help the computer understand human behavior and to interact with them. In the past decades, researchers have made significant progress on facial expression recognition (FER) with algorithms and large-scale datasets, where datasets can be collected in laboratory or in the wild, such as CK+ , MMI , Oulu-CASIA , SFEW/AFEW , FERPlus , AffectNet , EmotioNet , RAF-DB , etc.
However, for the large-scale FER datasets collected from the Internet, it is extremely difficult to annotate with high quality due to the uncertainties caused by the subjectiveness of annotators as well as ambiguous in-the-wild facial images. As illustrated in Figure 1, the uncertainties increase from high-quality and evident facial expressions to low-quality and micro expressions. These uncertainties usually lead to inconsistent labels and incorrect labels, which are suspending the progress of large-scale Facial Expression Recognition (FER), especially for the one of data-driven deep learning based FER. Generally, training with uncertainties of FER may lead to the following problems. First, it may result in over-fitting on the uncertain samples which may be mislabeled. Second, it is harmful for a model to learn useful facial expression features. Third, a high ratio of incorrect labels even makes the model disconvergence in the early stage of optimization.
To address these issues, we propose a simple yet efficient method, termed as Self-Cure Network (SCN), to suppress the uncertainties for large-scale facial expression recognition. The SCN consists of three crucial modules: self-attention importance weighting, ranking regularization, and noise relabeling. Given a batch of images, a backbone CNN is first used to extract facial features. Then the self-attention importance weighting module learns a weight for each image to capture the sample importance for loss weighting. It is expected that uncertain facial images are assigned low importance weights. Further, the ranking regularization module ranks these weights in descending order, splits them into two groups (i.e. high importance weights and low importance weights), and regularizes the two groups by enforcing a margin between the average weights of the two groups. This regularization is implemented with a loss function, termed as Rank Regularization loss (RR-Loss). The ranking regularization module ensures that the first module learns meaningful weights to highlight certain samples (\egreliable annotations) and to suppress uncertain samples (\egambiguous annotations). The last module is a careful relabeling module that attempts to relabel these samples from the bottom group by comparing the maximum predicted probabilities to the probabilities of given labels. A sample is assigned to a pseudo label if the maximum prediction probability is higher than the one of given label with a margin threshold. In addition, since the main evidence of uncertainties is the incorrect/noisy annotation problem, we collect an extreme noisy FER dataset from the Internet, termed as WebEmotion, to investigate the effect of SCN with extreme uncertainties.
Overall, our contributions can be summarized as follows,
We innovatively pose the uncertainty problem in facial expression recognition, and propose a Self-Cure Network to reduce the impact of uncertainties.
We elaborately design a rank regularization to supervise the SCN to learn meaningful importance weights, which also provides a reference for the relabeling module.
We extensively validate our SCN on synthetic FER data and a new real-world uncertain emotion dataset (WebEmotion) collected from the Internet. Our SCN also achieves performance 88.14% on RAF-DB, 60.23% on AffectNet, and 89.35% on FERPlus, which set new records on them.
2 Related Work
2.1 Facial Expression Recognition
Generally, a FER system mainly consists of three stages, namely face detection, feature extraction, and expression recognition. In face detection stage, several face detectors like MTCNN  and Dlib ) are used to locate faces in complex scenes. The detected faces can be further aligned alternatively. For feature extraction, various methods are designed to capture facial geometry and appearance features caused by facial expressions. According to the feature type, they can be grouped into engineered features and learning-based features. For the engineered features, they can be further divided into texture-based local features, geometry-based global features, and hybrid features. The texture-based features mainly include SIFT , HOG , Histograms of LBP , Gabor wavelet coefficients , etc. The geometry-based global features are mainly based on the landmark points around noses, eyes, and mouths. Combining two or more of the engineered features refers to the hybrid feature extraction, which can further enrich the representation. For the learned features, Fasel  finds that a shallow CNN is robust to face poses and scales. Tang  and Kahou et al.  utilize deep CNNs for feature extraction, and win the FER2013 and Emotiw2013 challenge, respectively. Liu et al.  propose a Facial Action Units based CNN architecture for expression recognition. Recently, both Li et al.  and Wang et al.  have designed region-based attention networks for pose and occlusion aware FER, where the regions are either cropped from landmark points or fixed positions.
2.2 Learning with Uncertainties
Uncertainties in the FER task mainly come from ambiguous facial expressions, low-quality facial images, inconsistent annotations, and incorrect annotations (\ienoisy labels). Particularly, learning with noisy labels is extensively studied in the computer vision community while the other two aspects are rarely explored. In order to handle noisy labels, one intuitive idea is to leverage a small set of clean data that can be used to assess the quality of the labels during the training process [41, 25, 8], or to estimate the noise distribution , or to train the feature extractors . Li et al.  propose a unified distillation framework using ‘side’ information from a small clean dataset and label relations in knowledge graph, to ‘hedge the risk’ of learning from noisy labels. Veit et al. use a multi-task network that jointly learns to clean noisy annotations and to classify images. Azadi et al. select reliable images by an auxiliary image regularization for deep CNNs with noisy labels. Other methods do not need a small clean dataset but they may assume extra constrains or distributions on the noisy samples , such as a specific loss for randomly flipped labels , regularizing the deep networks on corrupted labels by a MentorNet , and other approaches that model the noise with a softmax layer by connecting the latent correct labels to the noisy ones [13, 43]. For the FER task, Zeng et al.  first consider the inconsistent annotation problem among different FER datasets, and propose to leverage these uncertainties to improve FER. In contrast, our work focuses on suppressing these uncertainties to learn better facial expression features.
3 Self-Cure Network
To learn robust facial expression features with uncertainties, we propose a simple yet efficient Self-Cure Network (SCN). In this section, we first provide an overview of the SCN, and then present its three modules. We finally present the detailed implementation of SCN.
3.1 Overview of Self-Cure Network
Our SCN is built upon traditional CNNs and consists of three crucial modules: i) self-attention importance weighting, ii) ranking regularization, and iii) relabeling, as shown in Figure 2.
Given a batch of face images with some uncertain samples, we first extract the deep features by a backbone network. The self-attention importance weighting module assigns an importance weight for each image using a fully-connected (FC) layer and the sigmoid function. These weights are multiplied by the logits for a sample re-weighting scheme. To explicitly reduce the importance of uncertain samples, a rank regularization module is further introduced to regularize the attention weights. In the rank regularization module, we first rank the learned attention weights and then split them into two groups, i.e. high and low importance groups. We then add a constraint between the mean weights of these groups by a margin-based loss, which is called rank regularization loss (RR-Loss). To further improve our SCN, the relabeling module is added to modify some of the uncertain samples in the low importance group. This relabeling operation aims to hunt more clean samples and then to enhance the final model. The whole SCN can be trained in an end-to-end manner and easily added into any CNN backbones.
3.2 Self-Attention Importance Weighting
We introduce the self-attention importance weighting module to capture the contributions of samples for training. It is expected that certain samples may have high importance weights while uncertain ones have low importance. Let denotes the facial features of images, the self-attention importance weighting module takes as input, and output an importance weight for each feature. Specifically, the self-attention importance weighting module is comprised of a linear fully-connected (FC) layer and a sigmoid activation function, which can be formulated as ,
where is the importance weight of the i-th sample, is the parameters of the FC layer used for attention, and is the sigmoid function. This module also provides reference for the other two modules.
Logit-Weighted Cross-Entropy Loss. With the attention weights, we have two simple choices to perform loss weighting inspired by . The first choice is to multiply the weight of each sample by the sample loss. In our case, since the weights are optimized in an end-to-end manner and are learned from the CNN features, they are doomed to be zeros as this trival solution makes zero loss. MentorNet  and other self-paced learning methods [19, 30] solve this problem by alternating minimization, i.e. optimize one at a time while the other is held fixed. In this paper, we choose the logit-weighted one of  which is shown to be more efficient. For a multi-class Cross-Entropy loss, we call our weighted loss as Logit-Weighted Cross-Entropy loss (WCE-Loss), which is formulated as,
where is the j-th classifier. As suggested in , the has a positive correlation with the .
3.3 Rank Regularization
The self-attention weights in the above module can be arbitrary in (0, 1). To explicitly constrain the importance of uncertain samples, we elaborately design a rank regularization module to regularize the attention weights. In the rank regularization module, we first rank the learned attention weights in descending order and then split them into two groups with a ratio . The rank regularization ensures that the mean attention weight of high-importance group is higher than the one of low-importance group with a margin. Formally, we define a rank regularization loss (RR-Loss) for this purpose as follows,
where is a margin which can be a fixed hyper parameter or a learnable parameter, and are the mean values of the high importance group with samples and the low importance group with samples, respectively. In training, the total loss function is where is a trade-off ratio.
In the rank regularization module, each mini-batch is divided into two groups, i.e. the high-importance and the low-importance groups. We experimentally find that the uncertain samples usually have low importance weights, thus an intuitive idea is to design a strategy to relabel these samples. The main challenge to modify these annotations is to know which annotation is incorrect.
Specifically, our relabeling module only considers the samples in the low-importance group and is performed on the Softmax probabilities. For each sample, we compare the maximum predicted probability to the probability of given label. A sample is assigned to a new pseudo label if the maximum prediction probability is higher than the one of given label with a threshold. Formally, the relabeling module can be defined as,
where denotes the new label, is a threshold, is the maximum predicted probability, and is the predicted probability of the given label. and are the original given label and the index of the maximum prediction, respectively.
In our system, uncertain samples are expected to obtain low importance weights thus to degrade their negative impacts with re-weighting, and then fall into the low-importance group, and finally may be corrected as certain samples by relabeling. Those corrected samples may obtain high important weights in the next epoch. We expect the network can be cured by itself with either re-weighting or relabeling, which is the reason why we call our method as self-cured network.
Pre-processing and facial features. In our SCN, face images are detected and aligned by MTCNN  and further resized to 224 224 pixels. The SCN is implemented with Pytorch toolbox and the backbone network is ResNet-18 . By default, the ResNet-18 is pre-trained on the MS-Celeb-1M face recognition dataset and the facial features are extracted from its last pooling layer.
Training. We train our SCN in an end-to-end manner with 8 Nvidia Titan 2080ti GPU, and set the batch size as 1024. In each iteration, the training images are divided into two groups including 70% high importance samples and 30% low importance samples by default. The margin between the mean value of high and low importance groups can be either set at 0.15 by default or designed as a learnable parameter. Both strategies will be evaluated in the ensuing Experiments. The whole network is jointly optimized with RR-Loss and WCE-Loss. The ratio of the two losses is empirically set at 1:1, and its influence will be studied in the ensuing ablation study of Experiments. The leaning rate is initialized as 0.1 which is further divided by 10 after 15 epochs and 30 epochs, respectively. The training stops at 40 epochs. The relabeling module is included for optimization from the 10th epoch, where the relabeling margin is set at 0.2 by default.
In this section, we first describe three public datasets and our WebEmotion dataset. We then demonstrate the robustness of our SCN under uncertainties of both synthetic and real-world noisy facial expression annotations. Further, we conduct ablation studies with qualitative and quantitative results to show the effectiveness of each module in SCN. Finally, we compare our SCN to the state-of-the-art methods on public datasets.
RAF-DB  contains 30,000 facial images annotated with basic or compound expressions by 40 trained human coders. In our experiment, only images with six basic expressions (neutral, happiness, surprise, sadness, anger, disgust, fear) and neutral expression are used which leads to 12,271 images for training and 3,068 images for testing. The overall sample accuracy is used for measurement.
FERPlus  is extended from FER2013 as used in the ICML 2013 Challenges. It is a large-scale dataset collected by the Google search engine. It consists of 28,709 training images, 3,589 validation images and 3,589 test images, all of which are resized to 4848 pixels. Contempt is included which leads to 8 classes in this dataset. The overall sample accuracy is used for measurement
AffectNet  is by far the largest dataset that provides both categorical and Valence-Arousal annotations. It contains more than one million images from the Internet by querying expression-related keywords in three search engines, of which 450,000 images are manually annotated with eight expression labels as in FERPlus. It has imbalanced training and test sets as well as a balanced validation set. The mean class accuracy on the validation set is used for measurement.
The collected WebEmotion. Since the main evidence of uncertainties is the incorrect/noisy annotation problem, we collect an extreme noisy FER dataset from the Internet, termed as WebEmotion, to investigate the effect of SCN with extreme uncertainties. The WebEmotion is a video dataset (though we use it as image data by assigning labels to frames) downloaded from YouTube with a set of keywords including 40 emotion-related words, 45 countries from Asia, Europe, Africa, America, and 6 age-related words (i.e. baby, lady, woman, man, old man, old woman). It consists of the same 8 classes with FERPlus, where each class is connected to several emotion-related keywords, \egHappy is connected to the keywords happy, funny, ecstatic, smug, and kawaii. To obtain meaningful correlation between the keywords and the searched videos, only the top 20 crawled videos with less then 4 minutes are selected. This leads to around 41,000 videos which are further segmented into 200,000 video clips with a constraint that a face (detected by MTCNN) appears at least 5 seconds. For evaluation, we only use WebEmotion for pretraining since annotating is extremely difficult. Table 1 shows the statistics of WebEmotion. The meta videos and video clips will be public to the research community.
4.2 Evaluation of SCN on Synthetic Uncertainties
The uncertainties of FER mainly come from ambiguous facial expressions, low-quality facial images, inconsistent annotations, and incorrect annotations (\ienoisy labels). Considering that only noisy labels can be analyzed quantitatively, we explore the robustness of SCN with three levels of label noises including the ratio of 10%, 20%, and 30% to RAF-DB, FERPLus, and AffectNet datasets. Specifically, we randomly choose 10%, 20%, and 30% of training data for each category and randomly change their labels to others. In Table 2, we use ResNet-18 as CNN backbone and compare our SCN to the baseline (traditional CNN training without considering label noises) with two training schemes: i) training from scratch and ii) fine-tuning with a pretrained model on Ms-Celeb-1M . We also compare our SCN to two state-of-the-art noise-tolerant methods on RAF-DB, namely CurriculumNet  and MetaCleaner .
As shown in Table 2, our SCN consistently improves the baseline by a large margin. For scheme i) with noise ratio 30%, our SCN outperforms the baseline by 13.80% , 1.07%, and 1.91% on RAF-DB, FERPLus, and AffectNet, respectively. For scheme ii) with noise ratio 30%, our SCN still gain improvements of 2.20%, 2.47%, and 3.12% on these datasets though the performance of them are relatively high. For both schemes, the benefit from SCN becomes more obvious as the noise ratio increases up. CurriculumNet designs training curriculum by measuring data complexity using cluster density which can avoid training noisy-labeled data in early stages. MetaCleaner aggregates the features of several samples in each class into a weighted mean feature for classification which can also weaken the noisy-labeled samples. Both CurriculumNet and MetaCleaner improve the baseline largely but are still inferior to the SCN which is simpler. Another interesting find is that the improvement of SCN on RAF-DB is much higher than on other datasets. It may be explained by the following reasons. On the one hand, RAF-DB consists of compound facial expressions and is annotated by 40 people with crowdsourcing, which make the data annotations more inconsistent. Thus, our SCN may also gain improvement on the original RAF-DB without synthetic label noises. On the other hand, AffectNet and FERPlus are annotated by experts, thus less inconsistent labels are involved, leading to less improvement on RAF-DB.
Visualization of in SCN. To further investigate the effectiveness of our SCN under noisy annotations, we visualize the importance weight during the training phase of SCN on RAF-DB with noise ratio 10% . In Figure 3, the first row indicates the importance weights when SCN is trained with original labels. The images of the second row are annotated with synthetic corrupted labels, and we use SCN (without Relabel module) to train the synthetic noisy dataset. Indeed, the SCN regards those label-corrupted images as noises and automatically suppresses the weights of them. After sufficient training epochs, the relabeling module are added into SCN, and these noisy-labeled images are relabeled (of course many others may be not relabeled since we have relabeling constraint). After several other epochs, the importance weights of them become high (the 3rd row), which demonstrates that our SCN can ‘self-cure’ the corrupted labels. It is worth noting that the new labels from relabeling module may be inconsistent with “ground-truth” labels (see the 1st, 4th, and 6th columns) but they are also reasonable in visualization.
4.3 Exploring SCN on Real-World Uncertainties
Synthetic noisy data proves the effectiveness of the ‘self-curing’ ability of SCN. In this section, we apply our SCN to real-world FER datasets which can include all types of uncertainties.
SCN on WebEmotion for pretraining. Our collected WebEmotion dataset consists of massive noises since the searching keywords are regarded as labels. To better validate the effect of SCN on real-world noisy data, we apply our SCN to WebEmotion for pretraining and then finetune the model on target datasets. We show the comparison experiments in Table 3. From the 1st and the 2nd rows, we can see that pretraining on WebEmotion without SCN improves the baseline by 6.97%, 9.85%, and 1.80% on RAF-DB, FERPlus and AffectNet, respectively. Fine-tuning with SCN on target datasets obtains gains ranged from 1% to 2%. Pretraining on WebEmotion with SCN further boosts the performance from 80.42% to 82.45% on RAF-DB. This suggests that SCN learns robust features on WebEmotion which is better for further fine-tuning.
SCN on Original FER datasets. We further conduct experiments on original FER datasets to evaluate our SCN since these datasets inevitably suffer from uncertainties such as ambiguous facial expressions, low-quality facial images, etc. Results are shown in Table 4. When training from scratch, our proposed SCN improves the baseline consistently with gains of 6.31%, 0.7%, and 1.02% on RAD-DB, AffectNet, and FERPlus, respectively. MetaCleaner also boosts the baseline on RAF-DB but slightly worse than our SCN. With pretraining, we still obtain improvements of 2.83%, 1.73%, and 1.21% on these datasets. The improvement of SCN and MetaCleaner suggests that there indeed exists uncertainties in those datasets. To validate our speculation, we rank the importance weights of RAF-DB, and show some examples with low importance weights in Figure 4. The ground-truth labels from top-left to bottom-right are surprise, neutral, neutral, sad, surprise, surprise, neutral, surprise, neutral, surprise. We find that images with low quality and occlusion are difficult to annotate and are more likely to have low-importance weights in SCN.
4.4 Ablation Studies
Evaluation of the three modules in SCN. To evaluate the effect of each module of SCN, we design an ablation study to investigate WCE-Loss, RR-Loss and Relabel modules on RAF-DB. We show the experimental results in Table 5. Several observations can be concluded in the following. First, for both training schemes, a naive relabeling module (the 2nd row) added into the baseline (1st row) can degrade performance slightly. This may be explained by that many relabeling operations are wrong from the baseline model. It indirectly indicates that our elaborately-designed relabeling in the low-importance group with rank regularization is more effective. Second, when adding one module, we obtain the highest improvement by WCE-Loss which improves the baseline from 72% to 76.26% on RAF-DB. This suggests that the re-weighting is the most contributed module for our SCN. Third, the RR-Loss and the relabeling module can further boost WCE-Loss by 2.15% on RAF-DB.
Evaluation of the ratio . In Table 6, we evaluate the effect of different ratios between the RR-Loss and WCE-Loss. We find that setting equal weight for each loss achieves the best results. Increasing the weight of RR-Loss from 0.5 to 0.8 dramatically degrades performance which suggests that WCE-Loss is more important.
Evaluation of and . is a margin parameter to control the mean margin between the high- and low-importance groups. For fixed setting, we evaluate it from 0 to 0.30. Figure 5 (left) shows the results for both fixed and learned . The default = 0.15 obtains the best performance, which shows that the margin should be an appropriate value. We also design a learnable paradigm of , and initialize it to 0.15. The learnable converges to and the performances are 77.76% and 69.45% in original and noise RAF-DB datasets, respectively.
is a margin to determine when to relabel a sample. The default is 0.2. We evaluate from 0 to 0.5 on original RAF-DB, and show the results in Figure 5 (middle). means we relabel a sample if the max prediction probability is larger than the probability of the given label. Small leads to a lot of incorrect relabeling operations which may hurt performance significantly. Large leads to few relabeling operations which converges to no relabeling. We get the best performance in 0.2.
Evaluation of the . is the ratio of high importance samples in a minibatch. We study different ratios from 0.9 to 0.5 in both synthetic noisy and original RAF-DB dataset. The results are shown in Figure 5 (right). Our default ratio is 0.7 that achieves the best performance. Large degrades the ability of SCN since it considers few of the data is uncertain. Small leads to over-consideration of uncertainties which decreases the training loss unreasonably.
4.5 Comparison to the State of the Art
Table 7 compares our method to several state-of-the-art methods on RAF-DB, AffectNet, and FERPlus. IPA2LT  introduces the latent ground-truth idea for training with inconsistent annotations across different FER datasets. gaCNN  leverages a patch-based attention network and a global network. RAN utilizes face regions and original face with a cascade attention network. gaCNN and RAN are time-consuming due to the cropped patches and regions. Our proposed SCN does not increase any cost in inference. Our SCN outperforms these recent state-of-the-art methods with 88.14%, 60.23%, and 89.35% (with IR50 ) on RAF-DB, AffectNet, and FERPlus, respectively.
This paper presents a self-cure network (SCN) to suppress the uncertainties of facial expression data thus to learn robust feature for FER. The SCN consists of three novel modules including self-attention importance weighting, ranking regularization, and relabeling. The first module learns a weight for each facial image with self-attention to capture the sample importance for training and is used for loss weighting. The ranking regularization ensures that the first module learns meaningful weights to highlight certain samples and suppress uncertain samples. The relabeling module attempts to identify mislabeled samples and modify their labels. Extensive experiments on three public datasets and our collected WebEmotion show that our SCN achieves state-of-the-art results and can handle both synthetic and real-world uncertainties effectively.
- (2018) Emotion recognition in speech using cross-modal transfer in the wild. arXiv preprint arXiv:1808.05561. Cited by: 7(c).
- (2016) OpenFace: a general-purpose face recognition library with mobile applications. Technical report CMU-CS-16-118, CMU School of Computer Science. Cited by: §2.1.
- (2015) Auxiliary image regularization for deep cnns with noisy labels. arXiv preprint:1511.07069. Cited by: §2.2.
- (2016) Training deep networks for facial expression recognition with crowd-sourced label distribution. In ACM ICMI, Cited by: §1, §4.1.
- (2016) Training deep networks for facial expression recognition with crowd-sourced label distribution. In ACM ICMI, pp. 279–283. Cited by: 7(c).
- (2005) Histograms of oriented gradients for human detection. In CVPR, Cited by: §2.1.
- (1998) The expression of the emotions in man and animals. Oxford University Press, USA. Cited by: §1.
- (2017) Avoiding your teacher’s mistakes: training neural networks with controlled weak supervision. arXiv preprint 1711.00313. Cited by: §2.2.
- (2019) Arcface: additive angular margin loss for deep face recognition. In CVPR, pp. 4690–4699. Cited by: §4.5.
- (2011) Static facial expression analysis in tough conditions: data, evaluation protocol and benchmark. In ICCV, pp. 2106–2112. Cited by: §1.
- (2016) Emotionet: an accurate, real-time algorithm for the automatic annotation of a million facial expressions in the wild. In CVPR, pp. 5562–5570. Cited by: §1.
- (2002) Robust face analysis using convolutional neural networks. In ICPR, pp. 40–43. Cited by: §2.1.
- (2016) Training deep neural-networks using a noise adaptation layer. Cited by: §2.2.
- (2018-09) CurriculumNet: weakly supervised learning from large-scale web images. In ECCV, Cited by: §4.2, Table 2, Table 4.
- (2016) MS-celeb-1m: A dataset and benchmark for large-scale face recognition. CoRR abs/1607.08221. External Links: Cited by: §4.2.
- (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: §3.5.
- (2019) Noise-tolerant paradigm for training face recognition cnns. In CVPR, pp. 11887–11896. Cited by: §3.2.
- (2017) Combining convolutional neural networks for emotion recognition. In 2017 IEEE MIT Undergraduate Research Technology Conference (URTC), pp. 1–4. Cited by: 7(c).
- (2014) Self-paced learning with diversity. In NIPS, pp. 2078–2086. Cited by: §3.2.
- (2017) Mentornet: learning data-driven curriculum for very deep neural networks on corrupted labels. arXiv preprint:1712.05055. Cited by: §2.2, §3.2.
- (2013) Combining modality specific deep neural networks for emotion recognition in video. In International Conference on Multimodal Interaction, pp. 543–550. Cited by: §2.1.
- (2017) Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild. In CVPR, pp. 2852–2861. Cited by: §1, §4.1, 7(a).
- (2019-05) Occlusion aware facial expression recognition using cnn with attention mechanism. IEEE Transactions on Image Processing 28 (5), pp. 2439–2450. External Links: Cited by: §2.1.
- (2018) Occlusion aware facial expression recognition using cnn with attention mechanism. TIP 28 (5), pp. 2439–2450. Cited by: §4.5, 7(a).
- (2017) Learning from noisy labels with distillation. In ICCV, pp. 1910–1918. Cited by: §2.2.
- (2002-04) Gabor feature based classification using the enhanced fisher linear discriminant model for face recognition. IEEE Transactions on Image Processing 11 (4), pp. 467–476. External Links: Cited by: §2.1.
- (2015) AU-inspired deep networks for facial expression feature learning. Neurocomputing 159 (C), pp. 126–136. Cited by: §2.1.
- (2017) Sphereface: deep hypersphere embedding for face recognition. In CVPR, pp. 212–220. Cited by: §3.2.
- (2010) The extended cohn-kanade dataset (ck+): a complete dataset for action unit and emotion-specified expression. In CVPRW, pp. 94–101. Cited by: §1.
- (2017) Self-paced co-training. In ICML, pp. 2275–2284. Cited by: §3.2.
- (2012) Learning to label aerial images from noisy data. In ICML, pp. 567–574. Cited by: §2.2.
- (2017) Affectnet: a database for facial expression, valence, and arousal computing in the wild. TAC 10 (1), pp. 18–31. Cited by: §1, §4.1, 7(b).
- (2013) Learning with noisy labels. In NIPS, pp. 1196–1204. Cited by: §2.2.
- (2003) SIFT: predicting amino acid changes that affect protein function. Nucleic Acids Research 31 (13), pp. 3812–3814. External Links: Cited by: §2.1.
- (2009) Facial expression recognition based on local binary patterns: a comprehensive study. Image and Vision Computing 27 (6), pp. 803 – 816. External Links: Cited by: §2.1.
- (2014) Learning from noisy labels with deep neural networks. arXiv preprint:1406.2080 2 (3), pp. 4. Cited by: §2.2.
- (2013) Deep learning using linear support vector machines. Computer Science. Cited by: §2.1.
- (2001) Recognizing action units for facial expression analysis. T-PAMI 23 (2), pp. 97–115. Cited by: §1.
- (2010) Induced disgust, happiness and surprise: an addition to the mmi facial expression database. In Proc. 3rd Intern. Workshop on EMOTION (satellite of LREC): Corpora for Research on Emotion and Affect, pp. 65. Cited by: §1.
- (2017-07) Learning from noisy large-scale datasets with minimal supervision. In CVPR, Cited by: §2.2.
- (2017) Learning from noisy large-scale datasets with minimal supervision. In CVPR, pp. 839–847. Cited by: §2.2.
- (2019) Region attention networks for pose and occlusion robust facial expression recognition. arXiv preprint:1905.04075. Cited by: §2.1, §4.5, 7(a), 7(b), 7(c).
- (2018) Facial expression recognition with inconsistently annotated datasets. In ECCV, pp. 222–237. Cited by: §2.2, §4.5, 7(a), 7(b).
- (2016-10) Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters 23 (10), pp. 1499–1503. External Links: Cited by: §2.1.
- (2016) Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letter 23 (10), pp. 1499–1503. Cited by: §3.5.
- (2019-06) MetaCleaner: learning to hallucinate clean representations for noisy-labeled visual recognition. In CVPR, Cited by: §4.2, Table 2, Table 4.
- (2011) Facial expression recognition from near-infrared videos. Image and Vision Computing 29 (9), pp. 607–619. Cited by: §1.