Local Learning with Deep and Handcrafted Features for Facial Expression Recognition
We present an approach that combines automatic features learned by convolutional neural networks (CNN) and handcrafted features computed by the bag-of-visual-words (BOVW) model in order to achieve state-of-the-art results in facial expression recognition. In order to obtain automatic features, we experiment with multiple CNN architectures, pre-trained models and training procedures, e.g. Dense-Sparse-Dense. After fusing the two types of features, we employ a local learning framework to predict the class label for each test image. The local learning framework is based on three steps. First, a k-nearest neighbors model is applied for selecting the nearest training samples for an input test image. Second, a one-versus-all Support Vector Machines (SVM) classifier is trained on the selected training samples. Finally, the SVM classifier is used for predicting the class label only for the test image it was trained for. Although local learning has been used before in combination with handcrafted features, to the best of our knowledge, it has never been employed in combination with deep features. The experiments on the 2013 Facial Expression Recognition (FER) Challenge data set and the FER+ data set demonstrate that our approach achieves state-of-the-art results. With a top accuracy of on the FER 2013 data set and on the FER+ data set, we surpass all competition by nearly on both data sets.
Local Learning with Deep and Handcrafted Features for Facial Expression Recognition
Mariana-Iuliana Georgescu email@example.com Radu Tudor Ionescu firstname.lastname@example.org Marius Popescu email@example.com Department of Computer Science University of Bucharest 14 Academiei, Bucharest, Romania SecurifAI 24 Mircea Vodă, Bucharest, Romania
The copyright of this document resides with its authors.
It may be distributed unchanged freely in print or electronic forms.
Automatic facial expression recognition is an active research topic in computer vision, having many applications including human behavior understanding, detection of mental disorders, human-computer interaction, among others. In the past few years, most works [? ? ? ? ? ? ? ? ? ? ? ? ? ] have focused on building and training deep neural networks in order to achieve state-of-the-art results. Engineered models based on handcrafted features [? ? ? ? ] have drawn very little attention, since such models usually yield less accurate results compared to deep learning models. In this paper, we show that we can surpass the current state-of-the-art systems by combining automatic features learned by convolutional neural networks (CNN) and handcrafted features computed by the bag-of-visual-words (BOVW) model, especially when we employ local learning in the training phase. In order to obtain automatic features, we experiment with multiple CNN architectures, such as VGG-face [? ], VGG-f [? ] and VGG-13 [? ], some of which are pre-trained on other computer vision tasks such as object class recognition [? ] or face recognition [? ]. We also fine-tune these CNN models using standard training procedures as well as Dense-Sparse-Dense (DSD) [? ]. To our knowledge, we are the first to successfully apply DSD to train CNN models for facial expression recognition. In order to obtain handcrafted features, we use a standard BOVW model, which is based on a variant of dense Scale-Invariant Feature Transform (SIFT) features [? ] extracted at multiple scales, known as Pyramid Histogram of Visual Words (PHOW) [? ]. We use automatic and handcrafted features both independently and together. For the independent models, we use either softmax (for the fine-tuned CNN models) or Support Vector Machines (SVM) based on the one-versus-all scheme. The one-versus-all SVM is used both as a global learning method (trained on all training samples) or as a local learning method (trained on a subset of training samples, selected specifically for each test sample using a nearest neighbors scheme). We combine the automatic and handcrafted features by concatenating the corresponding feature vectors, before the learning stage. For the combined models, we explore only global or local SVM alternatives. We perform a thorough experimental study on the 2013 Facial Expression Recognition (FER) Challenge data set [? ] and the FER+ data set [? ], comparing various deep, engineered and combined models. Our best results are obtained when automatic and handcrafted features are combined, and local SVM is employed in the learning phase. With a top accuracy of on the FER 2013 data set, we surpass the state-of-the-art accuracy [? ] by . We also surpass the best method [? ] on the FER+ data set by , reaching the best accuracy of . Although automatic and handcrafted features have been combined before in the context of facial expression recognition [? ? ], we provide a more extensive evaluation that includes various CNN architectures and we employ a local learning strategy that leads to superior results. To the best of our knowledge, local learning has been used only once for facial expression recognition, and only in combination with the BOVW model [? ]. We are the first to combine local learning with automatic features learned by deep CNN models. Compared to the best accuracy of Ionescu et al. [? ], which is , we report an improvement of almost . In summary, our contributions consist of successfully training CNN models for facial expression recognition using Dense-Sparse-Dense [? ], successfully combining automatic and handcrafted features with local learning, conducting an extensive empirical evaluation with various deep, engineered and combined facial expression recognition models, and reporting state-of-the-art results on two benchmark data sets.
The rest of the paper is organized as follows. We present recent related works in Section 2. We describe the automatic and handcrafted features, as well as the learning methods, in Section 3. We present the experiments on facial expression recognition in Section 4. Finally, we draw our conclusions in Section 6.
2 Related Work
The early works on facial expression recognition are mostly based on handcrafted features [? ]. After the success of the AlexNet [? ] deep neural network in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) [? ], deep learning has been widely adopted in the computer vision community. Perhaps some of the first works to propose deep learning approaches for facial expression recognition were presented at the 2013 Facial Expression Recognition (FER) Challenge [? ]. Interestingly, the top scoring system in the 2013 FER Challenge is a deep convolutional neural network [? ], while the best handcrafted model ranked only in the fourth place [? ]. With only a few exceptions [? ? ? ], most of the recent works on facial expression recognition are based on deep learning [? ? ? ? ? ? ? ? ? ? ? ? ]. Some of these recent works [? ? ? ? ] propose to train an ensemble of convolutional neural networks for improved performance, while others [? ? ] combine deep features with handcrafted features such as SIFT [? ] or Histograms of Oriented Gradients (HOG) [? ]. While most works study facial expression recognition from static images, some works approach facial expression recognition in video [? ? ]. Hasani et al. [? ] propose a network architecture that consists of 3D convolutional layers followed by a Long-Short Term Memory (LSTM) unit that together extract the spatial relations within facial images and the temporal relations between different frames in the video. Different from other approaches, Meng et al. [? ] and Liu et al. [? ] present identity-aware facial expression recognition models. Meng et al. [? ] propose to jointly estimate expression and identity features through a neural architecture composed of two identical CNN streams, in order to alleviate inter-subject variations introduced by personal attributes and to achieve better facial expression recognition performance. Liu et al. [? ] employ deep metric learning and jointly optimize a deep metric loss and the softmax loss. They obtain an identity-invariant model by using an identity-aware hard-negative mining and online positive mining scheme. Li et al. [? ] train a CNN model using a modified back-propagation algorithm which creates a locality preserving loss aiming to pull the locally neighboring faces of the same class together. Closer to our work are methods [? ? ] that combine deep and handcrafted features or that employ local learning [? ] for facial expression recognition. While Ionescu et al. [? ] use local learning to improve the performance of a handcrafted model, we show that local learning can also improve performance when deep features are used alone or in combination with handcrafted features. Remarkably, our top accuracy is almost better than the accuracy reported in [? ]. Works that combine deep and handcrafted features usually employ a single CNN model and various handcrafted features, e.g. Connie et al. [? ] employ SIFT and dense SIFT [? ] and Kaya et al. [? ] employ SIFT, HOG and Local Gabor Binary Patterns (LGBP). On the other hand, we employ a single type of handcrafted features and we include various CNN architectures in the combination. Another important difference from works [? ? ] that combine deep and handcrafted features is that we employ local learning in the training stage. With these key changes, the empirical results indicate that our approach achieves better performance than the approach of Connie et al. [? ]. We do not compare with Kaya et al. [? ], since our approach is designed to work on static images and their approach is designed to work on video.
3.1 Deep Models
We employ three CNN models in this work, namely VGG-face [? ], VGG-f [? ] and VGG-13 [? ]. Among these three models, only VGG-13 is trained from scratch. For the other two CNN models, we use pre-trained as well as fine-tuned versions. In order to train or fine-tune the models, we use stochastic gradient descent using mini-batches of images and the momentum rate set to . All models are trained using data augmentation, which is based on including horizontally flipped images.
VGG-face. With layers, VGG-face [? ] is the deepest network that we fine-tune. Since VGG-face is pre-trained on a closely related task (face recognition), we freeze the weights in the convolutional (conv) layers and we train only the fully-connected (fc) layers to adapt the network for our task (facial expression recognition). We replace the softmax layer of units with a softmax layer of units, since the FER 2013 data set [? ] contains classes of emotion. We randomly initialize the weights in this layer, using a Gaussian distribution with zero mean and standard deviation. We add a dropout layer after the first fc layer, with the dropout rate set to . We set the learning rate to and we decrease it by a factor of when the validation error stagnates for more than epochs. We fine-tune VGG-face using DSD training [? ]. On FER 2013, we train the network for epochs in the first dense phase. In the sparse phase, we carry on training for another epochs, with the sparsity rate set to for all fc layers. In the second dense phase, we train the network for epochs. Finally, we train the network for another epochs during a second sparse phase, without changing the sparsity rate. In total, the network is trained for epochs on FER 2013. We also train VGG-face on the FER+ data set [? ], by fine-tuning only the fc layers. For this data set, we replace the softmax layer with a softmax layer of units, since there are classes of emotion in FER+ instead of , as in FER 2013. We train the network for epochs in the dense phase, then, we switch to the sparse phase for another epochs. We continue the training for another epochs during a dense phase. Finally, we carry on training the VGG-face model for epochs in a second sparse phase. In total, the network is trained for epochs on FER+.
VGG-f. We also fine-tune the VGG-f [? ] network with layers, which is pre-trained on ILSVRC [? ]. Since VGG-f is pre-trained on a distantly related task (object class recognition), we fine-tune all of its layers. We set the learning rate to and we decrease it by a factor of when the validation error stagnates for more than epochs. In the end, the learning rate drops to . After each fc layer, we add a dropout layer with the dropout rate set to . We also add dropout layers after the last two conv layers, setting their dropout rates to . In total, there are four dropout layers. As for VGG-face, we use the DSD training method to fine-tune the VGG-f model. However, we refrain from pruning the weights of the first two conv layers during the sparse phase, since these layers have a higher negative impact on the validation accuracy of the network, as illustrated in Figure 1. Based on the sensitivity analysis presented in Figure 1, we choose, for each layer, the highest sparsity rate in the set that does not affect the validation accuracy by more than . On FER 2013, we train this network for a total of epochs using DSD with a dense phase of epochs, a sparse phase of epochs, another dense phase of epochs, followed by a sparse phase of epochs, a dense phase of epochs, another sparse phase of epochs, and, finally, a dense phase of epochs. On FER+, we set the learning rate set to and decrease it by a factor of to a final rate of . Once again, we use DSD training without applying weight pruning on the first two conv layers. We keep the dropout rates for the two dropout layers added after the fc layers, but we decrease the dropout rates to for the dropout layers that are added after last conv layers. We train the network for a total of epochs, starting with a dense phase of epochs, followed by a sparse phase of epochs, and a final dense phase of epochs.
VGG-13. The VGG-13 architecture was specifically designed by Barsoum et al. [? ] for the FER+ data set. Since the images in FER 2013 are of the same size, we consider that VGG-13 is an excellent choice for a model that we train from scratch. The weights are randomly initialized, by drawing them from a Gaussian distribution with zero mean and standard deviation. We use the same dropout rates as in the original paper [? ]. We set the initial learning rate to and we decrease it by a factor of whenever the validation error stops decreasing. The last learning rate that we use is . On the FER 2013 data set, we train VGG-13 for epochs. On the FER+ data set, we reach the best validation accuracy after the same number of epochs. For VGG-13, the DSD training strategy does not seem to improve the validation accuracy. Hence, we do not employ DSD for VGG-13.
3.2 Handcrafted Model
The BOVW model proposed for facial expression recognition is divided in two pipelines, one for training and one for testing. In the training pipeline, we build the feature representation by extracting dense SIFT descriptors [? ? ] from all training images, and by later quantizing the extracted descriptors into visual words using k-means clustering [? ]. The visual words are then stored in a randomized forest of k-d trees [? ] to reduce search cost. After building the vocabulary of visual words, the training and testing pipelines become equivalent. For each image in the training or testing sets, we record the presence or absence of each visual word in a binary feature vector. The standard BOVW model described so far ignores spatial relationships among visual words, but we can achieve better performance by including spatial information. Perhaps the most popular and straightforward approach to include spatial information is the spatial pyramid [? ]. The spatial pyramid representation is obtained by dividing the image into increasingly fine sub-regions (bins) and by computing the binary feature vector corresponding to each bin. The final representation is a concatenation of all binary feature vectors. It is reasonable to think that dividing an image representing a face into bins is a good choice, since most features, such as the contraction of the muscles at the corner of the eyes, are only visible in a certain region of the face.
3.3 Model Fusion and Learning
Model fusion. We combine the deep and handcrafted models before the learning stage, by concatenating the corresponding features. To extract deep features from the pre-trained or fine-tuned CNN models, we remove the softmax classification layer and we consider the activation map of last remaining fc layer as the deep feature vector corresponding to the image provided as input to the network. The deep feature vectors are normalized using the -norm. The bag-of-visual-words representation is the only kind of handcrafted features that we employ. The BOVW feature vectors are also normalized using the -norm.
Global learning. In the case of binary classification problems, linear classification methods look for a linear discriminant function, a function that assigns to examples that belong to one class and to examples that belong to the other class. Various linear classifiers differ in the way they find the vector of weights and the bias term. Support Vector Machines (SVM) [? ] try to find the vector of weights and the bias term that defines the hyperplane which maximally separates the feature vectors of the training examples belonging to the two classes. To extend the linear SVM classifier to our multi-class facial expression recognition problem, we employ the one-versus-all scheme.
Local learning. Local learning methods attempt to locally adjust the performance of the training system to the properties of the training set in each area of the input space. A local learning algorithm essentially works by selecting a few training samples located in the vicinity of a given test sample, then by training a classifier with only these few examples and finally, by applying the classifier to predict the class label of the test sample. It is interesting to note that the k-nearest neighbors (k-NN) model can be included in the family of local learning algorithms. Actually, the k-NN model is the simplest formulation of local learning, since the discriminant function is constant (there is no learning involved). Moreover, almost any other classifier can be employed in the local learning paradigm. In our case, we employ the linear SVM classifier for the local classification problem. It is important to mention that besides the classifier, a similarity or distance measure is also required to determine the neighbors located in the vicinity of a test sample. In our case, we use the cosine similarity. An interesting remark is that a linear classifier such as SVM, put in the local learning framework, becomes non-linear. In the standard approach, a single linear classifier trained at the global level (on the entire train set) produces a linear discriminant function. On the other hand, the discriminant function for a set of test samples is no longer linear in the local learning framework, since each prediction is given by a different linear classifier which is specifically trained for a single test sample. Moreover, the discriminant function cannot be determined without having the test samples beforehand, yet the local learning paradigm is able to rectify some limitations of linear classifiers, as illustrated in Figure 2. Local learning has a few advantages over standard learning methods. First, it divides a hard classification problem into more simple sub-problems. Second, it reduces the variety of samples in the training set, by selecting samples that are most similar to the test one.
4.1 Data Sets
We conduct experiments on the FER 2013 [? ] and the FER+ data sets [? ]. The FER 2013 data set contains training images, validation (public test) images and another (private) test images. The images belong to classes of emotion: anger, disgust, fear, happiness, neutral, sadness, surprise. The FER+ data set is a curated version of FER 2013 in which some of original images are relabeled, while other images, e.g. not containing faces, are completely removed. Interestingly, Barsoum et al. [? ] add contempt as the eighth class of emotion. The FER+ data set contains training images, validation images and another test images. Images in both data sets are of pixels in size.
4.2 Implementation Details
The input images of pixels are upscaled to pixels for VGG-face and VGG-f, and to pixels for VGG-13. We use MatConvNet [? ] to train the CNN models. To implement the BOVW model we use functions from VLFeat [? ]. To generate the spatial pyramid representation, we divide the images into , , and bins. At each level of the pyramid, we use vocabularies of , , and words, respectively. In the training phase, we employ the SVM implementation from LibSVM [? ]. We set regularization parameter of SVM to for individual models and to for combined models. In the local learning approach, we employ the cosine similarity to choose the neighbors. We use neighbors for individual models and only neighbors for combined models. All parameters are tuned on the validation sets.
|Model||FER||FER (aug.)||FER+||FER+ (aug.)|
|Ionescu et al. [? ]||-||-||-|
|Tang [? ]||-||-||-|
|Yu et al. [? ]||-||-||-|
|Zhang et al. [? ]||-||-||-|
|Kim et al. [? ]||-||-||-|
|Barsoum et al. [? ]||-||-||-|
|Li et al. [? ]||-||-||-|
|Connie et al. [? ]||-||-||-|
|BOVW + global SVM|
|BOVW + local SVM|
|p. VGG-face + global SVM|
|p. VGG-face + local SVM|
|f. VGG-face + softmax||-||-|
|f. VGG-face + global SVM|
|f. VGG-face + local SVM|
|p. VGG-f + global SVM|
|p. VGG-f + local SVM|
|f. VGG-f + softmax||-||-|
|f. VGG-f + global SVM|
|f. VGG-f + local SVM|
|VGG-13 + softmax||-||-|
|VGG-13 + global SVM|
|VGG-13 + local SVM|
|BOVW + p. VGG-face + global SVM|
|BOVW + p. VGG-face + local SVM|
|BOVW + f. VGG-face + global SVM|
|BOVW + f. VGG-face + local SVM|
|f. VGG-face + VGG-13 + global SVM|
|f. VGG-face + VGG-13 + local SVM|
|f. VGG-face + f. VGG-f + global SVM|
|f. VGG-face + f. VGG-f + local SVM|
|BOVW + p. VGG-face + f. VGG-face|
|+ global SVM|
|BOVW + p. VGG-face + f. VGG-face|
|+ local SVM|
|BOVW + p. VGG-face + f. VGG-face|
|+ f. VGG-f + VGG-13 + global SVM|
|BOVW + p. VGG-face + f. VGG-face|
|+ f. VGG-f + VGG-13 + local SVM|
4.3 Results on FER 2013
Table 1 includes the results of our various models on the FER 2013 (private) test set, with and without data augmentation. Our models are compared with several state-of-the-art approaches [? ? ? ? ? ? ? ? ].
Individual models. The accuracy rates of our BOVW model are about better when we employ the SVM based on local learning. Using data augmentation, our BOVW model achieves an accuracy of , outperforming the BOVW model of Ionescu et al. [? ] by and closing the gap between handcrafted and deep models. Although the pre-trained VGG-face [? ] is trained on a rather complementary task, face recognition, it achieves a good accuracy () when it is combined with the local SVM. Fine-tuning the VGG-face model on the FER 2013 data set using data augmentation improves its accuracy to . Replacing the softmax layer with global or local SVM seems to alter the performance of the fine-tuned VGG-face, although the accuracy drops by at most . Since VGG-f [? ] is pre-trained on distantly-related task, object class recognition, its performance is much lower than the performance of the pre-trained VGG-face. When using data augmentation, the global SVM based on pre-trained VGG-f features reaches an accuracy of only . Nevertheless, local learning seems to be able to recover the performance gap. Indeed, the local SVM based on pre-trained VGG-f features reaches an accuracy of , which is better than the accuracy of the global SVM. The fine-tuned VGG-f model reaches an accuracy of . When the softmax layer is replaced by the local SVM without data augmentation, the accuracy of the fine-tuned VGG-f further improves to . The VGG-13 [? ] model, which is trained from scratch on the FER 2013 data set, achieves an accuracy of . Since the input of the VGG-13 arhitecture is pixels in size, it seems to be better suited to the FER 2013 data set, that contains images of pixels, than the VGG-face or the VGG-f arhitectures, which take as input images of pixels. However, its lower performance compared to VGG-face or VGG-f can be explained by the fact that the other CNN models are pre-trained on related computer vision tasks.
Combined models. Most combined models provide better results than the individual counterparts. We obtain lower results when the combination includes only deep features and the labels are predicted by the local SVM. On the other side, whenever we add handcrafted features in the combination, the local SVM approach provides better performance than the global SVM. Our best feature combination includes the BOVW representation and the deep features extracted with pre-trained VGG-face, fine-tuned VGG-face, fine-tuned VGG-f and VGG-13. With this combination, the local SVM classifier achieves an accuracy rate of when the training set is augmented with flipped images, and the difference from the global SVM is . We consider that the trade-off between accuracy and speed is acceptable, given that the local SVM finds the nearest neighbors and predicts the test labels in seconds for all test images, while the global SVM predicts the labels in seconds. The running times are measured on a computer with Intel Xeon GHz Processor and GB of RAM, using a single thread. Figure 3 provides a handful of test images that are incorrectly labeled by the global SVM, but correctly labeled by the local SVM. For our best combination of features, we also tried to determine if applying SVM locally (on the selected nearest neighbors) is indeed helpful in comparison with a k-NN model. The k-NN model yields an accuracy of with the same number of neighbors (). We thus conclude that the local SVM approach provides a considerable improvement.
4.4 Results on FER+
Table 1 includes the results of our various models on the FER+ test set, as well as the results of a state-of-the-art approach [? ], to facilitate a direct comparison between models.
Individual models. All models obtain much better results on FER+ than on FER 2013, indicating that the FER+ curation process is indeed helpful. Although the fine-tuned VGG-face obtains better results on FER 2013, it seems that the shallower VGG-f reaches the best performance () among individual models, when it is fine-tuned on FER+. The fine-tuned VGG-f model is already above the state-of-the-art approach of Barsoum et al. [? ]. Local learning seems to improve the accuracy by more than only for the BOVW model and the pre-trained VGG-f.
Combined models. The combined models usually obtain better results than each individual component. However, there is no clear evidence to indicate which of the two learning approaches, global SVM or local SVM, is better in terms of accuracy. Consistent with the top results on FER 2013, we notice that the best accuracy rates on FER+ are obtained by the same combination that includes the BOVW model, the pre-trained VGG-face, the fine-tuned VGG-face, the fine-tuned VGG-f, and the VGG-13 network. This combination of features attains an accuracy of when local SVM is employed in the training phase.
In this paper, we have presented a state-of-the-art approach for facial expression recognition, which is based on combining deep and handcrafted features and on applying local learning in the training phase. With a top accuracy of on the FER 2013 data set and a top accuracy of on the FER+ data set, our approach is able to surpass the best methods on these data sets [? ? ].
This research is supported by Novustech Services through Project 115788 funded under the Competitiveness Operational Programme POC-46-2-2.