Baseline CNN structure analysis for facial expression recognition

Baseline CNN structure analysis for facial expression recognition

Minchul Shin, Munsang Kim and Dong-Soo Kwon *This work was supported by the Industrial Strategic Technology Development Program (10044009, Development of a self-improving bidirectional sustainable HRI technology) funded by the Ministry of Knowledge Economy (MKE, Korea)Minchul Shin is with the Department of Mechanical Engineering and Human-Robot Interaction Research Center, Korea Advanced Institute of Science and Technology (KAIST), Daejeon 305-701, Republic of Korea (phone: +82-42-350-8212; fax: +82-42-350-8240). min.stellastra@gmail.comDong-Soo Kwon is with the Department of Mechanical Engineering and Human-Robot Interaction Research Center, KAIST, Daejeon 305-701, Republic of Korea (phone: +82-42-350-3042; fax: +82-42-350-8240).

We present a baseline convolutional neural network (CNN) structure and image preprocessing methodology to improve facial expression recognition algorithm using CNN. To analyze the most efficient network structure, we investigated four network structures that are known to show good performance in facial expression recognition. Moreover, we also investigated the effect of input image preprocessing methods. Five types of data input (raw, histogram equalization, isotropic smoothing, diffusion-based normalization, difference of Gaussian) were tested, and the accuracy was compared. We trained 20 different CNN models (4 networks x 5 data input types) and verified the performance of each network with test images from five different databases. The experiment result showed that a three-layer structure consisting of a simple convolutional and a max pooling layer with histogram equalization image input was the most efficient. We describe the detailed training procedure and analyze the result of the test accuracy based on considerable observation.


Baseline CNN structure analysis for facial expression recognition

Minchul Shin, Munsang Kim and Dong-Soo Kwonthanks: *This work was supported by the Industrial Strategic Technology Development Program (10044009, Development of a self-improving bidirectional sustainable HRI technology) funded by the Ministry of Knowledge Economy (MKE, Korea)thanks: Minchul Shin is with the Department of Mechanical Engineering and Human-Robot Interaction Research Center, Korea Advanced Institute of Science and Technology (KAIST), Daejeon 305-701, Republic of Korea (phone: +82-42-350-8212; fax: +82-42-350-8240). min.stellastra@gmail.comthanks: Dong-Soo Kwon is with the Department of Mechanical Engineering and Human-Robot Interaction Research Center, KAIST, Daejeon 305-701, Republic of Korea (phone: +82-42-350-3042; fax: +82-42-350-8240).

I Introduction

Human facial expression recognition is considered important in the human-robot interaction field and has been studied with much interest in the past 10 years [1, 2, 3]. However, despite the enormous efforts, it still remains as a challenging task for robots. The traditional approach for facial expression recognition consists of two main parts: feature extraction and classification. Features extracted from the training data always play an important role in the recognition problem because the classifier makes its decision based on the combination of extracted features. Handcrafted features such as LBP, HOG, and SIFT have been widely used in the traditional approach owing to their proven performances under specific circumstances and to their low computational cost for feature extraction process [4, 5, 6]. Shan et al. [4] formulated a boosted-LBP feature and combined it with a support vector machine (SVM) classifier. Their method performed robustly and stably over a useful range of low-resolution facial images. Berretti et al. [5] computed the SIFT descriptor on 3D facial landmarks of depth images and used SVM for the classification. Albiol et al. [6] proposed an HOG descriptor-based EBGM algorithm. Gabor features from EBGM were replaced by HOG descriptors, achieving good performance. Although these approaches reported good accuracy, the handcrafted feature has its inherent drawbacks. When we use handcrafted features, either unintended features that have no effects on classification may get included or important features that have a great influence on the classification may get omitted. This is because the features are “crafted” by human experts, and the experts may not be able to consider all possible cases and include them in the feature.

With recent advances in deep learning and parallel computing, applying convolutional neural network (CNN)-based deep neural networks into a classification problem has attained impressive successes [7, 8, 9, 10, 11, 12, 13]. Deep learning methods are distinguishable from traditional machine learning algorithms in that they perform the feature extraction and classification process simultaneously. Another advantage of using deep learning methods is that, since they extract features through iterative weight update by backpropagation and error optimization, the classifier could include critical and unforeseen features that humans hardly come up with. This process is called feature learning, and CNN is especially suitable for processing 2D image-based training datasets. CNN can be seen as a special type of multilayer perceptron (MLP), but CNN rather focuses on the local relationships between pixels by using receptive fields. According to Goodfellow et al. [7], the performance of a feature learning algorithm can be proven through the result of a facial expression recognition competition (FER-2013). The top three teams out of 56 teams that participated in the aforementioned competition all used CNNs, and the result showed that the features learned by the CNN are indeed capable of outperforming handcrafted features, although the difference is not extreme.

In this paper, we present a comparison between various types of CNN structures, and find out the most effective structure for application in facial expression recognition. Four types of CNN structure were tested, and the one that showed the highest accuracy was selected as a target structure for fine parameter tuning. In addition to the CNN structure selection, we also tested five types (raw, histogram-equalized, isotropic diffusion-based normalization, and LBP) of differently preprocessed images to verify the most suitable form of training images. We expect that, by setting up the baseline CNN structure and image preprocess method, the recognition accuracy of other CNN-based deep learning approach will improve.

Ii Related Works

Facial expression recognition with CNN-based deep learning was reported in Refs. [8, 9, 10, 11, 12, 13]. Tang, the winner of the ICMLW2013 facial expression recognition challenge, reported that the implementation of a multi-class L2-SVM instead of a softmax layer for loss calculation was actually able to improve the classification accuracy [8]. Since L2-SVM is in quadratic form, which is differentiable, switching softmax to SVM is simple and shows a slight increase in accuracy compared to softmax. Liu et al. [9] used 3D-CNN and a deformable facial action part model to localize facial action parts and learn part-based features for video-based emotion classification. The kernel size can also be an important factor for CNN performance. Fasel [10] implemented an intentionally large receptive field (11x11) for integrating the features found by the early stage convolutional layer. The classifier ensemble technique is frequently used for boosting the classification performance [11, 12, 13]. Yu et al. [11] reported that the random initialization of neural networks not only leads to varying network parameters, but also renders the classification ability of diverse networks. For this reason, the ensemble technique usually shows concrete performance improvement. Kahou et al. [12] combined modality-specific multiple deep neural networks. Their method mixed a number of modalities such as facial image, audio, bag of mouth features with CNN, and deep restricted Boltzmann machine, and the final predictions of each classifier were averaged. Yu et al. [11] independently trained multiple differently initialized CNNs and tested the log-likelihood loss and hinge loss for the ensemble optimization method of output responses. Their method resulted in a slight increase in accuracy compared to the single CNN model. Lastly, Kim et al. [13] proposed a hierarchical committee of deep CNNs with exponentially weighted decision fusion. They trained 216 CNN models sharing the same structure, but with different weights and ensembled them with their own decision fusion rule called VA-Expo-WA. This approach requires a large amount of computing power, but showed a fine performance, winning the EmotiW2015 facial expression recognition competition. As it can be seen so far, plenty of papers for facial expression recognition used CNN-based deep neural networks, and their baseline structures and image preprocessing methods were all different. The reason is that, in most cases, the CNN structure selection or data preprocessing method is not their main research focuses. However, choosing the proper CNN structure and preprocessed input image type is important for improving accuracy. In this paper, we will compare five preprocessed image types and four types of CNN structure and suggest the most suitable baseline CNN model and image type for facial expression recognition.

Iii Data Preparation

In this section, we first describe what type of dataset is used for the experiment and how it is modified. Then, the face detection method and face registration processes will be discussed in sequence.

Iii-a Dataset

To train a deep convolutional network and verify its performance, we used multiple kinds of datasets. For the training dataset, the Facial Expression Recognition 2013 (FER-2013) dataset released for the ICMLW subchallenge and the Static Facial Expression in the Wild (SFEW2.0) dataset released for the EmotiW2015 competition were used. FER-2013 contains 28,709 training faces, 3589 private test faces, and 3589 public test faces. We used training faces and public test faces for the training and left private test faces for the accuracy test. FER-2013 facial images are not exactly frontal, including a variation of the rotation and transition with a 48x48 grayscale format. Since FER-2013 images were collected using the Google Image Search API, they contain a variety of facial expressions existing in real-world conditions. The SFEW2.0 dataset consists of 944 training faces, 422 validation faces, and 372 test faces. The SFEW dataset is a static subset of AFEW, which contains video clips extracted from movies. Although the emotions in movies are not very spontaneous, they provide facial expressions in a much more natural and versatile way than those found in laboratory-controlled datasets [11].

For the performance evaluation, five different datasets were chosen: FER-2013, SFEW2.0, CK+ (extended Cohn-Kanade), KDEF (Karolinska Directed Emotional Faces), and Jaffe [14, 15, 16]. CK+ is the most widely used dataset for facial expression recognition, and it includes seven different facial expression labels. KDEF contains facial images of 70 individuals and 7 different facial expressions from 5 different angles. This dataset was originally developed for use in psychological and medical research purposes, but is also suitable for facial expression recognition. Jaffe contains facial images of Japanese females, and the head pose is almost frontal. Since there are very few datasets containing facial images of Asians, Jaffe can be a good test set for performance evaluation. All datasets used in this work included seven types of the same emotion expression, and every single image was fully labeled.

Iii-B Face Detection and Registration

Most of the papers on the subject used the face crop technique owing to the robustness of its performance and to its high accuracy. We also cropped faces from each dataset and aligned the faces with respect to the landmark position of the eye. Every single facial image was rotated so that the straight line connecting two eye positions becomes parallel to the horizontal line. For this task, we needed a face detector that could extract the face landmark, and the dlib C++ library provided by King [20] performed very well for this purpose. The face detection algorithm implemented in dlib is based on Ref. [21] with fast response. Kazemi and Sullivan [21] proposed a face alignment method based on the ensemble of regression trees that performs shape invariant feature selection. A simple example code is provided so that anyone can easily try and test the method. Faces were detected and cropped from the facial images in the datasets, and the number of images that were successfully processed is shown in Table 1. Det# stands for the number of successful detections, Manual# for the number of images left after manual filtering by the authors, and Crop# for the number of images after cropping and flipping. The success rate of the face detection differed depending on the dataset, with the highest rate, 0.802, for SFEW2.0 and the lowest rate, 0.607, for KDEF. There are two reasons for the failure of the face detection. First, the face detector could not find faces if only one side of the face is shown in the picture. This is because the ensemble model of the face detector in dlib was trained with the Helen dataset, in which training images are close to frontal face. Since the images in the KDEF dataset were captured at five different angles, plenty of faces were not frontal. Second, the FER-2013 dataset provides cropped facial images from the beginning, where the faces are fully filled in 48x48 pixels. For this reason, the face landmark detector often failed to extract the landmark, and failed images were excluded from the training data to eliminate the possible effects from face rotation. Although 68.53% of the faces in total were detected and processed, more than 25K training images were enough to include various features of facial expression, and the result will be discussed later.

FER-2103 SFEW2.0 CK+ KDEF Jaffe
Face # 35,887 1,366 309 4,898 181
Det # 24,668 1,095 309 2972 181
Rate (Det#/Face#) 0.687 0.802 1.000 0.607 1.000
Manual # 24,657 832 308 2961 180
Crop # 246,570 8320 3080 29610 1800
TABLE I: Number of correct detections for each dataset
Test set (%) Tang Yu Kahou Caffe-ImageNet
Indiv. Avg. Indiv. Avg. Indiv. Avg. Indiv. Avg.
Raw FER-2013 62.20 (35) 61.87 (35) 58.35 (70) 60.58 (30)
SFEW2.0 45.96 54.58 50.6 54.70 45.25 52.58 51.22 55.09
CK+ 60.98 59.93 56.72 59.02
KDEF 54.18 53.21 52.28 56.86
Jaffe 49.56 47.89 50.28 47.78

FER-2013 66.67 (30) 64.98 (35) 65.97 (70) 66.99 (30)
SFEW2.0 64.84 59.38 66.65 57.39 68.14 58.31 68.46 59.29
CK+ 65.54 60.92 57.48 63.11
KDEF 50.66 49.5 50.02 51.12
Jaffe 49.17 44.89 49.94 46.78

FER-2013 62.16 (35) 62.82 (32) 61.34 (70) 62.73 (23)
SFEW2.0 52.9 56.16 54.35 57.65 58.09 56.93 53.90 58.58
CK+ 62.26 63.8 61.77 66.49
KDEF 57.28 58.93 56.12 59.15
Jaffe 46.22 48.33 47.33 50.61

FER-2013 56.09 (40) 58.33 (50) 54.08 (70) 56.15 (22)
SFEW2.0 46.69 51.50 51.02 53.52 52.88 50.77 45.92 50.77
CK+ 54.33 57.48 53.60 53.54
KDEF 54.92 52.96 48.88 54.30
Jaffe 45.5 47.83 44.39 43.94

FER-2013 58.96 (35) 59.38 (40) 57.84 (60) 58.94 (25)
SFEW2.0 49.37 53.35 50.47 52.52 57.82 53.28 52.17 54.05
CK+ 56.03 55.28 56.75 57.25
KDEF 55.19 56.3 48.40 54.38
Jaffe 47.05 41.17 45.61 47.50

a Average indicates average accuracy (selected model epoch through the validation)
TABLE II: Accuracy result
network category layer1 layer2 layer3 layer4 layer5 layer6 layer7 layer8 layer9 layer10 layer11 layer12 layer13
Tang layer conv maxp conv maxp conv maxp fc output
kernel 5x5(1,2) 3x3(2,1) 4x4(1,1) 3x3(2,1) 5x5(1,2) 3x3(2,1*) - -
maps 42@32 21@32 20@32 10@32 42@32 42@32 1@3072 1@7
Yu layer conv stochp conv conv stochp conv conv stochp fc fc output
kernel 5x5(1,2) 3x3(2,1) 3x3(1,1) 3x3(1,1) 3x3(2,1) 3x3(1,1) 3x3(1,1) 3x3(2,0) - - -
maps 42@48 21@48 21@48 11@64 11@128 11@128 5@128 1@1024 1@1024 1@7
Kahou layer conv maxp lrn conv avgp lrn conv avgp fc output
kernel 5x5(1,2) 3x3(2,0) - 3x3(1,1) 3x3(2,1) - 3x3(1,1) 3x3(2,1*) - -
maps 42@64 21@64 21@64 20@64 10@64 10@128 5@128 1@3072 1@7
ImageNet layer conv maxp lrn conv maxp lrn conv conv conv maxp conv fc output
kernel 5x5(1,2) 3x3(2,0) - 3x3(1,1) 3x3(2,1) - 3x3(1,1) 3x3(1,1) 3x3(1,1) 3x3(2,1*) 5x5(1,0) - -
maps 42@32 20@32 20@32 20@96 10@96 10@96 10@128 10@128 10@96 5@96 1@1024 1@1024 1@7

a conv, fc, lrn, maxp, avgp, stochp: convolutional, fully-connected, local response norm, max-pooling, average-pooling, stochastic-pooling
b kernel: [kernel size]([stride],[zero-padding]) where ’padding with asterik*’ refers to zero-padding to top, left direction only.
c maps: [size of output maps]@[the nubmer of output maps]
TABLE III: Network structure descriptions

Iii-C Generating transition variation

Yu et al. [11] reported that the random perturbation through cropping of images essentially generates additional unseen training samples and therefore makes the network more robust. A similar image cropping technique was also applied in Refs. [12, 13]. The images were cropped and flipped for transition variation in both papers. On the basis of this observation, we also cropped the original 48x48 facial images into a size of 42x42 of five crops (center, top left, top right, bottom left, and bottom right) and flipped them. Consequently, the training data were augmented by 10 times. In total, 227,890 face images were reproduced and used for the training set.

Iv Image Preprocessing Method

Illumination or contrast influences the result accuracy greatly, depending on how it is dealt with. In this paper, five different frequently used face preprocessing methods (raw, histogram equalization, isotropic diffusion-based normalization, DCT-based normalization, difference of Gaussian) were tested. Histogram equalization (Hist-eq) is a contrast enhancement technique that usually increases the global contrast of images. This method is effective when the background’s and foreground’s brightness are almost the same. Isotropic diffusion-based normalization, so-called isotropic smoothing (IS), is a technique that aims to reduce image noise without removing significant parts of the image content such as edges or lines. The discrete cosine transform (DCT)-based normalization technique was proposed by Maheshkar et al. [17] and has been popularly applied to facial images owing to its powerful transform ability in image processing. Lastly, difference of Gaussian (DoG) is the most basic feature enhancement technique, which involves the subtraction of two differently blurred images. Usually, DoG is very powerful for increasing the visibility of edges and for representing texture details. Hist-eq was applied using the OpenCV library, and the IS, DCT, and DoG were applied using the transform function provided by the INface toolbox with default parameter settings.

V Cnn Structure Candidates

For the candidate of baseline CNN structure for facial expression, we chose four different structures based on other researchers’ works. The first one is Tang’s CNN structure [8], which was also used as a baseline structure in KIM’s work. It consists of one input transform and three convolutional and pooling layers, followed by a fully connected two-layer MLP. The next is Yu’s structure [11], which contains five convolutional layers, three stochastic pooling layers, and three fully connected layers. The network has two convolutional layers prior to pooling, except for the first layer. The third one is Kahou’s structure [12]. It consists of three convolutional pooling layers and a two-layer MLP, with local response normalization (spatial batch normalization) and average pooling. The last candidate is the Caffe-ImageNet structure [19]. This structure was designed to classify 1000 classes from the ImageNet dataset, but in this paper, the number of last output nodes was reduced to seven. For all four candidates, a rectified linear unit (ReLU) layer and a dropout layer were applied to every single convolutional and fully connected layers. It is generally believed that dropout prevents the network from being overfitted, and ReLU enables deep neural networks to be trained readily without additional initialization or a normalization process such as an autoencoder or a restricted Boltzmann machine [18]. The detailed parameter descriptions of the structure are described in Table 3, where the kernel category is represented as {filter size}({stride},{padding}).

Vi Experiment

Vi-a Network classification performance

The classification performance result of each network is shown in Table 2. Tests were performed on five test sets (FER-2013, SFEW2.0, CK+, KDEF, and Jaffe) with four different network structures (Tang, Yu, Kahou, and Caffe-ImageNet). As can be seen in Figure 4, the Hist-eq method showed the highest performance for all four different network candidates. According to Table 2, the average accuracy of Hist-eq is the highest followed by IS, except for Yu’s structure. It is interesting to note that the raw method generally showed better validation accuracy than IS did, as shown in Fig. 3, but the test accuracy showed better performance for IS. It indicates that the network is sensitive to illumination or contrast variation, such that reducing the illumination or contrast influences becomes more important for classifying test images that were not included in the training set. In addition, we can see that Hist-eq outperformed the other preprocessing methods, specifically for the SFEW test set. The reason is not clear, but we guess that contrast enhancement techniques such as Hist-eq fit because the SFEW database contains extracted images from movie scenes, which usually include a big contrast. From this observation, we concluded that the Hist-eq method is the most reliable for facial expression application. Suppose that Hist-eq was chosen, Tang’s model showed the best performance and resulted in a 59.38% test accuracy. Although Caffe-Imagenet achieved an almost similar accuracy to Tang’s model, the network complexity is different. Caffe-Imagenet was originally designed for classifying 1000 objects, so it has much a more complex structure and nodes. In practice, Caffe-Imagenet consumed almost twice as much computing power to complete the training. Despite the complex network structure, the fact that the test accuracy of Tang’s model and Caffe-Imagenet was similar means that both networks were complex enough to include the features extractable from a 42x42 input dimension. In this respect, choosing Tang’s network seems reasonable, saving more computing power. The best baseline CNN structure can vary depending on the image preprocessing methods. For example, if we use a DCT-based normalization technique, the best structure could be Yu’s structure. We suggest carefully choosing the network structure and the image preprocessing method by referring to the result shown in Table 2.

Vi-B Train and validation details

We used the Torch7 deep learning library on an NVIDIA GeForce GTX 750 Ti GPU. The loss calculation for weight backpropagation was performed based on a stochastic gradient descent method, with a batch size of 50 and a momentum of 0.9. A fixed learning rate (0.005) was applied and was not changed during the iteration. No fine-tuning was done previously, and weight decay was fixed to 0.00001. To avoid overfitting, we added a dropout layer to every single convolutional and fully connected layer, and to the ReLU layer as well. The initial weight was set with the Xavier method, and a softmax layer was used for the output layer. When training, the validation was performed in every epoch to avoid overfitting. In every epoch, 10% of the training images were randomly chosen for validation, and the other images were used for training.

Vii Conclusions

In this paper, we investigated a more efficient network structure and data preprocessing method for establishing a baseline structure for facial expression recognition. For the preprocessing method of the input image, the Hist-eq method showed the most reliable performance for all the network models. Moreover, we found that Tang’s network achieved reasonably high accuracy with Hist-eq images compared to the other networks, even with less network complexity. On the basis of this observation, we suggest Tang’s simple network with Hist-eq images as a baseline CNN model for further research. We expect that this baseline structure can help any trials of CNN-based algorithms that use the ensemble technique such as committee machines or any other research using a single CNN structure to choose a reasonable network structure.


This work was supported by the Industrial Strategic Technology Development Program (10044009, Development of a self-improving bidirectional sustainable HRI technology) funded by the Ministry of Knowledge Economy (MKE, Korea)

  • [1] Donato, G., Bartlett, M. S., Hager, J. C., Ekman, P., & Sejnowski, T. J. (1999). Classifying facial actions. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 21(10), 974-989.
  • [2] Pantic, M., & Rothkrantz, L. J. (2000). Automatic analysis of facial expressions: The state of the art. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 22(12), 1424-1445.
  • [3] Samal, A., & Iyengar, P. A. (1992). Automatic recognition and analysis of human faces and facial expressions: A survey. Pattern recognition, 25(1), 65-77.
  • [4] Shan, C., Gong, S., & McOwan, P. W. (2009). Facial expression recognition based on local binary patterns: A comprehensive study. Image and Vision Computing, 27(6), 803-816.
  • [5] Berretti, S., Amor, B. B., Daoudi, M., & Del Bimbo, A. (2011). 3D facial expression recognition using SIFT descriptors of automatically detected keypoints. The Visual Computer, 27(11), 1021-1036.
  • [6] Albiol, A., Monzo, D., Martin, A., Sastre, J., & Albiol, A. (2008). Face recognition using HOG? EBGM. Pattern Recognition Letters, 29(10), 1537-1543.
  • [7] Goodfellow, I. J., Erhan, D., Carrier, P. L., Courville, A., Mirza, M., Hamner, B., … & Zhou, Y. (2013, November). Challenges in representation learning: A report on three machine learning contests. In Neural information processing (pp. 117-124). Springer Berlin Heidelberg.
  • [8] Tang, Y. (2013). Deep learning using support vector machines. CoRR, abs/1306.0239.
  • [9] Liu, M., Li, S., Shan, S., Wang, R., & Chen, X. (2014). Deeply learning deformable facial action parts model for dynamic expression analysis. In Computer Vision–ACCV 2014 (pp. 143-157). Springer International Publishing.
  • [10] Fasel, B. (2002). Head-pose invariant facial expression recognition using convolutional neural networks. In Multimodal Interfaces, 2002. Proceedings. Fourth IEEE International Conference on (pp. 529-534). IEEE.
  • [11] Yu, Z., & Zhang, C. (2015, November). Image based static facial expression recognition with multiple deep network learning. In Proceedings of the 2015 ACM on International Conference on Multimodal Interaction (pp. 435-442). ACM.
  • [12] Kahou, S. E., Pal, C., Bouthillier, X., Froumenty, P., Gülçehre, Ç., Memisevic, R., … & Mirza, M. (2013, December). Combining modality specific deep neural networks for emotion recognition in video. In Proceedings of the 15th ACM on International conference on multimodal interaction (pp. 543-550). ACM. ISO 690
  • [13] Kim, B. K., Lee, H., Roh, J., & Lee, S. Y. (2015, November). Hierarchical committee of deep CNNs with exponentially-weighted decision fusion for static facial expression recognition. In Proceedings of the 2015 ACM on International Conference on Multimodal Interaction (pp. 427-434). ACM.
  • [14] Lucey, P., Cohn, J. F., Kanade, T., Saragih, J., Ambadar, Z., & Matthews, I. (2010, June). The extended Cohn-Kanade dataset (ck+): A complete dataset for action unit and emotion-specified expression. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2010 IEEE Computer Society Conference on (pp. 94-101). IEEE.
  • [15] Lyons, M., Akamatsu, S., Kamachi, M., & Gyoba, J. (1998, April). Coding facial expressions with Gabor wavelets. In Automatic Face and Gesture Recognition, 1998. Proceedings. Third IEEE International Conference on (pp. 200-205). IEEE.
  • [16] Goeleven, E., De Raedt, R., Leyman, L., & Verschuere, B. (2008). The Karolinska directed emotional faces: a validation study. Cognition and Emotion, 22(6), 1094-1118.
  • [17] Maheshkar, V., Kamble, S., Agarwal, S., & Srivastava, V. K. (2012). DCT-based reduced face for face recognition. International Journal of Information Technology and Knowledge Management, 5(1), 97-100.
  • [18] Sutskever, I., Martens, J., Dahl, G., & Hinton, G. (2013). On the importance of initialization and momentum in deep learning. In Proceedings of the 30th international conference on machine learning (ICML-13) (pp. 1139-1147).
  • [19] Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., … & Darrell, T. (2014, November). Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the ACM International Conference on Multimedia (pp. 675-678). ACM.
  • [20] King, D. E. (2009). Dlib-ml: A machine learning toolkit. The Journal of Machine Learning Research, 10, 1755-1758.
  • [21] Kazemi, V., & Sullivan, J. (2014). One millisecond face alignment with an ensemble of regression trees. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 1867-1874).
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description