Interpretable Facial Relational Network Using Relational Importance
Human face analysis is an important task in computer vision. According to cognitive-psychological studies, facial dynamics could provide crucial cues for face analysis. In particular, the motion of facial local regions in facial expression is related to the motion of other facial regions. In this paper, a novel deep learning approach which exploits the relations of facial local dynamics has been proposed to estimate facial traits from expression sequence. In order to exploit the relations of facial dynamics in local regions, the proposed network consists of a facial local dynamic feature encoding network and a facial relational network. The facial relational network is designed to be interpretable. Relational importance is automatically encoded and facial traits are estimated by combining relational features based on the relational importance. The relations of facial dynamics for facial trait estimation could be interpreted by using the relational importance. By comparative experiments, the effectiveness of the proposed method has been validated. Experimental results show that the proposed method outperforms the state-of-the-art methods in gender and age estimation.
Analysis of human face has been an important task in computer vision because it plays a major role in soft biometrics, security, human-computer interaction, and social interactions [31, 7]. Facial behavior is known to benefit perception of the identity [32, 30]. In particular, facial dynamics have crucial roles for improving the accuracy of facial trait estimation such as age estimation or gender classification [9, 6].
In recent progress of deep learning, convolutional neural networks (CNN) have shown outstanding performance on many fields of computer vision. Several research efforts have been devoted to developing spatio-temporal feature representation in various applications such as action recognition [20, 39, 13, 22] and activity parsing [41, 25]. In , a long short-term memory (LSTM) network has been designed on top of CNN features to encode dynamics in video. The LSTM network is a variant of recurrent neural network (RNN), which is designed to capture long-term temporal information in sequential data . By using the LSTM, the temporal correlation of CNN features was effectively encoded.
Recently, a few research efforts have been made regarding facial dynamic feature encoding for a facial analysis [9, 24, 6, 23]. It has been observed that facial dynamics are important for highly discriminative feature representation in authentication . Moreover, it is known that the dynamic features of local regions are valuable for facial trait estimation[9, 6]. Usually, the motion of facial local region in facial expression is related to the motion of other facial regions [38, 42]. However, to the best of our knowledge, there are no studies to utilize latent relations of dynamic features for facial trait estimation.
In this paper, a novel deep network has been proposed for estimating facial traits by utilizing relations of facial dynamics as summarized in Figure 1. In order to utilize the relations of facial dynamics in local regions, the proposed network consists of a facial local dynamic feature encoding network and a facial relational network. By the facial relational network, the importance of relations in face analysis is learned to estimate facial traits. The main contributions of this study are summarized in following three aspects:
(1) We propose a novel deep network structure which could estimate facial traits with interpretation on relations of facial dynamics.
(2) The relational importance is devised to consider the importance of relational features. The relational importance is encoded from the relational features in unsupervised manner. The facial trait estimation is conducted by combining the relational features based on the relational importance. Moreover, the relational importance is used for interpretation of relations in facial trait estimation.
(3) Validation of the proposed method has been conducted on two facial trait estimation problems (i.e. age estimation and gender classification). By considering the locational information and importance of relations, the proposed method could accurately estimate facial traits compared with the state-of-the-art methods.
The rest of the paper is organized as follows. In section 2, we survey the related work, including facial trait estimation, facial dynamic analysis, and relational networks. In section 3, the proposed interpretable facial relational network is explained. Comparative experiments and results are presented in section 4. The conclusions are drawn in section 5.
2 Related Work
Face based gender and age estimation. A lot of research efforts have been devoted to development of automatic age estimation technique from face image [15, 2, 37, 27, 4, 21, 26, 40]. Discriminative feature extraction is one of the key issues for successful automatic age estimation. In , biologically-inspired aging features (BIF) were proposed for age estimation. In , intensity-based features (IEF) were developed by learning-based encoding method for age estimation. For the case of gender classification, various descriptors such as scale-invariant feature transform (SIFT) , local binary patterns (LBP) , semi-supervised discriminant analysis  have been proposed.
Recently, deep learning methods show notable potential in various face related computer vision tasks, such as face verification, face recognition, and face analysis. One of the main focuses of these methods is designing suitable deep network structure for some specific tasks. In , the CNN based face verification achieved lower error rate compared with human performance. Parkhi et al.  reported VGG-style CNN learned from large-scale static face images. Deep learning based age estimation method and gender classification method have been reported but they were designed on static face image [21, 26, 40].
Facial dynamic analysis. The temporal dynamics of face have been ignored in both age estimation and gender classification. Recent studies have reported that facial dynamics could be an important cue for facial trait estimation [8, 10, 9, 6]. With aging, the face loses muscle tone and underlying fat tissue, which creates wrinkles, sunken eyes and increases crow’s feet around the eyes . As bone mass is reduced, the size of the lower jaw is reduced. It is also observed that the length of the nose is increased with cartilage growth. Aging also affects facial dynamics along with appearance. As a person gets older, the elastic fibers of the face show fraying. Therefore facial dynamic features of local facial regions are important cues for age estimation. In , volume LBP (VLBP) features were developed to describe spatio-temporal information in videos and conduct age classification. However, the VLBP features were not powerful due to the limitation in dynamic modeling. Dibeklioglu et al.  proposed dynamic descriptors for encoding 3D volume changes via surface patches. Various descriptors such as frequency, facial asymmetry, duration, amplitude, speed, and acceleration were used to describe dynamic characteristics of each facial local region (eyebrow, eyelid, eye-sides, cheek, mouth-sides, mouth, and chin). In cognitive-psychological studies [5, 17, 34, 1], evidence for gender-dimorphism in the human expression has been reported. Females express emotions more frequently compared with males. Males have a tendency to show restricted emotions and to be unwilling to self-disclose intimate feelings . In , Dantcheva et al. used dynamic descriptors (duration, amplitude, speed, and acceleration) extracted from facial landmarks. However, these studies simply combined the dynamic features without considering the relation of dynamics. There are no studies for learning relations of dynamic features in facial trait estimation.
Relational network. To get relational reasoning in neural network, Santoro et al.  proposed a relational network in visual question and answering (VQA). In , the authors defined an object as a neuron on feature map obtained from CNN and designed a neural network for relational reasoning. However, it was designed for image-based VQA and they did not consider the relational importance. In this paper, inspired by the fact that the faces could be automatically aligned based on landmarks, the locational information is also utilized in the relational network. Moreover, the proposed facial relational network automatically encode the importance of relations by considering the locational information of object features (facial local dynamic features and locational features). The proposed facial relational network analyzes facial traits based on the relations of object features.
3 Proposed Interpretable Facial Relational Network
Overall structure of the proposed interpretable facial relational network is shown in Figure 2. The aim of the proposed method is to estimate facial traits (i.e. age or gender) by using a series of face images of subject with facial dynamics during facial expression. The proposed method makes a decision by extracting latent relations of facial dynamics from facial local regions. The proposed method largely consists of the facial local dynamic feature encoding network, the facial relational network, and interpretation on relations. The details are described in following subsections.
3.1 Facial Local Dynamic Feature Encoding Network
Given a face sequence, appearance features are computed by CNN on each frame. For the purpose of appearance feature extraction, we employ the VGG-face network  which is trained on large-scale face images. The convolutional layer’s output of the VGG-face network is used as feature map of facial appearance representation.
Based on the feature map, the face is divided to 9 local regions as shown in Figure 3. Note that each face sequence is automatically aligned based on the landmark detection . The local regions on feature map (i.e. local appearance features) are used for the local dynamic modeling. Let denote the local appearance features of i-th facial local part at t-th time step. In order to encode local dynamic features, an LSTM network has been devised on top of the local appearance features as followings:
where is a function with learnable parameters . T denotes the length of face sequence. denotes the facial local dynamic feature of i-th local part. Due to the reason that the length of each face sequence is different (i.e. T is different for each face sequence), the LSTM is used for facial local dynamic feature encoding. The LSTM network could deal with the different length of sequences. The various dynamic related features including variation of appearance, duration, amplitude, speed, and acceleration could be encoded from the sequence of local appearance features in the LSTM network. The detailed configuration of the network used in the experiments will be presented in Section 4.1.
3.2 Facial Relational Network
We extract object features (i.e. dynamic features of facial local region and locational features) for pairs of objects. The locational features are defined as the central position of the object (i.e. facial local region). For the purpose of telling the location information of objects to the facial relational network, the dynamic features and locational features are embedded and defined as object features . The object feature can be written as
where denotes the normalized central position of i-th object.
The proposed facial relational network is in line with the idea of relational reasoning in . The design philosophy of the proposed facial relational network is to make the functional form of a neural network which captures the core relations for facial trait estimation. The importance of the relation could be different for a pair of object features. The proposed facial relational network is designed to consider relational importance in facial trait estimation. Moreover, the importance of the relation could be used for interpreting the relations of facial dynamics in facial trait estimation.
Let denote relational importance between i-th and j-th object feature. The relational feature, which represents latent relation of two objects for face analysis, can be written as
where is a function with learnable parameters . is relation pair from i-th and j-th facial local parts. is a set of relation pairs where denotes the number of objects in face. and denote the i-th and j-th object features, respectively. The relational importance for relation of two object features (, ) is encoded as followings:
where is a function with learnable parameters . In this paper, is defined as
The aggregated relational features are represented by
Finally, the facial trait estimation can be performed with
where denotes estimated result and is a function with parameters . and are implemented by multi-layer perceptron (MLP).
Note that the proposed facial relational network becomes interpretable by utilizing the relational importance . As in Eq.(4), the proposed facial relational network adaptively encodes the relational importance of two objects from the given object features (, ). The relations which are important for facial trait estimation are obtained based on the relational importance. In other words, the proposed facial relational network could combine the relational features with the relational importance for facial trait estimation.
3.3 Interpretation on Relations
The proposed method is useful for interpreting the relations in dynamic face analysis. The relational importance calculated in Eq. (4) is utilized to interpret the relations. Note that the high relational importance values mean that the relational features of corresponding facial local parts are important for estimating facial traits. Basically, the relational importance represents the importance for relation of two objects. It can be extended to interpreting the importance for relations of objects by combining the relational importance of two objects. For example, the importance for relation of “A, B, and C” is represented by combining the multiple relational importance values for relation of “A and B” , relation of “B and C” , and relation of “A and C” . Therefore, the relational importance of two objects could be used for interpreting relations of objects. The pseudocodes for calculating relational importance of objects are given in Algorithm 1. By analyzing the relational importance, important relations for estimating facial traits could be explained. In Section 4.2.2 and 4.3.2, we discuss the important relations for age estimation and gender classification, respectively.
4.1 Experimental Settings
Database. To evaluate effectiveness of the proposed facial relational network for facial trait estimation, two kinds of experiments have been conducted. The ability to estimate ages and classify gender have been assessed due to the reason that age and gender are known to be representative facial traits . The public UvA-NEMO Smile database was used for both tasks . The UvA-NEMO smile database has been known as the largest smile database . The UvA-NEMO Smile database was collected to analyze the dynamic characteristics of smiles for different ages . The database consists of 1,240 smile videos collected from 400 subjects. Among 400 subjects, 185 subjects are female and remaining 215 subjects are male. The ages of subjects range from 8 to 76 years. For evaluating the performance of age estimation, we used the experimental protocol defined in . The 10-fold cross-validation scheme was used to evaluate the performance of the proposed method. Each fold was divided by the way in which there was no subject overlap . The parameters of the deep network were trained on 9-folds and the remaining 1-fold was only used as a test set for evaluating the performance. For evaluating the performance of gender classification, we used the experimental protocol defined in .
Evaluation metric. For age estimation, the mean absolute error (MAE)  was utilized for evaluation. The MAE could measure the error between the predicted age and the ground-truth. The MAE was computed as follows:
where and denote predicted age and ground-truth age of n-th test sample, respectively. denotes the number of the test samples. For the case of gender classification, classification accuracy was used for evaluation. We reported the MAE and classification accuracy averaged over all test folds.
|Holistic dynamic approach||-||-||4.41||3.64||4.02|
|Facial relational network||-||-||4.25||3.87||4.06|
Implementation details. The face images used in the experiments were automatically aligned based on the two eye locations detected by the facial landmark detection . The face images were cropped and resized to 9696 pixels. For the appearance representation, the front 10 convolutional layers and 4 max-pooling layers of VGG-face network was used. As a result, 66512 size of feature map was obtained from each face image. Each facial local regions were defined on the feature map with size of 22512 as shown in Figure 3 (b). In other words, there were 9 objects in face sequence. The fully connected layer with 1024 units and stacked LSTM layers were used for . We stacked two LSTMs and each LSTM had 1024 memory cells. Two-layer MLP consisting of 4096 units ( dropout ) per layer was used for with RELU . was implemented by a fully-connected layer and softmax function. Two-layer MLP consisting of 2048, 1024 units (30 dropout, RELU, and batch normalization ) and a fully-connected layer (consisting of 1 neuron for age estimation and 2 neurons for gender classification.) were used for . The mean squared error was used for training the deep network in age estimation. The cross-entropy loss was used for training the deep network in gender classification.
4.2 Age Estimation
|BIF + Dynamics ||5.03|
|IEF + Dynamics (General) ||4.33|
|IEF + Dynamics (Spontaneity-specific) |
|Holistic dynamic approach||4.02|
|Proposed facial relational network||3.87|
4.2.1 Assessment of Facial Relational Network for Age Estimation
First, we evaluated the effectiveness of the relational importance and locational features for age estimation. Table 1 shows the MAE of the facial relational network with locational feature and relational importance. In holistic dynamic approach, appearance features were extracted by the same VGG-face network used in the proposed method and the dynamic features were encoded on the holistic appearance features without dividing the face into local parts. As shown in the table, the locational features could improve the performance of the age estimation by making the facial relational network know the location information of the object pairs. The locational features of the objects were meaningful as the objects of the face sequence were automatically aligned by the facial landmark detection. Moreover, by utilizing both the relational importance and the locational features, the proposed facial relational network achieved the lowest MAE of 3.87 over all test set. It was mainly due to the reason that the importance of relations for age estimation was different and considering the importance of relational features improved the accuracy of age estimation. Interestingly, the age estimation was more accurate at female subjects. It was due to the reason that the beard of the male made the age estimation difficult compared with female.
In order to assess the effectiveness of the proposed facial relational network (with locational features and relational importance), the MAE of the proposed method was compared with the state-of-the-art methods (please see Table 2). The VLBP , displacement , BIF , BIF with dynamics , IEF , IEF with dynamics , spontaneity-specific IEF with dynamics , and and holistic approach were compared. In the spontaneity-specific IEF with dynamics, the spontaneity of smile was classified by  and the separate regressor was trained for spontaneous and posed smiles. As shown in the table, the proposed method achieved lowest MAE without consideration of spontaneity. It was mainly attributed to the fact that the proposed method encoded the latent relational features from object features and effectively combined the relational features based on relational importance.
4.2.2 Interpreting Facial Relational Network in Age Estimation
In order to understand the mechanism of the proposed facial relational network in age estimation, the relational importance calculated from each sequence was analyzed. Figure 4 shows the important relations where the corresponding pair has high relational importance values. We showed the difference of important regions over different ages by presenting the important relations of age groups. Ages were divided into five age groups (8-12, 13-19, 20-36, 37-65, and 66+) according to . To interpret the facial relational network, the relational importance values encoded from test set were averaged in each age group, respectively. Four groups were visualized in the figure with example face images (note that there was no subject to be permitted for reporting in age group of [8-12]). As shown in the figure, when estimating age group of [66+], the relation of two eye regions was important. The relation of two eye regions could represent dynamic features according to crow’s feet and sunken eyes, which could be important factors for estimating ages of the older people. In addition, when considering three objects, the relation of left eye, right eye, and left cheek had highest relational importance in age group of [66+]. There was a tendency to symmetry about the relational importance. For example, the relation of left eye, right eye, and right cheek was included in top-5 high relational importance among 84 relations in age group of [66+].
In addition, to verify the effect of important relations, we made perturbation on the dynamic features as shown in figure 5. For the sequence of 17 years old subject, we changed the local dynamic features of left cheek region with that of 73 years old subject in the experiment. Note that the cheek constructed important pairs for estimating age group of [13-19] as shown in Figure 4 (a). By the perturbation, the absolute error was changed from 0.41 to 2.38. In the same way, we changed the dynamic features of other two regions (left eye and right eye) one by one. The other two regions constructed relatively less important relations and achieved the absolute error of 1.40 and 1.89 (left eye and right eye, respectively). The increase of absolute errors was less than the case which made perturbation on the left cheek. It showed that relations with the left cheek were important for estimating age compared to the relations with the left eye and right eye in age group of [13-19].
|Perturbation location||MAE (standard error)|
|Not contaminated||5.05 (0.28)|
|Less important parts|
|(left eye, forehead, right eye)||7.56 (0.35)|
|Important part (right cheek)||9.00 (0.41)|
For the same sequence, the facial relational network without the use of relational importance was also analyzed. For the facial relational network without the use of relational importance, the absolute error of the estimated age was increased by perturbation on the local dynamic features of the left cheek from 1.20 to 7.45. When conducting perturbation on the left eye and the right eye, the absolute errors were 1.87 and 4.21, respectively. The increase of absolute error became much larger when conducting perturbation on the left cheek. Moreover, the increase of error was larger when the facial relational network did not use relational importance. In other words, the facial relational network with the relational importance was more robust to feature contamination because it adaptively encoded the relational importance from the relational features as in Eq. (4).
In order to statistically analyze the effect of contaminated features in the proposed facial relational network, we also evaluated the MAE when conducting perturbation on each dynamic features of facial local parts with zero values as shown in Figure 6. For 402 subjects who were included in age group of [37-66] in the UvA-NEMO database, the MAE was calculated as shown in Table 3. As shown in the table, the perturbation on most important facial region (i.e. right cheek in age group of [37-66]) had more influenced the accuracy of age estimation compared with the case which made perturbation on less important parts (i.e. left eye, forehead, and right eye in age group of [37-66]). The difference of MAE between the cases which made perturbation on important part and less important parts was statically significant (p<0.05).
4.3 Gender Classification
|Classification Accuracy ()|
|Holistic dynamic approach||-||-||87.10|
|Facial relational network||-||-||88.79|
4.3.1 Assessment of Facial Relational Network for Gender Classification
We also evaluated the effectiveness of the relational importance and locational features for gender classification. The classification accuracy of the facial relational network with relational importance and locational features are summarized in Table 4. In holistic dynamic approach, appearance features were extracted by the same VGG-face network used in the proposed method and the dynamic features were encoded on the holistic appearance features. As shown in the table, the facial relational network methods achieved higher accuracy compared with the holistic dynamic approach. In addition, the relational network achieved the highest accuracy by using both locational features and relational importance. The locational features and the relational importance in the facial relational network were also important for gender classification.
|Classification Accuracy ()|
|how-old.net + dynamics(Tree) ||60.80||93.46||N/A||N/A||N/A|
|how-old.net + dynamics(SVM) ||N/A||N/A||60.80||92.89||N/A|
|COTS + dynamics (Tree) ||76.92||93.00||N/A||N/A||N/A|
|COTS + dynamics (Bagged Trees, PCA) ||N/A||N/A||76.92||92.89||N/A|
|Holistic dynamic approach||74.38||93.52||77.51||93.91||87.10|
|Proposed facial relational network||80.17||94.65||85.14||95.18||90.08|
Table 5 shows the classification accuracy of the proposed facial relational network compared with other methods. Two types of appearance based approach named “how-old.net” and “commercial off-the-shelf (COTS)” were combined with a hand-crafted dynamic approach for gender classification . How-old.net was a website (http://how-old.net/) launched by Microsoft for online age and gender recognition. Images could be uploaded and as an output age and gender labels were provided. COTS was a commercial face detection and recognition software, which included a gender classification. The dynamic approach calculated the facial local region’s dynamic descriptors such as amplitude, speed, and acceleration as described in . Note that  and the holistic dynamic approach did not consider the relations of dynamic features of facial local regions. By exploiting the relations of local dynamic features, the proposed method achieved the highest accuracy on both spontaneous and posed sequence over all age ranges.
4.3.2 Interpreting Facial Relational Network in Gender Classification
In order to understand the mechanism of the facial relational network in gender classification, the relational importance values encoded from each sequence were analyzed. Figure 7 shows the important relations where the relational importance had high values at classifying gender from face sequence. As shown in the figure, the relation of forehead, nose, and mouth side was important for determining male. Note that there was a tendency to symmetry about the relational importance. For determining male, the relation of forehead, nose, and right mouth side and the relation of forehead, nose, and left mouth side were top-2 important relations among 84 relations of three objects. For the case of female, the relation of forehead, nose, and cheek was important. It could be related to the observation that the females express emotions more frequently compared with males and the males have a tendency to show restricted emotions compared with the females. In other words, the females have a tendency to make smiles bigger than males by using muscles of cheek regions. Therefore, the relations of cheek and other face parts were important for recognizing females.
According to cognitive-psychological studies, facial dynamics could provide crucial cues for face analysis. The motion of facial local regions from facial expression is known that it is related to the motion of other facial regions. In this paper, the novel deep learning approach which exploited the relations of facial local dynamics was proposed to estimate facial trait estimation from the smile expression. In order to utilize the relations of facial dynamics in local regions, the proposed network, which consisted of the facial local dynamic feature encoding network and the facial relational network, was designed. To consider the different importance of relations in face analysis, the proposed relational network adaptively encoded the relational importance and effectively combined the relational features with respect to the relational importance. By comparative experiments, the effectiveness of the proposed method was validated for facial trait estimation. The proposed method could accurately estimate facial traits (age and gender) compared with the state-of-the-art methods. Moreover, the relation of facial dynamics was also interpreted by using the relational importance.
-  R. B. Adams Jr, U. Hess, and R. E. Kleck. The intersection of gender-related facial appearance and facial displays of emotion. Emotion Review, 7(1):5–13, 2015.
-  F. Alnajar, C. Shan, T. Gevers, and J.-M. Geusebroek. Learning-based encoding with soft assignment for age estimation under unconstrained imaging conditions. Image and Vision Computing, 30(12):946–953, 2012.
-  A. Asthana, S. Zafeiriou, S. Cheng, and M. Pantic. Incremental face alignment in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1859–1866, 2014.
-  J. Bekios-Calfa, J. M. Buenaposada, and L. Baumela. Robust gender recognition by exploiting facial attributes dependencies. Pattern Recognition Letters, 36:228–234, 2014.
-  E. Cashdan. Smiles, speech, and body posture: How women and men display sociometric status and power. Journal of Nonverbal Behavior, 22(4):209–228, 1998.
-  A. Dantcheva and F. Brémond. Gender estimation based on smile-dynamics. IEEE Transactions on Information Forensics and Security, 12(3):719–729, 2017.
-  A. Dantcheva, P. Elia, and A. Ross. What else does your biometric data reveal? a survey on soft biometrics. IEEE Transactions on Information Forensics and Security, 11(3):441–467, 2016.
-  M. Demirkus, M. Toews, J. J. Clark, and T. Arbel. Gender classification from unconstrained video sequences. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2010 IEEE Computer Society Conference on, pages 55–62. IEEE, 2010.
-  H. Dibeklioğlu, F. Alnajar, A. A. Salah, and T. Gevers. Combining facial dynamics with appearance for age estimation. IEEE Transactions on Image Processing, 24(6):1928–1943, 2015.
-  H. Dibeklioğlu, T. Gevers, A. A. Salah, and R. Valenti. A smile can reveal your age: Enabling facial dynamics in age estimation. In Proceedings of the 20th ACM international conference on Multimedia, pages 209–218. ACM, 2012.
-  H. Dibeklioğlu, A. A. Salah, and T. Gevers. Are you really smiling at me? spontaneous versus posed enjoyment smiles. In European Conference on Computer Vision, pages 525–538. Springer, 2012.
-  H. Dibeklioğlu, A. A. Salah, and T. Gevers. Recognition of genuine smiles. IEEE Transactions on Multimedia, 17(3):279–294, 2015.
-  J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2625–2634, 2015.
-  A. C. Gallagher and T. Chen. Understanding images of groups of people. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 256–263. IEEE, 2009.
-  G. Guo, G. Mu, Y. Fu, and T. S. Huang. Human age estimation using bio-inspired features. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 112–119. IEEE, 2009.
-  A. Hadid. Analyzing facial behavioral features from videos. Human Behavior Understanding, pages 52–61, 2011.
-  U. Hess, R. B. Adams Jr, and R. E. Kleck. Facial appearance, gender, and emotion expression. Emotion, 4(4):378, 2004.
-  S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
-  S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, pages 448–456, 2015.
-  S. Ji, W. Xu, M. Yang, and K. Yu. 3d convolutional neural networks for human action recognition. IEEE transactions on pattern analysis and machine intelligence, 35(1):221–231, 2013.
-  F. Juefei-Xu, E. Verma, P. Goel, A. Cherodian, and M. Savvides. Deepgender: Occlusion and low resolution robust facial gender classification via progressively trained convolutional neural networks with attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 68–77, 2016.
-  A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 1725–1732, 2014.
-  D. H. Kim, W. Baddar, J. Jang, and Y. M. Ro. Multi-objective based spatio-temporal feature representation learning robust to expression intensity variations for facial expression recognition. IEEE Transactions on Affective Computing, 2017.
-  S. T. Kim, D. H. Kim, and Y. M. Ro. Facial dynamic modelling using long short-term memory network: Analysis and application to face authentication. In Biometrics Theory, Applications and Systems (BTAS), 2016 IEEE 8th International Conference on, pages 1–6. IEEE, 2016.
-  C. Lea, A. Reiter, R. Vidal, and G. D. Hager. Segmental spatiotemporal cnns for fine-grained action segmentation. In European Conference on Computer Vision, pages 36–52. Springer, 2016.
-  S. Li, J. Xing, Z. Niu, S. Shan, and S. Yan. Shape driven kernel adaptation in convolutional neural network for robust facial traits recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 222–230, 2015.
-  E. Makinen and R. Raisamo. Evaluation of gender classification methods with automatically detected and aligned faces. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(3):541–547, 2008.
-  V. Nair and G. E. Hinton. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10), pages 807–814, 2010.
-  O. M. Parkhi, A. Vedaldi, A. Zisserman, et al. Deep face recognition. In BMVC, volume 1, page 6, 2015.
-  K. S. Pilz, I. M. Thornton, and H. H. Bülthoff. A search advantage for faces learned in motion. Experimental Brain Research, 171(4):436–447, 2006.
-  D. Reid, S. Samangooei, C. Chen, M. Nixon, and A. Ross. Soft biometrics for surveillance: an overview. Machine learning: theory and applications. Elsevier, pages 327–352, 2013.
-  D. A. Roark, S. E. Barrett, M. J. Spence, H. Abdi, and A. J. O’Toole. Psychological and neural perspectives on the role of motion in face recognition. Behavioral and cognitive neuroscience reviews, 2(1):15–46, 2003.
-  A. Santoro, D. Raposo, D. G. Barrett, M. Malinowski, R. Pascanu, P. Battaglia, and T. Lillicrap. A simple neural network module for relational reasoning. arXiv preprint arXiv:1706.01427, 2017.
-  R. W. Simon and L. E. Nath. Gender and emotion in the united states: Do men and women differ in self-reports of feelings and expressive behavior? American journal of sociology, 109(5):1137–1176, 2004.
-  N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. Journal of machine learning research, 15(1):1929–1958, 2014.
-  Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. Deepface: Closing the gap to human-level performance in face verification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1701–1708, 2014.
-  M. Toews and T. Arbel. Detection, localization, and sex classification of faces from arbitrary viewpoints and under occlusion. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(9):1567–1581, 2009.
-  Y. Tong, W. Liao, and Q. Ji. Facial action unit recognition by exploiting their dynamic and semantic relationships. IEEE transactions on pattern analysis and machine intelligence, 29(10), 2007.
-  D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 4489–4497, 2015.
-  M. Uřičař, R. Timofte, R. Rothe, J. Matas, et al. Structured output svm prediction of apparent age, gender and smile from deep features. In Proceedings of the 29th IEEE Conference on Computer Vision and Pattern Recognision Workshop (CVPRW 2016), pages 730–738. IEEE, 2016.
-  J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici. Beyond short snippets: Deep networks for video classification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4694–4702, 2015.
-  K. Zhao, W.-S. Chu, F. De la Torre, J. F. Cohn, and H. Zhang. Joint patch and multi-label learning for facial action unit and holistic expression recognition. IEEE Transactions on Image Processing, 25(8):3931–3946, 2016.
Appendix A Appendix
a.1 Structure of Facial Local Dynamic Feature Encoding Network
The facial local dynamic feature encoding network consists of the facial appearance feature representation network and the dynamic feature encoding network. The detailed structures of the networks are provided in Table 6 and Table 7. As seen in Table 6, the facial appearance feature representation network consists of 10 convolutional layers and 4 max-pooling layers. We used rectified linear units (ReLU)  for the non-linearity activation. The parameters of 10 convolutional layers were transferred from VGG-face network . As a result, size of feature map was obtained from each face image. Each facial local region was defined on the feature map with size of and used for input of the dynamic feature encoding network.
||Structure||Input size||Output size|
|max-pool, stride 2|
|max-pool, stride 2|
|max-pool, stride 2|
|max-pool, stride 2|
For the dynamic feature representation, we made our LSTM network deep over temporal dimension, which had temporal recurrence of hidden variables. The deep LSTM network was constructed by stacking multiple LSTM layers on top of each other as seen in Table 7. The output of the last time step of the second LSTM layer was used for facial local dynamic feature. The 50 dropout  was used for fully-connected layer and the 20 dropout was used for each LSTM layer.
|Input size||Output size|