Multi-term and Multi-task Affect Analysis in the Wild
Human affect recognition is an important factor in human-computer interaction. However, the development of method for in-the-wild data is still nowhere near enough. in this paper, we introduce the affect recognition method that was submitted to the Affective Behavior Analysis in-the-wild (ABAW) 2020 Contest. In our approach, since we considered that affective behaviors have different time window features, we generated features and averaged labels using short-term, medium-term, and long-term time windows from video images. Then, we generated affect recognition models in each time window, and esembled each models. In addition,we fuseed the VA and EXP models, taking into account that Valence, Arousal, and Expresion are closely related. The features were trained by gradient boosting, using the mean, standard deviation, max-range, and slope in each time winodows. We achieved the valence-arousal score: 0.495 and expression score: 0.464 on the validation set.
Human emotional recognition is an important factor in human-computer interaction. It is expected to contribute to a wide range of fields such as healthcare and learning. Many methods of expressing human emotions have been studied, of which ”categorical emotion classification” and ”Valence-Arousal” are the methods most commonly used. In the emotional category, six basic emotional expressions proposed by Ekman and Friesen are popular. Ekman et al. Classify emotions as ”anger, disgust, fear, happiness, sadness, surprise”. Another way to express emotions is the emotional circumplex model developed by Russell. The circumflex plex model, human emotions are mapped in a two-dimensional plane using two orthogonal axes of the valence axis and arousal axis.
Recently, D. Kollias has provided a large scale in-the-wild dataset, Aff-Wild2. Aff-wild2 is an extended version of Aff-wild. this dataset has used actual videos including a wide range of content (different age, ethnicity, lighting conditions, location, image quality, etc.) collected from YouTube. And multiple lablels such as 7 emotion classifications (6 basic emotion expressions + Neutral), Valence-Arousal, Action-unit (based on Facial action coding system (FACS)) have been annotated to the video.
In this paper, we propose a fusion model that uses multiple time sclae features and different recognition tasks. Fig. 1 shows the framework of fusion model. When the video data is received, facial features and posture features are extracted. Given the videos, facial and pose features are extracted. These features are then converted into multiple term features calculated over short-term, medium-term, and long-term time window. A model for a single recognition task (Valence or Arousal or Expression) is constructed by ensembling the models constructed using each multiple term features. Furthermore, the final predictive model is generated by fusing other recognition task models.
Ii Related Work
Estimating not only the occurrence of emotions, but also intensity of them, is a concern that has been studied for many years. In recent years, Van Tong Huin et al. have proposed a method for estimating the regression of engagement, that is strongly related to emotions, with high accuracy by ensembling Action-unit features obtained from Openface and image features obtained from ResNet50 in ”the 6th Wild Challenge in Emotional Recognition (EmotiW 2019)”. Similarly, Zhiguang Zhou et al. estimate the regression problem of engagement with high accuracy by ensembling Action-unit features obtained from Openface and posture features obtained from Openpose. Ensembling weak models is one of the effective ways to estimate emotions.
Also, Nigel Bosch, Sidney D’Mello et al. investigated the relationship between time windows and classification performance in emotion classification, and showed that some emotions performed well in different time windows (eg. ”Delighted” is high performance in a short time window, and ”Confused” is high performance in a long time window).
In this section, we introduce our proposed method that combines multiple time-scales and multiple recognition tasks. The method consists of pre-processing, multi-term model, and model fusion of multi-task.
Iii-a Visual Data Pre-processing
First, as shown in Fig. 2, facial expression features and posture features are extracted for each video. There are two types of facial features, one obtained from Openface and the other obtained from ResNet50 or EfficientNET. From Openface, 49 dimensions consisting of Action unit Intencity (17 dimensions), Action unit Occurrence (18 dimensions), Head-pose (6 dimensions), Gaze features (8 dimensions) are acquired as the features: F1. From ResNet50, after acquiring the 2048-dimensional image features: F2, the features which have been dimensionally reduced to 200 dimensions by principal component analysis (PCA) are obtained as F2’. From EfficientNet, after acquiring the 2048-dimensional image features: F2, the features which have been dimensionally reduced to 300 dimensions by principal component analysis (PCA) are obtained as F2’. The posture features are obtained from Openpose. In Openpose, 25-dimensional x 3-axis = 75-dimensional skeleton features: F3 is used.
Next, the features using multiple time windows for the features F1, F2’and F3 are computed, since we think that changes in emotions and facial expressions are characterized by different time windows (For example, opening a mouth with a yawn is characterized in a long time window, and raising an eyebrow by suprising is characterized in a short time window). There are three types of time windows: short-term, middle-term and long-term. The features of each time window (Fs, Fm, Fl) consists of the following.
Maximum change width (maximum value - minimum value)
Slope (using least squares method)
Similarly, for each time window, a label (Ls, Lm, Ll) is generated using annotations. The label generation method differs depending on the target and is as follows.
Valence, Aroual: Average value of annotations
Expression: Mode of annotations
Iii-B Data Balancing
It is important to address the data imbalance problem. In the Expression of the Aff-wild2 dataset, over 60% are Neutral, and Anger and Fear are only about 1%. In Valens-Arousal, more than 23% of data is collected in the range of Valence: 0 to 0.25 and Arousal: 0 to 0.25. Therefore, we balanced the data. Fig. 3 shows the expression balancing results, and Fig. 4 shows the Valens-Arousal balancing results. In the Expression, the data is balanced by halving the number of neutral data, and duplicating other emotional data. In Valens-Arousal, after dividing into a total of 64 areas of Valence 8 division * Arousal 8 division, the data is balanced by halving the data in the central, and duplicating the data in other areas.
Iii-C Multi-term model
The structure of the multi-term model is shown in Fig. 5. First, a single-term model is generated using Ft and Lt, which are the time window features and label described in the previous section (t is the target time window).In the single-term model, the feature: Ft is divided into AU feature: Ft-au, Head-pose feature: Ft-head, Gaze feature: Ft-gaze, Openpose feature: Ft-pose, ResNet50 feature: Ft-rnet or EfficientNet feature: Ft-enet, and the estimation models are generated individually.This is because the final performance is improved by generating a model for each feature with different characteristics and then ensemble the model, especially in the estimation of emotion. Labels for Valence and Arosal use the values avraged in each time window. In other words, the model in single-term is a model that estimates the trend of short-term, middle-term, long-term Valence and Arousal. Then, Msingle-task, which is an estimation model for single task, is generated by ensemble the single-term models (Ms, Mm, Ml) in short-term, middle-term, and long-term, . task is a recognition task in this paper, and is three kinds of Valence, Arousal, and Expression. The label uses the value of short-term. This is because the values are comparable to the data in frame units.
Iii-D Model fusion of multi-task
It has been reported that the estimation performance of the target task is improved by using different task features . Therefore, as shown in Fig. 6, a Fusion model is generated by incorporating the estimated values for other recognition tasks as features into the Multi-term model.Fusion model uses Multi-term models (Msingle-valence, Msingle-arousal, Msingle-expression) generated in each task. The estimated value of the target task is generated by combining the estimated values of the three single-term models for the target task with the estimated values of the multi-term models for non-target tasks. For example, when Valence is targeted, the estimated values of the short-term, middle-term, and long-term models that estimate Valence, the estimated values of the Multi-term model that estimate Arousal, and the Multi-term that estimates Expression The model estimates are combined to generate the final Valence estimation model.
Iv-a Implementation and Setup
We used the Aff-wild2 dataset . This contains 548 videos, and multiple annotations (Valence-Arousal, Expression, etc.) are added in frame units. This is currently the largest in-the-wild dataset annotated to audiovisual. In this challenge, the following Training subjects, Validation subjects, and Test subjects data were provided from the data annotated with Valence-Arousal and Expression.
Valence-Arousal: 351, 71, 139 subjects in the training, validation, test
Expression: 253, 70, 223 subjects in the training, validation, test
However, some videos may have multiple subjects in the frame. Since it is difficult to separate subjects from these videos, videos without these were used for training and validation. As a result, 341 training subjects and 65 validation subjects were used for Valence-Arousal task, and 244 training subjects and 68 validation subjects were used for Expression task. The test subjects were manually divided and used all.
For Challenge-Track 1: Valence-Arousal estimation, ABAW Challenge used the Canonical Concordance Coefficient (CCC) metric as follows:
where and are the valence/arousal annotations and predicted values, and are their variances, is the covariance, and are the mean values. Total score of track 1 is Valence-Arousal the mean value of CCC in valence and arousal.
For Challenge-Track 2: 7 Basic Expression Classification, ABAW Challenge used the accuracy and F1 score, and score of track 2 is calculated as below equation:
Our framework was implemented by Jupyter lab. For feature extraction in pre-processing, Openface 2.2.0, ResNet50 or EfficientNet, and Openpoe 1.5.1 were used, and standardization was performed for each feature. For short-term, middle-term, and long-term, we used time-windows of 1 second, 6 seconds, and 12 seconds, respectively, and training data and validation data were generated with 0.2 second shift. As a result of preprocessing, 285,260 training data and 46,398 validation data were used for Valence-Arousal, and 285,260 training data and 46,398 validation data were used for Expression for each time window. We used LightGBM to generate the learning model. In Valence-Arousal, regression was used in objective and the following custom functions was used in metric function. This is to balance CCC, which is the evaluation metric of this task, with MSE to minimize the error.
metric function = 2 * CCC - MSE
CCC: Canonical Concordance Coefficient
MSE: Mean Squared Error
In Expression, multiclass (7 classes) was used in objective and Eq. (3), which is Evaluation Metric of this task, was used in metric function as a custom function. Among other parameters, num_leaves, learning_rate, max_depth, min_child_samples were tuned by grid search. The above tuning was performed for each model of AU, Head-pose, Gaze, Openpose, Resnet50 or EfficientNet, Single-term model, Multi-term model and Multi task model, and the final model was generated. In addition, submissions up to 7 times are allowed in this challenge, so we generated models with the following patterns and validated it.
Using ResNet50 or EfficientNet
Data balancing, or not
Add 3-second window data to the 1-second, 6-second, and 12-second window data, or not
Feature extraction (reduction of features to 50% based on LoightGBM importances), or not
Iv-B Results and Discussion
First, Table I shows the comparison result on the validation set between models trained using only the features of single time window (short-term, medium-term, long-term), ensmbled multiple time windows (multi-term), and fused different cognitive tasks (multi-task). This is the comparison result under the conditions of no data balancing, no 3-second window, and no feature extraction. The Expression Score is the result calculated based on Eq. (3), and the Valence-Arousal Score is the result calculated based on Eq. (1) and Eq. (2). As a result of the validation, it was confirmed that the score of Mult-term is higher than that of Single-term, and that the score of Multi task is higher than that of Multi-term. In particular, the Score was significantly improved in the Multi-term model. We think that this is because the gestures expressing emotions have different time window features, such as yawning and raising eyebrows with surprise, and the model incorporates each features effectively. Next, Table II shows the validation results of various patterns of Multi-task. ”Submit” in the table is the submittion number for this challenge.
|enet.) using EfficientNet, bal.) data balancing|
|ext.) feature extraction, 3s.) using 3s time window|
V Conclusions and Future Works
This paper describes the Multiple time window & Multitask Model and data balancing for estimating emotion classifications and valence-arousal intensity using the Aff-Wild2 dataset.
Our model has achieved significantly higher performance than baseline on tracks 1 and 2 of the ABAW Challenge.
In the future, we will investigate effective data augmentation including pseudo label and time series data augmentation etc.
- P. Ekman, E. R. Sorenson and W. V. Friesen, ”Pan-cultural elements in facial displays of emotions”, Cognition and Emotion, vol. 164, 1969, pp 86-88.
- P. Ekman and W. V. Friesen, ”Constants acrosscultures in the face and emotion”, Journal of Personality and Social Psychology, vol. 17, 1971, pp 124-129.
- J. A. Russell, ”A circumplex model of affect”, Journal of Personality and Social Psychology, vol. 39, no. 6, 1980, pp 1161.
- D. Kollias and S. Zafeiriou, ”Aff-wild2: Extending the aff-wild database for affect recognition”, arXiv preprint, arXiv:1811.07770, 2018.
- D. Kollias and S. Zafeiriou, ”Expression, affect, action unit recognition: Aff-wild2, multi-task learning and arcface”, arXiv preprint, arXiv:1910.04855, 2019.
- S. Zafeiriou, D. Kollias, M. A. Nicolaou, A. Papaioannou, G. Zhao and I. Kotsia, ”Aff-Wild: Valence and Arousal ’In-the-Wild’ Challenge”, IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, vol. 2017-July, 2017, pp 1980-1987.
- P. Ekman and W. V. Friesen, ”Facial action coding system: A technique for the measurement of facial movement”, Consulting Psychologists Press, 1978.
- Tadas Baltrusaitis ; Amir Zadeh ; Yao Chong Lim ; Louis-Philippe Morency, ”OpenFace 2.0: Facial Behavior Analysis Toolkit”, IEEE International Conference on Automatic Face and Gesture Recognition, 2018.
- Kaiming He, Xiangyu Zhang, Shaoqing Ren and Jian Sun, ”Deep residual learning for image recognition”, IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp 770-778.
- Van Thong Huynh, Soo-Hyung Kim, Guee-Sang Lee and Hyung-Jeong Yang, ”Engagement Intensity Prediction withFacial Behavior Features”, International Conference on Multimodal Interaction, 2019, pp 567-571.
- Zhe Cao, Tomas Simon, Shih-En Wei and Yaser Sheikh, ”Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields”, IEEE Conference on Computer Vision and Pattern Recognition, 2017.
- Jianming Wu, Zhiguang Zhou, Yanan Wang, Yi Li, Xin Xu and Yusuke Uchida, ”Multi-feature and Multi-instance Learning with Anti-overfitting Strategy for Engagement Intensity Prediction”, International Conference on Multimodal Interaction, 2019, pp 582-588.
- D. Kollias, V. Sharmanska, and S. Zafeiriou, ”Face behavior la carte: Expressions, affect and action units in a single network”, arXiv preprint, arXiv:1910.11111, 2019.
- N. Bosch, S. D’Mello, R. Baker, J. Ocumpaugh, V. Shute, M. Ventura and W. Zhao, ”Automatic detection of learning-centered affective states in the wild”, International Conference on Intelligent User Interfaces, 2015, pp 379-388.
- Wei-Yi Chang, Shih-Huan Hsu, Jen-Hsien Chien, ”FATAUVA-Net: An Integrated Deep Learning Framework for Facial Attribute Recognition, Action Unit Detection, and Valence-Arousal Estimation”, IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2020.
- Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, Tie-Yan Liu, ”LightGBM: A Highly Efficient Gradient Boosting Decision Tree”, Neural Information Processing Systems conference, 2017, pp 3146-3154.
- D. Kollias, A. Schulc, E. Hajiyev, and S. Zafeiriou, ”Analysing affective behavior in the first abaw 2020 competition”, arXiv preprint, arXiv:2001.11409, 2020.
- Jianfei Yang, Kai Wang, Xiaojiang Peng and Yu Qiao, ”Deep Recurrent Multi-instance Learning with Spatio-temporal Features for Engagement Intensity Prediction”, International Conference on Multimodal Interaction, 2018, pp 594-598.
- Kai Wang, Jianfei Yang, Da Guo, Kaipeng Zhang, Xiaojiang Peng and Yu Qiao, ”Bootstrap Model Ensemble and Rank Loss for Engagement Intensity Regression”, International Conference on Multimodal Interaction, 2019, pp 551-556.
- Mingxing Tan and Quoc V. Le, ”EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks”, International Conference on Machine Learning, 2019.