Multi-term and Multi-task Affect Analysis in the Wild

Multi-term and Multi-task Affect Analysis in the Wild


Human affect recognition is an important factor in human-computer interaction. However, the development of method for in-the-wild data is still nowhere near enough. in this paper, we introduce the affect recognition method that was submitted to the Affective Behavior Analysis in-the-wild (ABAW) 2020 Contest. In our approach, since we considered that affective behaviors have different time window features, we generated features and averaged labels using short-term, medium-term, and long-term time windows from video images. Then, we generated affect recognition models in each time window, and esembled each models. In addition,we fuseed the VA and EXP models, taking into account that Valence, Arousal, and Expresion are closely related. The features were trained by gradient boosting, using the mean, standard deviation, max-range, and slope in each time winodows. We achieved the valence-arousal score: 0.495 and expression score: 0.464 on the validation set.

Action Unit, Valence-Arousal, Emotional Expression, multi-term, multi-task

I Introduction

Human emotional recognition is an important factor in human-computer interaction. It is expected to contribute to a wide range of fields such as healthcare and learning. Many methods of expressing human emotions have been studied, of which ”categorical emotion classification” and ”Valence-Arousal” are the methods most commonly used. In the emotional category, six basic emotional expressions[1][2] proposed by Ekman and Friesen are popular. Ekman et al. Classify emotions as ”anger, disgust, fear, happiness, sadness, surprise”. Another way to express emotions is the emotional circumplex model[3] developed by Russell. The circumflex plex model, human emotions are mapped in a two-dimensional plane using two orthogonal axes of the valence axis and arousal axis.

Recently, D. Kollias has provided a large scale in-the-wild dataset, Aff-Wild2[4][5]. Aff-wild2 is an extended version of Aff-wild[6]. this dataset has used actual videos including a wide range of content (different age, ethnicity, lighting conditions, location, image quality, etc.) collected from YouTube. And multiple lablels such as 7 emotion classifications (6 basic emotion expressions + Neutral), Valence-Arousal, Action-unit (based on Facial action coding system (FACS)[7]) have been annotated to the video.

In this paper, we propose a fusion model that uses multiple time sclae features and different recognition tasks. Fig. 1 shows the framework of fusion model. When the video data is received, facial features and posture features are extracted. Given the videos, facial and pose features are extracted. These features are then converted into multiple term features calculated over short-term, medium-term, and long-term time window. A model for a single recognition task (Valence or Arousal or Expression) is constructed by ensembling the models constructed using each multiple term features. Furthermore, the final predictive model is generated by fusing other recognition task models.

Fig. 1: Overview of the proposed method for predicting Valence, Arousal, Expression.

Ii Related Work

Estimating not only the occurrence of emotions, but also intensity of them, is a concern that has been studied for many years. In recent years, Van Tong Huin et al. have proposed a method for estimating the regression of engagement, that is strongly related to emotions, with high accuracy by ensembling Action-unit features obtained from Openface[8] and image features obtained from ResNet50[9] in ”the 6th Wild Challenge in Emotional Recognition (EmotiW 2019)”[10]. Similarly, Zhiguang Zhou et al. estimate the regression problem of engagement with high accuracy by ensembling Action-unit features obtained from Openface and posture features obtained from Openpose[11][12]. Ensembling weak models is one of the effective ways to estimate emotions.

Also, Nigel Bosch, Sidney D’Mello et al. investigated the relationship between time windows and classification performance in emotion classification, and showed that some emotions performed well in different time windows (eg. ”Delighted” is high performance in a short time window, and ”Confused” is high performance in a long time window)[13].

In addition, regarding the impact of various recognition tasks, D. Kollias et al. have shown that combining the tasks of action-unit detection, emotional classification, and estimation of valense-arousal improves the performance of each task[14][15].

Iii Methodology

In this section, we introduce our proposed method that combines multiple time-scales and multiple recognition tasks. The method consists of pre-processing, multi-term model, and model fusion of multi-task.

Iii-a Visual Data Pre-processing

First, as shown in Fig. 2, facial expression features and posture features are extracted for each video. There are two types of facial features, one obtained from Openface and the other obtained from ResNet50[9] or EfficientNET[20]. From Openface, 49 dimensions consisting of Action unit Intencity (17 dimensions), Action unit Occurrence (18 dimensions), Head-pose (6 dimensions), Gaze features (8 dimensions) are acquired as the features: F1. From ResNet50, after acquiring the 2048-dimensional image features: F2, the features which have been dimensionally reduced to 200 dimensions by principal component analysis (PCA) are obtained as F2’. From EfficientNet, after acquiring the 2048-dimensional image features: F2, the features which have been dimensionally reduced to 300 dimensions by principal component analysis (PCA) are obtained as F2’. The posture features are obtained from Openpose. In Openpose, 25-dimensional x 3-axis = 75-dimensional skeleton features: F3 is used.

Next, the features using multiple time windows for the features F1, F2’and F3 are computed, since we think that changes in emotions and facial expressions are characterized by different time windows (For example, opening a mouth with a yawn is characterized in a long time window, and raising an eyebrow by suprising is characterized in a short time window). There are three types of time windows: short-term, middle-term and long-term. The features of each time window (Fs, Fm, Fl) consists of the following.

  • Average value

  • Standard deviation

  • Maximum change width (maximum value - minimum value)

  • Slope (using least squares method)

Similarly, for each time window, a label (Ls, Lm, Ll) is generated using annotations. The label generation method differs depending on the target and is as follows.

  • Valence, Aroual: Average value of annotations

  • Expression: Mode of annotations

Fig. 2: Pre-processing: feature engineering of multi model and multi time-scale data

Iii-B Data Balancing

It is important to address the data imbalance problem. In the Expression of the Aff-wild2 dataset, over 60% are Neutral, and Anger and Fear are only about 1%. In Valens-Arousal, more than 23% of data is collected in the range of Valence: 0 to 0.25 and Arousal: 0 to 0.25. Therefore, we balanced the data. Fig. 3 shows the expression balancing results, and Fig. 4 shows the Valens-Arousal balancing results. In the Expression, the data is balanced by halving the number of neutral data, and duplicating other emotional data. In Valens-Arousal, after dividing into a total of 64 areas of Valence 8 division * Arousal 8 division, the data is balanced by halving the data in the central, and duplicating the data in other areas.

Fig. 3: Expression data distributions

Fig. 4: Valence-Arousal data distributions

Iii-C Multi-term model

The structure of the multi-term model is shown in Fig. 5. First, a single-term model is generated using Ft and Lt, which are the time window features and label described in the previous section (t is the target time window).In the single-term model, the feature: Ft is divided into AU feature: Ft-au, Head-pose feature: Ft-head, Gaze feature: Ft-gaze, Openpose feature: Ft-pose, ResNet50 feature: Ft-rnet or EfficientNet feature: Ft-enet, and the estimation models are generated individually.This is because the final performance is improved by generating a model for each feature with different characteristics and then ensemble the model, especially in the estimation of emotion[10][12][18][19]. Labels for Valence and Arosal use the values avraged in each time window. In other words, the model in single-term is a model that estimates the trend of short-term, middle-term, long-term Valence and Arousal. Then, Msingle-task, which is an estimation model for single task, is generated by ensemble the single-term models (Ms, Mm, Ml) in short-term, middle-term, and long-term, . task is a recognition task in this paper, and is three kinds of Valence, Arousal, and Expression. The label uses the value of short-term. This is because the values are comparable to the data in frame units.

Fig. 5: Multi-Term Model: ensembled short-term, middle-term, long-term model

Iii-D Model fusion of multi-task

It has been reported that the estimation performance of the target task is improved by using different task features [14][15]. Therefore, as shown in Fig. 6, a Fusion model is generated by incorporating the estimated values for other recognition tasks as features into the Multi-term model.Fusion model uses Multi-term models (Msingle-valence, Msingle-arousal, Msingle-expression) generated in each task. The estimated value of the target task is generated by combining the estimated values of the three single-term models for the target task with the estimated values of the multi-term models for non-target tasks. For example, when Valence is targeted, the estimated values of the short-term, middle-term, and long-term models that estimate Valence, the estimated values of the Multi-term model that estimate Arousal, and the Multi-term that estimates Expression The model estimates are combined to generate the final Valence estimation model.

Fig. 6: Multi Task Model: fusion multi-term model for taregt task and models for other tasks

Iv Experiments

Iv-a Implementation and Setup


We used the Aff-wild2 dataset [4][5]. This contains 548 videos, and multiple annotations (Valence-Arousal, Expression, etc.) are added in frame units. This is currently the largest in-the-wild dataset annotated to audiovisual. In this challenge, the following Training subjects, Validation subjects, and Test subjects data were provided from the data annotated with Valence-Arousal and Expression.

  • Valence-Arousal: 351, 71, 139 subjects in the training, validation, test

  • Expression: 253, 70, 223 subjects in the training, validation, test

However, some videos may have multiple subjects in the frame. Since it is difficult to separate subjects from these videos, videos without these were used for training and validation. As a result, 341 training subjects and 65 validation subjects were used for Valence-Arousal task, and 244 training subjects and 68 validation subjects were used for Expression task. The test subjects were manually divided and used all.

[Evaluation Metric]

For Challenge-Track 1: Valence-Arousal estimation, ABAW Challenge used the Canonical Concordance Coefficient (CCC) metric as follows:


where and are the valence/arousal annotations and predicted values, and are their variances, is the covariance, and are the mean values. Total score of track 1 is Valence-Arousal the mean value of CCC in valence and arousal.


For Challenge-Track 2: 7 Basic Expression Classification, ABAW Challenge used the accuracy and F1 score, and score of track 2 is calculated as below equation:



Our framework was implemented by Jupyter lab. For feature extraction in pre-processing, Openface 2.2.0, ResNet50 or EfficientNet, and Openpoe 1.5.1 were used, and standardization was performed for each feature. For short-term, middle-term, and long-term, we used time-windows of 1 second, 6 seconds, and 12 seconds, respectively, and training data and validation data were generated with 0.2 second shift. As a result of preprocessing, 285,260 training data and 46,398 validation data were used for Valence-Arousal, and 285,260 training data and 46,398 validation data were used for Expression for each time window. We used LightGBM[16] to generate the learning model. In Valence-Arousal, regression was used in objective and the following custom functions was used in metric function. This is to balance CCC, which is the evaluation metric of this task, with MSE to minimize the error.

  • metric function = 2 * CCC - MSE

    • CCC: Canonical Concordance Coefficient

    • MSE: Mean Squared Error

In Expression, multiclass (7 classes) was used in objective and Eq. (3), which is Evaluation Metric of this task, was used in metric function as a custom function. Among other parameters, num_leaves, learning_rate, max_depth, min_child_samples were tuned by grid search. The above tuning was performed for each model of AU, Head-pose, Gaze, Openpose, Resnet50 or EfficientNet, Single-term model, Multi-term model and Multi task model, and the final model was generated. In addition, submissions up to 7 times are allowed in this challenge, so we generated models with the following patterns and validated it.

  • Using ResNet50 or EfficientNet

  • Data balancing, or not

  • Add 3-second window data to the 1-second, 6-second, and 12-second window data, or not

  • Feature extraction (reduction of features to 50% based on LoightGBM importances), or not

Iv-B Results and Discussion

First, Table I shows the comparison result on the validation set between models trained using only the features of single time window (short-term, medium-term, long-term), ensmbled multiple time windows (multi-term), and fused different cognitive tasks (multi-task). This is the comparison result under the conditions of no data balancing, no 3-second window, and no feature extraction. The Expression Score is the result calculated based on Eq. (3), and the Valence-Arousal Score is the result calculated based on Eq. (1) and Eq. (2). As a result of the validation, it was confirmed that the score of Mult-term is higher than that of Single-term, and that the score of Multi task is higher than that of Multi-term. In particular, the Score was significantly improved in the Multi-term model. We think that this is because the gestures expressing emotions have different time window features, such as yawning and raising eyebrows with surprise, and the model incorporates each features effectively. Next, Table II shows the validation results of various patterns of Multi-task. ”Submit” in the table is the submittion number for this challenge.

EXPR Valence-Arousal
Method Score Val. Aro. Score
Baseline [17] 0.360 0.140 0.240 0.190
short-term 0.364 0.327 0.417 0.402
Middle-term 0.435 0.361 0.430 0.396
Long-term 0.374 0.351 0.380 0.366
Multi-term 0.426 0.455 0.504 0.480
Multi-task 0.432 0.455 0.508 0.482
Pettern EXPR Valence-Arousal
Subjct enet. bal. ext. 3s. Score Val. Aro. Score
Base [17] 0.360 0.140 0.240 0.190
submit. 1 0.432 0.455 0.508 0.482
submit. 2 0.435 0.429 0.502 0.466
submit. 3 0.462 0.500 0.489 0.495
submit. 4 0.471 0.480 0.477 0.479
submit. 5 0.462 0.500 0.489 0.495
submit. 6 0.471 0.480 0.477 0.479
enet.) using EfficientNet, bal.) data balancing
ext.) feature extraction, 3s.) using 3s time window

V Conclusions and Future Works

This paper describes the Multiple time window & Multitask Model and data balancing for estimating emotion classifications and valence-arousal intensity using the Aff-Wild2 dataset. Our model has achieved significantly higher performance than baseline on tracks 1 and 2 of the ABAW Challenge.
In the future, we will investigate effective data augmentation including pseudo label and time series data augmentation etc.


  1. P. Ekman, E. R. Sorenson and W. V. Friesen, ”Pan-cultural elements in facial displays of emotions”, Cognition and Emotion, vol. 164, 1969, pp 86-88.
  2. P. Ekman and W. V. Friesen, ”Constants acrosscultures in the face and emotion”, Journal of Personality and Social Psychology, vol. 17, 1971, pp 124-129.
  3. J. A. Russell, ”A circumplex model of affect”, Journal of Personality and Social Psychology, vol. 39, no. 6, 1980, pp 1161.
  4. D. Kollias and S. Zafeiriou, ”Aff-wild2: Extending the aff-wild database for affect recognition”, arXiv preprint, arXiv:1811.07770, 2018.
  5. D. Kollias and S. Zafeiriou, ”Expression, affect, action unit recognition: Aff-wild2, multi-task learning and arcface”, arXiv preprint, arXiv:1910.04855, 2019.
  6. S. Zafeiriou, D. Kollias, M. A. Nicolaou, A. Papaioannou, G. Zhao and I. Kotsia, ”Aff-Wild: Valence and Arousal ’In-the-Wild’ Challenge”, IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, vol. 2017-July, 2017, pp 1980-1987.
  7. P. Ekman and W. V. Friesen, ”Facial action coding system: A technique for the measurement of facial movement”, Consulting Psychologists Press, 1978.
  8. Tadas Baltrusaitis ; Amir Zadeh ; Yao Chong Lim ; Louis-Philippe Morency, ”OpenFace 2.0: Facial Behavior Analysis Toolkit”, IEEE International Conference on Automatic Face and Gesture Recognition, 2018.
  9. Kaiming He, Xiangyu Zhang, Shaoqing Ren and Jian Sun, ”Deep residual learning for image recognition”, IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp 770-778.
  10. Van Thong Huynh, Soo-Hyung Kim, Guee-Sang Lee and Hyung-Jeong Yang, ”Engagement Intensity Prediction withFacial Behavior Features”, International Conference on Multimodal Interaction, 2019, pp 567-571.
  11. Zhe Cao, Tomas Simon, Shih-En Wei and Yaser Sheikh, ”Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields”, IEEE Conference on Computer Vision and Pattern Recognition, 2017.
  12. Jianming Wu, Zhiguang Zhou, Yanan Wang, Yi Li, Xin Xu and Yusuke Uchida, ”Multi-feature and Multi-instance Learning with Anti-overfitting Strategy for Engagement Intensity Prediction”, International Conference on Multimodal Interaction, 2019, pp 582-588.
  13. D. Kollias, V. Sharmanska, and S. Zafeiriou, ”Face behavior la carte: Expressions, affect and action units in a single network”, arXiv preprint, arXiv:1910.11111, 2019.
  14. N. Bosch, S. D’Mello, R. Baker, J. Ocumpaugh, V. Shute, M. Ventura and W. Zhao, ”Automatic detection of learning-centered affective states in the wild”, International Conference on Intelligent User Interfaces, 2015, pp 379-388.
  15. Wei-Yi Chang, Shih-Huan Hsu, Jen-Hsien Chien, ”FATAUVA-Net: An Integrated Deep Learning Framework for Facial Attribute Recognition, Action Unit Detection, and Valence-Arousal Estimation”, IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2020.
  16. Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, Tie-Yan Liu, ”LightGBM: A Highly Efficient Gradient Boosting Decision Tree”, Neural Information Processing Systems conference, 2017, pp 3146-3154.
  17. D. Kollias, A. Schulc, E. Hajiyev, and S. Zafeiriou, ”Analysing affective behavior in the first abaw 2020 competition”, arXiv preprint, arXiv:2001.11409, 2020.
  18. Jianfei Yang, Kai Wang, Xiaojiang Peng and Yu Qiao, ”Deep Recurrent Multi-instance Learning with Spatio-temporal Features for Engagement Intensity Prediction”, International Conference on Multimodal Interaction, 2018, pp 594-598.
  19. Kai Wang, Jianfei Yang, Da Guo, Kaipeng Zhang, Xiaojiang Peng and Yu Qiao, ”Bootstrap Model Ensemble and Rank Loss for Engagement Intensity Regression”, International Conference on Multimodal Interaction, 2019, pp 551-556.
  20. Mingxing Tan and Quoc V. Le, ”EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks”, International Conference on Machine Learning, 2019.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description