Non-Volume Preserving-based Feature Fusion Approach
to Group-Level Expression Recognition on Crowd Videos
Group-level emotion recognition (ER) is a growing research area as the demands for assessing crowds of all sizes is becoming an interest in both the security arena and social media. This work investigates group-level expression recognition on crowd videos where information is not only aggregated across a variable length sequence of frames but also over the set of faces within each frame to produce aggregated recognition results. In this paper, we propose an effective deep feature level fusion mechanism to model the spatial-temporal information in the crowd videos. Furthermore, we extend our proposed NVP fusion mechanism to temporal NVP fussion appoarch to learn the temporal information between frames. In order to demonstrate the robustness and effectiveness of each component in the proposed approach, three experiments were conducted: (i) evaluation on the AffectNet database to benchmark the proposed emoNet for recognizing facial expression; (ii) evaluation on EmotiW2018 to benchmark the proposed deep feature level fusion mechanism NVPF; and, (iii) examine the proposed TNVPF on an innovative Group-level Emotion on Crowd Videos (GECV) dataset composed of 627 videos collected from social media. GECV dataset 111The dataset will be publicly available is a collection of videos ranging in duration from 10 to 20 seconds of crowds of twenty (20) or more subjects and each video is labeled as positive, negative, or neutral.
Emotion recognition (ER) based on human’s facial expression via facial action units (FACS), i.e. movement of facial muscles, has been studied for years in the field of affective computing, e-learning, health care, virtual reality entertainment, and human-computer interaction (HCI). ER approaches can be technically categorized into two groups: (i) Individual ER, (ii) Group-level ER. While the studies in individual ER are quite mature, the research in group-level ER is still in its infancy. A challenge of group-level ER is the detection of all faces in the group and aggregating the emotional content of the group across the scene (image or video) as shown in Figure 1.
Traditional approaches to ER are based on hand-designed features as illustrated by [28, 19]. However, with the emergence of deep learning, copious large-scale datasets, and the compute power of graphical processors, computer vision tasks have seen enormous performance gains, this is indeed true for individual (traditional) ER. Compared to traditional hand-crafted models, an optimal deep learning model is capable of extracting deeper discriminate features. These deep feature-based ER solutions have proven capable on not only images, but videos for individual ER [20, 24, 8, 2, 11, 17, 22, 10]; and, there has been some inroads into classifying group-level emotions on single images [29, 30, 12, 26, 1].
Unlike prior work tackle ER on videos, this work examines group (crowd) ER responses instead of a single person in a video. To accomplish ER fidelity across a crowd of 20 or more in a video, the ER responses are categorized as positive, negative, or neutral. Furthermore, a new approach to facial feature-based group-level ER has been developed over the simplified approaches presented to date in which the final decision is based on the group of faces as represented by some form of averaging or winner take all voting paradigm.
This work introduces a new deep feature-based fusion mechanism termed Non-volume Preserving Fusion (NVPF) which is demonstrated to better model the spatial relationship between facial emotions among the group within an image or still frame. In addition to the proposed NVPF mechanism, we solve for the crowd problem in which multiple emotions are presented. On top of that, this mechanism is a remedy for unclear emotion due to the resolution of the face–the face is too small to register an emotion as shown in Figure 2. The contribution of our proposed deep feature-level fusion approach to group-level ER on crowd videos can be summarized as follows:
To the best of our knowledge, this is the first work to address group-level emotion on crowd videos with multiple emotions across the crowd in videos with variable face resolution: (i) multiple emotions present within a frame and (ii) faces are not well detected due to face resolution.
Propose a high performance and low cost deep network for facial expression recognition named emoNet to robustly extract facial expression features.
Present a novel deep learning based fusion mechanism named Non-volume Preserving Fusion (NVPF) to model the feature-level spatial relationship between facial expression within a group.
The presented framework is then extended in an new end-to-end deep network Temporal Non-volume Preserving Fusion (TNVPF) to tackle the temporal-spatial fusion mechanism on videos.
Differentiated from previous work that only presents one emotion status for entire an image, the proposed method is able to cluster multiple emotion regions in images or videos as given in Fig.2.
Finally, a new dataset GECV is introduced for the problem of group-level ER on crowd videos.
2 Related Work
2.1 Group-level ER
Previous work on this task [29, 30, 12, 26, 1, 13, 21] have focused on extracting scene features from the entire image as a global representation and facial features from faces in the given image as a local representation. Most state-of-the-art approaches use ”naive” mechanism such as averaging [1, 29], concatenating , weighting [12, 29, 26], etc. to merge the global information and local representation. Averaging as identified in the work referenced above, nothing more than a voting or majority selecting scheme. Concatenating or weighting introduced by Guo et al.  utilized seven different CNNs-based models which have been trained on different parts of the scene, background, faces, and skeletons, which is optimized over the predictions. Tan et al.  built three CNNs models for aligned faces, non-aligned faces and entire images, respectively. Each CNN produces scores across each class which is then combined via an averaging strategy to obtain the final class score. By contrast, Wei et al. modelled the spatial relationship between faces with an LSTM network. The local information of each face is presented by VGGFace-lstm and DCNN-lstm while the global information is extracted by Pyramid Histogram of Oriented Gradients (PHOG), CENTRIST, DCNN, and VGG features. The local and global features are fused by score fusion. Rassadin et al.  approach involved extracting feature vectors of detected faces using CNNs trained for face identification task. Random Forest classifiers were employed to predict the emotion score.
Different from other works on score fusion by weighting or averaging, Abbas et al.  utilized densely connected network to merge 1x3 score vector from the scene and 1x3 score vector from the facial feature. Gupta et al.  proposed different weighted fusion mechanism for both local and global information. Their attention model is performed at either feature level or score level. Applying ResNet18 and ResNet34 on both small face and big face was proposed by Khan et al.  which designed as four-stream hybrid network.
2.2 ER on Videos
Kahou et al.  combined multiple deep neural networks including deep CNN, deep belief net, deep autoencoder and shallow network for different data modalities on the EmotiW2013. This approach won the competition. The temporal information between frames is fused through averaging the score decisions. A year later, the winner of EmotiW2014, Liu et al.  used three types of image set models i.e. linear subspace, covariance matrix, and Gaussian distribution and three classifier i.e.logistic regression, and partial least squares are investigated on the video sets. Similar to the work in  , the temporal information between frames is fusing through averaging on score decisions in . Instead of averaging,  - winner of EmotiW 2015 utilized RNNs to model the temporal information. In this approach, Multilayer Perceptron (MLP) with separate hidden layers for each modality which then are concatenated.
Bargal et al.  used a spatial approach to video classification where the feature encoding module based on Signed Square Root(SSR) and normalization by concatenating FC5 of VGG13+FC7 of VGG16+pool of RESNET, and finally an SVM classification module. Fan et al.  presents a video-based ER system whose core module of this system is a hybrid network that combines RNNs and 3D CNNs. The 3D CNNs encode appearance and motion information in different ways whereas the RNNs encode the motion later. Recently, Hu et al. present Supervised Scoring Ensemble (SSE) by adding supervision not only to deep layers but also to intermediate layers and shallow layers. A new fusion structure in which class-wise scoring activation at diverse complementary feature layers are concatenated and further used as the inputs for second-level supervision, acting as a deep feature ensemble within a single CNN architecture.
From the aforementioned literature review, none of the prior work is able to tackle both problems of group-level ER and ER on videos in a single framework. Furthermore, most of the previous work which makes use of facial-based feature are able to neither handle the cases when human faces are not well detected nor deal with the scenario where multiple emotions exist within an image. Take crowd images/videos where most human faces are captured in a tiny portion (low resolution) and under multiple conditions as an instance. Table 1 shows the comparison on facial feature-based expression recognition between our proposed framework and other state-of-the-art methods.
3 Our Proposed Approach
In this section, we describe our proposed end-to-end deep learning based approach to handle the problem of group-level emotion recognition on crowd videos in the wild. Figure 3 shows the overall structure of our proposed approach to model the spatial representation of groups of people in a single image. Then the temporal relationship between video frames are further exploited in the structure of Temporal NVPF as presented in Figure 5. Unlike previous fusion methods, our proposed NVPF approach can handle fusion in both well detected face and non-detected face windows. For detected face windows, deep facial expression features are extracted using the proposed emoNet network. These features are vectorized and structured as inputs to NVPF module. Meanwhile, in the non-detected face region, pixel cropping will be adopted for fusing process.
The proposed network consists of three main components: (i) our new designed CNN framework named emoNet to extract facial emotion features, (ii) a novel Non-Volume Preserving Fusion mechanism to model spatial representation and, (iii) Temporal relationship embedding with Temporal NVPF structure.
3.1 The Proposed emoNet
In this section, we propose a novel lightweight and high performance deep neural network design, named emoNet, to efficiently and accurately recognize group-level emotion. For the group-level ER problem, due to a large number of faces to be processed within one image, extracting their representations in feature space using very deep network (e.g. Resnet101, DenseNet, etc.) could be very costly. Therefore, in our framework, we propose the emoNet structure such that the information flow during expression embedding process can be maximized while maintaining a relative low computational cost. Our emoNet designed structure is motivated by three main strategies: (1) performing convolutional operator faster and more memory efficiently via depthwise separatable convolutional layers ; (2) increasing the network capacity in embedding emotion features via bottleneck blocks with residual connections ; and (3) quickly reducing the spatial dimension in the first few layers while expanding the layers by depthwise. Following those strategies, we propose the main architecture of our emoNet containing convolutional layers, depthwise separable convolutional layers, a sequence of Bottleneck blocks with and without residual connections, and fully connected (FC) layers (see Table 2 for more details). The input of the emoNet is a face image that are cropped and aligned in order to remove unnecessary information for emotion recognition such as background, head-hair, etc.
A bottleneck block in our emoNet is composed of three main components: (1) a convolution layer with ReLU activation - ; (2) a depthwise convolution layer with stride with ReLU activation - ; and (3) a convolution layer - . Given the input having the size of , the bottleneck block operator can be mathematically defined as
where , and The difference of the bottleneck block (BBlock) with and without residual connections is in the stride is in BBlock with residual while it is set to in BBlock without residual.
|112 112 3||1||Conv 33||2||64||✗|
|56 56 64||1||DWconv 33||1||64||✗|
|56 56 64||2||Conv 11||1||128||✗|
|Conv 11, Linear||1||64|
|28 28 64||4||Conv 11||1||128||✗|
|Conv 11, Linear||1||128|
|14 14 128||2||Conv 11||1||256||✓|
|Conv 11, Linear||1||128|
|14 14 128||4||Conv 11||1||256||✗|
|Conv 11, Linear||1||128|
|7 7 128||2||Conv 11||1||256||✓|
|Conv 11, Linear||1||128|
|7 7 128||1||Conv 11||1||512||✗|
|7 7 512||1||512-d FC||–||512||✗|
|11 512||1||M-d FC||–||M||✗|
3.2 Non-volume Preserving Fusion (NVPF)
In this section, we present a novel fusion mechanism named Non-volume Preserving Fusion (NVPF), where a set of faces in a group is efficiently fused via a non-linear process with multiple-level CNN-based fusion units. The end goal of this structure is to obtain a group-level feature in the form of probability density distributions for emotion recognition. By this way, rather than simply concatenating or applying the weighted linear combination, separated facial features of the subjects can be naturally embedded into a unified group-level feature in NVPF and, therefore, boosting the performance of emotion recognition in later steps.
Formally, given a set of faces of subjects in a group, we first extract their representations in latent space using the emoNet structure as . These features are then stacked into a grouped feature as follows.
where denotes a grouping function. Notably, there are many choices for and stacking emotion features into a matrix is among these choices. Any other choice can be easily adopted to this structure. Moreover, since the grouping operator still treat independently, the directly usage of for emotion recognition is equivalent to the trivial solution where no relationship between faces of a group is exploited. Therefore, in order to efficiently take this kind of relationship into account, we propose to model in a form of density distributions in a higher-level feature domain . By this way, not only the feature is modeled, but also their relationship is naturally embedded in the distributions presented in . We define this mapping from feature domain of to as the fusion process; and and can be considered as subject-level and group-level features, respectively. Let be a non-linear function that employs the mapping from to .
The probability distribution of S can be forumalated by.
Thanks to this formulation, computing the density function of S is equivalent to estimate the density distribution of H with an associated Jacobian matrix. By learning such a mapping function , we can employ a transformation from the subject-level feature S to an embedding H with a density . This property brings us to the point such that if we consider as a prior density distribution and choose the Gaussian Distribution for , naturally becomes a mapping function from S to a latent variable H that distributed as a Gaussian. Consequently, via , the subject-level feature can be fused into a unique Gaussian-distributed feature that embeds all information presented in each as well as among all and in S.
In order to enforce the non-linear property with more abstract-levels during the information flow in mapping process of , we construct as a composition of non-linear units where each of them exploits different relationships between facial features within the group of subjects.
As illustrated in Fig. 4, by representing S as a feature map, convolutional operation is very effective in exploiting the spatial relationship between in S. Moreover, longer-range relationship, i.e. vs. can be easily extracted by stacking multiple convolutional layers. Therefore, we propose to construct each mapping unit as a composition of multiple convolution layers. As a result, become a deep CNN network with the capability of capturing non-linear relationship embedded between faces in the group. Notice that, different from other types of CNN networks, our NVPF network is formulated and optimized based on the likelihood of and the output is the fused group-level feature . Furthermore, in order to enable the easy-to-compute property of the determinant for each unit , We adopt the structure of non-linear units in  as follows.
where Y is the output of the fusion unit , , b is a binary mask where the first half of b is all one and the remaining is zero. denotes the Hadamard product. We adopt scale and the translation as the transformation and , respectively. In practice, the functions and can be implemented by a residual block with skip connections similar to the building block of Residual Networks (ResNet) . Then, by stacking fusion unit together, the output Y will be the input of the next fusion unit and so on. Finally, we have the mapping function as defined in Eqn. (5).
Model Learning. The parameters of NVPF can be learned via maximizing the log-likelihood or minimize the negative log-likelihood as follows.
In order to further enhance the discriminative property of the features H, during training process, we choose different gaussian distribution (i.e. different mean and standard deviation) for each emotion class. After optimizing the parameters , has capabilities of both transform subject-level features to group-level feature and enforcing that feature to the corresponding distribution of emotion class.
3.3 Temporal Non-volume Preserving Fusion (TNVPF)
In this section, we describe how to extend our proposed NVPF in sub-section 3.2 to a temporal-spatial fusion framework named Temporal NVPF (TNVPF) to handle videos instead of images while preserving temporal information from the input videos. The main idea is to propagate the fused information from preceding frames. Thus, we reformulate Gated Recurrent Units (GRUs)  in such a way that we can perform an end-to-end training with the NVPF framework. GRUs have been known for a wide usage in time-series related problems such as speak recognition, video segmentation, scene parsing and prediction, etc.
Far apart from those approaches, our TNVPF unit is defined as a connecting block of NVPF together with memory and hidden units/states. TNVPF structure is defined as.
where U is the input-to-hidden weight matrix, is the state-to-state recurrent weight matrix. The input of TNVPF is the fused features H by the proposed NVPF given the input at frame . At timestep , each TNVPF has a reset gate and an update gate , the activation state and the new candidate memory content . TNVPF will give an output which is the label (positive, negative or neutral) of the current based on the fused features from the current frame and the hidden state of previous frame. Fig. 5 shows the overall end-to-end TNVPF framework for group-level ER on videos. TNVPF can be optimized via minimizing the negative log-likelihood of training sequences as.
where is the class number (). and are parameters of the TNVPF. is the emotion label of the video frame -th. and are the weight and bias for the hidden-to-output connections of TNVPF.
|Emotion Databases||Data Type||Group-type||No. Images/Videos||Condition||No. Emotion Classes|
|AffectNet ||Images||Individual||1.5M images||in-the-wild||8 emotion categories|
|EmotioNet ||Images||Individual||1M images||in-the-wild||23 emotion categories|
|EmotiW-Group ||Images||Group ()||17k images||in-the-wild||3 classes|
|EmotiW-Video ||Videos||Individual||1426 short videos (s)||in-the-wild||7 emotion categories|
|Our GEVC||Videos||Group ()||627 videos (s)||in-the-wild||
4 Experimental Results
In this section, we first introduce our new collected GECV dataset for ER on crowd videos in sub-section 4.1. Then, the proposed emoNet will be benchmarked and compared against other prior ER methods on AffectNet database in sub-section 4.2. The proposed NVPF approach is evaluated and compared against established methods on EmotiW2017 and EmotiW2018 challenge in sub-section 4.3. Finally, our proposed TNVPF framework will be evaluated on crowd videos GECV dataset in sub-section 4.4.
In this section, we introduce our new collected database named Group-level Emotion on Crowded Videos (GECV) to study ER in group-level in crowd videos. The presented GECV dataset contains 627 videos in total with 204 positive videos, 202 negative videos, and 221 neutral videos. Each video has about 300 frames ranging in duration from 10 to 20 secs. Each video frame consists of 20 people or more, which we define minimally as a crowd. The facial emotions in these videos have been focused on three emotion states, positive, negative or neutral. The ground-truth in these videos are manually labeled.
To the best of our knowledge, the proposed GECV is the first video database that contains videos footage and annotations for group-level ER on videos. The comparison between the properties of this database and others are presented in Table 3. All videos have been collected by using search engines such as Google and YouTube to locate videos that may contain crowds as defined above. Search criteria such as festival, marching, wedding party, parade, funeral, game shows, sport, stadium, congress meeting etc. are used to find candidate videos. To create diversity among videos, we translate the keywords into different languages to obtain videos from various places. All chosen videos have high quality i.e. more than 480p in resolutions.
4.2 Benchmarking the proposed emoNet on Single Subject Emotion
To demonstrate the effectiveness of the proposed emoNet on recognizing facial expression on single object, we use AffectNet dataset  to benchmark the proposed network and make comparison against other state-of-the-art work including: AlexNet (reported baseline) , ResNet-18 , ResNet-34 , ResNet-101 , DenseNet-121 , MobileNetV1 , MobileFaceNet , etc. AffectNet database is organized in such a way that there are 415,000 images for training and 5,500 images for validation. All the images are manually annotated with seven facial expression categories. However, the training set of this database is highly imbalanced, for example, ”happy” class has about 100K images whereas some other classes like fear or disgust, only has few thousand images. Fig. 6 shows the performance of our proposed emoNet compared against other networks on the AffectNet database. While emoNet gives highly accurate recognizing emotion, it’s model size remains small (MB).
|Model||EmotiW||Network / Feature||Fusion scheme||Fusion Stage||mAC||UAR||F1||Neu||Pos||Neg|
|Tan et al. ||2017||SphereFace||Averaging||Score||74.1%||–||–||–||–||–|
|Wei et al. ||2017||VGG-Face||LSTM||Feature||74.14 %||–||–||–||–||–|
|Rassadin et al. ||2017||VGG-Face||Median||Feature||70.11 %||–||–||–||–||–|
|Khan et al. ||2018||ResNet18||Averaging||Score||69.72%||–||–||–||–||–|
|Gupta et al ||2018||SphereFace||Averaging||Feature||73.03%||–||–||–||–||–|
|Gupta et al. ||2018||SphereFace||Attention||Feature||74.38%||–||–||–||–||–|
|Guo et al. ||2018||VGG-Face||Concat||Feature||74%||0.74||–||66.48%||75.68%||81.38%|
|FeC + RNN||59.68%||65%||50%||63.64%|
|FeC + LSTM||69.35%||70%||50%||86.36%|
4.3 Benchmarking the proposed NVPF on Group-level Emotion
In this section, the group-level datasets from both EmotiW 2017 and 2018 challenges are used to benchmark the proposed NVPF fusion mechanism and compare against other recent works on group-level ER with different fusion strategies. EmotiW 2017 contains 3,630 training, 2,068 validation, and 772 testing images. EmotiW 2018 is an extension of EmotiW 2017 with 9,815 images for training, 4,346 images for validation, and 3,011 for testing, respectively. In order to evaluate only the proposed NVPF component and compare agaist other fusion mechanisms, we have made the various experiments on emoNet (Sec.3.1.) using different fusion strategies including: averaging score fusion, concatenating feature fusion and NVPF. We name (i) Fused emoNetA (FeA) for the framework where emoNet is used for facial expression extracting together score fusion level with averaging mechanism; (ii) Fused emoNetB (FeB) for the framework where emoNet is used for facial expression extracting together feature fusion level with concatenating mechanism; (iii) Fused emoNetC (FeC) for the framework where emoNet is used for facial expression extracting together feature fusion level with the proposed NVPF. The performance of three frameworks FeA, FeB, FeC are evaluate on EmotiW2018 challenge which is an extension of EmotiW2017 challenge. Overall/mean accuracy, per class accuracy, mean F1 and Unweighted Average Recall (UAR) are reported in this experiment. Table 4 summarizes all the state-of-the-art approaches on the EmotiW2017 and EmotiW2018 challenges and the performance of our model emoNet with different fusion scheme (FeA, FeB, FeC) on the EmotiW 2018. As we can see from Table 4, our model using emoNet network to extract features and NVPF scheme to fuse those feature gives the best results among all other group-level ER approaches on the EmotiW2018.
4.4 Benchmarking the proposed TNVPF on Group-level Emotion on Crowd Videos
In this section, we use our presented GECV dataset to benchmark the proposed TNVPF for recognizing group-level emotion on crowd videos. The GECV dataset contains 627 crowd videos which partitions into 90% for training and 10% for testing (565 videos for training and 62 videos for testing. In addition to the achievement of the proposed TNVPF, we also exam the performance of NVPF on other temoporal modeling such as Vanilla RNNs, Long Short Term Memory (LSTM). As shown in the previous experiment, FeC by the proposed NVPF fusion mechanism gives the best performance on EmotiW; thus, we choose FeC for further evaluating in this section. Table 5 shows the performance of FeC on different temporal models (FeC+RNNs, FeC+LSTM and TNVPF) whereas the experiment on the proposed TNVPF which built upon FeC model and a derivation of GRUs obtains the best performance. Fig. 7 illustrates the confusion matrices of those three approaches (FeC+RNNs, FeC+LSTM and TNVPF).
This paper has first presented a high performance and low computation network named emoNet for robustly extracting facial expression feature. Then, a new fusion mechanism NVPF is proposed to deal with group-level emotion in crowds where multiple emotions may occur within a frame and human faces are not always clearly identified, e.g. large sports scenes where by nature their faces are shown in low resolution. The proposed NVPF is extended to TNVPF in order to model the temporal information between frames in crowd videos. To demonstrate the robustness and effectiveness of each component, three different experiments have been conducted, namely, the proposed emoNet is benchmarked and compared against other recent work on AffectNet database whereas the presented NVPF fusion mechanism is evaluated on EmotiW 2018 database and the proposed TNVPF is examined on our novel facial expression dataset GECV collected from social media. Our GECV dataset contains 627 crowd videos which are labeled as positive, negative, or neutral and range in duration from 10 to 20 secs of twenty or more people in each frame.
-  A. Abbas and S. K. Chalup. Group emotion recognition in the wild by combining deep neural networks for facial expression classification and scene-context analysis. In Proceedings of the 19th ACM International Conference on Multimodal Interaction, ICMI 2017, pages 561–568, 2017.
-  S. A. Bargal, E. Barsoum, C. C. Ferrer, and C. Zhang. Emotion recognition in the wild from videos using images. In Proceedings of the 18th ACM International Conference on Multimodal Interaction, pages 433–436, 2016.
-  S. Chen, Y. Liu, X. Gao, and Z. Han. Mobilefacenets: Efficient cnns for accurate real-time face verification on mobile devices. arXiv preprint arXiv:1804.07573, 2018.
-  K. Cho, B. van Merrienboer, Ç. Gülçehre, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Empirical Methods in Natural Language Processing (EMNLP), pages 1724–1734, 2014.
-  A. Dhall, J. Joshi, K. Sikka, R. Goecke, and N. Sebe. The more the merrier: Analysing the affect of a group of people in images. In Automatic Face and Gesture Recognition (FG), 2015 11th IEEE International Conference and Workshops on, volume 1, pages 1–8. IEEE, 2015.
-  A. Dhall, A. Kaur, R. Goecke, and T. Gedeon. Emotiw 2018: Audio-video, student engagement and group-level affect prediction. In Proceedings of the 2018 on International Conference on Multimodal Interaction, pages 653–656. ACM, 2018.
-  C. N. Duong, K. G. Quach, K. Luu, T. H. N. Le, and M. Savvides. Temporal non-volume preserving approach to facial age-progression and age-invariant face recognition. In IEEE International Conference on Computer Vision (ICCV). IEEE, 2017.
-  S. Ebrahimi Kahou, V. Michalski, K. Konda, R. Memisevic, and C. Pal. Recurrent neural networks for emotion recognition in video. In Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, pages 467–474, 2015.
-  C. Fabian Benitez-Quiroz, R. Srinivasan, and A. M. Martinez. Emotionet: An accurate, real-time algorithm for the automatic annotation of a million facial expressions in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5562–5570, 2016.
-  Y. Fan, J. C. Lam, and V. O. Li. Video-based emotion recognition using deeply-supervised neural networks. In Proceedings of the 2018 on International Conference on Multimodal Interaction, pages 584–588. ACM, 2018.
-  Y. Fan, X. Lu, D. Li, and Y. Liu. Video-based emotion recognition using cnn-rnn and c3d hybrid networks. In Proceedings of the 18th ACM International Conference on Multimodal Interaction, pages 445–450. ACM, 2016.
-  X. Guo, L. F. Polanía, and K. E. Barner. Group-level emotion recognition using deep models on image scene, faces, and skeletons. In Proceedings of the 19th ACM International Conference on Multimodal Interaction, pages 603–608. ACM, 2017.
-  A. Gupta, D. Agrawal, H. Chauhan, J. Dolz, and M. Pedersoli. An attention model for group-level emotion recognition. In Proceedings of the 2018 on International Conference on Multimodal Interaction, ICMI ‘18, pages 611–615, New York, NY, USA, 2018. ACM.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
-  A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
-  A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. CoRR, abs/1704.04861, 2017.
-  P. Hu, D. Cai, S. Wang, A. Yao, and Y. Chen. Learning supervised scoring ensemble for emotion recognition in the wild. In Proceedings of the 19th ACM International Conference on Multimodal Interaction, pages 553–560. ACM, 2017.
-  G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In CVPR, volume 1, page 3, 2017.
-  S. E. Kahou, P. Froumenty, and C. Pal. Facial expression analysis based on high dimensional binary features. In Computer Vision - ECCV 2014 Workshops, pages 135–147, 2015.
-  S. E. Kahou, C. Pal, X. Bouthillier, P. Froumenty, c. Gülçehre, R. Memisevic, P. Vincent, A. Courville, Y. Bengio, R. C. Ferrari, M. Mirza, S. Jean, P.-L. Carrier, Y. Dauphin, N. Boulanger-Lewandowski, A. Aggarwal, J. Zumer, P. Lamblin, J.-P. Raymond, G. Desjardins, R. Pascanu, D. Warde-Farley, A. Torabi, A. Sharma, E. Bengio, M. Côté, K. R. Konda, and Z. Wu. Combining modality specific deep neural networks for emotion recognition in video. In Proceedings of the 15th ACM on International Conference on Multimodal Interaction, pages 543–550, 2013.
-  A. S. Khan, Z. Li, J. Cai, Z. Meng, J. O’Reilly, and Y. Tong. Group-level emotion recognition using deep models with a four-stream hybrid network. In Proceedings of the 20th ACM International Conference on Multimodal Interaction, pages 623–629, 2018.
-  B. Knyazev, R. Shvetsov, N. Efremova, and A. Kuharenko. Convolutional neural networks pretrained on large face recognition datasets for emotion classification from video. arXiv preprint arXiv:1711.04598, 2017.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
-  M. Liu, R. Wang, S. Li, S. Shan, Z. Huang, and X. Chen. Combining multiple kernel methods on riemannian manifold for emotion recognition in the wild. In Proceedings of the 16th International Conference on Multimodal Interaction, pages 494–501, 2014.
-  A. Mollahosseini, B. Hasani, and M. H. Mahoor. Affectnet: A database for facial expression, valence, and arousal computing in the wild. IEEE Transactions on Affective Computing, pages 1–1, 2018.
-  A. Rassadin, A. Gruzdev, and A. Savchenko. Group-level emotion recognition using transfer learning from face identification. In Proceedings of the 19th ACM International Conference on Multimodal Interaction, pages 544–548. ACM, 2017.
-  M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4510–4520, 2018.
-  C. Shan, S. Gong, and P. W. McOwan. Facial expression recognition based on local binary patterns: A comprehensive study. Image Vision Comput., 27(6):803–816, May 2009.
-  L. Tan, K. Zhang, K. Wang, X. Zeng, X. Peng, and Y. Qiao. Group emotion recognition with individual facial emotion cnns and global image based cnns. In Proceedings of the 19th ACM International Conference on Multimodal Interaction, pages 549–552. ACM, 2017.
-  Q. Wei, Y. Zhao, Q. Xu, L. Li, J. He, L. Yu, and B. Sun. A new deep-learning framework for group emotion recognition. In Proceedings of the 19th ACM International Conference on Multimodal Interaction, pages 587–592. ACM, 2017.