Deep Affect Prediction in-the-wild: Aff-Wild Database and Challenge, Deep Architectures, and Beyond
Automatic understanding of human affect using visual signals is of great importance in everyday human-machine interactions. Appraising human emotional states, behaviors and reactions displayed in real-world settings, can be accomplished using latent continuous dimensions (e.g., the circumplex model of affect). Valence (i.e., how positive or negative is an emotion) and arousal (i.e., power of the activation of the emotion) constitute the most popular and effective affect representations. Nevertheless, the majority of collected datasets this far, although containing naturalistic emotional states, have been captured in highly controlled recording conditions. In this paper, we introduce the Aff-Wild benchmark for training and evaluating affect recognition algorithms. We also report on the results of the First Affect-in-the-wild Challenge (Aff-Wild Challenge) that was recently organized on the Aff-Wild database, and was the first ever challenge on the estimation of valence and arousal in-the-wild. Furthermore, we design and extensively train an end-to-end deep neural architecture which performs prediction of continuous emotion dimensions based on visual cues. The proposed deep learning architecture, AffWildNet, includes convolutional and recurrent neural network (CNN-RNN) layers, exploiting the invariant properties of convolutional features, while also modeling temporal dynamics that arise in human behavior via the recurrent layers. The AffWildNet produced state-of-the-art results on the Aff-Wild Challenge. We then exploit the AffWild database for learning features, which can be used as priors for achieving best performances both for dimensional, as well as categorical emotion recognition, using the RECOLA, AFEW-VA and EmotiW 2017 datasets, compared to all other methods designed for the same goal.
Keywords:deep convolutional recurrent Aff-Wild database facial dimensional categorical emotion recognition valence arousal AffWildNet
Current research in automatic analysis of facial affect aims at developing systems, such as robots and virtual humans, that will interact with humans in a naturalistic way under real-world settings. To this end, such systems should automatically sense and interpret facial signals relevant to emotions, appraisals and intentions. Moreover, since real-world settings entail uncontrolled conditions, where subjects operate in a diversity of contexts and environments, systems that perform automatic analysis of human behavior should be robust to video recording conditions, the diversity of contexts and the timing of display. 111It is well known that the interpretation of a facial expression may depend on its dynamics, e.g. posed vs. spontaneous expressions zeng2009survey.
For the past twenty years research in automatic analysis of facial behavior was mainly limited to posed behavior which was captured in highly controlled recording conditions pantic2005web; valstar2010induced; tian2001recognizing; lucey2010extended. Some representative datasets, which are still used in many recent works jung2015joint, are the Cohn-Kanade database tian2001recognizing; lucey2010extended, MMI database pantic2005web; valstar2010induced, Multi-PIE database gross2010multi and the BU-3D and BU-4D databases yin20063d; yin2008high.
Nevertheless, it is now accepted by the community that the facial expressions of naturalistic behaviors can be radically different from the posed ones corneanu2016survey; sariyanidi2015automatic; zeng2009survey. Hence, efforts have been made in order to collect subjects displaying naturalistic behavior. Examples include the recently collected EmoPain Emopain and UNBC-McMaster lucey2011painful databases for analysis of pain, the RU-FACS database of subjects participating in a false opinion scenario bartlett2006fully and the SEMAINE corpusmckeown2012semaine which contains recordings of subjects interacting with a Sensitive Artificial Listener (SAL) in controlled conditions. All the above databases have been captured in well-controlled recording conditions and mainly under a strictly defined scenario eliciting pain.
Representing human emotions has been a basic topic of research in psychology. The most frequently used emotion representation is the categorical one, including the seven basic categories, i.e., Anger, Disgust, Fear, Happiness, Sadness, Surprise and Neutral dalgleish2000handbookcowie2003describing. It is, however, the dimensional emotion representation whissel1989dictionary; russell1978evidence which is more appropriate to represent subtle, i.e., not only extreme, emotions appearing in everyday human computer interactions. To this end, the 2-D Valence and Arousal Space is the most usual dimensional emotion representation. Figure 1 shows the 2-D Emotion Wheel plutchik1980emotion, with valence ranging from very positive to very negative and arousal ranging from very active to very passive.
Some emotion recognition databases exist in the literature that utilize dimensional emotion representation. Examples are the SAL douglas2008sensitive, SEMAINE mckeown2012semaine, MAHNOB-HCI soleymani2012multimodal, Belfast naturalistic 222https://belfast-naturalistic-db.sspnet.eu/, Belfast induced sneddon2012belfast, DEAP koelstra2012deap, RECOLA ringeval2013introducing and AFEW-VA kossaifi2017afew databases.
Currently, there are many challenges (competitions) in the behavior analysis domain. One such example is the Audio/Visual Emotion Challenges (AVEC) series valstar2013avec; valstar2014avec; ringeval2015avec; valstar2016avec; ringeval2017avec which started in 2011. The first challenge schuller2011avec (2011) used the SEMAINE database for classification purposes by binarizing its continuous values, while the second challenge schuller2012avec (2012) used the same database but with its original values. The last challenge (2017) ringeval2017avec utilized the SEWA database. Before this and for two consecutive years (2015 ringeval2015avec, 2016 valstar2016avec) the RECOLA database was used.
However these databases have some of the below limitations, as shown in Table 1:
they contain data recorded in laboratory or controlled environments.
their diversity is limited due to the small total number of subjects they contain, the limited amount of head pose variations and present occlusion, the static background or uniform illumination
they consist of small number of data
|27||20||34.9 - 117 secs||controlled|
|AFEW-VA||600||600||0.5 - 4 secs||in-the-wild|
|125||298||10 - 60 secs||controlled|
|37||37||5 - 30 secs||controlled|
To tackle the aforementioned limitations, we collected the first, to the best of our knowledge, large scale captured in-the-wild database and annotated it in terms of valence and arousal. To do so, we capitalised on the abundance of data available in video-sharing websites, such as YouTube youtube2011youtube 333The collection has been conducted under the scrutiny and approval of Imperial College Ethical Committee (ICREC). The majority of the chosen videos were under Creative Commons License (CCL). For those videos that were not under CCL, we have contacted the person who created them and asked for their approval to be used in this research. and selected videos that display the affective behavior of people, for example videos that display the behavior of people when watching a trailer, a movie, a disturbing clip or reactions to pranks etc.
To this end we have collected 298 videos displaying reactions of 200 subjects. This database has been annotated by 6-8 lay experts with regards to two continuous emotion dimensions, i.e. valence and arousal. We then organized the Aff-Wild Challenge444https://ibug.doc.ic.ac.uk/resources/first-affect-wild-challenge/, which was based on the Aff-Wild data- base. The participating teams submitted their results to the challenge, outperforming the provided baseline. However, as described later in this paper, the achieved performances were rather low.
For this reason, we capitalized on the Aff-Wild database to build deep neural CNN and CNN plus RNN architectures shown to achieve excellent performance on this database, outperfoming all previous participants’ performances. We have made extensive experimentations, testing different structures for combining convolutional and GRU/B-LSTM recurrent neural networks in these architectures. We have used the concordance correlation coefficient (CCC) as a loss function, also comparing it with the usual mean squared error (MSE) criterion. Additionally, we appropriately fused, within the network structures, two types of inputs, the 2-D facial images - presented at the input of the end-to-end architecture - and the 2-D facial landmark positions - presented at the 1st Fully Connected layer of the architecture.
We have also investigated the use of the created CNN-RNN architecture for Valence-Arousal estimation in other datasets, focusing on RECOLA and AFEW-VA. Last but not least, taking into consideration the large in-the-wild nature of this database, we show that our network can be also used for other emotion recognition tasks, such as classification to the universal expressions. The only challenge that uses ‘in-the-wild’ data is the series of EmotiW dhall2013emotion; dhall2014emotion; dhall2015video; dhall2016emotiw; dhall2017individual. It uses the AFEW dataset, whose samples come from movies, TV shows and series. To the best of our knowledge, this is the first time that a dimensional database and features extracted from it, are used as priors for categorical emotion recognition in-the-wild, exploiting the EmotiW Challenge dataset.
To summarize, there exist several databases for dimensional emotion recognition. However, they have limitations, mostly due to the fact that they are not captured in-the-wild (i.e., not in uncontrolled conditions). This urged us to create the benchmark Aff-Wild database and organize the Aff-Wild Challenge. The results acquired will be presented later in full detail. We proceeded in conducting experiments and building CNN and CNN plus RNN architectures, including the AffWildNet, producing state-of-the-art results. Upon acceptance of this article, the AffWildNet’s weights will be made publicly available. We also used the Aff-Wild database as a prior for dimensional and categorical emotion recognition, thus producing state-of-the-art results.
The rest of the paper is organized as follows. Section 2 presents the databases generated and used in the presented experiments. Section 3 describes the pre-processing and annotation methodologies that we used. Section 4 describes the Aff-Wild Challenge that was organized, the baseline method, the methodologies of the participating teams and their results. It also presents the end-to-end deep neural architectures which we developed and evaluated and the experimental studies and results which illustrate all theoretical developments. Section 5 describes how the AffWildNet can be used as a prior for other emotion recognition problems yielding state-of-the-art results. Finally, Section 6 presents the conclusions and future work following the reported developments.
2 Existing Databases
We briefly present the current available databases for the emotion recognition task and mention their limitations which lead to the creation of the Aff-Wild.
2.1 RECOLA Dataset
The REmote COLlaborative and Affective (RECOLA) data- base was introduced by Ringeval et al. ringeval2013introducing. It contains natural and spontaneous emotions in the continuous domain (arousal and valence) on audio-visual data. The corpus includes four modalities: audio, video, electro-dermal activity (EDA) and electro-cardiogram (ECG). It consists of 46 French speaking subjects being recorded for 9.5 h recordings in total. The recordings were annotated for 5 minutes each by 6 French-speaking annotators (three male, three female). The dataset is divided in three parts, namely, train (16 subjects), validation (15 subjects) and test (15 subjects), in such a way that the gender, age and mother tongue are stratified (i.e., balanced). It is therefore obvious that the limitations of this dataset include the small number of subjects, as well as the tightly controlled laboratory environment (constant lighting conditions).
2.2 The EmotiW Datasets
The series of EmotiW challenges dhall2013emotion; dhall2014emotion; dhall2015video; dhall2016emotiw; dhall2017individual make use of the data from the Acted Facial Expression In The Wild (AFEW) dataset dhall2012collecting. This dataset is a dynamic temporal facial expressions data corpus consisting of close to real world scenes extracted from movies and reality TV shows. In total it contains 1809 videos. The whole dataset is split into three sets: training set (773 video clips), validation set (383 video clips) and test set (653 video clips). It should be emphasized that both training and validation sets are mainly composed of real movie records, however 114 out of 653 video clips in the test set are real TV clips, thus increasing the difficulty of the challenge. The number of subjects is more than 330, aged 1-77 years. The annotation is according to 7 facial expressions (Anger, Disgust, Fear, Happiness, Neutral, Sadness and Surprise). These challenges focus on audio-video classification of each clip into the seven basic emotion categories. The limitations of this dataset include its small size and being restricted to only seven basic emotion categories, some of which are quite imbalanced (fear, disgust, surprise).
2.3 The AFEW-VA Database
Very recently, a part of the AFEW dataset of the series of EmotiW challenges has been annotated in terms of valence and arousal, thus creating the so called AFEW-VA kossaifi2017afew data- base. In total, it contains 600 video clips that were extracted from feature films and simulate real-world conditions, i.e., occlusions, different illumination conditions and free movements from subjects. The videos range from short (around 10 frames) to longer clips (more than 120 frames). This database consists of accurate per-frame annotations of valence and arousal. In total, more than 30,000 frames were annotated for dimensional affect prediction of arousal and valence, using discrete values in the range of [, ]. The database’s limitations include its small size and the existence of discrete values for valence and arousal.
2.4 The Aff-Wild Database
We created a database consisting of 298 videos, with a total length of more than 30 hours. The aim was to collect spontaneous facial behavior in arbitrary recording conditions. To this end, the videos were collected using the Youtube video sharing web-site. The main keyword that was used to retrieve the videos was ”reaction”. The database displays subjects reacting to a variety of stimuli (i.e., from a video to tasting something hot or disgusting). Examples include subjects reacting on an unexpected plot twist of a movie or series, a trailer of a highly anticipated movie, etc.. The subjects display both positive or negative emotions (or combinations of them). In other cases, subjects display emotions while performing an activity (e.g., riding a rolling coaster). In some videos, subjects react on a practical joke, or on positive surprises (e.g., a gift). The videos contain subjects from different genders and ethnicities with high variations in head pose and lightning.
Most of the videos were in YUV 4:2:0 format, with some of them being in AVI format. Six to eight subjects have annotated the videos following a methodology similar to the one proposed in cowie2000feeltrace, in terms of valence and arousal. An on line annotation procedure was used, according to which annotators were watching each video and provided their annotations through a joystick. Valence and arousal range continuously in the range [, ]. All the subjects present in a video have been annotated. The total number of subjects is 200, with 130 of them being male and 70 of them female. Table 2 shows attributes of the Aff-Wild database. Figure 2 shows some frames from the Aff-Wild database.
|Total no of videos||252(train)+46(test) = 298|
|Length of videos||0.10-14.47 min|
|Video format||AVI , MP4|
|Average Image Resolution (AIR)||607 x 359|
|Standard deviation of AIR||85 x 11|
|Median Image Resolution||640 x 360|
|No of annotators||6-8|
In Figures 4, 4, 5 we present three characteristic examples of facial images taken from three different videos, with their respective video frame number, and the valence and arousal annotation for each of them. We also present a visual representation of these values on the 2-D emotion space, showing the change of the reactions/behavior of the person among these time instances of the video. Time evolution is indicated, by using a larger size for the more recent frames and a smaller size for the older ones. Figure 6 provides a histogram for the annotated values for valence and arousal in the generated database.
3 Annotation and data processing
3.1 Annotation tool
For data annotation, we developed our own application that builds on other existing ones, like Feeltrace cowie2000feeltrace and Gtrace cowie2012tracing. A time-continuous annotation is performed for each affective dimension, with the annotation process being as follows: (a) the user selects whether to annotate valence or arousal, (b) a log in screen appears and the user uses an identifier (e.g. his/her name) and selects an appropriate joystick, (c) the screen is split into two parts: a scrolling list of all videos is given on the left side and on the right side there is a scrolling list of all annotated videos, (d) the user selects a video to annotate and a screen appears that shows the video and a slider of values ranging in , (e) the video can be annotated by moving the joystick either up or down. At the same time our application samples the annotations at a variable time rate. Figure 7 shows the graphical interface of our tool when annotating valence (the interface for arousal is similar).
3.2 Annotation guidelines
Each annotator was orally instructed and received instructions through a multi page document, explaining in detail the procedure to follow for the annotation task. This document included a short list of some well identified emotional cues for both arousal and valence, in order to provide a common introduction on emotions to the annotators, even though they were rather instructed to use their own appraisal of the subjects emotional state with respect to the annotation task 555All annotators were computer scientists who were working on face analysis problems and all had a working understanding of facial expressions.. Before starting the annotation of the data, each annotator watched the whole video so as to know what to expect regarding all emotions being depicted in the video.
3.3 Data pre-processing
VirtualDub lee2002welcome was used in order to trim the raw YouTube videos, mainly at the beginning and end-points, so as to remove useless content (e.g., an advertisement). Then another pre-processing step was applied in order to locate the faces in all frames of the videos. In more detail, we extracted a total of 1,180,000 frames using the Menpo software menpo14. From each frame, we detected the faces using the method described in mathias2014face. Next, we extracted facial landmarks for all frames using the best performing method as indicated in chrysos2016comprehensive. We removed frames in which the bounding box or landmark detection failed. In Figures 8(a) and 8(b), we illustrate examples of tracked landmarks from the same subject in a particular video and from different subjects in many videos, respectively.
In addition, a matching process was developed between the annotation time stamps and the cropped facesâ time instances. More particularly, for each frame time instance, we searched for the nearest neighbor in the annotation time stamp sequence and then linked the latter valence and arousal annotation values to the corresponding frame time stamp. In cases we had two annotation timestamps with same time distance from a frame, we computed the average of those two timestamps and attributed this value to the frame.
Since many of the pre-trained networks, such as the VGG series of networks, operate on images with resolution of , we have resized the facial images to this resolution. We also experimented with smaller sizes, such as . The images’ intensity values were normalized to the range .
3.4 Annotation Post-processing
We further extended our annotation tool to provide the valence and arousal annotated values while displaying a specific video. Every expert-annotator (i.e., the annotators that are working directly on the problem of valence and arousal estimation) watched all videos for a second time, in order to verify that the recorded annotations were in accordance with the videos, that is, satisfactory depicting the emotion expressed at all times. In this way, a further validation of annotations was achieved. Using this procedure some frames were dropped, especially at the end of the videos.
After the annotations have been validated, cross correlations between all annotators were computed for every video. Subsequently, two more experts watched all videos and, for every video, validated the most correlated annotations (between 2 to 4 annotations). We then computed the mean of these annotations. Figure 9 shows a small part of a video (2000 frames) and the 4 most highly correlated annotations for valence.
4 Experimental results
In this section, we first describe the First Aff-Wild Challenge that was organized using the Aff-Wild database. We proceed with providing information and results for the baseline algorithm which was based on the CNN-M network chatfield2014return, a deep learning convolutional neural network. We also provide short descriptions and results of the algorithms of the teams that participated in this challenge. Those results are promising, but yield much room for improvement. For this reason we developed our own CNN and CNN plus RNN architectures based on the Aff-Wild database. We first describe this architecture and then provide our, currently the state-of-the-art, results. Finally, in order to indicate the power of the Aff-Wild benchmark, we extracted features from the above architectures and used them for dimensional and categorical emotion recognition. We conducted more experiments based on the RECOLA, the AFEW-VA and the EmotiW datasets. We report our findings and draw conclusions regarding the superiority of our networks.
4.1 The Aff-Wild Challenge
The training data (i.e., videos and annotations) of the Aff-Wild challenge were made publicly available on the 30th of January 2017, followed by the release of the test videos (without annotations). The participants could split the training data into train and validation sets and were free to use any other datasets. The participants could submit up to 3 entries to the challenge. Table 3 shows the number of subjects in the training and test sets of the Aff-Wild challenge. Figure 10 demonstrates the histogram in the 2-D Valence & Arousal Space of annotations for the challenge training set.
|Set||no of males||no of females|
Ten different research groups downloaded the Aff-Wild database. Six of them made experiments and submitted their results to the workshop portal. Based on the performance they obtained on the test data, three of them were selected to present their results to the workshop.
Two criteria were considered for evaluating the performance of the networks. The first one is Concordance Correlation Coefficient (CCC), which can be defined as follows:
where and are the variances of the ground truth and predicted values respectively, and are the corresponding mean values and is the respective covariance value.
The second criterion is the Mean Squarred Error (MSE), defined as follows:
where and are the ground truth and predicted values respectively and is the total number of samples.
4.1.1 Baseline Architecture
The baseline architecture for the challenge was based on the structure of the CNN-M chatfield2014return network. The platfrom used for this implementation was Tensorflow tensorflow2015-whitepaper. We used the pre-trained on the FaceValue dataset kolliasalbanie16learning CNN-M network as a starting structure and performed transfer learning of its convolutional and pooling parts on our designed network.
More particularly, we used two fully connected layers, the second being the output layer providing the valence and arousal predictions. We either retained the CNN part of the network and performed fine-tuning on the weights of the fully connected layers, or performed fine-tuning on the whole network’s weights. The exact structure of the network is shown in Table 4. Note that the activation function in the convolutional and batch normalisation layers is the ReLU one; this is also the case in the first fully connected layer. The activation function of the second fully connected layer is a linear one.
|Layer||filter||ksize||stride||padding||no of units|
|conv 1||[7, 7, 3, 96]||[1, 2, 2, 1]||’VALID’|
|max pooling||[1, 3, 3, 1]||[1, 2, 2, 1]||’VALID’|
|conv 2||[5, 5, 96, 256]||[1, 2, 2, 1]||’SAME’|
|max pooling||[1, 3, 3, 1]||[1, 2, 2, 1]||’SAME’|
|conv 3||[3, 3, 256, 512]||[1, 1, 1, 1]||’SAME’|
|conv 4||[3, 3, 512, 512]||[1, 1, 1, 1]||’SAME’|
|conv 5||[3, 3, 512, 512]||[1, 1, 1, 1]||’SAME’|
|max pooling||[1, 2, 2, 1]||[1, 2, 2, 1]||’SAME’|
In order to train the network (in mini batches) we utilized the Adam optimizer algorithm, that provided slightly better performance overall in comparison to other methods, such as stochastic gradient descent. The Mean Squared Error (MSE) was used as the error/cost function. The hyper-parameters used were: the batch size equal to 80, the constant learning rate equal to 0.001 and the number of hidden units in the first fully connected layer equal to 4096. We also used biases in the fully connected layers. The weights of the fully connected layers were initialised from a Truncated Normal distribution with a zero mean and variance equal to 0.1 and the biases were initialised to 1. Training was performed on a single GeForce GTX TITAN X GPU and the training time was about 4-5 days.
Table 5 summarizes the obtained by our baseline network MSE and Concordance Correlation Coefficient (CCC) values. From the results we deduced that the task is very challenging and requires meticulously designed deep learning architectures in order to be tackled.
4.1.2 Participating Teams’ Algorithms
The three papers accepted to this challenge are briefly reported below, while Table 6 compares the acquired results (in terms of CCC and MSE) by all three methods. As one can see, best results have been provided by FATAUVA-Net weichi.
We should note that after the end of the challenge, some more groups enquired about the Aff-Wild database and sent results for evaluation, but here we report only on the teams that participated on the challenge.
In jianshu (Method MM-Net), a variation of the deep convolutional residual neural network is first presented for affective level estimation of facial expressions. Then multiple memory networks are used to model temporal relations between the video frames. Finally, ensemble models are used to combine the predictions of the multiple memory networks, showing that the latter steps improve the initially obtained performance, as far as MSE is concerned, by more than 10%.
In weichi (Method FATAUVA-Net), a deep learning framework is presented, in which a core layer, an attribute layer, an AU layer and a V-A layer are trained sequentially. The facial part-based response is firstly learned through attribute recognition Convolutional Neural Networks, and then these layers are applied to supervise the learning of AUs. Finally, AUs are employed as mid-level representations to estimate the intensity of valence and arousal.
In hasani (Method DRC-Net), three neural network-based methods which are based on Inception-ResNet modules redesigned specifically for the task of facial affect estimation are presented and compared. These methods are: Shallow Inception-ResNet, Deep Inception-ResNet, and Inception-ResNet with LSTMs. Facial features are extracted in different scales and both the valence and arousal are simultaneously estimated in each frame. Best results are obtained by the Deep Inception-ResNet method.
All participants applied deep learning methods to the problem of visual analysis of the video inputs. The following conclusions can be drawn by the reported results. First, CCC of arousal predictions is really low for all three methods. Second, MSE of valence predictions is high for all three methods and CCC is low, except for the winning method. This illustrates the difficulty in recognizing emotion in-the-wild, where, for instance, illumination conditions differ, occlusions are present, different head poses are met.
4.2 CNN plus RNN architectures & Ablation Studies
Here we provide extensive experiments and ablation studies regarding DCNN and DCNN plus RNN architectures in Aff-Wild. We present our own architecture, AffWildNet, which is a CNN plus RNN network that produced the best results in the database. In summary, we have tested three main architectures, ResNet he2016deep, VGG Face parkhi2015deep, and VGG-16 simonyan2014very for feature extraction.
We considered two network settings:
the models were trained in an end-to-end manner, i.e., using raw intensity pixels, to produce 2-D predictions of valence and arousal,
a RNN is stacked on top of the models to capture temporal information in the data, before predicting the affect dimensions.
To further boost the performance of our models we also experimented with the use of facial landmarks. To summarize, the following two scenarios were tested:
The network is applied directly on cropped facial video frames of the generated database.
The network is trained on both the facial video frames as well as the facial landmarks corresponding to the same frame.
For our RNN model we experimented with both the Long Short Term Memory (LSTM) hochreiter1997long and the Gated Recurrent Unit (GRU) chung2014empirical. The number of layers we experimented with were one and two. We note that all deep learning architectures have been implemented in the Tensorflow platform tensorflow2015-whitepaper.
In order to have a more balanced dataset for training, we performed data augmentation, mainly through oversampling by re-sampling and duplicating more2016survey some data from the Aff-Wild database. In particular, we re-sampled/duplicated consecutive video frames that had negative valence and arousal values, as well as positive valence and negative arousal values. As a consequence, the training set consisted of about 43% of positive valence and arousal values, 24% of negative valence and positive arousal values, 19% of positive valence and negative arousal values and 14% of negative valence and arousal values.
Our loss function was based on the Concordance Correlation Coefficient (CCC) metric that has been shown to provide better insight on whether the prediction follows the structure of the ground truth annotation e2e_multimodal trigeorgis2016adieu. During training we also kept track of the Mean Squared Error (MSE).
Hence, our loss was defined as . In our case, we trained our models to predict both arousal and valence. Our total loss was defined to be:
where and is the loss for the arousal and valence, respectively. For evaluation, our metrics were the CCC and MSE.
We also experimented with the initialization of our models by initializing the weight values either (i) randomly or (ii) using pre-trained weights from networks that have been pre-trained on large databases. For the second approach we used transfer learning ng2015deep, especially of the convolutional and pooling part of the pre-trained networks.
In particular, we utilized the ResNet-50 and VGG-16 networks, which have been pre-trained using the ImageNet deng2009imagenet dataset which is used for object detection tasks. The VGG-Face network has been pre-trained on face recognition tasks using the VGG-Face dataset parkhi2015deep. After heavy experimentation the VGG-Face has proven to provide the best results.
It should be noted that when utilizing pre-trained networks, we experimented following two approaches: either performing fine-tuning, i.e., training the entire architecture with a relatively small learning rate, or freezing the pre-trained part of the architecture and retraining the rest (i.e., the fully connected layers of the CNN, as well as the hidden layers of the RNN). In general, the procedure of freezing a part of the network and fine-tuning jung2015joint the rest can prove very useful, in particular when the given dataset is incremented with more videos. This increases the flexibility of the architecture, as fine-tuning can be performed by simply considering only the new videos.
4.2.1 Selecting Best Architectures
In the following we provide specific information on the selected structure and parameters in the developed end-to-end neural architectures, with reference to the results obtained for each case.
Extensive experiments have been performed by selecting different network parameter values, including (1) the number of neurons in the CNN fully connected layers, (2) the batch size used for network parameter updating, (3) the value of the learning rate and the strategy for reducing it during training (e.g. exponential decay in fixed number of epochs), (4) the weight decay parameter value, and finally (5) the dropout probability value.
With respect to parameter selection in the CNN architectures, we used a batch size in the range 10 100 and a constant learning rate value in the range 0.0001 0.001. The best results have been obtained with batch size equal to 50 and learning rate equal to 0.001. The dropout probability value was 0.5. The number of neurons per layer per CNN type is described in the next subsections.
18.104.22.168 Exploiting Residual Networks
The first architecture we utilize is a deep residual network (ResNet) of 50 layers he2016deep. Residual learning is adopted in these models by stacking multiple blocks of the form:
where , and indicate the input, the weights, and the output of layer , respectively. indicates the residual function that is learned and is the identity mapping between the residual function and the input.
The first layer of the ResNet-50 model is comprised of a convolutional layer with 64 feature maps, followed by a max pooling layer of size . Next, there are 4-bottleneck blocks, where after each block a shortcut connection is added. Each of these blocks is comprised of 3 convolutional layers of sizes , , and with different number of feature maps.
In addition to the use of ResNet-50, we stack on top a 2-layer fully connected (FC) network. For the first FC layer, after experimenting with the number of units in the range 1000 2000, we found that the best results were obtained with 1500 units. For the second FC layer, we experimented in the range 200 500 and found that 256 units provided the best results. The architecture of the network is depicted in Figure 11.
In case where landmarks are used (scenario 2) these are input, along with the features extracted from the ResNet-50, to the FC network, so that they can be mapped to the same feature space before performing the prediction.
22.214.171.124 Exploiting VGG Face/VGG-16 networks
Table 7 shows the configuration of the CNN architecture based on VGG Face or VGG-16. It is composed of 8 blocks. For each convolutional layer the parameters are denoted as (channels, kernel, stride) and for the max pooling layer as (kernel, stride). The output number of units is also shown in the Table. The use of 2 Fully-Connected (FC) layers, before the final output layer, was found to provide the best results.
|block 1||conv layer||(64, , )|
|max pooling||(, )|
|block 2||conv layer||(128, , )|
|max pooling||(, )|
|block 3||conv layer||(256, , )|
|max pooling||(, )|
|block 4||conv layer||(512, , )|
|max pooling||(, )|
|block 5||conv layer||(512, , )|
|max pooling||(, )|
|block 6||fully connected 1||4096|
|block 7||fully connected 2||2048|
|block 8||fully connected 3||2|
Table 7 also refers to the second scenario. In this case, however, best results were obtained, when the 68 landmark 2-D positions ( values) were provided, together with the outputs of the last pooling layer of the CNN, as inputs to the first FC layer of the architecture. In scenario 1, the outputs of the first FC layer of the CNN were the only inputs to the second fully connected layer of our architecture. A linear activation function was used in the last FC layer, providing the final estimates. All units in the rest FC layers were equipped with the rectification (ReLU) non-linearity. The architecture of the network is depicted in Figure 12.
126.96.36.199 Developing CNN plus RNN architectures
In order to consider the contextual information in the data, we developed a CNN-RNN architecture, in which the RNN part was fed with the outputs of either the first, or the second fully connected layer of the respective CNN networks.
The structure of the RNN, which we examined, consisted of one or two hidden layers, with - units, following either the LSTM neuron model allowing peephole connections, or the GRU neuron model. Using one fully connected layer in the CNN part and two hidden layers in the RNN part was found to provide the best results.
|block 2||fully connected 1||4096 or 1500|
|block 3||RNN layer 1||128|
|block 4||RNN layer 2||128|
|block 5||fully connected 2||2|
Table 8 shows the configuration of the CNN-RNN architecture. The CNN part of this architecture is based on the convolutional and pooling layers of the CNN architectures described above (VGG Face and ResNet-50). It is followed by a fully connected layer. Note that in the case of the second scenario, both the outputs of the last pooling layer of the CNN, as well as the 68 landmark 2-D positions ( values) were provided as inputs to this fully connected layer. For the RNN and fully connected layers, Table 8 shows the respective number of units. We call this CNN plus RNN architecture, AffWildNet, and illustrate it in Figure 13.
Long evaluation has been performed by selecting different network parameter values. These parameters included: the batch size used for network parameter updating, the value of the learning rate and the dropout probability value. Final selection of these parameters was similar to the CNN cases, apart from the batch size which was selected in the range - . Best results have been obtained with batch size .
4.2.2 Experimental Results
In the following we present the results obtained for the above derived CNN-only and CNN plus RNN architectures.
188.8.131.52 CNN-only architectures
Table 9 summarizes the obtained CCC and MSE values on the test set of Aff-Wild using each of the three afore-mentioned CNN structures as pre-trained networks. The best results have been obtained using the VGG Face pre-trained CNN for initialization. Moreover, Table 10 shows that there is a significant improvement in the performance, when we also use the 68 2-D landmark positions as input data (case with landmarks).
|With Landmarks||Without Landmarks|
Furthermore, we have trained the networks with two different annotations. The first is the annotation provided by the Aff-Wild database, which is the average over the most correlated annotations. The second is the annotation produced by only one annotator (the one with the highest correlation to the landmarks). Annotations coming from a single annotator are generally less smooth than the average over annotators. Hence, they are more difficult to be modeled. The results are summarized in Table 11. As it was expected, it is better to train on the annotation provided by the Aff-Wild database.
|1 Annotator||Mean of Annotators|
184.108.40.206 CNN plus RNN architectures
Regarding the application of CNN plus RNN end-to-end neural architecture on Aff-Wild, we first perform a comparison between two different units that can be used in an RNN network, i.e. an LSTM vs GRU. Table 12 summarises the CCC and MSE values when using LSTM and GRU. It can be seen that best results have been obtained when the GRU model was used. All results reported in the following are, therefore, based on the GRU model.
Table 13 shows the improvement in the CCC and MSE values obtained when using the best CNN-RNN end-to-end neural architecture compared to the best only-CNN one. This improvement clearly indicates the ability of the CNN-RNN architecture to better capture the dynamic phenomenon.
|1 Hidden Layer||2 Hidden Layers|
We have tested various numbers of hidden layers and hidden units per layer when training and testing the CNN-RNN network. Some characteristic selections and their corresponding CNN-RNN performances are shown in Table 14.
In Figures 14(a) and 14(b), we qualitatively illustrate some of the obtained results, comparing a segment of the obtained valence/arousal predictions to the ground truth values, in over 6000 consecutive frames of test data.
It can be easily verified that the results obtained by our method (see for example Table 13) greatly outperform the achieved performances in the Challenge.
5 Feature Learning from Aff-Wild
When it comes to dimensional emotion recognition, there exists great variability between different databases, especially those containing emotions in-the-wild. In particular, the annotators and the range of the annotations are different and the labels can be either discrete or continuous. To tackle the problems caused by this variability, we take advantage of the fact that the Aff-Wild is a powerful database that can be exploited for learning features, which may then be used as priors for dimensional emotion recognition. In the following we show that it can be used as prior for the RECOLA and AFEW-VA databases that are annotated for valence and arousal, just like Aff-Wild. In addition to this, we use it as a prior for categorical emotion recognition, on the EmotiW dataset, which is annotated in terms of the seven basic emotions. Experiments have been conducted on these databases yielding state-of-the-art results and thus verifying the strength of Aff-Wild for affect recognition.
5.1 Prior for Valence and Arousal prediction
5.1.1 Experimental Results for the Aff-Wild and RECOLA database
In this subsection, we demonstrate the superiority of our database when it is used as a means of pre-training a model. More particularly, we train the AffWildNet on the RECO-LA and for comparison purposes we train an architecture comprised of a ResNet-50 and a 2-layer GRU stacked on top (let us call it ResNet-GRU network). Table 15 shows the results only for the CCC score as our minimization loss was depending on this metric. It is clear that the performance on both arousal and valence of the fine-tuned model on the Aff-Wild database is significantly higher than the performance of the ResNet-GRU model.
To further demonstrate the benefits of our model when it comes to automatic prediction of arousal and valence we demonstrate a histogram in the 2-D Valence & Arousal Space of the annotations (Figure 16(a)) and predictions of the fine-tuned AffWildNet (Figure 16(b)) for the whole test set of the RECOLA. Finally, we also illustrate in Figures 17(a) and 17(b) the prediction and ground truth for one test video of RECOLA, for the valence and arousal dimensions, respectively.
5.1.2 Experimental Results for the AFEW-VA database
In this subsection we focus on emotion recognition of the data in the AFEW-VA database. The annotation of this data- base is somewhat different from the annotation of the Aff-Wild database. In particular, the labels of the AFEW-VA database are in the range [, ], while the labels of the Aff-Wild database are in the range [, ]. To tackle this problem, we scaled the range of the AFEW-VA labels to [, ]. However, differences were observed, due to the fact that the labels of the AFEW-VA are discrete while the labels of the Aff-Wild are continuous. Figure 19 shows the discrete valence and arousal values of the annotations in AFEW-VA database, whereas Figure 19 shows the corresponding histogram in the 2-D Valence & Arousal Space.
We then performed fine-tuning of the AffWildNet to the AFEW-VA database and tested the performance of the generated network. Similarly to kossaifi2017afew, we used a 5-fold person-independent cross-validation strategy. Table 16 shows a comparison of the performance of the fine-tuned AffWildNet with the best results reported in kossaifi2017afew. Those results are in terms of the Pearson Correlation Coefficient criterion (Pearson CC), defined as follows:
where and are the variances of the ground truth and predicted values respectively and is the respective covariance value.
|best of kossaifi2017afew||0.407||0.445|
It can be easily seen that the fine-tuned AffWildNet greatly outperforms the best method reported in kossaifi2017afew.
For comparison purposes, we also trained a CNN network on the AFEW-VA database. This network’s architecture was based on the convolution and pooling layers of VGG Face followed by 2 fully connected layers with 4096 and 2048 hidden units, respectively. As shown in Table 17, the performance of the fine-tuned AffWildNet, in terms of CCC, greatly outperforms this network as well.
|VGG-16 + RNN||0.431||0.559||0.026||0.07||0.444||0.259||0.044||0.293|
|ResNet + RNN||0.431||0.237||0.077||0.07||0.587||0.155||0.089||0.261|
|VGG-Face + RNN||0.552||0.593||0.026||0.047||0.794||0.259||0.111||0.384|
All these verify, overall, that our network can be used as a pretrained one to yield perfect results across different dimensional databases.
5.2 Prior for Categorical Emotion Recognition
5.2.1 Experimental Results for the EmotiW dataset
To further show the strength of the AffWildNet, we used the AffWildNet -which is trained for valence and arousal- in a very different problem, that of categorical in-the-wild emotion recognition, focusing on the EmotiW 2017 Grand Challenge. To tackle categorical emotion recognition, we modified the AffWildNet’s output layer to include 7 neurons (one for each basic emotion category) and performed fine-tuning on the AFEW 5.0 dataset.
In the presented experiments, we compare the fine-tuned AffWildNet’s performance with that of other state-of-the-art CNN and CNN-RNN networks, the CNN part of which is based on the ResNet 50, VGG-16 and VGG-Face architectures (which were described in section 4.2), trained on the same AFEW 5.0 dataset. The accuracies of all networks on the validation set of the EmotiW 2017 Grand Challenge are shown in Table 18. We can easily see that the AffWildNet outperforms all those other networks in terms of total accuracy.
We should note that:
the AffWildNet was trained to classify only video frames (and not audio) and then video classification based on frame aggregation was performed
the cropped faces provided by the challenge were only used (and not our own detection and/or normalization procedure)
no data-augmentation, post-processing of the results or ensemble methodology have been conducted.
It should also be mentioned that the fine-tuned AffWildNet’s performance is:
much higher than the baseline 0.3881 reported in dhall2017individual
better than all vanilla architectures’ performances reported by the three winning methods in the audio-video emotion recognition EmotiW 2017 Grand challenge hu2017learning knyazev2017convolutional vielzeuf2017temporal
comparable and better in some cases than the rest of the results obtained by the three winning methods hu2017learning knyazev2017convolutional vielzeuf2017temporal
The above are shown in Table 19. Those results verify that the AffWildNet can be appropriately fine-tuned and successfully used for dimensional, as well as for categorical emotion recognition.
6 Conclusions and Future Work
Deep learning and deep neural networks have been successfully used in the past years for facial expression and emotion recognition based on still image and video frame analysis. Recent research focuses on in-the-wild facial analysis and refers, either to categorical emotion recognition, targeting recognition of the seven basic emotion categories, or to dimensional emotion recognition, analyzing the valence-arousal (V-A) representation space.
In this paper we introduce Aff-Wild, a new, large in-the-wild database that consists of 298 videos of 200 subjects, with a total length of more than 30 hours. We also present the Aff-Wild Challenge that was organized on Aff-Wild. We report on the results of the Challenge, and the pitfalls and challenges in terms of predicting valence and arousal in-the-wild. Furthermore, we design a deep convolutional and recurrent neural architecture and perform extensive experimentation with the Aff-Wild database. We show that the generated AffWildNet provides the best performance in terms of valence and arousal estimation on the Aff-Wild dataset, both in terms of the concordance correlation coefficient and the mean squared error criteria, when compared with other deep learning networks trained on the same database.
Subsequently, we then show that the AffWildNet and the Aff-Wild database constitute tools that can be used for facial expression and emotion recognition on other datasets. Using appropriate fine-tuning and retraining methodologies, we show that best results can be obtained by applying the AffWildNet to other dimensional databases, such as the RE- COLA and the AFEW-VA ones and by comparing the obtained performances with other state-of-the-art pretrained and fine-tuned networks.
Furthermore, we show that fine-tuning of the AffWildNet can produce state-of-the-art performance, not only for dimensional, but also for categorical emotion recognition. We use this approach to tackle the facial expression and emotion recognition parts of the EmotiW 2017 Grand Challenge, referring to recognition of the seven basic emotion categories, showing that we produce comparable, or better, results when compared with the winners of this contest.
It should be stressed that it is the first time, to the best of our knowledge, that the same deep architecture can be used for both types of dimensional and categorical emotion analysis. To achieve this, the AffWildNet has been effectively trained with the largest existing, in-the-wild, database for continuous V-A recognition (regression analysis problem) and then used for tackling the discrete seven basic emotion recognition (classification problem).
The proposed procedure for fine-tuning the AffWildNet can be used to further extend its use in the analysis of other new visual emotion recognition datasets. This includes our current work on extending the Aff-Wild with new in-the-wild audio-visual information, as well as a means for unifying the approaches for facial expression and emotion recognition, including dimensional emotion representations, basic and compound emotion categories, facial action unit representations, as well as specific emotion categories met in different contexts, such as negative emotions, emotions in games, in social groups and other human machine (or robot) interactions.