Deep Learning for Domain Adaption: Engagement Recognition

Deep Learning for Domain Adaption: Engagement Recognition

Omid Mohamad Nezami Omid Mohamad Nezami 22email: Hamey 44email: Richards 66email: Dras 88email: of Computing, Macquarie University, Sydney, Australia    Len Hamey Omid Mohamad Nezami 22email: Hamey 44email: Richards 66email: Dras 88email: of Computing, Macquarie University, Sydney, Australia    Deborah Richards Omid Mohamad Nezami 22email: Hamey 44email: Richards 66email: Dras 88email: of Computing, Macquarie University, Sydney, Australia    Mark Dras Omid Mohamad Nezami 22email: Hamey 44email: Richards 66email: Dras 88email: of Computing, Macquarie University, Sydney, Australia
Received: date / Accepted: date

Engagement is a key indicator of the quality of learning experience, and one that plays a major role in developing intelligent educational interfaces. Any such interface requires the ability to recognise the level of engagement in order to respond appropriately; however, there is very little existing data to learn from, and new data is expensive and difficult to acquire. This paper presents a deep learning model to improve engagement recognition from face images captured ‘in the wild’ that overcomes the data sparsity challenge by pre-training on readily available basic facial expression data, before training on specialised engagement data. In the first of two steps, a state-of-the-art facial expression recognition model is trained to provide a rich face representation using deep learning. In the second step, we use the model’s weights to initialize our deep learning based model to recognize engagement; we term this the Transfer model. We train the model on our new engagement recognition (ER) dataset with 4627 engaged and disengaged samples. We find that our Transfer architecture outperforms standard deep learning architectures that we apply for the first time to engagement recognition, as well as approaches using HOG features and SVMs. The model achieves a classification accuracy of 72.38%, which is 6.1% better than the best baseline model on the test set of the ER dataset. Using the F1 measure and the area under the ROC curve, our Transfer model achieves 73.90% and 73.74%, exceeding the best baseline model by 3.49% and 5.33% respectively.

Engagement Facial Expression Recognition Deep Learning Convolutional Neural Networks (CNNs)

1 Introduction

Engagement is a significant aspect of human-technology interactions and is defined differently for a variety of applications such as search engines, online gaming platforms, and mobile health applications (O’Brien, 2016). According to Monkaresi et al. (2017), most definitions describe engagement as attentional and emotional involvement in a task.

This paper deals with engagement during learning via technology. Investigating engagement is vital for designing intelligent educational interfaces in different learning settings including educational games, massively open online courses (MOOCs), and intelligent tutoring systems (ITSs). For instance, if students feel frustrated and become disengaged (see disengaged samples in Fig. 1), the system should intervene in order to bring them back to the learning process. However, if students are engaged and enjoying their tasks (see engaged samples in Fig. 1), they should not be interrupted even if they are making some mistakes (Kapoor et al., 2001). In order for the learning system to adapt the learning setting and provide proper responses to students, we first need to automatically measure engagement. This can be done by, for example, using context performance (Alyuz et al., 2016), facial expression (Whitehill et al., 2014) and heart rate (Monkaresi et al., 2017) data.

This paper aims at quantifying and characterizing engagement using facial expressions extracted from images ‘in the wild’. In this domain, engagement detection models usually use typical facial features which are designed for general purposes, such as Gabor features (Whitehill et al., 2014), histogram of oriented gradients (Kamath et al., 2016) and facial action units (Bosch et al., 2015). To the best of the authors’ knowledge, there is no work in the literature investigating the design of specific and high-level features for engagement. Therefore, providing a rich engagement representation model to distinguish engaged and disengaged samples remains an open problem (Challenge 1). Training such a rich model requires a large amount of data which means extensive effort, time, and expense would be required for collecting and annotating data due to the complexities (Bosch, 2016) and ambiguities (O’Brien, 2016) of the engagement concept (Challenge 2).

To address the aforementioned challenges, we design a deep learning model which includes two essential steps: basic facial expression recognition, and engagement recognition. In the first step, a convolutional neural network (CNN) is trained on the dataset of the Facial Expression Recognition Challenge 2013 (FER-2013) to provide a rich facial representation model, achieving state-of-the-art performance. In the next step, the model is applied to initialize our engagement recognition model, designed using a separate CNN, learned on our newly collected dataset in the engagement recognition domain. As a solution to Challenge 1, we train a deep learning-based model that provides our representation model specifically for engagement recognition. As a solution to Challenge 2, we use the FER-2013 dataset, which is around eight times larger than our collected dataset, as external data to pre-train our engagement recognition model and compensate for the shortage of engagement data. The contributions of this work are threefold:

  • To the authors’ knowledge, the work in this paper is the first time a rich face representation model has been used to capture basic facial expressions and initialize an engagement recognition model, resulting in positive outcomes. This shows the effectiveness of applying basic facial expression data in order to recognize engagement.

  • We have collected a new dataset we call the Engagement Recognition (ER) dataset to facilitate research on engagement recognition from face images. To handle the complexity and ambiguity of engagement concept, our data is annotated in two steps, separating the behavioral and emotional dimensions of engagement. The final engagement label in the ER dataset is the combination of the two dimensions.

  • To the authors’ knowledge, this is the first study which models engagement using deep learning techniques. The proposed model outperforms an extensive range of baseline approaches on the ER dataset.

The rest of this paper is organized as follows. In Sec. 2, related work in facial expression recognition and engagement recognition is described. In Sec. 3 and Sec. 4, we explain our models for recognizing facial expression and engagement, respectively; we apply the facial expression recognition model to initialize the engagement recognition model. Sec. 5 describes our experimental setup and the evaluation results and Sec. 6 offers the paper’s conclusion.

Figure 1: Engaged (left) and disengaged (right) samples collected in our studies. We blurred the children’s eyes for ethical issues, even though we have their parents consent at the time.

2 Related Work

2.1 Facial Expression Recognition

As a form of non-verbal communication, facial expressions convey attitudes, affects, and intentions of people. They are the result of movements of muscles and facial features (Fasel and Luettin, 2003). Study of facial expressions was started more than a century ago by Charles Darwin (Ekman, 2006), leading to a large body of work in recognizing basic facial expressions (Fasel and Luettin, 2003; Sariyanidi et al., 2015). Much of the work uses a framework of six ‘universal’ emotions (Ekman, 1999): sadness, happiness, fear, anger, surprise and disgust, with a further neutral category.

Recently, facial expression recognition (FER) using deep learning based methods has been successful in automatically recognizing facial expressions in images (Liu et al., 2014; Jung et al., 2015; Yu and Zhang, 2015; Zhang et al., 2015; Mollahosseini et al., 2016; Zhang et al., 2017; Rodriguez et al., 2017). These approaches learn hierarchical structures from low- to high-level feature representations thanks to the complex, multi-layered architectures of neural networks. Convolutional Neural Networks (CNNs), among the classes of deep models, have been the most successful in the FER domain. For example, Kahou et al. (2013) applied CNNs to recognize facial expressions and won the 2013 Emotion Recognition in the Wild Challenge. Another CNN model, followed by a linear support vector machine, was trained to recognize facial expressions by Tang (2013); this won the 2013 FER challenge (Goodfellow et al., 2013). In FER tasks, CNNs can be also applied for feature extraction and transfer learning. For instance, Kahou et al. (2016) applied CNNs for extracting visual features accompanied by audio features in a multi-modal data representation. Nezami et al. (2018) used a CNN model to recognize facial expressions, where the learned representation is used in an image captioning model; the model embedded the recognized facial expressions to generate more human-like captions for images including human faces.

Yu and Zhang (2015) employed a CNN model that was pre-trained on the FER-2013 dataset (Goodfellow et al., 2013) and fine-tuned on the Static Facial Expression in the Wild (SFEW) dataset (Dhall et al., 2011). They applied a face detection method to detect faces and remove noise in their target data samples. Mollahosseini et al. (2016) trained CNN models across different well-known FER datasets to enhance the generalizablity of recognizing facial expressions. They applied face registration processes to extract and align faces to achieve better performance. Kim et al. (2016) measured the impact of combining registered and unregistered face samples on FER recognition tasks. They used the unregistered samples when the facial landmarks of the samples were not detectable. Zhang et al. (2017) applied CNNs to capture spatial information from video frames. The spatial information was combined with temporal information to recognize facial expressions.

In the FER domain, models typically use CNNs with fairly standard deep architectures to achieve good performance on the FER-2013 dataset, as a large dataset collected ‘in the wild’. Pramerdorfer and Kampel (2016), instead, employed a combination of modern deep architectures such as VGGnet (Simonyan and Zisserman, 2014) on the FER-2013 dataset. They also achieved the state-of-the-art result on FER-2013 dataset.

We similarly first train a facial expression recognition module that is able to recognize facial expressions in the wild and achieves the state-of-the-art performance on FER-2013 dataset. Then, the model is used to initialize our engagement recognition model.

2.2 Engagement Recognition

Engagement can be detected in three different time scales: the entire video of a learning session, 10-second video clips, and static images. In the first category, Grafsgaard et al. (2013) studied the relation between facial action units (AUs) and engagement in learning contexts. They collected videos of web-based learning sessions between students and tutors. After finishing the sessions, they requested each student to fill in an engagement survey to annotate the student’s engagement for the entire session of learning. Their work also used linear regression methods to detect different levels of engagement. However, this approach does not characterize engagement in fine-grained time intervals which would be required for making an adaptive educational interface.

As an attempt to solve this issue, using 10-second video clips, Whitehill et al. (2014) applied linear SVMs and Gabor features, as the best approach in this work, to classify four engagement levels: not engaged at all, nominally engaged, engaged in task, and very engaged. In this work, the dataset includes 10-second captured videos annotated into the four levels of engagement by observers, who are analyzing the videos. Monkaresi et al. (2017) used heart rate features in addition to facial features to detect engagement. They used Kinect SDK’s face tracking engine to extract facial features and WEKA (a classification toolbox) to classify the features into engaged or not engaged classes. They annotated their dataset, including 10-second videos, using self-reported data collected from students during and after their tasks. In the wild condition, Bosch et al. (2015) detected engagement by AUs and Bayesian classifiers. The generalizability of the model was also investigated across different times, days, ethnicities and genders (Bosch et al., 2016). Furthermore, in interacting with ITSs, engagement was investigated based on a personalized model including appearance and context features (Alyuz et al., 2016). Engagement was also considered in learning with MOOCs as an e-learning environment (D’Cunha et al., 2016). In such settings, data are usually annotated by observing video clips or filling self-reports. However, the engagement levels of students can change during 10-second video clips, so assigning a label to the entire clip is difficult and sometimes inaccurate.

In the third category, HOG features and SVMs have been applied to classify static images according to three levels of engagement: not engaged, nominally engaged and very engaged (Kamath et al., 2016). This work is based on the experimental results of Whitehill et al. (2014) in preparing engagement samples. Whitehill et al. (2014) showed that engagement patterns are mostly recorded in static images. Bosch et al. (2015) also confirmed that video clips could not provide extra information because they reported similar performances using different lengths of video clips in detecting engagement. The dataset of Kamath et al. (2016) includes static images annotated into the three levels of engagement using crowdsourcing platforms. However, competitive performances are not reported in this category.

We focus on this third category, static images, in this work. In order to detecting engagement from images, we need an effective data annotation procedure and engagement recognition model. To do this, we collected a new dataset annotated by Psychology students, who can potentially better recognize the psychological phenomena of engagement, because of the complexity of analyzing student engagement. To assist them with recognition, brief training was provided prior to commencing the task and delivered in a consistent manner via online examples and descriptions. We did not use crowdsourced labels, resulting in less effective outcomes, similar to the work of Kamath et al. (2016) to annotate our dataset. We also captured more effective labels by following an annotation process to simplify the engagement concept into the behavioral and the emotional dimensions. We requested annotators to label the dimensions for each image and make the overall annotation label by combining these. Our aim is for this dataset to be useful to other researchers interested in detecting engagement from images.

Given this dataset, we introduce a novel model to recognize engagement using deep learning. Our model includes two important phases. First, we train a deep model to recognize basic facial expressions, adapted from some existing state-of-the-art architectures. Second, the model is applied to initialize the weights of our engagement recognition model trained using our newly collected dataset.

3 Facial Expression Recognition from Face Images

3.1 Facial Expression Recognition Dataset

To recognize facial expressions, the facial expression recognition 2013 (FER-2013) dataset (Goodfellow et al., 2013) is used. The dataset includes examples in the wild, which are labeled happiness, anger, sadness, surprise, fear, disgust, and neutral. It contains 35,887 samples (28,709 for the training set, 3589 for the public test set and 3589 for the private test set), collected by the Google search API. The samples are in grayscale at the size of 48-by-48 pixels (Fig. 2).

We split the training set into two parts after removing 11 completely black samples: 3589 for validating and 25,109 for training our facial expression recognition model. To compare with related work (Kim et al., 2016; Pramerdorfer and Kampel, 2016; Yu and Zhang, 2015), we do not use the public test set for training or validation, but use the private test set for performance evaluation of our facial expression recognition model.

Figure 2: Examples from the FER-2013 dataset of seven basic facial expressions.

3.2 Facial Expression Recognition using Deep Learning

We train the VGG-B model (Simonyan and Zisserman, 2014), using the FER-2013 dataset, with

one less Convolutional (Conv.) block as shown in Fig. 3. This results in eight Conv. and three fully connected layers. We also have a max pooling layer after each Conv. block with stride . We normalize each FER-2013 example so that the sample has a mean and a norm (Tang, 2013). Moreover, for each pixel position, the pixel values are normalized to mean and standard-deviation using all FER-2013 training samples. The implementation details of our model are similar to the work of Pramerdorfer and Kampel (2016) that is state-of-the-art on FER-2013 dataset, and our replication has similar performance. The model’s output layer (softmax layer) consists of seven neurons, corresponding to the categorical distribution probabilities over the facial expression classes in FER-2013. In the next section, we use the model’s weights as an initial step to train our engagement recognition model to recognize engaged and disengaged samples.

Figure 3: The architecture of our facial expression recognition model adapted from VGG-B framework (Simonyan and Zisserman, 2014). Each rectangle is a Conv. block including two Conv. layers. The max pooling layers are not shown for simplicity.

4 Engagement Recognition from Face Images

4.1 Engagement Recognition Dataset

Data Collection

To recognize engagement from face images, we construct a new dataset that we call the Engagement Recognition (ER) dataset. The data samples are extracted from videos of students, who are learning scientific knowledge and research skills using a virtual world named Omosa (Jacobson et al., 2016). Samples are taken at a fixed rate instead of random selections, making our dataset samples representative, spread across both subjects and time. In the interaction with Omosa, the goal of students is to determine why a certain animal kind is dying out by talking to characters, observing the animals and collecting relevant information (Fig. 4 (top)). After collecting notes and evidence, students are required to complete a workbook (Fig. 4 (bottom)).

Figure 4: The example interactions of students with Omosa (Jacobson et al., 2016), which are captured in our studies.

The videos of students were captured from our studies in two public secondary schools in Australia involving twenty students (11 girls and 9 boys) from Years 9 and 10 (aged 14–16), whose parents agreed to their participation in our ethics-approved studies. We collected the videos from twenty individual sessions of students recorded at 20 frames per second (fps), resulting in twenty videos and totalling around 20 hours.

After extracting video samples, we applied a face detection algorithm  (King, 2009)111The most recent version (2018) of Dlib library is used. to select samples including detectable faces. The face detection algorithm cannot detect faces in a small numbers of samples (less than 1%) due to their high face occlusion (Fig. 5). We removed the occluded samples from the ER dataset.

Figure 5: Samples without detectable faces because of high face occlusions.

Data Annotation

We designed custom annotation software to request annotators to independently label 100 samples each. The samples are randomly selected from our collected data and are displayed in different orders for different annotators. Each sample is annotated by at least six annotators.222The Fleiss’ kappa of the six annotators is 0.59, indicating a high inter-coder agreement Following ethics approval, we recruited undergraduate Psychology students to undertake the annotation task, who received course credit for their participation.

Before starting the annotation process, annotators were provided with definitions of behavioral and emotional dimensions of engagement, which are defined in the following paragraphs, inspired by the work of Aslan et al. (2017).

Behavioral dimension:

  • On-Task: The student is looking towards the screen or looking down to the keyboard below the screen.

  • Off-Task: The student is looking everywhere else or eyes completely closed, or head turned away.

  • Can’t Decide: If you cannot decide on the behavioral state.

Emotional dimension:

  • Satisfied: If the student is not having any emotional problems during the learning task. This can include all positive states of the student from being neutral to being excited during the learning task.

  • Confused: If the student is getting confused during the learning task. In some cases, this state might include some other negative states such as frustration.

  • Bored: If the student is feeling bored during the learning task.

  • Can’t Decide: If you cannot decide on the emotional state.

During the annotation process, we show each data sample followed by two questions indicating the engagement’s dimensions. The behavioral dimension can be chosen among on-task, off-task, and can’t decide options and the emotional dimension can be selected among satisfied, confused, bored, and can’t decide options. In each annotation phase, annotators have access to the definitions to label each dimension. A sample of the annotation software is shown in Fig. 6. In the next step, each sample is categorized as engaged or disengaged by combining the dimensions’ labels (Table 1). For example, if a particular annotator labels an image as on-task and satisfied, the category for this image from this annotator is engaged. Then, for each image we use the majority of the engaged and disengaged labels to specify the final overall annotation.

If a sample receives the label of can’t decide more than twice (either for the emotional or behavioral dimensions) from different annotators, it is removed from ER dataset. Labeling this kind of sample is a difficult task for annotators, notwithstanding the good level of agreement that was achieved, and finding solutions to reduce the difficulty remains as a future direction of our work.

Using this approach, we have created the ER dataset consisting of 4627 annotated examples including 2290 engaged and 2337 disengaged.

Figure 6: An example of our annotation software where the annotator is requested to specify the behavioral and emotional dimensions of the displayed sample.
Behavioral Emotional Engagement
On-task Satisfied Engaged
On-task Confused Engaged
On-task Bored Disengaged
Off-task Satisfied Disengaged
Off-task Confused Disengaged
Off-task Bored Disengaged
Table 1: The adapted relationship between the behavioral and emotional dimensions from Woolf et al. (2009) and Aslan et al. (2017).

Dataset Preparation

We apply the CNN-based face detection algorithm to detect the face of each ER sample. If there is more than one face in a sample, we choose the face with the biggest size. Then, the face is transformed to grayscale and resized into 48-by-48 pixels, which is an effective resolution for engagement detection (Whitehill et al., 2014). Fig. 7 shows some examples of the ER dataset. We split the ER dataset into training (3224), validation (715), and testing (688) sets, which are subject-independent (the samples in these three sets are from different subjects). Table 2 demonstrates the statistics of these three sets.

Figure 7: Randomly selected examples of ER dataset including engaged and disengaged samples.
State ER dataset Training Validation Testing
Engaged 2290 1589 392 309
Disengaged 2337 1635 323 379
Total 4627 3224 715 688
Table 2: The statistics of ER dataset and its partitions.

4.2 Engagement Recognition

The basic deep learning architecture we will use is a Convolutional Neural Network (CNN). We define two of these as baselines, one simple architecture and one that is similar in structure to VGGnet (Simonyan and Zisserman, 2014). The key model of interest in this paper is a version of the latter baseline that incorporates facial emotion recognition. For completeness, we also include another baseline that is not based on deep learning, but rather uses classical machine learning with histogram of oriented gradients (HOG) features (Dalal and Triggs, 2005).

For all the models, every sample of the ER dataset is normalized so that it has a zero mean and a norm equal to 100. Furthermore, for each pixel location, the pixel values are normalized to mean zero and standard deviation one using all ER training data.


We trained a method using the histogram of oriented gradients (HOG) features extracted from ER samples and a linear support vector machine (SVM), which we call the HOG+SVM model. The model is similar to that of Kamath et al. (2016) for recognizing engagement from static images and is used as a baseline model in this work. HOG (Dalal and Triggs, 2005) applies gradient directions or edge orientations to express objects in local regions of images. For example, in facial expression recognition tasks, HOG features can represent the forehead’s wrinkling by horizontal edges. A linear SVM is usually used to classify HOG features. In our work, , determining the misclassification rate of training samples against the objective function of SVM, is fine-tuned, using the validation set of the ER dataset, to the value of .

Convolutional Neural Network

As Convolutional Neural Networks (CNNs) are standard tools in image recognition, we use the training and validation sets of the ER dataset to train a CNN for this task from scratch (the CNN model); this constitutes another of the baseline models in this paper. The model’s architecture is shown in Fig. 8. The model contains two convolutional (Conv.) layers, followed by two max pooling (Max.) layers with stride 2, and two fully connected (FC) layers. A rectified linear unit (ReLU) activation function (Nair and Hinton, 2010) is applied after all Conv. and FC layers. The last step of the CNN model includes a softmax layer, followed by a cross-entropy loss, which consists of two neurons indicating engaged and disengaged classes. To overcome model over-fitting, we apply a dropout layer (Srivastava et al., 2014) after every Conv. with rate 0.8 and hidden FC layer with rate 0.5. Local response normalization (Krizhevsky et al., 2012) is used after the first Conv. layer. As the optimizer algorithm, stochastic gradient descent with mini-batching and a momentum of 0.9 is used. Using Equation 1, (Abadi et al., 2016), the learning rate () is decayed by the rate () of 0.8 in the decay step () of 500. The total number of iterations from the beginning of the training phase is global step ().

Figure 8: The architecture of the CNN Model. We denote convolutional, max-pooling, and fully-connected layers with “Conv”, “Max”, and “FC”, respectively.

Very Deep Convolutional Neural Network

We similarly use the training and validation partitions of the ER dataset to train a very deep model which has eight Conv. and three FC layers similar to VGG-B architecture (Simonyan and Zisserman, 2014), but with two fewer Conv. layers. The architecture is trained using two different scenarios. Under the first scenario, the model is learned from scratch initialized with random weights; we call this the VGGnet model (Fig. 9), and this constitutes the second of our deep learning baseline models.

Under the second scenario, which uses the same basic architecture, the model’s layers, except the softmax layer, are initialized by the trained model of Sec. 3.2, the goal of which is to recognize basic facial expressions; we call this the Transfer model (Fig. 10), and this is the key model of interest in our paper. In this model, all layers’ weights are updated and fine-tuned to recognize the engaged and disengaged classes of the ER dataset.

For both VGGnet and Transfer models, after each Conv. block, we have a max pooling layer with stride 2. In the models, we set the number of output units in the softmax layer to two (engaged versus disengaged), followed by a cross-entropy loss. Similar to the CNN model, we apply a rectified linear unit (ReLU) activation function (Nair and Hinton, 2010) after all Conv. and hidden FC layers. Here, we use a dropout (Srivastava et al., 2014) layer after every Conv. block with rate 0.8 and hidden FC layer with rate 0.5. Furthermore, we apply local response normalization after the first Conv. block. We use the same approaches to optimization and learning rate decay as in the CNN model.

Figure 9: The architecture of the VGGnet model.
Figure 10: Our facial expression recognition model on FER-2013 dataset (left). The Transfer model on ER dataset (right).

5 Experiments

5.1 Evaluation Metrics and Testing

In this paper, the performance of all models are reported on the both validation and test splits of the ER dataset. We use three performance metrics including classification accuracy, F1 measure and the area under the ROC (receiver operating characteristics) curve (AUC). In this work, classification accuracy specifies the number of positive (engaged) and negative (disengaged) samples which are correctly classified and are divided by all testing samples (Equation 2).


where , , , and are true positive, true negative, false positive, and false negative, respectively. F1 measure is calculated using Equation 3.


where is precision defined as and is recall defined as . AUC is a popular metric in engagement recognition task (Monkaresi et al., 2017; Whitehill et al., 2014; Bosch et al., 2015); it is an unbiased assessment of the area under the ROC curve. An AUC score of 0.5 corresponds to chance performance by the classifier, and AUC 1.0 represents the best possible result.

5.2 Implementation Details

In the training phase, for data augmentation, input examples are randomly flipped along their width and cropped to 48-by-48 pixels (after applying zero-padding because the samples were already in this size). Furthermore, they are randomly rotated by a specific max angle. To prevent overfitting, we applied regularization with a small amount of weight decay. The max angle, weight decay, the number of training iterations, learning rate, and batch size are separately fine-tuned for all deep models using ER’s validation set. The optimum values of these hyper parameters are presented in Table 3. The best model on the validation set is used to estimate the performance on the test partition of the ER dataset for all models in this work.

Method I L B M W
CNN 2320 0.02 28 8.0 0.0001
VGGnet 2320 0.01 28 8.0 0.001
Transfer 2020 0.02 32 16.0 0.001
Table 3: The optimum hyper parameters for different deep models. Iterations, learning rate, batch size, max angle, and weight decay are denoted as “I”, “L”, “B”, “M”, and “W” , respectively.

5.3 Results

Overall Metrics

We summarize the experimental results on the validation partition of the ER dataset in Table 4 and on the test partition of the ER dataset in Table 5. On the validation and test sets, the Transfer model substantially outperforms all baseline models using all evaluation metrics, showing the effectiveness of using a trained model on basic facial expression data to initialize an engagement recognition model. All deep models including CNN, VGGnet, and Transfer models perform better than the HOG+SVM method, showing the benefit of applying deep learning to recognize engagement. On the test set, the Transfer model achieves classification accuracy, which outperforms VGGnet by 5%, and the CNN model by more than ; it is also better than the HOG+SVM approach. The Transfer model achieved F1 measure which is around improvement compared to the deep baseline models and better performance than the HOG+SVM model. Using the AUC metric, as the most popular metric in engagement recognition tasks, the Transfer model achieves which improves the CNN and VGGnet models by more than and is around better than the HOG+SVM method. There are similar improvements on the validation set.

Method Accuracy F1 AUC
HOG+SVM 67.69 75.40 65.50
CNN 72.03 74.94 71.56
VGGnet 68.11 70.69 67.85
Transfer 77.76 81.18 76.77
Table 4: The results of engagement subject independent models (%) on the validation set of ER dataset.
Method Accuracy F1 AUC
HOG+SVM 59.88 67.38 62.87
CNN 65.70 71.01 68.27
VGGnet 66.28 70.41 68.41
Transfer 72.38 73.90 73.74
Table 5: The results of engagement subject independent models (%) on the test set of ER dataset.

Confusion Matrices

We show the confusion matrices of the HOG+SVM, CNN, VGGnet, and Transfer models on the ER test set in Tables 9, 9, 9, and 9, respectively. The tables show the proportions of predicted classes with respect to the actual classes, allowing an examination of precision per class.

It is interesting that the effectiveness of deep models comes through their ability to recognize disengaged samples compared to the HOG+SVM model. Disengaged samples have a wider variety of body postures and facial expressions than engaged samples (Fig. 11). Due to complex structures, deep learning models are more powerful in capturing these wider variations. The VGGnet model, which has a more complex architecture compared to the CNN model, can also detect disengaged samples with a higher probability. Since we pre-trained the Transfer model on basic facial expression data including considerable variations of samples, this model is the most effective approach to recognize disengaged samples achieving precision which is around improvement in comparison with the HOG+SVM model.

Engaged Disengaged
predicted Engaged 92.23 7.77
Disengaged 66.49 33.51
Table 7: Confusion matrices for the CNN model (%).
Engaged Disengaged
predicted Engaged 93.53 6.47
Disengaged 56.99 43.01
Table 8: Confusion matrices for the VGGnet model (%).
Engaged Disengaged
predicted Engaged 89.32 10.68
Disengaged 52.51 47.49
Table 9: Confusion matrices for the Transfer model (%).
Engaged Disengaged
predicted Engaged 87.06 12.94
Disengaged 39.58 60.42
Table 6: Confusion matrices for the HOG+SVM model (%).

Qualitative Results

Both Fig. 11 and 12 show a number of examples which are correctly predicated as engaged and disengaged classes using the Transfer model. In Fig. 11, the predicted probability of the correct class is significantly larger than the predicted probability of the wrong class using the softmax layer. We characterize these examples as confidently predicted ones. However, in Fig. 12, the difference between the probabilities of the correct and incorrect classes is smaller. There is more visual likeness between engaged and disengaged samples compared to the confidently predicted samples, which makes the recognition task more difficult.

Fig. 13 presents some examples that are wrongly predicted using the Transfer model. The first two rows present samples in which the incorrect class has a comparable probability to the correct class. The last row shows confidently incorrect predicted classes where the probability of the incorrect class is bigger than the correct class, demonstrating some challenging examples in engagement recognition tasks.

Figure 11: Representative engaged (left) and disengaged samples (right) that are confidently predicted using the Transfer model.
Figure 12: Representative engaged (left) and disengaged samples (right) which are correctly but less confidently predicted using the Transfer model.
Figure 13: These samples are wrongly predicted as engaged (left) and disengaged (right) using the Transfer model. The ground truth labels of the left samples are disengaged and the right samples are engaged.

6 Conclusion

Reliable models that can recognize engagement during a learning session, particularly in contexts where there is no instructor present, play a key role in allowing learning systems to intelligently adapt to facilitate the learner. There is a shortage of data for training systems to do this; the first contribution of the paper is a new dataset, labelled by annotators with expertise in psychology, that we hope will facilitate research on engagement recognition from visual data.

In this paper we have used this dataset to train models for the task of automatic engagement recognition in the wild, including for the first time deep learning models. The key contribution has been the development of a model that can address the shortage of engagement data to train a reliable deep learning model. This Transfer model has two key steps. First, we pre-train the model’s weights using basic facial expression data, of which which is relatively abundant. Second, we train the model to produce a rich deep learning based representation for engagement, instead of commonly used features and classification methods in this domain. We have evaluated this model with respect to a comprehensive range of baseline models to demonstrate its effectiveness, and have shown that it leads to a considerable improvement against both classical approaches and standard deep learning models using all standard evaluation metrics.

We would like to thank the participants, Meredith Taylor and John Porte, Charlotte Taylor and Louise Sutherland for running the classroom study and facilitating data collection. This work has in part been supported by Australian Research Council Discovery Project DP150102144.


  • Abadi et al. (2016) Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, Corrado GS, Davis A, Dean J, Devin M, et al. (2016) Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:160304467
  • Alyuz et al. (2016) Alyuz N, Okur E, Oktay E, Genc U, Aslan S, Mete SE, Arnrich B, Esme AA (2016) Semi-supervised model personalization for improved detection of learner’s emotional engagement. In: ICMI, ACM, pp 100–107
  • Aslan et al. (2017) Aslan S, Mete SE, Okur E, Oktay E, Alyuz N, Genc UE, Stanhill D, Esme AA (2017) Human expert labeling process (help): Towards a reliable higher-order user state labeling process and tool to assess student engagement. Educational Technology pp 53–59
  • Bosch (2016) Bosch N (2016) Detecting student engagement: Human versus machine. In: UMAP, ACM, pp 317–320
  • Bosch et al. (2015) Bosch N, D’Mello S, Baker R, Ocumpaugh J, Shute V, Ventura M, Wang L, Zhao W (2015) Automatic detection of learning-centered affective states in the wild. In: IUI, ACM, pp 379–388
  • Bosch et al. (2016) Bosch N, D’mello SK, Ocumpaugh J, Baker RS, Shute V (2016) Using video to automatically detect learner affect in computer-enabled classrooms. ACM Transactions on Interactive Intelligent Systems 6(2):17
  • Dalal and Triggs (2005) Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: CVPR, IEEE, vol 1, pp 886–893
  • D’Cunha et al. (2016) D’Cunha A, Gupta A, Awasthi K, Balasubramanian V (2016) Daisee: Towards user engagement recognition in the wild. arXiv preprint arXiv:160901885
  • Dhall et al. (2011) Dhall A, Goecke R, Lucey S, Gedeon T (2011) Static facial expression analysis in tough conditions: Data, evaluation protocol and benchmark. In: ICCV, IEEE, pp 2106–2112
  • Ekman (1999) Ekman P (1999) Basic emotions. In: Dalgleish T, Power T (eds) The Handbook of Cognition and Emotion, John Wiley & Sons, Sussex, UK, pp 45–60
  • Ekman (2006) Ekman P (2006) Darwin and facial expression: A century of research in review. Ishk
  • Fasel and Luettin (2003) Fasel B, Luettin J (2003) Automatic facial expression analysis: a survey. Pattern recognition 36(1):259–275
  • Goodfellow et al. (2013) Goodfellow IJ, Erhan D, Carrier PL, Courville A, Mirza M, Hamner B, Cukierski W, Tang Y, Thaler D, Lee DH, et al. (2013) Challenges in representation learning: A report on three machine learning contests. In: ICONIP, Springer, pp 117–124
  • Grafsgaard et al. (2013) Grafsgaard J, Wiggins JB, Boyer KE, Wiebe EN, Lester J (2013) Automatically recognizing facial expression: Predicting engagement and frustration. In: Educational Data Mining 2013
  • Jacobson et al. (2016) Jacobson MJ, Taylor CE, Richards D (2016) Computational scientific inquiry with virtual worlds and agent-based models: new ways of doing science to learn science. Interactive Learning Environments 24(8):2080–2108
  • Jung et al. (2015) Jung H, Lee S, Yim J, Park S, Kim J (2015) Joint fine-tuning in deep neural networks for facial expression recognition. In: ICCV, pp 2983–2991
  • Kahou et al. (2013) Kahou SE, Pal C, Bouthillier X, Froumenty P, Gülçehre Ç, Memisevic R, Vincent P, Courville A, Bengio Y, Ferrari RC, et al. (2013) Combining modality specific deep neural networks for emotion recognition in video. In: ICMI, ACM, pp 543–550
  • Kahou et al. (2016) Kahou SE, Bouthillier X, Lamblin P, Gulcehre C, Michalski V, Konda K, Jean S, Froumenty P, Dauphin Y, Boulanger-Lewandowski N, et al. (2016) Emonets: Multimodal deep learning approaches for emotion recognition in video. Journal on Multimodal User Interfaces 10(2):99–111
  • Kamath et al. (2016) Kamath A, Biswas A, Balasubramanian V (2016) A crowdsourced approach to student engagement recognition in e-learning environments. In: WACV, IEEE, pp 1–9
  • Kapoor et al. (2001) Kapoor A, Mota S, Picard RW, et al. (2001) Towards a learning companion that recognizes affect. In: AAAI Fall symposium, pp 2–4
  • Kim et al. (2016) Kim BK, Dong SY, Roh J, Kim G, Lee SY (2016) Fusing aligned and non-aligned face information for automatic affect recognition in the wild: A deep learning approach. In: CVPR Workshops, IEEE, pp 48–57
  • King (2009) King DE (2009) Dlib-ml: A machine learning toolkit. Journal of Machine Learning Research 10(Jul):1755–1758
  • Krizhevsky et al. (2012) Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: NIPS, pp 1097–1105
  • Liu et al. (2014) Liu P, Han S, Meng Z, Tong Y (2014) Facial expression recognition via a boosted deep belief network. In: CVPR, pp 1805–1812
  • Mollahosseini et al. (2016) Mollahosseini A, Chan D, Mahoor MH (2016) Going deeper in facial expression recognition using deep neural networks. In: WACV, IEEE, pp 1–10
  • Monkaresi et al. (2017) Monkaresi H, Bosch N, Calvo RA, D’Mello SK (2017) Automated detection of engagement using video-based estimation of facial expressions and heart rate. IEEE Transactions on Affective Computing 8(1):15–28
  • Nair and Hinton (2010) Nair V, Hinton GE (2010) Rectified linear units improve restricted boltzmann machines. In: ICML, pp 807–814
  • Nezami et al. (2018) Nezami OM, Dras M, Anderson P, Hamey L (2018) Face-cap: Image captioning using facial expression analysis. arXiv preprint arXiv:180702250
  • O’Brien (2016) O’Brien H (2016) Theoretical perspectives on user engagement. In: Why Engagement Matters, Springer, pp 1–26
  • Pramerdorfer and Kampel (2016) Pramerdorfer C, Kampel M (2016) Facial expression recognition using convolutional neural networks: State of the art. arXiv preprint arXiv:161202903
  • Rodriguez et al. (2017) Rodriguez P, Cucurull G, Gonzalez J, Gonfaus JM, Nasrollahi K, Moeslund TB, Roca FX (2017) Deep pain: Exploiting long short-term memory networks for facial expression classification. IEEE Transactions on Cybernetics (99):1–11
  • Sariyanidi et al. (2015) Sariyanidi E, Gunes H, Cavallaro A (2015) Automatic analysis of facial affect: A survey of registration, representation, and recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 37(6):1113–1133
  • Simonyan and Zisserman (2014) Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:14091556
  • Srivastava et al. (2014) Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15(1):1929–1958
  • Tang (2013) Tang Y (2013) Deep learning using linear support vector machines. arXiv preprint arXiv:13060239
  • Whitehill et al. (2014) Whitehill J, Serpell Z, Lin YC, Foster A, Movellan JR (2014) The faces of engagement: Automatic recognition of student engagement from facial expressions. IEEE Transactions on Affective Computing 5(1):86–98
  • Woolf et al. (2009) Woolf B, Burleson W, Arroyo I, Dragon T, Cooper D, Picard R (2009) Affect-aware tutors: recognising and responding to student affect. International Journal of Learning Technology 4(3-4):129–164
  • Yu and Zhang (2015) Yu Z, Zhang C (2015) Image based static facial expression recognition with multiple deep network learning. In: ICMI, ACM, pp 435–442
  • Zhang et al. (2017) Zhang K, Huang Y, Du Y, Wang L (2017) Facial expression recognition based on deep evolutional spatial-temporal networks. IEEE Transactions on Image Processing 26(9):4193–4203
  • Zhang et al. (2015) Zhang Z, Luo P, Loy CC, Tang X (2015) Learning social relation traits from face images. In: ICCV, pp 3631–3639
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description