Image Captioning using Facial Expression and Attention
Benefiting from advances in machine vision and natural language processing techniques, current image captioning systems are able to generate detailed visual descriptions. For the most part, these descriptions represent an objective characterisation of the image, although some models do incorporate subjective aspects related to the observer’s view of the image, such as sentiment; current models, however, usually do not consider the emotional content of images during the caption generation process. This paper addresses this issue by proposing novel image captioning models which use facial expression features to generate image captions. The models generate image captions using long short-term memory networks applying facial features in addition to other visual features at different time steps. We compare a comprehensive collection of image captioning models with and without facial features using all standard evaluation metrics. The evaluation metrics indicate that applying facial features with an attention mechanism achieves the best performance, showing more expressive and more correlated image captions, on an image caption dataset extracted from the standard Flickr 30K dataset, consisting of around 11K images containing faces. An analysis of the generated captions finds that, perhaps unexpectedly, the improvement in caption quality appears to come not from the addition of adjectives linked to emotional aspects of the images, but from more variety in the actions described in the captions.
Image captioning systems aim to describe the content of an image using Computer Vision and Natural Language Processing approaches which have led to important and practical applications such as helping visually impaired individuals . This is a challenging task because we have to capture not only the objects but also their relations and the activities displayed in the image in order to generate a meaningful description. The impressive progress in deep neural networks and large image captioning datasets has recently resulted in a considerable improvement in generating automatic image captions [63, 65, 22, 68, 50, 3, 37, 2].
However, current image captioning methods often overlook the emotional aspects of the image, which play an important role in generating captions that are more semantically correlated with the visual content. For example, Figure 1 shows three images with their corresponding human-generated captions including emotional content. The first image at left has the caption of “a dad smiling and laughing with his child” using “smiling” and “laughing” to describe the emotional content of the image. In a similar fashion, ’angry” and “happy” are applied in the second and the third images, respectively. These examples demonstrate how image captioning systems that recognize emotions and apply them can generate richer, more expressive and more human-like captions. This desideratum of incorporating emotional content is one that is general to intelligent systems, which researchers like \citeAlisetti1998affective have identified as necessary to generate more effective and adaptive outcomes. In this work, we seek to demonstrate this desideratum holds also for image captioning systems. Although detecting emotions from visual data has been an active area of research in the recent years [12, 53], designing an effective image captioning system to employ emotions in describing an image is still an open and challenging problem.
A few models have incorporated sentiment or other non-factual information into image captions [15, 40, 67]; they typically require the collection of a supplementary dataset, from which a sentiment vocabulary is derived, drawing on work in Natural Language Processing  where sentiment is usually characterized as one of positive, neutral or negative. \citeAmathews2016senticap, for instance, constructed a sentiment image-caption dataset via crowdsourcing, where annotators were asked to include either positive sentiment (e.g. a cuddly cat) or negative sentiment (e.g. a sinister cat) using a fixed vocabulary; their model was trained on both this and a standard set of factual captions. These kinds of approaches typically embody descriptions of an image that represent an observer’s view towards the image (e.g. a cuddly cat for a positive view of an image, versus a sinister cat for a negative one); they do not aim to capture the emotional content of the image, as in Figure 1.
To capture the emotional content of the image, we propose two groups of models: Face-Cap
The main contributions of the paper are highlighted as follows:
We propose Face-Cap and Face-Attend models to effectively employ facial expression features with general visual content to generate image captions. To the authors’ knowledge, this is the first study to apply facial expression analyses in image captioning tasks.
Our generated captions using the models are evaluated by all standard image captioning metrics. The results show the effectiveness of the models comparing to a comprehensive list of image captioning models using the FlickrFace11K dataset,
3the subset of images from the Flickr 30K dataset  that include human faces.
We further assess the quality of the generated captions in terms of the characteristics of the language used, such as variety of expression. Our analysis suggests that the generated captions by our models improve over other image captioning models by better describing the actions performed in the image.
2 Previous Work
In the following sections, we review image captioning and facial expression recognition models as they are the key parts of our work.
2.1 Image Captioning
There are three main types of image captioning systems including template-based models, retrieval-based models and deep-learning based models . Template-based ones first detect visual objects, their attributes and relations and then fill a pre-defined template’s blank slots . Retrieval-based ones generate captions using the available captions corresponding to similar images in their corresponding datasets . These classical image captioning models have some limitations. For example, template-based ones cannot generate a wide variety of captions with different lengths, and retrieval-based ones are not able to generate specifically-designed captions for different images. Moreover, classical models do not incorporate the detection and generation steps using an end-to-end training approach. Because of these limitations, modern image captioning models using deep learning are currently the most popular.
Modern image captioning models usually use an encoder-decoder paradigm [29, 63, 65]. They apply a top-down approach where a Convolutional Neural Network (CNN) model learns the image content (encoding), followed by a Long Short-Term Memory (LSTM) generating the image caption (decoding). This follows the paradigm employed in machine translation tasks, using deep neural networks , to translate an image into a caption. This top-down mechanism directly converts the extracted visual features into image captions [5, 8, 22, 25, 39]. However, attending to fine-grained and important fragments of visual data, required to provide a better image description, is usually difficult using a top-down paradigm. To solve this problem, a combination of a top-down approach and a bottom-up approach, inspired from the classical image captioning models, is proposed by \citeAyou2016image. The bottom-up approach overcomes this limitation by generating the relevant words and phrases, which can be detected from visual data with any image resolution, and combining them to form an image caption [10, 11, 32, 33].
To attend to fine-grained fragments, attention-based image captioning models have been recently proposed . These kinds of approaches usually analyze different regions of an image in different time steps of a caption generation process, in comparison to the initial encoder-decoder image captioning systems which consider only the whole image  as an initial state for generating image captions. They can also take the spatial information of an image into account to generate the relevant words and phrases in the image caption. The current state-of-the-art models in image captioning are attention-based systems [2, 50, 65, 68], explained in the next section, similar to our attention-based image captioning systems.
Image Captioning with Attention
Visual attention is an important aspect of the visual processing system of humans [30, 6, 55, 51]. It dynamically attends to salient spatial locations in an image with special properties or attributes which are relevant to particular objects. It is different from dealing with the whole image as a set of static extracted features, and assists humans to concentrate more on a targeted object or region at each time step. Although visual attention has been extensively studied in Psychology and Neuroscience, it has only more recently been focused in different artificial intelligence fields including machine learning, computer vision and natural language processing.
The first image captioning model with attention was proposed by \citeAxu2015show. The model uses visual content extracted from the convolutional layers of CNNs, referred to as spatial features, as the input of a spatial attention mechanism to selectively attend to different parts of an image at every time step in generating an image caption. This work is inspired by the work of \citeAbahdanau2014neural, since extended by \citeAvaswani2017attention, who employed attention in the task of machine translation; by \citeAmnih2014recurrent; and by \citeAba2014multiple who applied attention in the task of object recognition. Image captioning with attention differs from previous encoder-decoder image captioning models by concentrating on the salient parts of an input image to generate its equivalent words or phrases simultaneously. \citeAxu2015show proposed two types of attention including a hard (stochastic) mechanism and a soft (deterministic) mechanism. In the soft attention mechanism, a weighted matrix is calculated to weight a particular part of an image as the input to the decoder (interpreted as a probability value for considering the particular part of the image). The hard attention mechanism, in contrast, picks a sampled annotation vector corresponding to a particular part of an image at each time step as the input to the decoder.
rennie2017self extended the work of Xu et al. by using the CIDEr metric , a standard performance metric for image captioning, to optimize their caption generator instead of optimizing maximum likelihood estimation loss. Their approach was inspired by a Reinforcement Learning approach [64, 57] called self-critical sequence training, which involves normalizing the reward signals calculated using the CIDEr metric at test time.
yu2017end and \citeAyou2016image applied a notion of semantic attention to detected visual attributes, learned in an end-to-end fashion, where bottom-up approaches were combined with top-down ones to take advantage of both paradigms. For instance, they acquired a list of semantic concepts or attributes, regarded as a bottom-up mechanism, and used the list with visual features, as an instance of top-down information, to generate an image caption. Semantic attention is used to attend to semantic concepts detected from various parts of a given image. Here, the visual content was only used in the initial time step. In other time steps, semantic attention was used to select the extracted semantic concepts. That is, semantic attention differs from spatial attention, which attends to spatial features in every time step, and does not preserve the spatial information of the detected concepts.
To preserve spatial information, salient regions can be localized using spatial transformer networks , which get the spatial features as inputs. This is similar to Faster R-CNN’s generation of bounding boxes , but it is trained in an end-to-end fashion using bilinear interpolation instead of a Region of Interest pooling mechanism as proposed by \citeAjohnson2016densecap. Drawing on this idea, \citeAanderson2018bottom applied spatial features to image captioning by using a pre-trained Faster R-CNN and an attention mechanism to discriminate among different visual-based regions regarding the spatial features. Specifically, they combined bottom-up and top-down approaches where a pre-trained Faster R-CNN is used to extract the salient regions from images, instead of using the detected objects as high-level semantic concepts in the work of \citeAyou2016image; and an attention mechanism is used to generate spatial attention weights over the convolutional feature maps representing the regions. Faster R-CNN, as an object detection model, is pre-trained on the Visual Genome dataset ; this pre-training on a large dataset is analogous to pre-training a classification model on the ImageNet dataset . \citeAjin2015aligning previously used salient regions with different scales which are extracted by applying selective search  instead of applying Faster R-CNN. Then, they made the input of their spatial attention mechanism by resizing and encoding the regions in the task of image captioning.
In our image captioning systems, we use an attention mechanism weighting visual features as a top-down approach. We also use another attention mechanism to attend to facial expression features as a bottom-up approach. This combination allows our image captioning models to generate captions which are highly correlated with visual content and facial features. To do so, we train a state-of-the-art facial expression recognition model to extract the features. Then, we use the features, attended using the attention mechanism at each time step, to enrich image captions by targeting emotional values.
Image Captioning with Style
Most image captioning systems concentrate on describing objective visual content without adding any extra information, giving rise to factual linguistic descriptions. However, there are also stylistic aspects of language which play an essential role in enriching written communication and engaging users during interactions. Style helps in clearly conveying visual content , and making the content more attractive [15, 4]. It also conveys personality-based  and emotion-based attributes which can impact on decision making . Incorporating style into the description of an image is effective in boosting the engagement level of humans in visually-grounded chatbot platforms  and in interacting with automatically-generated comments for photos and videos in social media platforms .
There are a few models that have incorporated style or other non-factual characteristics into the generated captions [40, 15]. In addition to describing the visual content, these models learn to generate different forms or styles of captions. For instance, \citeAmathews2016senticap proposed the Senti-Cap system to generate sentiment-bearing captions. Here, the notion of sentiment is drawn from Natural Language Processing , with sentiment either negative or positive. The Senti-Cap system of \citeAmathews2016senticap is a full switching architecture incorporating both factual and sentiment caption paths. In comparison, the work of \citeAgan2017stylenet consists of a Factored-LSTM learning the stylistic information in addition to the factual information of the input captions. \citeAchen2018factual subsequently applied a mechanism to weight the stylistic and the factual information using Factored-LSTM. All these approaches need two-stage training: training on factual image captions and training on sentiment-bearing image captions. Therefore, they do not support end-to-end training.
To address this issue, \citeAyou2018image designed two new schemes, Direct Inject and Sentiment Flow, to better employ sentiment in generating image captions. For Direct Inject, an additional dimension was added to the input of a recurrent neural network (RNN) to express sentiment,
All of the above work is focused on subjective descriptions of images using a given sentiment vocabulary, rather than representing the emotional content of the image, as we do in this work. In order to target content-based emotions using visual data, we propose Face-Cap and Face-Attend models employing attention mechanisms to attend to visual features. We aim to apply the emotional content, recognized using a facial expression analysis, of images themselves during a caption generation process. We use the emotional content to generate image captions without any extra style-based or sentiment-bearing vocabulary: our goal is to see whether, given the existing vocabulary, incorporating the emotional content can produce better captions.
2.2 Facial Expression Recognition
Facial expression is a form of non-verbal communication conveying attitudes, affects, and intentions of individuals. It happens as the result of changes over time in facial features and muscles . It is also one of the most important communication means for showing emotions and transferring attitudes in human interactions. Indeed, research on facial expressions started more than a century ago when Darwin published his book titled, “The expression of the emotions in man and animals” . Since then a large body of work has emerged on recognizing facial expressions, usually using a purportedly universal framework of a small number of standard emotions (happiness, sadness, fear, surprise, anger, and disgust) or this set including a neutral expression [13, 24, 12, 66, 14, 53] or more fine-grained facial features such as facial action units, defined as the deformations of facial muscles . Recently, recognizing facial expressions has been paid special attention because of its practical applications in different domains such as education , health-care and virtual reality [72, 12]. It is worth mentioning that the automatic recognition of facial expressions is a difficult task because different people express their attitudes in different ways and there are close similarities among various types of facial expressions  as shown in Figure 2.
To find effective representations, deep learning based methods have been recently successful in this domain. Due to their complex architectures including multiple layers, they can capture hierarchical structures from low- to high-level representations of facial expression data. \citeAtang2013deep, the winner of the 2013 Facial Expression Recognition (FER) challenge , trained a Convolutional Neural Network (CNN) with a linear support vector machine (SVM) to detect facial expressions. He replaced the softmax layer, used to generate a probability distribution across multiple classes, with a linear SVM and showed a consistent improvement compared to the previous work. Instead of cross-entropy loss, his approach optimizes a margin-based loss to maximize margins among data points belonging to diverse classes.
CNNs are also used for feature extraction and transfer learning in this domain. \citeAkahou2016emonets applied a CNN model to recognize facial expressions and won the 2013 Emotion Recognition in the Wild (EmotiW) Challenge. Their approach uses a combination of deep neural networks to learn from diverse data modalities including video frames, audio data and spatio-temporal information . The CNN model, as the best model in this work, aims to recognize emotions from static video frames. Then the recognized emotions are combined across a video clip by a frame aggregation technique and classified using an SVM with a radial basis function kernel. \citeAyu2015image used an ensemble of CNNs to detect facial expressions in a transfer learning framework. On their target samples, they applied a set of face detection approaches to optimally detect faces and remove irrelevant data. They used a multiple neural network training framework to learn a set of weights assigned to the responses of the CNNs in addition to averaging and voting over the responses. \citeAkim2016fusing combined aligned and non-aligned faces to enhance the recognition performance of facial expressions where they automatically detected facial landmarks from faces to rotate and align faces. Then, they trained a CNN model using this combination of faces. \citeAzhang2015learning proposed a CNN-based method to recognize social relation traits (e.g. friendly, competitive and dominant) from detected faces in an image. The method includes a CNN model to recognize facial expressions projected into a shared representation space. The space combines the extracted features from two detected faces in an image and generates the predictions of social traits.
The models mentioned above usually use conventional CNN architectures to report the performance on different facial expression recognition detasets including the FER-2013 dataset , which is a publicly available dataset with a large number of human faces collected in the wild condition. \citeApramerdorfer2016facial instead used an ensemble of very deep architectures of CNNs such as VGGnet, Inception and ResNet by identifying the bottlenecks of the previous state-of-the-art facial expression recognition models on the FER-2013 dataset and achieving a new state-of-the-art result on the dataset. The quality of these recent models is high: it is at least as good as human performance . The idea of applying VGGnet in facial expression recognition tasks motivates our work to make a facial expression recognition module reproducing the state-of-the-art result on FER-2013 dataset. We use the module to extract facial features from human faces to apply in our image captioning models.
In this section, we describe Face-Cap and Face-Attend, our proposed models for generating image captions using facial expression analyses. The models are inspired by two popular image captioning models, specifically Show-Attend-Tell  and Up-Down-Captioner .
Show-Attend-Tell is a well-known and widely used image captioning system that incorporates an attention mechanism to attend to spatial visual features. It demonstrates a significant improvement over earlier image captaining models that do not have an attention mechanism. From this starting point, we propose the Face-Cap model which similarly attends to visual features and additionally uses facial expression analyses in generating image captions. Face-Cap incorporates a one-hot encoding vector as a representation of the facial expression analysis, similar to the representations used for sentiment by \citeAhu2017toward and \citeAyou2018image.
Up-Down-Captioner is the current state-of-the-art image captioning model, defining a new architecture to incorporate attended visual features in generating image captions. In this model, the features directly relate to the objects in the image and two LSTMs (one for generating attention weights and another one for a language model) are used to generate image captions. We propose Face-Attend based on this kind of architecture, as we can apply more fine-grained facial expression features and use two LSTMs to attend to the features in addition to the general visual features. Because Up-Down-Captioner already incorporates attention on objects in the image, our models derived from this allow us to examine the effectiveness of the facial expression features beyond just recognition of the face as an object.
In what follows, we describe our datasets and our facial expression recognition model that are used by Face-Cap and Face-Attend. We then explain Face-Cap in Section 3.3.1 and Face-Attend in Section 3.3.2.
Facial Expression Recognition To train our facial expression recognition model, we use the facial expression recognition 2013 (FER-2013) dataset . It includes images labeled with standard facial expression categories (happiness, sadness, fear, surprise, anger, disgust and neutral). It consists of 35,887 examples (28,709 for training, 3589 for public and 3589 for private test), collected by means of the Google search API. The examples are in grayscale at the size of 48-by-48 pixels. We split the training set of FER-2013 into two sections after removing 11 completely black examples: 25,109 for training and 3589 for validating the model. Similar to other work in this domain [26, 48, 70], we use the private test set of FER-2013 for the performance evaluation of the model after the training phase. To compare with the related work, we do not apply the public test set either for training or for validating the model.
Image Captioning To train Face-Cap and Face-Attend, we have extracted a subset of the Flickr 30K dataset with image captions  that we name FlickrFace11K. It contains 11,696 images including human faces detected using a convolutional neural network-based face detector .
3.2 Facial Expression Recognition Model
We train a facial expression recognition (FER) model using the VGG-B architecture , but we remove the last convolutional block, including two convolutional layers, and the last max pooling layer from the architecture. We use kernel sizes for all remained convolutional layers. We use a batch normalization layer  after every remained convolutional block. Our FER model gives a similar performance to the state-of-the-art under a similar experimental setting, as described in \citeApramerdorfer2016facial; this is higher than reported human performance . The framework of our FER model is shown in Figure 3.
From the FER model, we extract two classes of facial expression features to use in our image captioning models. The first class of features is the output of the final softmax layer of our FER model, , representing the probability distribution of the facial expression classes for the th face in the image. For the image as a whole, we construct a vector of facial expression features used in our image captioning model as in Equation 1.
where is the number of faces in the image. That is, is a one-hot encoding, which we refer to as the facial encoding vector, of the aggregate facial expressions of the image.
The second class of features consist of convolutional features extracted from the FER model, giving a more fine-grained representation of the faces in the image. For each face in an image, we extract the last convolutional layer of the model, giving features. We convert these into a representation for each face. We restrict ourselves to a maximum of three faces: in our FlickrFace11K dataset, of the images have at most three faces. If one image has more than three faces, we select the three faces with the biggest bounding box sizes. We then concatenate the features of the three faces leading to dimensions, , where is and is ; we refer to these as facial features. If a sample includes fewer than three faces, we fill in dimensions with zero values.
3.3 Image Captioning Models
Our image captioning models aim to generate an image caption, , where is a word and is the length of the caption, using facial expression analyses. As a representation of the image, all our models use the last convolutional layer of VGG-E architecture . In addition to our proposed facial features, the VGG-E network trained on ImageNet  produces a feature map. We convert this into a representation, , where is and is ; we refer to this as the visual features. The specifics of the image captioning models are explained below.
As in the Show-Attend-Tell model of \citeAxu2015show, we use a long short-term memory (LSTM) network as our caption generator. The LSTM incorporates the emotional content of the image in the form of the facial encoding vector defined in Equation (1). We propose two variants, Face-Cap-Repeat and Face-Cap-Memory, that differ in terms of how the facial encoding vector is incorporated.
Face-Cap-Repeat In Face-Cap-Repeat, in each time step (), the LSTM uses the previous word embedded in dimensions ( selected from an embedding matrix learned without pre-training from random initial values), the previous hidden state (), the attention-based features (), and the facial encoding vector () to calculate input gate (), forget gate (), output gate (), input modulation gate (), memory cell (), and hidden state ().
where , and are learned weights and biases and is the logistic sigmoid activation function. From now on, we show this LSTM equation as a short style (Equation 3).
where are unnormalized weights for the visual features () and are the normalized weights using a softmax layer at time step . Our trained weights are represented by . Finally, our attention-based features () are calculated using:
To initialize the LSTM’s hidden state (), we feed the facial features through a standard multilayer perceptron, shown in Equation (6).
We use the current hidden state () to calculate the negative log-likelihood of in each time step (Equation (7)); we call this the face objective function.
where a multilayer perceptron generates , which is the categorical probability distribution of the current hidden state across the facial expression classes. (We adapt this from \citeAhu2017toward and \citeAyou2018image, who use this objective function for injecting ternary-valued sentiment (positive, neutral, negative) into captions.) This loss is estimated and averaged, over all steps, during the training phase.
The general objective function of Face-Cap-Repeat is defined as:
A multilayer perceptron and a softmax layer is used to calculate , the probability of the next generated word:
where the learned weights and bias are given by and . The last term in Equation (8) is to encourage Face-Cap-Repeat to equally pay attention to different sets of when a caption generation process is finished. is a regularization constant.
Face-Cap-Memory The above Face-Cap-Repeat model feeds in the facial encoding vector at the initial step (Equation (6)) and at each time step (Equation (3)), shown in Figure 4 (top). The LSTM uses the vector for generating every word because the vector is fed at each time step. Since not all words in the ground truth captions will be related to the vector — for example in Figure 1, where the majority of words are not directly related to the facial expressions — this mechanism could lead to an overemphasis on these features.
Our second variant of the model, Face-Cap-Memory, is as above except that the term is removed from Equation (3): we do not apply the facial encoding vector at each time step (Figure (4) (bottom)) and rely only on Equation (7) to memorize this facial expression information. Using this mechanism, the LSTM can effectively take the information in generating image captions and ignore the information when it is irrelevant. To handle an analogous issue for sentiment, You et al. \citeyearyou2018image implemented a sentiment cell, working similarly to the memory cell in the LSTM, initialized by the ternary sentiment. They then fed the visual features to initialize the memory cell and hidden state of the LSTM. Similarly, Face-Cap-Memory uses the facial features to initialize the memory cell and hidden state. Using the attention mechanism, our model applies the visual features in generating every caption word.
In this section, we apply two LSTMs to attend to our more fine-grained facial features () explained in Section 3.2, in addition to our visual features (). We propose two variant architectures for combining these features, Dual-Face-Att and Joint-Face-Att, explained below.
The framework of Dual-Face-Att is shown in Figure 5. To generate image captions, Dual-Face-Att includes two LSTMs: one, called F-LSTM, to attend to facial features and another one, called C-LSTM, to attend to visual content. Both LSTMs are defined as in Equation (10), but with separate training parameters.
In both LSTMs, to calculate at each time step (), features (the facial features () for F-LSTM and the visual features () for C-LSTM) are weighted using a soft attention mechanism, but with separately learned parameters.
where and are unnormalized weights for features , and normalized weights using a softmax layer, respectively. Our trained weights are . Finally, our attention-based features () are calculated using:
is for F-LSTM and for C-LSTM. The initial LSTM’s hidden state () is computed using a standard multilayer perceptron:
The objective function of Dual-Face-Att is defined using Equation (14).
where a multilayer perceptron and a softmax layer, for each LSTM, are used to calculate and (the probabilities of the next generated word on the basis of facial expression features and visual features, respectively):
The last two terms in Equation 14 are to encourage Face-Attend to equally pay attention to different sets of and when a caption generation process is finished. is a regularization constant. The ultimate probability of the next generated word is:
The above Dual-Face-Att model uses two LSTMs: one for attending to visual features and another one for attending to facial features. In the model, both LSTMs also play the role of language models (Equation (16)) and directly impact on the prediction of the next generated word. However, the recent state-of-the-art image captioning model of \citeAanderson2018bottom achieved better performance by using two LSTMs with differentiated roles: one for attending only to visual features and a second one purely as a language model. Inspired by this, we define our Joint-Face-Att variant to use one LSTM, which we call A-LSTM, to attend to image-based features, both facial and visual; and a second one, which we call L-LSTM, to generate language (Figure 6). Here, we calculate the hidden state of A-LSTM using:
where is the mean-pooled visual features and is the previous hidden state of L-LSTM. We also calculate the hidden state of L-LSTM using:
where and are the attended facial features and visual features, respectively. They are defined analogously to Equation (11) and (12), but with different sets of trainable parameters. and are similarly initialized as follows using two standard multilayer perceptrons:
The objective function of Joint-Face-Att is:
where is a balancing parameter and is the probability of the next generated word calculated as follows:
where and are trainable weights and bias, respectively.
4.1 Evaluation Metrics
Following previous work, we evaluate our image captioning model using standard evaluation metrics including BLEU , ROUGE , METEOR , CIDEr , and SPICE . Larger values are better results for all metrics. BLEU calculates a weighted average for n-grams with different sizes as a precision metric. ROUGE is a recall-oriented metric that calculates F-measures using the matched n-grams between the generated captions and their corresponding reference summaries. METEOR uses a weighted F-measure matching synonyms and stems in addition to standard n-gram matching. CIDEr uses a n-gram matching, calculated using the cosine similarity, between the generated captions and the consensus of the reference captions. Finally, SPICE calculates F-score for semantic tuples derived from scene graphs.
4.2 Systems for Comparison
The starting points for our Face-Cap and Face-Attend models are Show-Attend-Tell  and Up-Down-Captioner , respectively. We therefore use these models, trained on the FlickrFace11K dataset, as baselines to examine the effect of adding facial expression information. We call these baseline models Show-Att-Tell and Up-Down. (Moreover, \citeAanderson2018bottom has the state-of-the-art results for image captioning.)
We further look at two additional models to investigate the impact of the face loss function in using the facial encoding in different schemes. We train the Face-Cap-Repeat model, which uses the facial encoding in every time step, without calculating the face loss function (Equation (7)); we refer to this (following the terminology of \citeAhu2017toward and \citeAyou2018image) as the Step-Inject model. The Face-Cap-Memory model, which applies the facial encoding in the initial time step, is also modified in the same way; we refer to this as the Init-Flow model.
4.3 Implementation Details
The size of the word embedding layer, initialized via a uniform distribution, is set to except for Up-Down and Joint-Face-Att which is set to 512. We fixed dimensions for the memory cell and the hidden state in this work. We use the mini-batch size of and the initial learning rate of to train each image captioning model except Up-Down and Joint-Face-Att where we set the mini-batch size to 64 and the initial learning rate to . We used different parameters for Up-Down and Joint-Face-Att in comparison with other models because using similar parameters led to worse results for all models. The Adam optimization algorithm  is used for optimizing all models. During the training phase, if the model does not have an improvement in METEOR score on the validation set in two successive epochs, we divide the learning rate by two (the minimum learning rate is set to ) and the previous trained model with the best METEOR is reloaded. This method of learning rate decay is inspired by Wilson et al. \citeyearwilson2017marginal, who advocated tuning the learning rate decay for Adam. In addition to learning rate decay, METEOR is applied to select the best model on the validation set because of a reasonable correlation between METEOR and human judgments . Although SPICE can have higher correlations with human judgements, METEOR is quicker to calculate than SPICE, which requires dependency parsing, and so more suitable for a training criterion. The epoch limit is set to 30. We use the same vocabulary size and visual features for all models. in Equation (8) is empirically set to . and are also set to and in Equation (14) and (20), respectively. Multilayer perceptrons in Equation (6), (13) and (19) use as an activation function.
4.4 Experimental Results
Quantitative Analysis: Performance Metrics The FlickrFace11K splits are used for training and evaluating all image captioning models in this paper. Table 1 summarizes the results on the FlickrFace11K test set. Dual-Face-Att and Joint-Face-Att outperform other image captioning models using all the evaluation metrics. For example, Dual-Face-Att achieves 17.6 for BLEU-4 which is 1.9 and 0.4 points better that Show-Att-Tell (the first baseline model) and Face-Cap-Memory (the best of the Face-Cap models), respectively. Joint-Face-Att also achieves a BLEU-4 score of 17.7 which is 0.4 better than Up-Down, the baseline model it builds on, and 0.5 better than Face-Cap-Memory. Dual-Face-Att and Joint-Face-Att show very close results, with Dual-Face-Att demonstrating a couple of larger gaps in performance, in the BLEU-1 and ROUGE-L metrics. Among the Face-Cap models, Face-Cap-Memory is clearly the best.
Quantitative Analysis: Entropy, Top and Ranking of Generated Verbs To analyze what it is about the captions themselves that differs under the various models, with respect to our aim of injecting information about emotional states of the faces in images, we first extracted all generated adjectives, which are tagged using the Stanford part-of-speech tagger software . Perhaps surprisingly, emotions do not manifest themselves in the adjectives in our models: the adjectives used by all systems are essentially the same.
To investigate this further, we took the NRC emotion lexicon
Among the reference captions, as noted above the most frequent word from the emotion lexicon was young, followed by white, blue and black; all of these presumably have some emotional association, but do not generally embody an emotion. The first word embodying the expression of an emotion is the verb smiling, at rank 8, with other similar verbs following closely (e.g. laughing, enjoying). The highest ranked emotion-embodying adjective is happy at rank 26, with a frequency of around 15% of that of smiling; other adjectives were much further behind. It is clear that verbs form a more significant expression of emotion in this particular dataset than do adjectives.
To come up with an overall quantification of the different linguistic properties of the generated captions under the models, we therefore focussed our investigation on the differences in distributions of the generated verbs. To do this, we calculated three measures. The first is entropy (in the information-theoretic sense), which can indicate which distributions are closer to deterministic and which are more spread out (with a higher score indicating more spread out): in our context, it will indicate the amount of variety in selecting verbs. We calculated entropy using the standard Equation (22).
where indicates the number of the unique generated verbs and is the probability of each generated verb (), estimated as the Maximum Likelihood Estimate from the sample.
As a second measure, we looked at the four most frequent verbs (Top), which are the same for all models (is, sitting, are, standing) — these are verbs with relatively little semantic content, and for the most part act as syntactic props for the content words of the sentence. The amount of probability mass left beyond those four verbs is another indicator of variety in verb expression.
Table 2 shows that Dual-Face-Att can generate the most diverse distribution of the verbs compared to other models because it has the highest Entropy. It also shows that Dual-Face-Att has the lowest (best) proportion of the probability mass taken up by Top, leaving more for other verbs. In contrast to the results of the standard image captioning metrics shown in Table 1, Dual-Face-Att and Joint-Face-Att show very different behaviour: Dual-Face-Att is clearly superior. Among the Face-Cap models, as for the overall metrics, Face-Cap-Memory is the best, and is in fact better than Joint-Face-Att. (As a comparison, we also show Entropy and Top for all reference captions (5 human-generated captions per image): human-generated captions are still much more diverse than the best models.)
The two measures above are concerned only with variety of verb choice and not with verbs linked specifically to emotions or facial expressions. For a third measure, therefore, we look at selected individual verbs linked to actions that relate to facial emotion expression, either direct or indirect. Our measure is the rank of the selected verb among all those chosen by a model; higher (i.e. lower-numbered) ranked verbs mean that the model more strongly prefers this verb. Our selected verbs are among those that ranked highly in the reference captions and also appeared in the emotion lexicon.
Table 3 shows a sample of those verbs such as singing, reading and laughing. The baseline Show-Att-Tell model ranks all of those relatively low, where our other baseline Up-Down and our models incorporating facial expressions do better. Only Face-Cap-Memory (the best of our Face-Cap models by overall metrics) and our Face-Attend models manage to use verbs like laughing and reading.
Qualitative Analysis: Example Generated Captions In Figure 7, we compare some generated captions by different image captioning models using four representative images. The first one shows that Dual-Face-Att correctly uses smiling and laughing to capture the emotional content of the image. Step-Inject, Init-Flow, Face-Cap-Repeat and Face-Cap-Memory are also successful in generating smiling for the image. For the second sample, Dual-Face-Att and Joint-Face-Att use the relevant verb singing to describe the image, while other models cannot generate the verb. Similarly, Dual-Face-Att generates the verb reading for the third image. Moreover, most models can correctly generate smiling for the forth image except Show-Att-Tell and Up-Down which do not use the facial information. Init-Flow also cannot generate smiling because it uses the facial information only at initial step which provides a weak emotional signal for the model. Here, Dual-Face-Att can generate the most accurate caption (“A man and a woman are smiling at the camera”) for the image, while other models generate some errors. For example, Face-Cap-Memory generates “A woman and a young girl are smiling”, which does not describe the man in the image.
Figure 8 shows two examples including some improper words and phrases. For the first image, Dual-Face-Att generates “Two women are sitting at a table with a laptop and a laptop”. This caption wrongly includes laptop and two women. Here, other models are more successful in generating relevant image captions. For the second image, Joint-Face-Att incorrectly generates “holding a small child” and Face-Cap-Memory wrongly generates “a dog”.
Qualitative Analysis: Visualizing Attention To help qualitatively analyze the attention weights learned using different models, in Figure 9 we show the attended pixels in the regions of the detected faces in an example image. (We only highlight the attention weights in the regions.) We choose the regions to compare the models because they use extra information corresponding to the regions. As indicated by the figure, because of the extra information, our models attend to the regions less than Show-Att-Tell and Up-Down. For example, Dual-Face-Att almost does not attend to the regions for all generated words. This shows that the extra information is more representative than the faces themselves and the models mostly describe the image without the need of faces (the models use the extra information instead).
In this work, we have presented several image captioning models incorporating information from facial features. The joint image captioning models, Dual-Face-Att and Joint-Face-Att models, learned to apply both facial features and visual content to generate image captions that produce the highest results as measured by standard metrics on the FlickrFace11K dataset. They use attention mechanisms to adaptively take into account the presented facial expressions in images to generate more descriptive image captions. The example generated captions show that the models can generate more diverse image captions in addition to having a higher ability to employ facial expression features to describe images.
There is other recent work that explore other aspects of emotional content in images; we note specifically the dataset of \citeAyou-etal:2016:AAAI. In future work, we are interested in exploring this broader emotional content of images, which is reflected in the NRC Emotion Lexicon we used in our linguistic analysis of captions.
- An earlier version of Face-Cap has already been published .
- Our source codes and trained models are publicly available: https://github.com/omidmnezami/Face-Attend
- Our dataset splits and labels are publicly available: https://github.com/omidmnezami/Face-Cap
- A related idea was earlier proposed by \citeAradford2017learning who identified a sentiment unit in a RNN-based system.
- The new version (2018) of Dlib library is applied.
- (2016) Spice: semantic propositional image caption evaluation. In ECCV, pp. 382–398. Cited by: §4.1, §4.3.
- (2018) Bottom-up and top-down attention for image captioning and visual question answering. In CVPR, Vol. 3, pp. 6. Cited by: §1, §2.1, §3, §4.2.
- (2017) Sca-cnn: spatial and channel-wise attention in convolutional networks for image captioning. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6298–6306. Cited by: §1.
- (2018) âFactualâ or âemotionalâ: stylized image captioning with adaptive learning and attention. arXiv preprint arXiv:1807.03871. Cited by: §2.1.2.
- (2015) Mind’s eye: A recurrent visual representation for image caption generation. In CVPR, pp. 2422–2431. Cited by: §2.1.
- (2002) Control of goal-directed and stimulus-driven attention in the brain. Nature reviews neuroscience 3 (3), pp. 201. Cited by: §2.1.1.
- (2014) Meteor universal: language specific translation evaluation for any target language. In WMT, pp. 376–380. Cited by: §4.1.
- (2015) Long-term recurrent convolutional networks for visual recognition and description. In CVPR, pp. 2625–2634. Cited by: §2.1.
- (2006) Darwin and facial expression: a century of research in review. Ishk. Cited by: §2.2.
- (2013) Image description using visual dependency representations. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1292–1302. Cited by: §2.1.
- (2010) Every picture tells a story: generating sentences from images. In ECCV, pp. 15–29. Cited by: §2.1, §2.1.
- (2003) Automatic facial expression analysis: a survey. Pattern recognition 36 (1), pp. 259–275. Cited by: §1, §2.2.
- (1982) Discrimination and imitation of facial expression by neonates. Science 218 (4568), pp. 179–181. Cited by: §2.2.
- (2014) Human facial expression: an evolutionary view. Academic Press. Cited by: §2.2.
- (2017) Stylenet: generating attractive visual captions with styles. In CVPR, Cited by: §1, §2.1.2, §2.1.2.
- (2013) Challenges in representation learning: a report on three machine learning contests. In ICONIP, pp. 117–124. Cited by: Figure 2, §2.2, §2.2, §3.1, §3.2.
- (2013) Framing image description as a ranking task: data, models and evaluation metrics. Journal of Artificial Intelligence Research 47, pp. 853–899. Cited by: §2.1.
- (2019) A comprehensive survey of deep learning for image captioning. ACM Computing Surveys (CSUR) 51 (6), pp. 118. Cited by: §2.1.
- (2018) Emotional dialogue generation using image-grounded language models. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, pp. 277. Cited by: §2.1.2.
- (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §3.2.
- (2015) Spatial transformer networks. In Advances in neural information processing systems, pp. 2017–2025. Cited by: §2.1.1.
- (2016) Densecap: fully convolutional localization networks for dense captioning. In CVPR, pp. 4565–4574. Cited by: §1, §2.1.
- (2013) Combining modality specific deep neural networks for emotion recognition in video. In ICMI, pp. 543–550. Cited by: §2.2.
- (2000) Comprehensive database for facial expression analysis. In Proceedings Fourth IEEE International Conference on Automatic Face and Gesture Recognition (Cat. No. PR00580), pp. 46–53. Cited by: §2.2.
- (2015) Deep visual-semantic alignments for generating image descriptions. In CVPR, pp. 3128–3137. Cited by: §2.1.
- (2016) Fusing aligned and non-aligned face information for automatic affect recognition in the wild: a deep learning approach. In CVPR Workshops, pp. 48–57. Cited by: §3.1.
- (2009) Dlib-ml: a machine learning toolkit. Journal of Machine Learning Research 10 (Jul), pp. 1755–1758. Cited by: §3.1.
- (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.3.
- (2014) Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539. Cited by: §2.1.
- (1987) Shifts in selective visual attention: towards the underlying neural circuitry. In Matters of intelligence, pp. 115–141. Cited by: §2.1.1.
- (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision 123 (1), pp. 32–73. Cited by: §2.1.1.
- (2013) Baby talk: understanding and generating simple image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (12), pp. 2891–2903. Cited by: §2.1.
- (2012) Collective generation of natural image descriptions. In ACL, pp. 359–368. Cited by: §2.1.
- (2016) Share-and-chat: achieving human-level video commenting by search and multi-view embedding. In Proceedings of the 24th ACM international conference on Multimedia, pp. 928–937. Cited by: §2.1.2.
- (2004) Rouge: a package for automatic evaluation of summaries. Text Summarization Branches Out. Cited by: §4.1.
- (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §3.1.
- (2017) Knowing when to look: adaptive attention via a visual sentinel for image captioning. In CVPR, Vol. 6, pp. 2. Cited by: §1.
- (2010) The extended cohn-kanade dataset (ck+): a complete dataset for action unit and emotion-specified expression. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Workshops, pp. 94–101. Cited by: §1.
- (2014) Deep captioning with multimodal recurrent neural networks (m-rnn). arXiv preprint arXiv:1412.6632. Cited by: §2.1.
- (2016) SentiCap: generating image descriptions with sentiments.. In AAAI, pp. 3574–3580. Cited by: §1, §2.1.2, §2.1.2.
- (2018) SemStyle: learning to generate stylised image captions using unaligned text. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8591–8600. Cited by: §2.1.2.
- (2013) Crowdsourcing a word-emotion association lexicon. 29 (3), pp. 436–465. Cited by: §4.4.
- (2018) Face-cap: image captioning using facial expression analysis. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 226–240. Cited by: footnote 1.
- (2018) Automatic recognition of student engagement using deep learning and facial expression. arXiv preprint arXiv:1808.02324. Cited by: §2.2.
- (2008-01) Opinion mining and sentiment analysis. Found. Trends Inf. Retr. 2 (1-2), pp. 1–135. External Links: Cited by: §1, §2.1.2.
- (2002) BLEU: a method for automatic evaluation of machine translation. In ACL, pp. 311–318. Cited by: §4.1.
- (1999) Linguistic styles: language use as an individual difference.. Journal of personality and social psychology 77 (6), pp. 1296. Cited by: §2.1.2.
- (2016) Facial expression recognition using convolutional neural networks: state of the art. arXiv preprint arXiv:1612.02903. Cited by: §3.1.
- (2017) Faster r-cnn: towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis & Machine Intelligence (6), pp. 1137–1149. Cited by: §2.1.1.
- (2017) Self-critical sequence training for image captioning. In CVPR, Vol. 1, pp. 3. Cited by: §1, §2.1.
- (2000) The dynamic representation of scenes. Visual cognition 7 (1-3), pp. 17–42. Cited by: §2.1.1.
- (2015) Imagenet large scale visual recognition challenge. International journal of computer vision 115 (3), pp. 211–252. Cited by: §2.1.1, §3.3.
- (2015) Automatic analysis of facial affect: a survey of registration, representation, and recognition. IEEE transactions on pattern analysis and machine intelligence 37 (6), pp. 1113–1133. Cited by: §1, §2.2.
- (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §3.2, §3.3.
- (2004) A feedback model of visual attention. Journal of cognitive neuroscience 16 (2), pp. 219–237. Cited by: §2.1.1.
- (2014) Sequence to sequence learning with neural networks. In NIPS, pp. 3104–3112. Cited by: §2.1.
- (1998) Introduction to reinforcement learning. Vol. 135, MIT press Cambridge. Cited by: §2.1.1.
- (2001) Recognizing action units for facial expression analysis. IEEE Transactions on pattern analysis and machine intelligence 23 (2), pp. 97–115. Cited by: §2.2.
- (2003) Feature-rich part-of-speech tagging with a cyclic dependency network. In NAACL HLT, pp. 173–180. Cited by: §4.4.
- (2013) Selective search for object recognition. International journal of computer vision 104 (2), pp. 154–171. Cited by: §2.1.1.
- (2015) Cider: consensus-based image description evaluation. In CVPR, pp. 4566–4575. Cited by: §2.1.1, §4.1.
- (2015) A neural conversational model. arXiv preprint arXiv:1506.05869. Cited by: §2.1.
- (2015) Show and tell: a neural image caption generator. In CVPR, pp. 3156–3164. Cited by: §1, §2.1.
- (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8 (3-4), pp. 229–256. Cited by: §2.1.1.
- (2015) Show, attend and tell: neural image caption generation with visual attention. In ICML, pp. 2048–2057. Cited by: §1, §2.1, §2.1, §3, §4.2.
- (2006) A 3d facial expression database for facial behavior research. In 7th international conference on automatic face and gesture recognition (FGR06), pp. 211–216. Cited by: §2.2.
- (2018) Image captioning at will: a versatile scheme for effectively injecting sentiments into image descriptions. arXiv preprint arXiv:1801.10121. Cited by: §1.
- (2016) Image captioning with semantic attention. In CVPR, pp. 4651–4659. Cited by: §1, §2.1.
- (2014) From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics 2, pp. 67–78. Cited by: Figure 1, 2nd item, §3.1.
- (2015) Image based static facial expression recognition with multiple deep network learning. In ICMI, pp. 435–442. Cited by: §3.1.
- (2018) Facial expression recognition via learning deep sparse autoencoders. Neurocomputing 273, pp. 643–649. Cited by: §2.2.
- (2008) A survey of affect recognition methods: audio, visual, and spontaneous expressions. IEEE transactions on pattern analysis and machine intelligence 31 (1), pp. 39–58. Cited by: §2.2.