Face-Cap: Image Captioning using Facial Expression Analysis

Face-Cap: Image Captioning using Facial Expression Analysis

Omid Mohamad Nezami Department of Computing, Macquarie University, Sydney, Australia

11email: {mark.dras,len.hamey}@mq.edu.au
omid.mohamad-nezami@hdr.mq.edu.au
   Mark Dras Department of Computing, Macquarie University, Sydney, Australia

11email: {mark.dras,len.hamey}@mq.edu.au
omid.mohamad-nezami@hdr.mq.edu.au
   Peter Anderson Department of Computing, Macquarie University, Sydney, Australia

11email: {mark.dras,len.hamey}@mq.edu.au
omid.mohamad-nezami@hdr.mq.edu.au The Australian National University, Canberra, Australia
peter.anderson@anu.edu.au
   Len Hamey Department of Computing, Macquarie University, Sydney, Australia

11email: {mark.dras,len.hamey}@mq.edu.au
omid.mohamad-nezami@hdr.mq.edu.au
Abstract

Image captioning is the process of generating a natural language description of an image. Most current image captioning models, however, do not take into account the emotional aspect of an image, which is very relevant to activities and interpersonal relationships represented therein. Towards developing a model that can produce human-like captions incorporating these, we use facial expression features extracted from images including human faces, with the aim of improving the descriptive ability of the model. In this work, we present two variants of our Face-Cap model, which embed facial expression features in different ways, to generate image captions. Using all standard evaluation metrics, our Face-Cap models outperform a state-of-the-art baseline model for generating image captions when applied to an image caption dataset extracted from the standard Flickr 30K dataset, consisting of around 11K images containing faces. An analysis of the captions finds that, perhaps surprisingly, the improvement in caption quality appears to come not from the addition of adjectives linked to emotional aspects of the images, but from more variety in the actions described in the captions.

Keywords:
Image captioning Facial expression recognition Sentiment analysis Deep learning.

1 Introduction

Image captioning systems aim to describe the content of an image using computer vision and natural language processing. This is a challenging task in computer vision because we have to capture not only the objects but also their relations and the activities displayed in the image in order to generate a meaningful description. Most of the state-of-the-art methods, including deep neural networks, generate captions that reflect the factual aspects of an image [3, 8, 12, 16, 20, 35, 37]; the emotional aspects which can provide richer and attractive image captions are usually ignored in this process. Emotional properties, including recognizing and expressing emotions, are required in designing intelligent systems to produce intelligent, adaptive, and effective results [22]. Designing an image captioning system, which can recognize emotions and apply them to describe images, is still a challenge.

A few models have incorporated sentiment or other non-factual information into image captions [10, 23, 38]; they typically require the collection of a supplementary dataset, with a sentiment vocabulary derived from that, drawing from work in Natural Language Processing [25] where sentiment is usually characterized as one of positive, neutral or negative. Mathews et al. [23], for instance, constructed a sentiment image-caption dataset via crowdsourcing, where annotators were asked to include either positive sentiment (e.g. a cuddly cat) or negative sentiment (e.g. a sinister cat) using a fixed vocabulary; their model was trained on both this and a standard set of factual captions. Gan et al. [10] proposed a captioning model called StyleNet to add styles, which could include sentiments, to factual captions; they specified a predefined set of styles, such as humorous or romantic.

These kinds of models typically embody descriptions of an image that represent an observer’s sentiment towards the image (e.g. a cuddly cat for a positive view of an image, versus a sinister cat for a negative one); they do not aim to capture the emotional content of the image, as in Fig. 1. This distinction has been recognized in the sentiment analysis literature: the early work of [24], for instance, proposed a graph-theoretical method for predicting sentiment expressed by a text’s author by first removing text snippets that are positive or negative in terms of the actual content of the text (e.g. “The protagonist tries to protect her good name” as part of the description of a movie plot, where good has positive sentiment) and leaving only the sentiment-bearing text that reflects the writer’s subjective view (e.g. “bold, imaginative, and impossible to resist”). We are interested in precisely this notion of content-related sentiment, in the context of an image.

In this paper, therefore, we introduce an image captioning model we term Face-Cap to incorporate emotional content from the images themselves: we automatically detect emotions from human faces, and apply the derived facial expression features in generating image captions. We introduce two variants of Face-Cap, which employ the features in different ways to generate the captions. The contributions of our work are:

  1. Face-Cap models that generate captions incorporating facial expression features and emotional content, using neither sentiment image-caption paired data nor sentiment caption data, which is difficult to collect. To the authors’ knowledge, this is the first study to apply facial expression analysis in image captioning tasks.

  2. A set of experiments that demonstrate that these Face-Cap models outperform baseline, a state-of-the-art model, on all standard evaluation metrics. An analysis of the generated captions suggests that they improve over baseline models by better describing the actions performed in the image.

  3. An image caption dataset that includes human faces which we have extracted from Flickr 30K dataset [39], which we term FlickrFace11K. It is publicly available111https://github.com/omidmn/Face-Cap for facilitating future research in this domain.

The rest of the paper is organized as follows. In Sec. 2, related work in image captioning and facial expression recognition is described. In Sec. 3, we explain our models to caption an image using facial expression analysis. To generate sentimentally human-like captions, we show how facial expression features are detected and applied in our image captioning models. Sec. 4 presents our experimental setup and the evaluation results. The paper concludes in Sec. 5.

2 Related Work

In the following subsections, we review image captioning and facial expression recognition models as they are the key parts of our work.

2.1 Image Captioning

Recent image captioning models apply a CNN model to learn the image contents (encoding), followed by a LSTM to generate the image caption (decoding). This follows the paradigm employed in neural machine translation, using deep neural networks [31] to translate an image into a caption. In terms of encoding, they are divided into two categories: global encoding and fragment-level encoding [15]. The global approach encodes an image into a single feature vector, while the fragment-level one encodes the image fragments into separate feature vectors.

As a global encoding technique, Kiros et al. [20] applied a CNN and a LSTM to capture the image and the caption information, separately. They made a joint multi-modal space to encode the information and a multi-modal log-bilinear model (in the form of a language model) to generate new captions. In comparison, Vinyals et al. [35] encoded image contents using a CNN and applied a LSTM to generate a caption for the image in an end-to-end neural network model. In general, the global encoding approaches generate captions according to the detected objects in an image; however, when the test samples are significantly different from the training ones in terms of the object locations and interactions, they often cannot generalize to the test samples in terms of appropriate captions.

With respect to fragment-level encoding, Fang et al. [8] detected words from visual regions and used a maximum entropy language model to generate candidate captions. Instead of using LSTMs, they utilized a re-ranking method called deep multi-modal similarity to select the captions. Karpathy and Fei-Fei [16] applied a region-based image captioning model consisting of two separate models to detect an image region and generate its corresponding caption. Johnson et al. [12], based on the work of Ren et al. [28] on detecting image regions, incorporated the detection and generation tasks in an end-to-end training task. Attention mechanisms (either hard or soft) were applied by Xu et al. [37] to detect salient regions and generate their related words. In each time step, the model dynamically used the regional features as inputs to the LSTM model. The fragment-level encoding methods detect objects and their corresponding regions in an image. However, they usually neglect encoding fine and significant fragments of data such as emotions. The work that we describe next has recognised this: human captions, such as those in Fig. 1, do include sentiment, and image captioning systems should therefore also aim to do this.

There are a few models that have incorporated sentiment into image captions [10, 23, 38]. However, this has typically required the construction of a new dataset, and the notion of sentiment is realized via a sentiment lexicon. Mathews et al. [23] applied a model to describe images using predefined positive and negative sentiments called SentiCap. The model used a full switching method including two parallel systems, each of which includes a Convolutional Neural Network (CNN) and a Long Short-Term Memory (LSTM). The first system was used to generate factual image captions and the second one to add word-level sentiments. The latter required a specifically constructed dataset, where crowdsourced workers rewrote thousands of factual captions to incorporate terms from a list of sentiment-bearing adjective-noun pairs. You et al. [38] presented two optimum schemes to employ the predefined sentiments to generate image descriptions. Their approach is still focused on subjective descriptions of images using a given sentiment vocabulary, rather than representing the emotional content of the image.

Gan et al. [10] StyleNet system that we noted in Sec. 1 adds styles, including sentiment values, to factual captions; these styles, such as humorous or romantic. Once more, these reflect the attitude of the viewer to the image, and it is in principle possible to generate captions that do not accord with the content of the image: for instance, while happy faces of babies can be properly described using positive sentiment, it is difficult to apply negative sentiment in this context.

Figure 1: The examples of Flickr 30K dataset [39] including sentiments. A man in a suit and tie with a sad look on his face (left) and a man on a sidewalk is playing the accordion while happy people pass by (right).

In contrast to this work, we focus on images including human faces and recognize relevant emotions, using facial expression analyses, to generate image captions. Furthermore, we do not use any specific sentiment vocabulary or dataset to train our models: our goal is to see whether, given the existing vocabulary, incorporating facial emotion can produce better captions.

2.2 Facial Expression Recognition

Facial expression is a form of non-verbal communication which conveys attitude, affects, and intentions of individuals. Facial features and muscles changes during time lead to facial expression [9]. Darwin started research leading to facial expressions more than one century ago [7]. Now, there is a large body of work in recognizing basic facial expressions [9, 29] most often using the framework of six purportedly universal emotions [6] of happiness, sadness, fear, surprise, anger, and disgust plus neutral expressions. Recently, to find effective representations, deep learning based methods have been successfully applied to facial expression recognition (FER) tasks. They are able to capture hierarchical structures from low- to high-level data representations thanks to their complex architectures including multiple layers. Among deep models, Convolutional Neural Networks (CNNs) have achieved state-of-the-art performances in this domain. Kahou et al. [14], as a winning submission to the 2013 Emotion Recognition in the Wild Challenge, used CNNs to recognize facial expressions. CNNs and linear support vector machines were trained to detect basic facial expressions by Tang [32], who won the 2013 FER challenge [11]. In FER tasks, CNNs can be also used for transfer learning and feature extraction. Yu and Zhang [40] used CNNs, in addition to a face detection approach, to recognize facial expressions using transfer learning. The face detection approach was applied to detect faces areas and remove irrelevant noises in the target samples. Kahou et al. [13] also used CNNs for extracting visual features together with audio features in a multi-modal framework.

As is apparent, these models usually employ CNNs with a fairly standard deep architecture to produce good results on the FER-2013 dataset [11], which is a large dataset collected ‘in the wild’. Pramerdorfer et al. [27], instead, applied a combination of modern deep architectures including VGGnet [30] on the dataset. They succeeded in generating the state-of-the-art result in this domain. We similarly aim to train a facial expression recognition model that can recognize facial expressions in the wild and produce state-of-the-art performance on FER-2013 dataset. In the next step, we then use the model as a feature extractor on the images of FlickrFace11K, our extracted dataset from Flickr 30K [39]. The features will be applied as a part of our image captioning models in this work.

3 Describing an Image using Facial Expression Analysis

In this paper, we describe our image captioning models to generate image captions using facial expression analysis, which we term Face-Cap. We use a facial expression recognition model to extract the facial expression features from an image; the Face-Cap models in turn apply the features to generate image descriptions. In the following subsections, we first describe the datasets used in this work. Second, the face pre-processing step is explained to detect faces from our image caption data, and make them exactly similar to our facial expression recognition data. Third, the faces are fed into our facial expression recognition model to extract facial expression features. Finally, we elucidate Face-Cap models, which are image captioning systems trained by leveraging additional facial expression features and image-caption paired data.

3.1 Datasets

To train our facial expression recognition model, we use the facial expression recognition 2013 (FER-2013) dataset [11]. It includes in-the-wild samples labeled happiness, sadness, fear, surprise, anger, disgust, and neutral. It consists of 35,887 examples (28,709 for training, 3589 for public and 3589 for private test), collected by means of the Google search API. The examples are in grayscale at the size of 48-by-48 pixels. We split the training set of FER-2013 into two sections after removing 11 completely black examples: 25,109 for training and 3589 for validating the model. Similar to other work in this domain [17, 27, 40], we use the private test set of FER-2013 for the performance evaluation of the model after the training phase. To compare with the related work, we do not apply the public test set either for training or for validating the model.

To train our image captioning models, we have extracted a subset of the Flickr 30K dataset with image captions [39], which we term FlickrFace11K. It contains 11,696 examples including human faces, which are detected using a CNN-based face detection algorithm [18].222The new version (2018) of Dlib library is applied. We observe that the Flickr 30K dataset is a good source for our dataset, because it has a larger portion of samples that include human faces, in comparison with other image caption datasets such as the COCO dataset [4]. We split the FlickrFace11K samples into 8696 for training, 2000 for validation and 1000 for testing, and make them publicly available.333https://github.com/omidmn/Face-Cap To extract the facial features of the samples, we use a face pre-processing step and a facial expression recognition model as follows.

3.2 Face Pre-processing

Since we aim to train a facial expression recognition model on FER-2013 and use it as a facial expression feature extractor on the samples of FlickrFace11K, we need to make the samples consistent with the FER-2013 data. To this end, a face detector is used to pre-process the faces of FlickrFace11K. The faces are detected by the CNN-based face detection algorithm and cropped from each sample. Then, we transform each face to grayscale and resize it into 48-by-48 pixels, which is exactly the same FER-2013 data.

3.3 Facial Expression Recognition Model

In this section, using the FER-2013 dataset, we train a VGGnet model [30] to recognize facial expressions. The model’s architecture is similar to recent work [27] that is state-of-the-art in this domain, and our replication gives similar performance. The classification accuracy, which is a popular performance metric on the FER-2013 dataset, on the test set of FER-2013 is . It is around better than the human performance () on the test set [11]. The output layer of the model, generated using a softmax function, includes seven neurons, corresponding to the categorical distribution probabilities over the emotion classes in FER-2013 including happiness, sadness, fear, surprise, anger, disgust, and neutral; we refer to this by the vector .

We use the network to extract the probabilities of each emotion from all faces, as detected in the pre-processing step of Sec. 3.2, in each FlickrFace11K sample.

For each image, we construct a vector of facial emotion features used in the Face-Cap models as in Eq. 1.

(1)

where is the number of faces in the sample. That is, is a one-hot encoding of the aggregate facial emotion features of the image.

3.4 Training Face-Cap

3.4.1 Face-Cap

In order to train the Face-Cap models, we apply a long short-term memory (LSTM) network as our caption generator, adapted from Xu et al. [37]. The LSTM is informed about the emotional content of the image using the facial features, defined in Eq. 1. It also takes the image features which are extracted by Oxford VGGnet [30], learned on the ImageNet dataset, and weighted using the attention mechanism [37]. In the mechanism, the attention-based features, including the factual content of the image, are chosen for each generated word in the LSTM. Using Eq. 2, in each time step (), the LSTM uses the previously embedded word (), the previous hidden state (), the image features (), and the facial features () to generate input gate (), forget gate (), output gate (), input modulation gate (), memory cell (), and hidden state ().

(2)

where , and are learned weights and biases and is the logistic sigmoid activation function. According to Eq. 2, the facial features of each image are fixed in all time steps and the LSTM automatically learns to condition, at the appropriate time, the next generated word by applying the features. To initialize the LSTM’s memory state () and hidden state (), we feed the facial features through two typical multilayer perceptrons, shown in Eq. 3.

(3)

We use the current hidden state (), to calculate the negative log-likelihood of in each time step (Eq. 4), named the face loss function. Using this method, will be able to record a combination of , and in each time step.

(4)

where a multilayer perceptron generates , which is the categorical probability distribution of the current state across the emotion classes. In this we adapt You et al. [38], who use this loss function for injecting ternary sentiment (positive, neutral, negative) into captions. This loss is estimated and averaged, over all steps, during the training phase.

3.4.2 Face-Cap

The above Face-Cap model feeds in the facial features at the initial step (Eq. 3) and at each time step (Eq. 2), shown in Fig. 2 (top). In Face-Cap, the LSTM uses the facial features for generating every word because the features are fed at each time step. Since a few words, in the ground truth captions (e.g. Fig. 1), are related to the features, this mechanism can sometimes lead to less effective results.

Our second variant of the model, Face-Cap, is as above except that the term is removed from Eq. 2: we do not apply the facial feature information at each time step (Fig. 2 (bottom)), eliminating it from Eq. 2. Using this mechanism, the LSTM can effectively take the facial features in generating image captions and ignore the features when they are irrelevant. To handle this issue, You et al. [38] implemented the sentiment cell, working similar to the memory cell in the LSTM, initialized by the ternary sentiment. They fed the image features to initialize the memory cell and hidden state of the LSTM. In comparison, Face-Cap uses the facial features to initialize the memory cell and hidden state rather than the sentiment cell which requires more time and memory to compute. Using the attention mechanism, our model applies the image features in generating every caption word.

Figure 2: The frameworks of Face-Cap (top), and Face-Cap (bottom). The face pre-processing and the feature extraction from the faces and the image are illustrated. The Face-Cap models are trained using the caption data plus its corresponding image features, selected using the attention mechanism, and facial features.

4 Experiments

4.1 Evaluation Metrics and Testing

To evaluate Face-Cap and Face-Cap, we use standard evaluation metrics including BLEU [26], ROUGE-L [21], METEOR [5], CIDEr [34], and SPICE [2]. All five metrics with larger values mean better results.

We train and evaluate all models on the same splits of FlickrFace11K.

4.2 Models for Comparison

The model of Xu et al. [37] is the starting point of Face-Cap and Face-Cap, which is selectively attending to a visual section at each time step. We train Xu’s model using the FlickrFace11K dataset.

We also look at two additional models to investigate the impact of the face loss function in using the facial features in different schemes. We train the Face-Cap model, which uses the facial features in every time step, without calculating the face loss function (Eq. 4); we refer to this as the Face-Step model. The Face-Cap model, which applies the facial features in the initial time step, is also modified in the same way; we refer to this as the Face-Init model.

4.3 Implementation Details

In our implementation, the memory cell and the hidden state of the LSTM each have 512 dimensions.444We use TensorFlow to implement the models [1]. We set the size of the word embedding layer to 300, which is initialized using a uniform distribution. The mini-batch size is 100 and the epoch limit is 20. We train the models using the Adam optimization algorithm [19]. The learning rate is initialized to 0.001, while its minimum is set to 0.0001. If there is no improvement of METEOR for two successive epochs, the learning rate is divided by two and the prior network that has the best METEOR is reloaded. This approach leads to effective results in this work. For Adam, tuning the learning rate decay, similar to our work, is supported by Wilson et al. [36]. METEOR on the validation set is used for model selection. We apply METEOR for the learning rate decay and the model selection because it shows reasonable correlation with human judgments but calculates more quickly than SPICE (as it does not require dependency parsing) [2].

Exactly the same visual feature size and vocabulary are used for all five models. As the encoder of images, in this work as for Xu et al., we use Oxford VGGnet [30] trained on ImageNet, and take its fourth convolutional layer (after ReLU), which gives features. For all five models, the negative log likelihood of the generated word is calculated, as the general loss function, at each time step.

Model B-1 B-2 B-3 B-4 METEOR ROUGE-L CIDEr SPICE
Xu’s model 55.95 35.43 23.06 15.69 16.96 43.71 21.94 9.30
Face-Step 58.43 37.56 24.78 16.96 17.45 45.04 22.83 9.90
Face-Init 56.63 36.49 24.30 16.86 17.17 44.84 23.13 9.80
Face-Cap 57.13 36.51 24.07 16.52 17.19 44.76 23.04 9.70
Face-Cap 58.90 37.89 25.07 17.19 17.44 45.47 24.72 10.00

Table 1: Comparisons of image caption results (%) on the test split of FlickrFace11K dataset. B-1, … SPICE are standard evaluation metrics, where B-N is BLEU-N metric.

4.4 Results

4.4.1 Overall Metrics

The experimental results are summarized in Table 1. All Face models outperform Xu’s model using all standard evaluation metrics. This shows that the facial features are effective in image captioning tasks. As predicted, Face-Cap has a better performance in comparison with other models using all the metrics except METEOR, where it is only very marginally (0.01) lower. Under most metrics, Face-Step performs second best, with the notable exception of CIDEr, suggesting that its strength on other metrics might be from use of popular words (which are discounted under CIDEr). Comparing the mechanics of the top two approaches, Face-Cap uses the face loss function to keep the facial features and apply them at the appropriate time; however, Face-Step does not apply the face loss function. Face-Cap only applies the facial features in the initial time step, while Face-Step uses the features in each time step, in generating an image caption. In this way, Face-Step can keep the features without applying the face loss function. This yields comparable results between Face-Cap and Face-Step; however, the results show that applying the face loss function is more effective than the facial features in each time step. This relationship can also be seen in the results of Face-Init, which is Face-Cap without the face loss function. The results of Face-Cap show that a combination of applying the face loss function and the facial features in each time step is problematic.

Model Entropy Top 4
Xu’s model 2.7864 77.05%
Face-Step 2.9059 74.80%
Face-Init 2.6792 78.78%
Face-Cap 2.7592 77.68%
Face-Cap 2.9306 73.65%
Table 2: Comparisons of distributions of verbs in generated captions: entropies, and probability mass of the top 4 frequent verbs (is, sitting, are, standing)
Model Smiling Looking Singing Reading Eating Laughing
Xu’s model 19 n/a 15 n/a 24 n/a
Face-Step 11 18 10 n/a 15 n/a
Face-Init 10 21 12 n/a 14 n/a
Face-Cap 12 20 9 n/a 14 n/a
Face-Cap 9 18 15 22 13 27
Table 3: The ranks of sample generated verbs under each model.

4.4.2 Caption Analysis

To analyze what it is about the captions themselves that differs under the various models, with respect to our aim of injecting information about emotional states of the faces in images, we first extracted all generated adjectives, which are tagged using the Stanford part-of-speech tagger software [33]. Perhaps surprisingly, emotions do not manifest themselves in the adjectives in Face-Cap models: the adjectives used by all systems are essentially the same. This may be because adjectives with weak sentiment values (e.g. long, small) predominate in the training captions, relative to the adjectives with strong sentiment values (e.g. happy, surprised).

We therefore also investigated the difference in distributions of the generated verbs under the models. Entropy (in the information-theoretic sense) can indicate which distributions are closer to deterministic and which are more spread out (with a higher score indicating more spread out) calculated using Eq. 5.

(5)

where is the entropy score and is the number of the generated unique verbs under each model. is the probability of each generated unique verb (), estimated as the Maximum Likelihood Estimate from the sample. From Table 2, Face-Cap has the highest entropies, or the one with the greatest variability of expression. Relatedly, we look at the four most frequent verbs, which are the same for all models (is, sitting, are, standing) — these are verbs with relatively little semantic content, and for the most part act as syntactic props for the content words of the sentence. Table 2 also shows that Face-Cap has the lowest proportion of the probability mass taken up by these, leaving more for other verbs.

The ranks of the generated verbs under the models, which are calculated using the numerical values of their frequency, are also interesting. Table 3 includes some example verbs; of these, smiling, singing, and eating are higher ranked under the Face-Cap models, and reading and laughing only appear under the Face-Cap model. Looking is also generated only using the models including the facial features. These kinds of verbs are relevant to the facial features and show the effectiveness of applying the features in generating image captions.

Figure 3: Examples of different image captioning models including X (Xu’s model), S (Face-Step), I (Face-Init), F (Face-Cap), and L (Face-Cap).
Figure 4: Examples of the models including various amounts of error.

4.4.3 Samples

Fig. 3 includes a number of generated captions, for six sample images, under all models in this work. In example 1, the models that include facial features properly describe the emotional content of the image using smiling. The Face-Cap model also generates laughing according to the emotional content of example 4. In example 3, the Face-Init and the Face-Cap models generate playing which is connected to the emotional content of the example. It is perhaps because the child in the example is happy that the models generate playing, which has a positive sentiment connotation. In example 5, Face-Cap also uses playing in a similar way. Example 2 shows that the Face-Cap models apply singing at the appropriate time. Similarly, looking is used, by Face-Cap, in example 6. Singing and looking are generated because of the facial features of people in the examples, which are related to some emotional states such as surprised and neutral. Fig. 3 shows that our models can effectively apply the facial features to describe images in different ways. In Fig. 4, three examples are shown, which our models inappropriately use the facial features. Smiling is used to describe the emotional content of the example 1; however, the girl in the example is not happy. The results of the example 2 and 3 wrongly contain holding a microphone and eating, which are detected from the facial features, due to visual likeness.

5 Conclusion and Future Work

In this paper, we have proposed two variants of an image captioning model, Face-Cap, which employ facial features to describe images. To this end, a facial expression recognition model has been applied to extract the features from images including human faces. Using the features, our models are informed about the emotional content of the images to automatically condition the generating of image captions. We have shown the effectiveness of the models using standard evaluation metrics compared to the state-of-the-art baseline model. The generated captions demonstrate that the Face-Cap models succeed in generating image captions, incorporating the facial features at the appropriate time. Linguistic analyses of the captions suggest that the improved effectiveness in describing image content comes through greater variability of expression.

Future work can involve designing new facial expression recognition models, which can cover a richer set of emotions including confusion and curiousity; and effectively apply their corresponding facial features to generate image captions. In addition, we would like to explore alternative architectures for injecting facial emotions, like the soft injection approach of [37].

References

  • [1] Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis, A., Dean, J., Devin, M., et al.: Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467 (2016)
  • [2] Anderson, P., Fernando, B., Johnson, M., Gould, S.: Spice: Semantic propositional image caption evaluation. In: ECCV. pp. 382–398. Springer (2016)
  • [3] Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., Zhang, L.: Bottom-up and top-down attention for image captioning and vqa. arXiv preprint arXiv:1707.07998 (2017)
  • [4] Chen, X., Fang, H., Lin, T.Y., Vedantam, R., Gupta, S., Dollár, P., Zitnick, C.L.: Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015)
  • [5] Denkowski, M., Lavie, A.: Meteor universal: Language specific translation evaluation for any target language. In: WMT. pp. 376–380 (2014)
  • [6] Ekman, P.: Basic emotions. In: Dalgleish, T., Power, T. (eds.) The Handbook of Cognition and Emotion, pp. 45–60. John Wiley & Sons, Sussex, UK (1999)
  • [7] Ekman, P.: Darwin and facial expression: A century of research in review. Ishk (2006)
  • [8] Fang, H., Gupta, S., Iandola, F., Srivastava, R., Deng, L., Dollár, P., Gao, J., He, X., Mitchell, M., Platt, J., et al.: From captions to visual concepts and back. In: CVPR. IEEE (2015)
  • [9] Fasel, B., Luettin, J.: Automatic facial expression analysis: a survey. Pattern recognition 36(1), 259–275 (2003)
  • [10] Gan, C., Gan, Z., He, X., Gao, J., Deng, L.: Stylenet: Generating attractive visual captions with styles. In: CVPR. IEEE (2017)
  • [11] Goodfellow, I.J., Erhan, D., Carrier, P.L., Courville, A., Mirza, M., Hamner, B., Cukierski, W., Tang, Y., Thaler, D., Lee, D.H., et al.: Challenges in representation learning: A report on three machine learning contests. In: ICONIP. pp. 117–124. Springer (2013)
  • [12] Johnson, J., Karpathy, A., Fei-Fei, L.: Densecap: Fully convolutional localization networks for dense captioning. In: CVPR. pp. 4565–4574. IEEE (2016)
  • [13] Kahou, S.E., Bouthillier, X., Lamblin, P., Gulcehre, C., Michalski, V., Konda, K., Jean, S., Froumenty, P., Dauphin, Y., Boulanger-Lewandowski, N., et al.: Emonets: Multimodal deep learning approaches for emotion recognition in video. Journal on Multimodal User Interfaces 10(2), 99–111 (2016)
  • [14] Kahou, S.E., Pal, C., Bouthillier, X., Froumenty, P., Gülçehre, Ç., Memisevic, R., Vincent, P., Courville, A., Bengio, Y., Ferrari, R.C., et al.: Combining modality specific deep neural networks for emotion recognition in video. In: ICMI. pp. 543–550. ACM (2013)
  • [15] Karpathy, A.: Connecting Images and Natural Language. Ph.D. thesis, Stanford University (2016)
  • [16] Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: CVPR. pp. 3128–3137. IEEE (2015)
  • [17] Kim, B.K., Dong, S.Y., Roh, J., Kim, G., Lee, S.Y.: Fusing aligned and non-aligned face information for automatic affect recognition in the wild: A deep learning approach. In: CVPR Workshops. pp. 48–57. IEEE (2016)
  • [18] King, D.E.: Dlib-ml: A machine learning toolkit. Journal of Machine Learning Research 10(Jul), 1755–1758 (2009)
  • [19] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
  • [20] Kiros, R., Salakhutdinov, R., Zemel, R.S.: Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539 (2014)
  • [21] Lin, C.Y.: Rouge: A package for automatic evaluation of summaries. Text Summarization Branches Out (2004)
  • [22] Lisetti, C.: Affective computing (1998)
  • [23] Mathews, A.P., Xie, L., He, X.: Senticap: Generating image descriptions with sentiments. In: AAAI. pp. 3574–3580 (2016)
  • [24] Pang, B., Lee, L.: A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In: ACL. pp. 271–278. Barcelona, Spain (July 2004). https://doi.org/10.3115/1218955.1218990, http://www.aclweb.org/anthology/P04-1035
  • [25] Pang, B., Lee, L.: Opinion mining and sentiment analysis. Found. Trends Inf. Retr. 2(1-2), 1–135 (Jan 2008). https://doi.org/10.1561/1500000011, http://dx.doi.org/10.1561/1500000011
  • [26] Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: ACL. pp. 311–318. Association for Computational Linguistics (2002)
  • [27] Pramerdorfer, C., Kampel, M.: Facial expression recognition using convolutional neural networks: State of the art. arXiv preprint arXiv:1612.02903 (2016)
  • [28] Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detection with region proposal networks. In: NIPS. pp. 91–99 (2015)
  • [29] Sariyanidi, E., Gunes, H., Cavallaro, A.: Automatic analysis of facial affect: A survey of registration, representation, and recognition. IEEE transactions on pattern analysis and machine intelligence 37(6), 1113–1133 (2015)
  • [30] Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
  • [31] Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: NIPS. pp. 3104–3112 (2014)
  • [32] Tang, Y.: Deep learning using linear support vector machines. arXiv preprint arXiv:1306.0239 (2013)
  • [33] Toutanova, K., Klein, D., Manning, C.D., Singer, Y.: Feature-rich part-of-speech tagging with a cyclic dependency network. In: NAACL HLT. pp. 173–180. Association for Computational Linguistics (2003)
  • [34] Vedantam, R., Lawrence Zitnick, C., Parikh, D.: Cider: Consensus-based image description evaluation. In: CVPR. pp. 4566–4575. IEEE (2015)
  • [35] Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: A neural image caption generator. In: CVPR. pp. 3156–3164. IEEE (2015)
  • [36] Wilson, A.C., Roelofs, R., Stern, M., Srebro, N., Recht, B.: The marginal value of adaptive gradient methods in machine learning. In: NIPS. pp. 4151–4161 (2017)
  • [37] Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., Bengio, Y.: Show, attend and tell: Neural image caption generation with visual attention. In: ICML. pp. 2048–2057 (2015)
  • [38] You, Q., Jin, H., Luo, J.: Image captioning at will: A versatile scheme for effectively injecting sentiments into image descriptions. arXiv preprint arXiv:1801.10121 (2018)
  • [39] Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics 2, 67–78 (2014)
  • [40] Yu, Z., Zhang, C.: Image based static facial expression recognition with multiple deep network learning. In: ICMI. pp. 435–442. ACM (2015)
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
213636
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description