Building Emotional Machines: Recognizing Image Emotions through Deep Neural Networks

Building Emotional Machines: Recognizing Image Emotions through Deep Neural Networks

Hye-Rin Kim, Yeong-Seok Kim, Seon Joo Kim, In-Kwon Lee

An image is a very effective tool for conveying emotions. Many researchers have investigated in computing the image emotions by using various features extracted from images. In this paper, we focus on two high level features, the object and the background, and assume that the semantic information of images is a good cue for predicting emotion. An object is one of the most important elements that define an image, and we find out through experiments that there is a high correlation between the object and the emotion in images. Even with the same object, there may be slight difference in emotion due to different backgrounds, and we use the semantic information of the background to improve the prediction performance. By combining the different levels of features, we build an emotion based feed forward deep neural network which produces the emotion values of a given image. The output emotion values in our framework are continuous values in the 2-dimensional space (Valence and Arousal), which are more effective than using a few number of emotion categories in describing emotions. Experiments confirm the effectiveness of our network in predicting the emotion of images.

I Introduction

Images are very powerful tools for conveying moods and emotions as shown in Figure 1. Through images, people can express their feelings and communicate with other people. With the recent development in the deep learning technology, computers have become better at recognizing objects, faces, and actions. Computers have also started to write image captions and answer questions about images. But how about emotions? Can we teach computers to have similar feeling as humans do when looking at images? Predicting evoked emotion from an image is a difficult task and is still in its early stage.

Fig. 1: Images with different emotions.
Fig. 2: Two dimensional emotion models (Valence and Arousal).

As the deep learning technology shows remarkable performance in various computer vision tasks such as image classifcation [1, 2, 3, 4], segmentation [5, 6] and image processing[7, 8], several studies have been introduced recently that apply the deep learning for the emotion prediction [9, 10, 11, 12, 13]. Those works mostly use the convolutional neural network (CNN) [14], which has shown better prediction results for the emotion classification compared to the model that uses a shallow network such as the linear model.

The CNN has had a big impact especially in the image classification, and it is an effective network model for learning filters that capture the shapes that repeatedly appear in images. However, we argue that the learning process for the image emotion prediction should be different from that of image classification. This is because some images with different appearances can have same emotions, and some images with similar appearances can have different emotions. For example, an image that includes a person riding a bike and an image that includes a person surfing in the ocean may give the same feelings, even though they look different. From this point of view, the performance may be limited if CNN is applied to an emotion prediction system.

In addition, one of the main issues in the emotion recognition is the affective gap. The affective gap is the lack of coincidence between the measurable signal properties, commonly referred to as features, and the expected affective state in which the user is brought by perceiving the signal [15]. To narrow this affective gap, several works proposed emotion classification systems based on the psychology and art theory based high level features such as harmony, movement, rule of third, etc. [16, 17, 18]. While those features help to improve the emotion recognition, a better set of features are still necessary. Distinguishing the effective features among various features is also important. As another example, similar to the example above, images with the objects such as guns or sharks arouse scary feeling, while images with babies or flowers lead to more happiness. It can be speculated that certain objects will affect the determination of emotions. Based on this observation, we assume that the main object appearing in the image plays an important role in determining the emotion. The idea is that object categories can be good cues for emotions. Through experiments, we show that objects appearing in images are related to emotion values, and objects are used as one of the features of our model. Besides, the emotion, even if images include the same object, can vary depending on the background. We also use the semantic information of the background as our features to improve prediction performance.

Predicting emotions from images is a complex task that is quite different from the object detection or the image classification. In this paper, we combine high-level features such as the object and the background information extracted from a pre-trained deep network model for the image classification and the segmentation with low-level features such as color statistics to obtain a set of features for the emotion prediction. Using this feature set, we design a feed forward deep network (FFNN) [19] which produces the emotion value for a given image.

In previous works for the emotion prediction, the emotions are categorized by a few number of classes, such as happy, awe, sad, fear, etc. In comparison, we opt for the dimensional model for expressing the emotions [20], which is widely used in the field of psychology. Specifically, the dimensional model consists of two parameters, Valence and Arousal (Figure 2). Valence represents the pleasure through the scale from 1 (negative) to 9 (positive). Arousal is the level of excitement, which also ranges from 1 (calm) to 9 (excited). Using this model, any emotion can be represented by using these two values in 2D space. As this dimensional model can express more emotions compared to using a discrete set of emotion categories, we train our emotion prediction system based on the V-A model.

The contributions of our work are summarized as follows:

  • We propose an image emotion recognition system which outputs valence-arousal values for expressing emotion. As far as we know, this is the first system to build the deep learning model for the dimensional emotion model (VA model).

  • We build a new image emotion dataset through the crowdsourcing. The feed forward neural network is used to learn emotion features from this database.

  • We propose a novel idea of using an pre-trained CNN to relate the main object and background of the image to the emotional feature.

  • We show the effectiveness of an object to estimate an emotion of an image using correlations between emotional values of words and images.

Ii Related Work

Emotion of an image can be evoked by various factors. To figure out significant features for the emotion prediction problem, many researchers have considered various types of the features from color statistics to the art and the psychological features. Machajdik et al. [17] introduced an affective image classification system using psychology and art theory based features such as Itten’s color contrast and rule of thirds. Zhao et al. [18] proposed to extract principles of art features for emotion classification. Similarly, Lu et al. [16] computed the shape based features in natural images. As a high level concept for representing the sentiment of an image, adjective-noun pairs were introduced by Borth et al. in [21].

Despite the rise of deep learning studies, relatively few studies have attempted to address the emotion prediction of images using the deep network. With the data set from Borth et al. [21], Chen et al. [22] classified the adjective-noun pairs using the CNN and achieved better accurate classification performance than Borth et al. [21].

Several studies utilized the pre-trained model for the image classification and transferred the learned parameter. By changing the number of outputs to be the same as the number of labels of their dataset, the classifier can be trained. For example, for a binary classification with a positive and a negative label, the number of output would be two. Some researchers [10, 12, 13, 11, 9] used AlexNet’s structure [1] with pre-trained weight to train their sentiment prediction frameworks with different output numbers. Binary classification which gives either a positive or a negative label was considered in [13, 23, 24, 11]. Peng et al. [10] and You et al. [12] trained the classifiers for seven and eight classes, respectively. In most studies, an emotion classification model was created through transfer learning using a pre-trained model for image classification. In general, fine-tuned models record better emotion classification performance compared to previous studies with shallow networks. In this paper, we use a feed forward neural network using low level and high-level features rather than transfer learning method. Our method is compared with transfer learning method by using same training and test dataset.

Compared to the previous CNN based emotion recognition systems which are limited to output only a few number of emotional states, we propose to use the valence-arousal model to represent emotions. By using these two parameters that lie on a continuous space, we can represent emotions much better than the previous works. To enable the learning of the VA values using the CNN, we build a large set of data with various emotions and obtain the V-A labels through the crowdsourcing survey.

Fig. 3: Image representation of the Self Assessment Manikin (SAM) [25] used for V-A value assignment in the user study.

Iii Image Emotion Dataset

In order to build the dataset with emotion values (valence and arousal), we collected a large set of images from two resources. We first searched for the emotional images using the 22 keywords from [26] in Flickr [27]. Those words include what we call basic emotions (happy, angry, afraid, sad), as well as less prototypical emotions (gloomy, bored) and affective states (sleepy, serene).

Using the keywords, we first collected over twenty thousand images. Since the goal is to collect emotional images, we manually eliminated non-emotional images from the collection. Each candidate image in the dataset was assessed by three human subjects, and only the images that were determined to be emotional by more than two subjects were included in the database. The words used for the search are listed along with the number of images in the dataset: afraid (113), alarmed (14), angry (217), annoyed (179), distressed (150), frustrated (77), tense (8), aroused (4), delighted (28), excited (734), glad (315), happy (102), astonished (13), at ease (16), content (656), satisfied (33), serene (1917), pleased (42), depressed (179), bored (312), tired (942), gloomy(793). As a result, the total number of image is 6844.

Second, we collected more images from [12] which include eight types of emotion categories (Amusement, Awe, Contentment, Excitement, Anger, Disgust, Fear, Sad). We took 3,236 images out of 23,308 images. An even number of images from eight classes have been selected and included in our database. As a result, a total of 10766 images is acquired in our dataset.

Next, we use the AMT to assign V-A emotion values for all images in our database. Given an image, a worker rates the emotion values for each image using the representation of the V-A scale, Self-Assessment Manikin (SAM) [25] (Figure 3). In each question, two images are shown to the worker. One is the image whose emotion value is to be measured, and the other is the image of the previous question with emotion value recorded by the worker. By showing the previous image, the worker can measure a value for the given image compared to the value selected in the previous problem. It also alleviates the difficulty of having to choose absolute numbers. Each image is represented to the worker randomly and is evaluated only once by the same worker. To ensure the quality of the answers, we only allow the workers with 95% approval rate to participate in our Human Intelligence Task (HIT). We also limit the number of questions for each worker to 200 to keep them focused throughout the test.

Fig. 4: The histogram of valence and arousal value from the collected images.
Fig. 5: Example images in our database. From left to right side, the valence value of the image increase. The images on the left/right have negative/positive emotion. From bottom to top, the arousal value of the image increase. The images on the bottom/top have the calm/exciting emotion.

A total of 1,339 workers were recruited to assign the emotion values of the 10,766 images in the dataset. The evaluation time per image varied from 10 seconds to a minute depending on the worker. Each image is evaluated on at least five workers, and the average of the acquired values is assigned to the emotion value of the image. Figure 4 shows the total distribution of the V-A values of the images. In the case of Valence, it can be seen that there are relatively more positive images than negative images. In common sense, people usually share the positive images, rather than negative images, and this tendency is well represented in this distribution. In the case of Arousal, various emotion values were obtained except for the extreme values. Figure 5 shows some of the images in our database.

We compare the emotion distribution of the dataset we collected with the distribution of IAPS dataset which is an emotion dataset introduced in [28]. The IAPS consists of of 1182 images, each of which contains valence, arousal and dominance values. Although the number of images is different, we can see that our data is spread more than the IAPS data distribution in the V-A emotion space (Figure 6).

We provide additional analysis to validate our new dataset. We split Valence-Arousal space into 4 by 4 sections: Most negative (from 1 to 3), negative (from 3 to 5), positive (from 5 to 7), and most positive (from 7 to 9). Most calm (from 1 to 3), calm (from 3 to 5), exciting (from 5 to 7), and most exciting (from 7 to 9). We then place the images in the emotion space according to the obtained emotion values and extract the most used words (using the image tags) in each section. The results are shown in Figure 7. In the negative section of V-A space, most words have low valence values such as afraid, angry, annoyed, depressed and gloomy. On the other hand, the words such as content, serene, excited, and glad are included in the positive section. In the case of the arousal, the lowest arousal side (from 1 to 3) consists of the words such as serene, tired and gloomy and the highest arousal side (from 7 to 9) include the words such as excited, angry and distressed. This analysis indicates that the use of words for image collection and user study is appropriate for obtaining as diverse and well-distributed emotion values as possible in space. The new image emotion dataset can be used for the emotion recognition in two ways. A classifier can be trained to output either the continuous V-A values or discrete categories of emotions using the words in Figure 7.

Fig. 6: Emotion distribution of our database and the IAPS [28] in V-A emotion space.
Fig. 7: The most used words in each section.

Iv Features

Now, we extract a various type of features for emotion prediction including color, local, object, and semantic features.

Iv-a Color features

Color is the most basic and powerful element to express emotions and can be effectively used by artists to induce emotional effects. Many studies have been conducted to change the color of an image as a means to change the emotion of the image [29, 30, 31, 32]. Color is not an element that can directly resolve an affective-gap because it can be viewed as a low-dimensional feature, but color is still a crucial factor in emotion recognition. We extract the mean values of RGB and HSV color space as the basic color characteristics. We also calculate the HSV histogram and extract the label number and values of the bin with the largest value in the histogram. A concept similar to a color histogram, it calculates how much of the 11 basic colors exist in the image [33]. According to [20], saturation and brightness can have direct in on pleasure, arousal, and dominance. Using the saturation and brightness values, Valdez and Mehrabian [34] introduced formulas to measure the value of pleasure, arousal, and dominance through experiments. The formula for computing the values is as follows:


We measure the values for the three elements from the image and use them as features.

Fig. 8: Correlation between image emotion and object emotion. Image emotion values are taken from the IAPS data [28] set and object emotion values are taken from the word emotion dictionary [35].
Fig. 9: The overall architecture of the proposed emotion prediction model.

Iv-B Local features

We exploit two kinds of local features used in the [21]. We use a 512-dimensional GIST descriptor that is effective in detecting scenes and a 59-dimensional local binary pattern (LBP) descriptor that is effective in detecting textures.

Iv-C Object features

Identifying the emotion of the image from low-level features, such as color statistics and texture-related features, is difficult for a human subject. Many researchers stated the need for high level of features on the affective level and designed the various types of features for emotion prediction. In our study, we assume that the object is one of the most important factors contributing to the emotion of an image. We conducted an experiment to prove that the object in images has relevance to the emotion elicited from the images. Each image in the IAPS dataset, an emotion dataset introduced in [28], includes a tag representing the primary object (e.g., baby, snake, and shark) and the V-A emotion values. In order to replace the object tag with the emotion level representation, we adopt the word emotion dictionary [35] with the valence and the arousal values for each word. We convert each tag to V-A values by searching for the tag in the word emotion dictionary. By measuring the Pearson correlation coefficient, we can easily understand the relationship between the object and the image emotion. Figure 8 shows the results of the experiment. As you can see, the correlation between the object emotion and the image emotion is significantly high, which means the object affects the emotion of the image. Especially, the correlation of the valence is higher than the arousal, which means that valence is more likely to be affected by the object than the arousal.

Based on this observation, we add object-based features to our system in predicting emotions. In recent years, many studies have used CNN models of various structures using ImageNet dataset [36] for the image classification. We use three of the most popular models (AlexNet, VGG16, and ResNet) to extract object features from our image datasets and experiment with the effects of features extracted from each model on emotion prediction results. Our object feature is the result of the final output layer, and it represents the probabilities of 1000 object categories.

Iv-D Sematic features

As a high-level feature with a similar concept to object, we consider another semantic information that can describe the background of the image. It is important what the object is in the image, but what the background is made up of is as important. For example, if the main object of an image is a person, the emotion may be different depending on whether the background is a city with a lot of buildings or nature such as a mountain or a sea. Also, the ratio of sky, sea, or buildings in the background can also affect the emotions. To use the semantic information of the background as a feature, we perform scene parsing on all images. Wu et al. [37] proposed a semantic segmentation method based on a deep network, which classifies each pixel of an image into one of 150 semantic categories. Given a semantic map, we find out which of the 150 semantic categories each pixel in the image belongs to. As a result, a 150-dimensional vector is obtained and used as one of the input features.

These object and semantic-based high level features are combined with low level features such as color and local features to learn our networks for emotion prediction.

V Learning emotion model

In this section, we introduce the details of our emotion prediction framework. The overall architecture of our framework is shown in Figure 9. Our model is a fully connected feedforward neural network. In general, a neural network consists of an input layer, an output layer, and one or more hidden layers. Normally, when there are two or more hidden layers, the network is called deep network. Each layer is made up of multiple neurons, and the edges that connect neurons between adjacent layers have weights. The values of neurons (except the neurons in input layer) and weights are trained during a training phase.

Our network including an input layer, three hidden layers and an output layer is as follows:


where is input feature vector, is a set of weights including weights and bias , and produces the final output of our neural network (Valence and Arousal value).

Specifically, given an input vector in layer , the preactivation value for the neuron of layer is obtained through function :


where is the connection weight connecting in layer to neuron in layer , is the bias of neural j in layer , and is the number of the neuron in layer . Then, the output value of the layer (also input vector for layer ) is obtained through function which is a nonlinear activation function in layer ,


Note that, rectified linear unit(ReLU) [1], max(0, x), is used as the non-linear activation function throughout the all network.

The number of neurons in each layer is given in the Table I. We set the loss function of our network as , where the and are the output value predicted by our model and the ground truth emotion value of given image and is the number of images. In training phase, network weights are updated by backpropagating the gradients through all layers. By minimizing the cost of the loss function, we can optimize the weights of our network. We set learning rate to 0.0001 and the network is trained by using the stocahstic gradient descent (SGD) optimization method with momentum of 0.9. We set the batch size to 1,000 and train our model until the error no longer diminishes. All experiments are implemented by using the open source deep learning framework Tensorflow [38].

Input H1 H2 H3 Output
Num of neurons 1588 3000 1000 1000 1
TABLE I: The number of neurons in each layer.

Vi Experiment

Vi-a Model performance

We first evaluate the performance of our model. The entire dataset is divided into five groups, four groups are used for the training phase, and the remaining one group is used for the test phase. Each group is used as a test phase once. In each group, the number of training images is 8600, and the number of test images is 2166. All groups are learned using the same structure.

The results are presented in Table II. The number in the first row represents the group number. Column g1 shows the training and test error when using group 2 to 5 as training data and group 1 as test data. The dimension of input feature is 3088 described in Tabel I, and the output is valence or arousal value. To compare the performance of the object features obtained from the three CNNs, we build various models by combining object features with other features and compare their performance. Note that the ‘A’, ‘V’, and ‘R’ in the third column represent the AlexNet, VGG16 and ResNet, respectively. The number of each bin represent the mean square error between ground truth emotion value and predicted emotion value by each model. We also extracted category-level features from [39], not from CNN-based model, and included them as input features in emotion prediction. Borth et al. [21] also used this feature to predict the sentiment of images. Note that, the dimension of this feature is 2000, and the number of input feature neurons in our model is changed to 2588. The row with ’O’ represents the prediction result. The results show that the features from VGG16 achieved the best performance in both the ’Valence’ and ’Arousal’ models (valence: 1.64, arousal: 1.47).

g1 g2 g3 g4 g5 Avg.
Valence Training A 1.37 1.37 1.35 1.37 1.38 1.37
V 1.31 1.31 1.30 1.30 1.33 1.31
R 1.29 1.30 1.28 1.31 1.30 1.30
O 1.47 1.47 1.46 1.48 1.48 1.47
Test A 1.72 1.68 1.67 1.69 1.61 1.67
V 1.68 1.66 1.63 1.63 1.59 1.64
R 1.70 1.66 1.65 1.62 1.60 1.65
O 1.80 1.81 1.73 1.73 1.69 1.75
Arousal Training A 1.22 1.24 1.20 1.21 1.23 1.22
V 1.16 1.21 1.17 1.16 1.16 1.17
R 1.18 1.22 1.18 1.18 1.18 1.19
O 1.26 1.30 1.25 1.27 1.29 1.27
Test A 1.50 1.52 1.50 1.48 1.44 1.49
V 1.48 1.49 1.46 1.48 1.44 1.47
R 1.49 1.49 1.47 1.47 1.45 1.48
O 1.56 1.53 1.54 1.50 1.48 1.52
TABLE II: Results of 5-fold validation experiments

Figures 10, 14, 13, 12 and 11 show the qualitative analysis with emotion values and accuracy which is predicted by our model. In Figure 10, the images were placed so that the predicted values matches the emotion values in the VA space. Figure 11 and 12 shows the results of the valence model. Figure 13 and 14 shows the results of the arousal model.

Vi-B Feature performance

We also investigate the effects of the various features we proposed. First, we combine the color feature and the local features into a low level feature. As a method for constructing a network model using each feature, we use the structure of our model and change the number of neurons. Our proposed model consists of 3 hidden layers with 3000, 1000, and 500 neurons. In the model for feature learning, the number of nodes in each layer is based on the ratio of the number of nodes in two adjacent layers of our model (Table III (bottom)). As a result, the object feature among the three features showed the best result in emotion prediction. In Valence, the object feature extracted from VGG16 had the best result (mse:1.92). In arousal, the object feature extracted from AlexNet showed the best result (mse:1.61). Semantic features also showed lower error than low-level features. Our model combining all the features resulted in the best prediction performance, which is the synergy effect of extracted features from various sources and deep neural network, which is a powerful expression power.

Low Object semantic All
Alexnet VGG16 ResNet
Valence 1.98 1.93 1.92 1.98 1.97 1.64
Arousal 1.66 1.61 1.62 1.68 1.64 1.47
input 438 1000 1000 1000 150 1588
h1 900 2000 300 3000
h2 300 700 100 1000
h3 150 350 350 350 50 500
output 1
TABLE III: Performance of each feature

Vi-C Comparison with CNN

Some studies have learned emotion classification models using pre-trained weights learned for image classification [10, 12, 13, 11, 9]. We compare our emotion prediction model with CNN-based emotion prediction model generated by transfer learning. Two CNN structures are used for comparison; AlexNet and Vgg19. We first initialize the weights of AlexNet and VGG19 to the weights learned for the image classification. Except for the final output layer, the other convolution layer and the fully-connected layer use the existing model structure. Since the number of the output layer of CNN model based on ImageNet is 1000, we change the number of output layers to 1 for our purpose (Valence and Arousal).

Besides, various results can be obtained in transfer learning. AlexNet has five convolutional layers and three fully connected layers, and VGG19 has 16 convolutional layers and three fully connected layers. In transfer learning, we can determine which layer to freeze and which layer to train. We experiment with two conditions. The first is that the convolutional layer is frozen, only the fully connected layer is learned (conv-frozen), and the second is learning all layers together (conv-train). The learning environment for the CNN-based model is almost similar to that of our FFNN model except for the batch size and learning rate. The learning rates of both conv-frozen network and conv-train network are , and the batch size of both CNN models is 50. When we train the CNN models, including our model, we use same training and test dataset.

The results are shown in the Table IV. The second column shows the training range, conv-frozen means that only the fully connected layer has been learned, and conv-train means that all layers have been learned. From the training range perspective, it can be seen that the error of the transfer learning of the entire network is smaller than that of the fully connected layer transfer learning. This result implies that the filter in low-level, as well as the filters in high-level, must be learned in order to achieve a better performance emotion prediction model. In other words, the convolutional layer and the fully connected layer must be learned together. On the model side, we can see that the performance of the model learned by using the structure of Vgg19 Network is better than AlexNet. However, the results of both models are much more error-prone than our proposed emotion based FFNN. If CNN-based models have the same type of data with the same class, such as object detection or image classification, the learning is well done and the prediction performance is excellent. However, as mentioned earlier, images with different shapes can have the same emotions, and images with similar shapes can have different emotions. We also conducted the test using other machine learning methods with same dataset. Linear regression and Support vector regression method were used. Compared to the CNN-based model, the performance of both models is better, but the results of our model still show the best performance (See Table IV).

Emotion Train range AlexNet VGG19 Linear SVR Ours
Valence conv-frozen 2.76 2.64 2.42 1.75 1.64
conv-train 2.64 2.60
Arousal conv-frozen 1.95 1.91 2.65 1.54 1.47
conv-train 1.89 1.87
TABLE IV: Comparision with other learning method
Fig. 10: Prediction results. The values below each image show the prediction results of FFNN and ground truth emotion values, respectively. The prediction accuracy results (V/A) are also shown.
Fig. 11: Valence prediction results of the images with low arousal. The values below each image represent the predicted valence value and ground truth value (prediction/ground truth). The prediction accuracy result is also shown. From the top left to the bottom right, the value of valence increases. All images in this example have low arousal values.
Fig. 12: Valence prediction results of the images with high arousal. The values below each image represent the predicted valence value and ground truth value (prediction/ground truth). The prediction accuracy result is also shown. From the top left to the bottom right, the value of valence increases. All images in this example have high arousal values.
Fig. 13: Arousal prediction results of the images with low valence. The values below each image represent the predicted arousal value and ground truth value (prediction/ground truth). The prediction accuracy result is also shown. From the top left to the bottom right, the value of arousal increases. All images in this example have low valence values.
Fig. 14: Arousal prediction results of the images with high valence. The values below each image represent the predicted arousal value and ground truth value (prediction/ground truth). The prediction accuracy result is also shown. From the top left to the bottom right, the value of arousal increases. All images in this example have high valence values.

Vii Conclusion

In this paper, we presented a new emotion recognition system with a deep learning framework. To reduce the affective gap, we designed and extracted objects and background semantic features as high-level features, and showed that these features are effective for emotion prediction. Both high-level features and low-level features complement each other well, which leads to better emotion recognition performance. As expected, the accuracy of the object recognition has an impact on the performance of the emotion prediction. The object features with incorrect recognition may lead to incorrect emotion prediction results. There is also a problem when the main object of the image is not included in the existing 1000 classes. However, with the rapid progress in the deep learning technology with large dataset, the accuracy as the number of classes will be increased, which in turn will also help our emotion recognition system.

As an interesting future work, one can consider the presence of a person in the image and the facial expression. A facial expression is one of the features that can greatly affect the emotion prediction. Even if the overall mood of the image is dark, smiling face can mitigate negative emotion a little (Figure 15). In addition, when the face occupies most of the part in the photograph, facial expression and emotion are directly connected. We will consider enhancing emotion recognition performance by adding a facial expression recognition framework.

Several studies have demonstrated that biometric data has a positive effect on emotion recognition [40, 41]. We can also consider using biometric data or an observer’s facial feature as additional features. However, in general, deep networks require thousands or tens of thousands of data, and collecting these biometric and facial data is not easy task. We will try to find the method to improve the performance of the model by considering a small amount of biometric data, which is left as another future work.

We also built a database for the emotion estimation with the V-A model and will continue to collect more data. We expect our dataset will be widely used in the field of affective computing.

Fig. 15: Different emotions with different facial expressions.


  • [1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
  • [2] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
  • [3] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1–9.
  • [4] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” arXiv preprint arXiv:1512.03385, 2015.
  • [5] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3431–3440.
  • [6] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 580–587.
  • [7] Z. Cheng, Q. Yang, and B. Sheng, “Deep colorization,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 415–423.
  • [8] C. Dong, C. C. Loy, K. He, and X. Tang, “Learning a deep convolutional network for image super-resolution,” in European Conference on Computer Vision.   Springer, 2014, pp. 184–199.
  • [9] V. Campos, A. Salvador, X. Giro-i Nieto, and B. Jou, “Diving deep into sentiment: Understanding fine-tuned cnns for visual sentiment prediction,” in Proceedings of the 1st International Workshop on Affect & Sentiment in Multimedia.   ACM, 2015, pp. 57–62.
  • [10] K.-C. Peng, T. Chen, A. Sadovnik, and A. C. Gallagher, “A mixed bag of emotions: Model, predict, and transfer emotion distributions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 860–868.
  • [11] Q. You, J. Luo, H. Jin, and J. Yang, “Robust image sentiment analysis using progressively trained and domain transferred deep networks,” in Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015.
  • [12] ——, “Building a large scale dataset for image emotion recognition: The fine print and the benchmark,” 2016.
  • [13] C. Xu, S. Cetintas, K.-C. Lee, and L.-J. Li, “Visual sentiment prediction with deep convolutional neural networks,” arXiv preprint arXiv:1411.5731, 2014.
  • [14] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
  • [15] A. Hanjalic, “Extracting moods from pictures and sounds: Towards truly personalized tv,” IEEE Signal Processing Magazine, vol. 23, no. 2, pp. 90–100, 2006.
  • [16] X. Lu, P. Suryanarayan, R. B. Adams Jr, J. Li, M. G. Newman, and J. Z. Wang, “On shape and the computability of emotions,” in Proceedings of the 20th ACM international conference on Multimedia.   ACM, 2012, pp. 229–238.
  • [17] J. Machajdik and A. Hanbury, “Affective image classification using features inspired by psychology and art theory,” in Proceedings of the 18th ACM international conference on Multimedia.   ACM, 2010, pp. 83–92.
  • [18] S. Zhao, Y. Gao, X. Jiang, H. Yao, T.-S. Chua, and X. Sun, “Exploring principles-of-art features for image emotion recognition,” in Proceedings of the 22nd ACM international conference on Multimedia.   ACM, 2014, pp. 47–56.
  • [19] S. Haykin, Neural Networks: A Comprehensive Foundation.   Prentice Hall, 1999. [Online]. Available:
  • [20] C. E. Osgood, “The nature and measurement of meaning.” Psychological bulletin, vol. 49, no. 3, p. 197, 1952.
  • [21] D. Borth, R. Ji, T. Chen, T. Breuel, and S.-F. Chang, “Large-scale visual sentiment ontology and detectors using adjective noun pairs,” in Proceedings of the 21st ACM international conference on Multimedia.   ACM, 2013, pp. 223–232.
  • [22] T. Chen, D. Borth, T. Darrell, and S.-F. Chang, “Deepsentibank: Visual sentiment concept classification with deep convolutional neural networks,” arXiv preprint arXiv:1410.8586, 2014.
  • [23] V. Campos, B. Jou, and X. Giro-i Nieto, “From pixels to sentiment: Fine-tuning cnns for visual sentiment prediction,” Image and Vision Computing, 2017.
  • [24] J. Islam and Y. Zhang, “Visual sentiment analysis for social images using transfer learning approach,” in Big Data and Cloud Computing (BDCloud), Social Computing and Networking (SocialCom), Sustainable Computing and Communications (SustainCom)(BDCloud-SocialCom-SustainCom), 2016 IEEE International Conferences on.   IEEE, 2016, pp. 124–130.
  • [25] M. M. Bradley and P. J. Lang, “Measuring emotion: the self-assessment manikin and the semantic differential,” Journal of behavior therapy and experimental psychiatry, vol. 25, no. 1, pp. 49–59, 1994.
  • [26] J. A. Russell, “A circumplex model of affect,” 1980.
  • [27] Flickr, “”
  • [28] P. J. Lang, M. M. Bradley, and B. N. Cuthbert, “International affective picture system (iaps): Affective ratings of pictures and instruction manual,” Technical report A-8, 2008.
  • [29] E. Reinhard, M. Ashikhmin, B. Gooch, and P. Shirley, “Color transfer between images,” IEEE Computer graphics and applications, no. 5, pp. 34–41, 2001.
  • [30] G. Csurka, S. Skaff, L. Marchesotti, and C. Saunders, “Learning moods and emotions from color combinations,” in Proceedings of the Seventh Indian Conference on Computer Vision, Graphics and Image Processing.   ACM, 2010, pp. 298–305.
  • [31] L. He, H. Qi, and R. Zaretzki, “Image color transfer to evoke different emotions based on color combinations,” Signal, Image and Video Processing, pp. 1–9, 2014.
  • [32] H.-R. Kim, H. Kang, and I.-K. Lee, “Image recoloring with valence-arousal emotion model,” in Computer Graphics Forum, vol. 35, no. 7.   Wiley Online Library, 2016, pp. 209–216.
  • [33] J. Van De Weijer, C. Schmid, J. Verbeek, and D. Larlus, “Learning color names for real-world applications,” IEEE Transactions on Image Processing, vol. 18, no. 7, pp. 1512–1523, 2009.
  • [34] P. Valdez and A. Mehrabian, “Effects of color on emotions.” Journal of experimental psychology: General, vol. 123, no. 4, p. 394, 1994.
  • [35] A. B. Warriner, V. Kuperman, and M. Brysbaert, “Norms of valence, arousal, and dominance for 13,915 english lemmas,” Behavior research methods, vol. 45, no. 4, pp. 1191–1207, 2013.
  • [36] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on.   IEEE, 2009, pp. 248–255.
  • [37] Z. Wu, C. Shen, and A. v. d. Hengel, “Wider or deeper: Revisiting the resnet model for visual recognition,” arXiv preprint arXiv:1611.10080, 2016.
  • [38] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin et al., “Tensorflow: Large-scale machine learning on heterogeneous distributed systems,” arXiv preprint arXiv:1603.04467, 2016.
  • [39] F. X. Yu, L. Cao, R. S. Feris, J. R. Smith, and S.-F. Chang, “Designing category-level attributes for discriminative visual recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 771–778.
  • [40] L. Aftanas, N. Reva, A. Varlamov, S. Pavlov, and V. Makhnev, “Analysis of evoked eeg synchronization and desynchronization in conditions of emotional activation in humans: temporal and topographic characteristics,” Neuroscience and behavioral physiology, vol. 34, no. 8, pp. 859–867, 2004.
  • [41] G. Chanel, J. Kronegg, D. Grandjean, and T. Pun, “Emotion assessment: Arousal evaluation using eeg’s and peripheral physiological signals,” Multimedia content representation, classification and security, pp. 530–537, 2006.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description