Scene-based Factored Attention for Image Captioning

Scene-based Factored Attention for Image Captioning

Chen Shen, Rongrong Ji, Fuhai Chen, Xiaoshuai Sun, Xiangming Li
Media Analytics and Computing Lab, Department of Artificial Intelligence,
School of Informatics, Xiamen University, 361005, China.
Peng Cheng Laboratory, Shenzhen, China.
schenxmu@stu.xmu.edu.cn, rrj@xmu.edu.cn,
{cfh3c.xmu, xiaoshuaisun.hit}@gmail.com, lixiangming@stu.xmu.edu.cn
Corresponding author.
Abstract

Image captioning has attracted ever-increasing research attention in the multimedia community. To this end, most cutting-edge works rely on an encoder-decoder framework with attention mechanisms, which have achieved remarkable progress. However, such a framework does not consider scene concepts to attend visual information, which leads to sentence bias in caption generation and defects the performance correspondingly. We argue that such scene concepts capture higher-level visual semantics and serve as an important cue in describing images. In this paper, we propose a novel scene-based factored attention module for image captioning. Specifically, the proposed module first embeds the scene concepts into factored weights explicitly and attends the visual information extracted from the input image. Then, an adaptive LSTM is used to generate captions for specific scene types. Experimental results on Microsoft COCO benchmark show that the proposed scene-based attention module improves model performance a lot, which outperforms the state-of-the-art approaches under various evaluation metrics.

1 Introduction

Describing what is in an image, known as image captioning, is a very challenging task, which attracts increasing attention in the multimedia research. In order to translate images to sentences, an encoder-decoder architecture is typically adopted for image captioning [45, 51, 46], which has achieved promising performance. Recent works in image captions prefer the usage of attention mechanism, which forces image captioning to dynamically focus on different regional features as needed, rather than being locked by a static image representation. Since object-centered visual concepts have been proven to be effective in visual recognition [35], some captioning methods [49, 53, 15] also prefer to selectively attend a set of detected object-centered visual concepts. These concepts are then combined into the hidden states of recurrent neural network (RNN) for dynamic caption generation.

Figure 1: Top: Scene concepts affect word chosen in caption generation. Middle: Words probability distribution of leveraging scene concepts as semantic concepts. Bottom: Words probability distribution of our scene-based factored attention method.

Despite the exciting recent progress, those works model attention based on either regional features or object-centered visual concepts. However, attention driven by scene concepts has never been explicitly considered, which actually plays a very important role in determining the major keywords of captions. As shown in the left case of Fig. 1 (top), it is better to say ”a person is laying111Due to the variance of crowdsourcing labeling, the word ”laying” are used more frequently than ”lying” in the captions of MS COCO dataset.” than ”a person is sleeping” when the scene is obviously outdoor. By contrast, in the right case of Fig. 1 (top), when the photo is taken in a room with a man lying, it is more likely to get a caption as ”a man is sleeping”. Clearly the scene concepts have a considerable influence on the word generation.

It is intuitive to introduce scene cues into image captioning. A possible way to leverage the scene cues is to apply semantic concept attention. For example, one can follow You et al. [54] to attend scene cues as semantic concepts for attention. Nevertheless, the visual information is always hierarchical [26], which makes the existing works suboptimal. As the word probability distribution shown in Fig. 1 (middle), after partial sentence generated for images in Fig. 1 (top), the model with scene semantic attention is still not clear enough about choosing whether the word ”laying” or ”sleeping”. We argue that scene concepts and object-centered visual concepts should not be treated equally, since the scene concepts contain more global and macroscopic context information than object-centered visual concepts. It therefore needs a more explicitly mechanism in the attention module as core guidance.

Figure 2: The overview of the proposed model. Given a set of visual information, i.e., regional features, object-centered visual concepts and scene concepts, which extracted from the input image in the encoder, factored attention module embeds scene concepts into the current hidden feature of the first LSTM to attend regional features and object-centered visual concepts. Then the weighted visual information is fed into the second LSTM to generate the next word in the Decoder.

In this paper, we argue that the fundamental issue lies in explicit and respective modeling of scene concepts, object-centered visual concepts and sentence generation. On one hand, the scene concepts are usually corresponding to the attribute keywords in captions. On the other hand, the context of scene concepts can guide to attend object-centered visual concepts when a sentence is generated. Driven by the above insights, we propose a novel scene-based factored attention module for image captioning. The framework of the proposed method is illustrated in Fig. 2. To fully encode the input image, we first integrate the hierarchical visual information (including regional features, object-centered visual concepts and scene concepts) to enrich keywords and details in caption generation. Then, we design a scene-based factored attention module to attend the hierarchical visual information. Generally speaking, we embed scene concepts into the hidden feature of an LSTM [20]. Conditioned on the embedded scene hidden feature, the module determines which features and object-centered visual concepts are more important by assigning the corresponding weights. Finally, the outputs of the factored attention module are fed into a second LSTM to generate the next word. As shown in Fig. 1 (bottom), our model with scene-based factored attention is more confident with the chosen words.

The contributions of this paper are summarized as follows: (1) We are the first to explicitly embed scene concepts in image captioning. We are also the first to explicitly model relevance among scene concepts, object-centered visual concepts and caption generation. (2) We propose a factored attention module to better perceive the hierarchical visual information. Quantitative comparisons to the state-of-the-art demonstrate our merits.

2 Related Work

Our work relates to three topics: image captioning, tensor factorization and scene understanding. In this section, we categorize and review related work as follows.

2.1 Image Captioning

Most existing image captioning methods rely on the encoder-decoder framework inspired by machine translation [3, 41]. The framework is used to ”translate” an image to a sentence, where the visual features are extracted from convolutional neural network (CNN) and fed into Long Short-Term Memory (LSTM) to generate captions. Image captioning techniques have been extensively explored in [22, 45, 31, 9, 4, 5]. A few models [51, 54, 2] seek to apply attention mechanism to bridge the gap of visual understanding and language processing. The prior attention mechanism relies on either regional convolution features or object-centered visual concepts extracted from images. The former allows the model to dynamically select regional features during sentence generation. And the latter, such as semantic attention [54, 49], applies top-down attention on detected object-centered visual concepts. However, these object-centered visual concepts have two major drawbacks. Firstly, they do not retain spatial information and scene guidance, which may make captions miss scene keywords and scene details. Secondly, they do not take the hierarchy of semantics into account, which may lead to sentence bias. As demonstrated in our experiments, considering the hierarchical semantic concepts at scene and object levels can better guide the attention selection and caption generation.

2.2 Tensor Factorization

Tensor factorization has been used in many multimedia tasks, such as attributes learning [32], motion style modeling [43], image transformations [40] and sequence learning [39, 50]. Recently, tensor factorization has been widely used in [24, 13, 15, 14], which can further improve the model performance. More specifically, Kiros et al. [24] used factored tensor to guide word embedding with visual features. Fu et al. [13] inferred a topic vector (named scene vector) for tensor factorization in LSTM. Gan et al. [15] used factorization to remedy dimension explosion. Gan et al. [14] introduced factored LSTM to learn different style captions. In contrast to these works, we use tensor factorization not only to explicitly model the relevance among visual information and sentence generation, but also to guide the attention selection mechanism.

2.3 Scene understanding

In the last few years, CNNs have emerged as powerful image representations for scene classification [33, 48, 38, 25, 55, 47, 17]. Thanks to the development of Scene-15, MIT Indoor-67, SUN-397 and Place datasets [56], the well-known scene classification task has been pushed forward with great progress and gradually weeded out hand-crafted features. Recently, deep convolutional networks have been exploited for scene classification by Zhou et al. [56]. We take full advantage of the recent scene understanding methods to help improve the quality of caption generation.

3 The Proposed Model

Firstly, a set of hierarchical visual information, i.e., regional features , object-centered visual concepts and scene concepts are extracted from the input image. Secondly, scene-based factored attention module embeds scene concepts into the current hidden feature of the first LSTM to attend regional features and object-centered visual concepts . Finally, the weighted visual information is fed into the second LSTM to generate the next word.

In Sec. 3.1, we briefly introduce the basic architecture of our proposed image captioning method. Then in Sec. 3.2, we introduce the factored attention module in details. Finally, in Sec. 3.3, we introduce the objective function used in our work.

3.1 Caption Generation

Long Short-Term Memory (LSTM) [20] is a widely-used Recurrent Neural Network (RNN), which is known to learn patterns with long-term temporal dependencies. We briefly refer to the operation of the LSTM over a single time step using the following notation:

(1)

where is the input vector of LSTM, and is the hidden feature of LSTM.

The hidden feature at time step can be calculated via Eq. 1, formulated as follows:

(2)
(3)
(4)
(5)
(6)
(7)

where and are input gate, forget gate, output gate, memory cell and hidden feature, respectively. and denote sigmoid function and an element-wise Hadamard product operator, respectively. For brevity, we omit all bias terms in the following paper.

LSTM’s core is a memory cell that maintains the multi-modal knowledge of the inputs observed with respect to the time step . Updating operations on the memory cell is modulated by three gates, i.e., the input gate , the output gate and the forget gate , which determine when and how the information flow. Especially, the input gate controls the input of the LSTM. The output gate manages the memory transfer to the hidden feature of the LSTM and generate the next word. And the forget gate decides whether to forget previous memory .

Our captioning model consists of two LSTM layers, referred as first LSTM and second LSTM. The superscript of variables in the equations is to distinguish which layer of LSTM. The first LSTM generates a hidden feature of the current sequence based on the input, which contains partial sequence output generated so far, the current input word and the context information of the second LSTM. It is formulated as follows:

(8)
(9)

where is a word embedding matrix for a vocabulary of size . is the input word of a one-hot vector at time step .

We define the notation as a sequence of words , and get the first words conditional probability distribution at time step as follows:

(10)

where is a learned weight matrix. Note that the output is a distribution of words only for loss optimization in training. The details will be described in Sec. 3.3.

In our proposed scene-based factored attention module, at each time step , we use the current hidden feature to get the attentive weighted visual information , where the details will be described in Sec. 3.2.

We devise the second LSTM layer to make use of weighed visual information to generate a word at each time step , which can be further reformulated as:

(11)
(12)
(13)

where is a learned weight matrix. The output is the second distribution of words, which not only participates in loss optimization in training, but is used independently to sample word in testing. The distribution of the whole generated caption is calculated as the product of conditional distributions:

(14)

3.2 Scene-based Factored Attention Module

In order to take full advantages of scene concepts and model hierarchical semantic concepts, we further propose a factorization method to embed scene concepts into the attention mechanism.

We firstly obtain diagonal matrix by direct diagonalization of scene concepts . Then this diagonal scene matrix is embedded into the LSTM hidden feature by factorizing the parameters in the traditional attention mechanism [51, 2] into three matrices , , :

(15)
(16)

where and are the learned weight matrices that shared by all images and scene concepts.

The factored is used to transform the hidden feature , which fuels the context of the scene concepts directly. Therefore, the hidden feature obtains the context of the scene in this way. Given the regional features , we generate first normalized attention weight as follows:

(17)
(18)
(19)

where and are the learned weight matrices.

Similarly, given the object-centered visual concepts , the second normalized attention weight is generated as follows:

(20)
(21)
(22)

where and are the learned weight matrices.

Finally, the weighted regional features and the weighted object-centered visual concepts are concatenated via Eq. 23 and fed into the second LSTM in Eq. 11 and Eq. 12.

(23)

3.3 Objective Function

Given a target ground-truth sequence and a model with parameters , we minimize the following maximum likelihood estimation (MLE) loss:

(24)

In order to regularize the first LSTM more directly, we calculate the loss for both LSTMs as:

(25)

where is the hyper-parameter between 0 and 1.

Finally, we also introduce the reinforcement learning (RL) method into our framework for fair comparison with recent RL-based works like [37, 2, 6, 30, 16].222It should be noted that our scene-based factored attention module can be broadly used in other RL-based methods or GAN-based methods [11, 7]. We minimize the negative expected reward after MLE training:

(26)

where is a sampled caption and is the CIDEr [44] reward function. Similar negative expected reward function has been proven to be effective in other works [19, 37, 2].

Following the Self-critical Sequence Training (SCST) [37], the gradient of can be approximated:

(27)

where is a sampled caption and defines the baseline score obtained by greedily decoding.

4 Experiments

In this section, we conduct extensive experiments to validate the effectiveness of scene-based factored attention module. In Section 4.1, we briefly introduce the dataset, images and captions pre-processing, evaluation metrics used in the experiments and implement details. Next, in Section 4.2, we discuss the ablation study of the proposed model. Then in Section 4.3, we compare and analyze the results of the proposed model with other state-of-the-art models on image captioning both offline and online. Finally, in Section 4.4, we qualitatively analyze our merits in details.

4.1 Experimental Settings

4.1.1 Dataset

In this paper, we utilize the MS COCO dataset [8], which has been far and wide used in image captioning training and evaluation. MS COCO dataset contains 123,827 images. Each image in the dataset is given at least five captions by different Amazon’s Mechanical Turk (AMT) workers. Following the Karpathy split333https://github.com/karpathy/neuraltalk in [22], we use a set of 113,287 images for training, 5K images for validation and 5K for testing.

Model Bleu1 Bleu2 Bleu3 Bleu4 METEOR ROUGE CIDEr SPICE
Baseline 0.764 0.602 0.460 0.349 0.269 0.559 1.088 0.201
Baseline + VC 0.765 0.605 0.468 0.359 0.274 0.564 1.131 0.205
Baseline + Scene 0.776 0.616 0.473 0.359 0.271 0.568 1.124 0.205
Baseline + [VC, Scene] 0.776 0.618 0.476 0.361 0.272 0.567 1.132 0.208
Baseline + VC + Scene 0.776 0.618 0.477 0.367 0.277 0.570 1.147 0.209
Table 1: Ablation study results on MS COCO Karpathy test split. The notation of ”VC” denotes that we add traditional visual concepts attention and the notation of ”Scene” denotes that we add factored attention module. The notation of ”[VC, Scene]” denotes that we concatenate the visual concepts and scene concepts like ”VC”.
Model Bleu1 Bleu2 Bleu3 Bleu4 METEOR ROUGE CIDEr SPICE
NIC[45] 0.663 0.423 0.277 0.183 0.237 - 0.855 -
Soft-Attention[51] 0.707 0.492 0.344 0.243 0.239 - - -
Hard-Attention[51] 0.718 0.504 0.357 0.250 0.230 - - -
ATT[54] 0.709 0.537 0.402 0.304 0.243 - - -
LSTM-A5[53] 0.730 0.565 0.429 0.325 0.251 0.538 0.986 -
ARNet[10] 0.740 0.576 0.440 0.335 0.261 0.546 1.034 0.190
LTG-Review-Net[21] 0.743 0.579 0.442 0.336 0.261 0.548 1.039 -
Up-Down[2] 0.772 - - 0.362 0.270 0.564 1.135 0.203
DA[16] 0.758 - - 0.357 0.274 0.562 1.119 0.205
Ours 0.776 0.618 0.477 0.367 0.277 0.570 1.147 0.209
SCST:Att2in[37] - - - 0.313 0.260 0.543 1.013 -
SCST:Att2all[37] - - - 0.300 0.259 0.534 0.994 -
BAM[6] - - - 0.350 0.262 0.559 1.111 -
ATTN+C+D(1)[30] - - - 0.363 0.273 0.571 1.141 0.211
Up-Down[2] 0.798 - - 0.363 0.277 0.569 1.201 0.214
DA[16] 0.799 - - 0.375 0.285 0.582 1.256 0.223
Ours 0.803 0.646 0.601 0.381 0.285 0.582 1.268 0.220
Table 2: Single-model image captioning performance on MS COCO Karpathy test split. Results are reported for models trained with standard MLE loss in Table (top) and RL-based methods in Table (bottom). The numbers in boldface are the best known results and underlined numbers are the result of the second.

4.1.2 Images and Captions Pre-processing

In the encoder-decoder framework, image encoder is an essential part of image captioning, which is used to extract the visual information of images. To totally understand the input image , we design three different kinds of visual information with hierarchical visual levels. The low-level is the region feature extracted from the output of a Faster R-CNN [36] with ResNet-101 [18] like other methods in [2, 30, 16]. And note that the number of regional features varies from image to image. The middle-level is the object-centered visual concepts , which extracted from a visual concept extractor CNN trained on MS COCO dataset [8]. We refer nouns from captions as our visual semantic concepts. We regard it as a multi-label classification problem by minimizing a label smoothing [42] element-wise logistic loss function. The high-level is the scene concepts , which is extracted from a scene classifier CNN pretrained on Place dataset [56].

We follow standard practice and perform only minimal text-precessing. All the sentences in the training set are truncated to 16 characters, converting all sentences to lower case, tokenizing on white space, and filtering words that do not occur at least 5 times, resulting in a model vocabulary of 9,487 words.

4.1.3 Evaluation Metric

To evaluate the quantitative performance of the captions generated by our proposed model, we used five metrics which are commonly used in image captioning, including BLEU [34], METEOR [12], ROUGE [27], CIDEr [44] and SPICE [1]. All the results are evaluated by Microsoft COCO caption evaluation tool444https://github.com/tylin/coco-caption, where a larger score number in the results means better performance for all five metrics.

4.1.4 Implementation Details

We set the number of hidden units in each LSTM to 1,000, the number of hidden units in the attention layer to 512, and the size of the input word embedding to 1,000. In training, the Adam optimizer [23] with a learning rate initialized to 5e-4 and decay by a factor 0.8 for every three epochs. The batch size is 100. In testing, beam search is used to sample captions and the beam size is set to 2.

Model Bleu1 Bleu2 Bleu3 Bleu4 METEOR ROUGE CIDEr
c5 c40 c5 c40 c5 c40 c5 c40 c5 c40 c5 c40 c5 c40
Google NIC[45] 0.713 0.895 0.542 0.802 0.407 0.694 0.309 0.587 0.254 0.346 0.530 0.682 0.943 0.946
ATT[54] 0.731 0.901 0.565 0.816 0.424 0.710 0.316 0.600 0.251 0.336 0.535 0.683 0.944 0.959
Review Net[52] 0.720 0.900 0.550 0.812 0.414 0.705 0.311 0.597 0.256 0.347 0.535 0.686 0.965 0.969
Adaptive[29] 0.748 0.920 0.584 0.845 0.444 0.744 0.336 0.637 0.264 0.359 0.555 0.705 1.042 1.059
PG-BCMR[28] 0.754 0.918 0.591 0.841 0.445 0.738 0.332 0.624 0.257 0.340 0.550 0.695 1.013 1.031
SCST:Att2all[37] 0.781 0.937 0.619 0.860 0.470 0.759 0.352 0.645 0.270 0.355 0.563 0.707 1.147 1.167
LSTM-A3[53] 0.787 0.937 0.627 0.867 0.476 0.765 0.356 0.652 0.270 0.354 0.564 0.705 1.160 1.180
DA[16] 0.794 0.944 0.635 0.880 0.487 0.784 0.368 0.674 0.282 0.370 0.577 0.722 1.205 1.220
Up-Down[2] 0.802 0.952 0.641 0.888 0.491 0.794 0.369 0.685 0.276 0.367 0.571 0.724 1.179 1.205
Ours 0.803 0.947 0.647 0.887 0.500 0.797 0.379 0.690 0.282 0.372 0.581 0.730 1.235 1.256
Table 3: Quantitative comparisons to the state-of-the-art works in image captioning on dataset c5 and c40 evaluated on the online MS-COCO server. Both SCST:Att2all and Up-Down are an ensemble of 4 models while ours is a single model. LSTM-A3 utilizes Resnet-152 based visual feature. The numbers in bold are the best and the underlined numbers are the second.

4.2 Ablation Study

In order to figure out the contribution of each component, we conduct the following ablation studies on the MS COCO dataset with Karpathy test split. Specifically, we remove the visual concepts (VC) and the proposed factored attention module (Scene) respectively from our model.

We summarized the experimental results in Tab. 1. The baseline is a re-implementation of Up-Down method proposed in [2]. The notation of ”VC” denotes that we add traditional visual concepts attention and the notation of ”Scene” denotes that we add factored attention module. The notation of ”[VC, Scene]” denotes that we concatenate the visual concepts and scene concepts as semantic attention. And the notation of ”Baseline + VC + Scene” is our full model, which denotes that the baseline model with our scene-based factored attention module.

With the results in Tab. 1, we can see that our model performs better than the baseline model with relative improvements range from 1.6% to 6.3%. With the guidance of scene concepts, the model makes better use of visual information. In addition, compared with ”Baseline + [VC, Scene]”, we can see that though adding scene cues in visual concepts attention helps the model choose words, this is not the optimal solution. ”Baseline + VC + Scene” obtains higher performance on all 5 metrics. This verifies the importance of our scene-based factored attention module.

In order to determine a hyper-parameters as mentioned in the Eq. 25, we design an experiment with a variable-controlling approach. The objective results on the Karpathy test split with different values are shown in Fig. 3. Notice that evaluation results achieve their optimal scores when .

Figure 3: A variable-controlling experiment for selection

4.3 Comparing with State-of-the-Arts

In Tab. 2, we report the performance of our framework in comparison to the existing state-of-the-arts on the test portion of the Karpathy splits. For a fair comparison, results are reported for models trained with standard MLE loss in Tab. 2 (top), and models optimized for CIDEr score Tab. 2 (bottom). For offline evaluation, all the image captioning models are single-model with no fine-tuning of the input ResNet / R-CNN model. It is clear that our model performs the best on the generally used evaluation metrics, e.g., BLUE, ROUGE, CIDEr scores. The experimental results demonstrate that our proposed scene-based factored attention module can significantly boost the scores compared with the existing state-of-the-arts

We also compare our model to the recent results on the official MS COCO evaluation by uploading results to the online MS COCO test server. The online server provides ”C5” and ”C40” metrics which denote 5 reference captions and 40 reference captions, respectively. The results are summarized in Tab. 3, we can see that the performance of a single model trained with CIDEr optimization achieves the best performance on most metrics among the published state-of-the-art image captioning models on the blind test split.

Figure 4: Qualitative analysis. The notation of ”Detected” denote the scene concepts detected from the image. And the notations of ”Ours w scene” and ”Ours wo scene” denote our proposed model with/without scene-based factored attention module, respectively. It is easy to see that the model with the proposed module pays more attention to the details of the scenes, and the model is more inclined to mention the scene keywords in description generation.
Figure 5: Visualization of attention regions with/without scene. The notations of ”Ours w scene” and ”Ours wo scene” denote our proposed model with/without scene-based factored attention module. The region with the maximum attention weight is in orange.

4.4 Qualitative analysis

Here, we show some qualitative results in Fig. 4 for a better understanding of our proposed model. The notation of ”Detected” denote the scene concepts detected from the image. And notations of ”Ours w scene” and ”Ours wo scene” denote our proposed model with/without scene-based factored attention module. We can see that model with the proposed module pays more attention to the details of the scenes, and the proposed model is more inclined to mention the scene keywords in description generation.

We further visualize the heatmap of attention regions for words generated with/without scene-based factored attention module on the same image in Fig. 5. It is common practice [51, 2] to directly visualize the attention weights in Eq. 18 associated with word emitted at the same time step . We can find out that the area of attention is more clear with using the scene semantic concepts as guidance. In the complex scene as shown in the top of Fig. 5, it can pay more clearly and discriminately attention to regional features and tends to describe the scene more. In a relatively simple scene, as shown in the bottom of Fig. 5, the attention weights generated by our model are more logical, indicating that they are more accurate for the application of regional features of images. As captions are being generated, the attention weights at both image examples vary properly when words generated.

5 Conclusions

In this work, we propose a novel scene-based factored attention module for image captioning. Different from previous works based on either regional features attention or object-centered visual concepts attention, our model takes scene concepts into account. As far as we know, we are the first to take scene concepts into consideration in image captioning and model relevance among scene concepts, object-centered visual concepts and caption generation. In our proposed scene-based factored attention module, we explicitly embed scene concepts in factored tensor into the LSTM hidden feature. Conditioned on the scene embedded hidden feature, we get the relative importance of regional features and object-centered visual concepts. The real power of our proposed module lies in its ability to attend hierarchically visual information for better captions. Experiments conducted on the MS COCO captioning datasets validate the superiority of the proposed approach.

Acknowledgement

This work is supported by the National Key R&D Program (No.2017YFC0113000, and No.2016YFB1001503), Nature Science Foundation of China (No.U1705262, No.61772443, and No.61572410), Post Doctoral Innovative Talent Support Program under Grant BX201600094, China Post-Doctoral Science Foundation under Grant 2017M612134, Scientific Research Project of National Language Committee of China (Grant No. YB135-49), and Nature Science Foundation of Fujian Province, China (No. 2017J01125 and No. 2018J01106).

References

  • [1] P. Anderson, B. Fernando, M. Johnson, and S. Gould (2016) Spice: semantic propositional image caption evaluation. In European Conference on Computer Vision, pp. 382–398. Cited by: §4.1.3.
  • [2] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang (2018) Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6077–6086. Cited by: §2.1, §3.2, §3.3, §4.1.2, §4.2, §4.4, Table 2, Table 3.
  • [3] D. Bahdanau, K. Cho, and Y. Bengio (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §2.1.
  • [4] F. Chen, R. Ji, J. Su, Y. Wu, and Y. Wu (2017) Structcap: structured semantic embedding for image captioning. In Proceedings of the 25th ACM international conference on Multimedia, pp. 46–54. Cited by: §2.1.
  • [5] F. Chen, R. Ji, X. Sun, Y. Wu, and J. Su (2018) Groupcap: group-based image captioning with structured relevance and diversity constraints. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1345–1353. Cited by: §2.1.
  • [6] S. Chen and Q. Zhao (2018) Boosted attention: leveraging human attention for image captioning. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 68–84. Cited by: §3.3, Table 2.
  • [7] T. Chen, Y. Liao, C. Chuang, W. Hsu, J. Fu, and M. Sun (2017) Show, adapt and tell: adversarial training of cross-domain image captioner. In Proceedings of the IEEE International Conference on Computer Vision, pp. 521–530. Cited by: footnote 2.
  • [8] X. Chen, H. Fang, T. Lin, R. Vedantam, S. Gupta, P. Dollár, and C. L. Zitnick (2015) Microsoft coco captions: data collection and evaluation server. arXiv preprint arXiv:1504.00325. Cited by: §4.1.1, §4.1.2.
  • [9] X. Chen and C. Lawrence Zitnick (2015) Mind’s eye: a recurrent visual representation for image caption generation. In CVPR, pp. 2422–2431. Cited by: §2.1.
  • [10] X. Chen, L. Ma, W. Jiang, J. Yao, and W. Liu (2018) Regularizing rnns for caption generation by reconstructing the past with the present. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7995–8003. Cited by: Table 2.
  • [11] B. Dai, S. Fidler, R. Urtasun, and D. Lin (2017) Towards diverse and natural image descriptions via a conditional gan. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2970–2979. Cited by: footnote 2.
  • [12] M. Denkowski and A. Lavie (2014) Meteor universal: language specific translation evaluation for any target language. In Proceedings of the ninth workshop on statistical machine translation, pp. 376–380. Cited by: §4.1.3.
  • [13] K. Fu, J. Jin, R. Cui, F. Sha, and C. Zhang (2017) Aligning where to see and what to tell: image captioning with region-based attention and scene-specific contexts. IEEE transactions on pattern analysis and machine intelligence 39 (12), pp. 2321–2334. Cited by: §2.2.
  • [14] C. Gan, Z. Gan, X. He, J. Gao, and L. Deng (2017) Stylenet: generating attractive visual captions with styles. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3137–3146. Cited by: §2.2.
  • [15] Z. Gan, C. Gan, X. He, Y. Pu, K. Tran, J. Gao, L. Carin, and L. Deng (2017) Semantic compositional networks for visual captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5630–5639. Cited by: §1, §2.2.
  • [16] L. Gao, K. Fan, J. Song, X. Liu, X. Xu, and H. T. Shen (2019) Deliberate attention networks for image captioning. Cited by: §3.3, §4.1.2, Table 2, Table 3.
  • [17] S. Guo, W. Huang, L. Wang, and Y. Qiao (2017) Locally supervised deep hybrid model for scene recognition. IEEE transactions on image processing 26 (2), pp. 808–820. Cited by: §2.3.
  • [18] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: §4.1.2.
  • [19] X. He and L. Deng (2012) Maximum expected bleu training of phrase and lexicon translation models. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1, pp. 292–301. Cited by: §3.3.
  • [20] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §1, §3.1.
  • [21] W. Jiang, L. Ma, X. Chen, H. Zhang, and W. Liu (2018) Learning to guide decoding for image captioning. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: Table 2.
  • [22] A. Karpathy and L. Fei-Fei (2015) Deep visual-semantic alignments for generating image descriptions. In CVPR, pp. 3128–3137. Cited by: §2.1, §4.1.1.
  • [23] D. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. In ICLR, Cited by: §4.1.4.
  • [24] R. Kiros, R. Salakhutdinov, and R. S. Zemel (2014) Multimodal neural language models.. In ICML, Vol. 14, pp. 595–603. Cited by: §2.2.
  • [25] L. Li, H. Su, L. Fei-Fei, and E. P. Xing (2010) Object bank: a high-level image representation for scene classification & semantic feature sparsification. In Advances in neural information processing systems, pp. 1378–1386. Cited by: §2.3.
  • [26] L. Li, S. Jiang, and Q. Huang (2012) Learning hierarchical semantic description via mixed-norm regularization for image understanding. IEEE Transactions on Multimedia 14 (5), pp. 1401–1413. Cited by: §1.
  • [27] C. Lin (2004) Rouge: a package for automatic evaluation of summaries. Text Summarization Branches Out. Cited by: §4.1.3.
  • [28] S. Liu, Z. Zhu, N. Ye, S. Guadarrama, and K. Murphy (2017) Improved image captioning via policy gradient optimization of spider. In Proceedings of the IEEE international conference on computer vision, pp. 873–881. Cited by: Table 3.
  • [29] J. Lu, C. Xiong, D. Parikh, and R. Socher (2017) Knowing when to look: adaptive attention via a visual sentinel for image captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 375–383. Cited by: Table 3.
  • [30] R. Luo, B. Price, S. Cohen, and G. Shakhnarovich (2018) Discriminability objective for training descriptive captions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6964–6974. Cited by: §3.3, §4.1.2, Table 2.
  • [31] J. Mao, W. Xu, Y. Yang, J. Wang, Z. Huang, and A. Yuille (2014) Deep captioning with multimodal recurrent neural networks (m-rnn). arXiv preprint arXiv:1412.6632. Cited by: §2.1.
  • [32] R. Memisevic and G. Hinton (2007) Unsupervised learning of image transformations. In 2007 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8. Cited by: §2.2.
  • [33] A. Oliva and A. Torralba (2001) Modeling the shape of the scene: a holistic representation of the spatial envelope. International journal of computer vision 42 (3), pp. 145–175. Cited by: §2.3.
  • [34] K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) BLEU: a method for automatic evaluation of machine translation. In ACL, pp. 311–318. Cited by: §4.1.3.
  • [35] D. Parikh and K. Grauman (2011) Relative attributes. In 2011 International Conference on Computer Vision, pp. 503–510. Cited by: §1.
  • [36] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In NIPS, pp. 91–99. Cited by: §4.1.2.
  • [37] S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel (2017) Self-critical sequence training for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7008–7024. Cited by: §3.3, §3.3, Table 2, Table 3.
  • [38] D. Song and D. Tao (2010) Biologically inspired feature manifold for scene classification. IEEE Transactions on Image Processing 19 (1), pp. 174–184. Cited by: §2.3.
  • [39] J. Song, Z. Gan, and L. Carin (2016) Factored temporal sigmoid belief networks for sequence learning. In International Conference on Machine Learning, pp. 1272–1281. Cited by: §2.2.
  • [40] I. Sutskever, J. Martens, and G. E. Hinton (2011) Generating text with recurrent neural networks. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 1017–1024. Cited by: §2.2.
  • [41] I. Sutskever, O. Vinyals, and Q. V. Le (2014) Sequence to sequence learning with neural networks. In NIPS, pp. 3104–3112. Cited by: §2.1.
  • [42] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna (2016) Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818–2826. Cited by: §4.1.2.
  • [43] G. W. Taylor and G. E. Hinton (2009) Factored conditional restricted boltzmann machines for modeling motion style. In Proceedings of the 26th annual international conference on machine learning, pp. 1025–1032. Cited by: §2.2.
  • [44] R. Vedantam, C. Lawrence Zitnick, and D. Parikh (2015) Cider: consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4566–4575. Cited by: §3.3, §4.1.3.
  • [45] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan (2015) Show and tell: a neural image caption generator. In CVPR, pp. 3156–3164. Cited by: §1, §2.1, Table 2, Table 3.
  • [46] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan (2017) Show and tell: lessons learned from the 2015 mscoco image captioning challenge. IEEE transactions on pattern analysis and machine intelligence 39 (4), pp. 652–663. Cited by: §1.
  • [47] L. Wang, S. Guo, W. Huang, Y. Xiong, and Y. Qiao (2017) Knowledge guided disambiguation for large-scale scene classification with multi-resolution cnns. IEEE Transactions on Image Processing 26 (4), pp. 2055–2068. Cited by: §2.3.
  • [48] L. Wu, S. C. Hoi, and N. Yu (2010) Semantics-preserving bag-of-words models and applications. IEEE Transactions on Image Processing 19 (7), pp. 1908–1920. Cited by: §2.3.
  • [49] Q. Wu, C. Shen, L. Liu, A. Dick, and A. Van Den Hengel (2016) What value do explicit high level concepts have in vision to language problems?. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 203–212. Cited by: §1, §2.1.
  • [50] Y. Wu, S. Zhang, Y. Zhang, Y. Bengio, and R. R. Salakhutdinov (2016) On multiplicative integration with recurrent neural networks. In Advances in neural information processing systems, pp. 2856–2864. Cited by: §2.2.
  • [51] K. Xu, J. Ba, R. Kiros, K. Cho, A. C. Courville, R. Salakhutdinov, R. S. Zemel, and Y. Bengio (2015) Show, attend and tell: neural image caption generation with visual attention.. In ICML, Vol. 14, pp. 77–81. Cited by: §1, §2.1, §3.2, §4.4, Table 2.
  • [52] Z. Yang, Y. Yuan, Y. Wu, W. W. Cohen, and R. R. Salakhutdinov (2016) Review networks for caption generation. In Advances in Neural Information Processing Systems, pp. 2361–2369. Cited by: Table 3.
  • [53] T. Yao, Y. Pan, Y. Li, Z. Qiu, and T. Mei (2017) Boosting image captioning with attributes. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4894–4902. Cited by: §1, Table 2, Table 3.
  • [54] Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo (2016) Image captioning with semantic attention. In CVPR, pp. 4651–4659. Cited by: §1, §2.1, Table 2, Table 3.
  • [55] J. Yu, D. Tao, Y. Rui, and J. Cheng (2013) Pairwise constraints based multiview features fusion for scene classification. Pattern Recognition 46 (2), pp. 483–496. Cited by: §2.3.
  • [56] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva (2014) Learning deep features for scene recognition using places database. In Advances in neural information processing systems, pp. 487–495. Cited by: §2.3, §4.1.2.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
388270
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description