Reflective Decoding Network for Image Captioning
State-of-the-art image captioning methods mostly focus on improving visual features, less attention has been paid to utilizing the inherent properties of language to boost captioning performance. In this paper, we show that vocabulary coherence between words and syntactic paradigm of sentences are also important to generate high-quality image caption. Following the conventional encoder-decoder framework, we propose the Reflective Decoding Network (RDN) for image captioning, which enhances both the long-sequence dependency and position perception of words in a caption decoder. Our model learns to collaboratively attend on both visual and textual features and meanwhile perceive each word’s relative position in the sentence to maximize the information delivered in the generated caption. We evaluate the effectiveness of our RDN on the COCO image captioning datasets and achieve superior performance over the previous methods. Further experiments reveal that our approach is particularly advantageous for hard cases with complex scenes to describe by captions.
The goal of image captioning is to automatically generate fluent and informative language description of an image for human understanding. As an interdisciplinary task connecting Computer Vision and Nature Language Processing, it explores towards the cutting edge techniques of scene understanding [li2009towards] and it is drawing increasing interests in recent years. 11footnotetext: This work was done while Lei Ke was an intern at Tencent.
To build a top captioning system, there are two crucial requirements. First, the captioning model needs to distill representative and meaningful visual representation from an image. Thanks to the success in image classification [krizhevsky2012imagenet] and object recognition [he2014spatial, ren2015faster], recent methods [anderson2017bottom, lu2017knowing, xu2015showattendtell, yang2016review] have shown significant advancements which mostly benefited from the improved quality of extracted visual features. Second, and the relatively neglected requirement, is to make the generated captions coherent and intelligent. Similar to the human language system, it needs to inference and reason during the generation process based on what has been generated and watched. Typically, this process is achieved by RNN (specifically, LSTM [hochreiter1997long]) in storing the sequential information during caption decoding.
The traditional LSTM model, however, tends to focus more on the relatively closer vocabulary while neglecting the farther one. For example, in Figure 1, the word ‘bridge’ has an important hint on predicting the word ‘river’ (which is neglected by the basis decoder), but the two words are separated by 6 words. Current mainstream caption decoder is weak in handling this kind of long-term dependency in sequential sentence, especially when the visual content of an image is complex and hard to describe, which usually leads to a general and less accurate caption description.
In this paper, we propose the Reflective Decoding Network (RDN) for image captioning, which mitigates the drawback of traditional caption decoder by enhancing its long sequential modeling ability. Different from previous methods which boost captioning performance by improving the visual attention mechanism [anderson2017bottom, lu2017knowing, xu2015showattendtell], or by improving the encoder to supply more meaningful intermediate representation for the decoder [jiang2018recurrent, yang2016review, yao2018exploring, you2016image], our RDN focuses directly on the target decoding side and jointly apply attention mechanism in both visual and textual domain. Besides, we propose to model the positional information of each word within a caption in a supervised way to capture the syntactic structure of natural language. Another advantage in RDN is to visualize how the model inferences and makes word prediction based on the generated words. For instance, our RDN successfully decodes the word ‘river’ in Figure 1 by referring to the previously generated words, especially the most relevant word ‘bridge’.
The main contributions of this paper are four folds:
We propose the RDN that effectively enhances the long sequential modeling ability of the traditional caption decoder for generating high-quality image captions.
By considering long-term textual attention, we explicitly explore the coherence between words and visualize the word decision making process in text domain to show how we can interpret the principle and result of the framework from a novel perspective.
We design a novel positional module to enable our RDN to perceive the relative position of each word in the whole caption and thereby better comprehend the syntactic paradigm of natural language.
Our RDN achieves state-of-the-art performance on COCO captioning dataset and is particularly superior over existing methods in hard cases with complex scenes to describe by captions.
2 Related Work
State-of-the-art captioning methods are mostly driven by advancements in machine translation [cho2014learning, sutskever2014sequence], where the encoder-decoder framework has demonstrated to generate much more novel and coherent sentences compared to the traditional template-based [kulkarni2013babytalk, yang2011corpus] or search-based [devlin2015language] methods. In [donahue2015long, vinyals2015showtell], the authors introduced a framework which utilizes a pre-trained CNN as an encoder to extract image features, followed by an RNN as a decoder to generate image descriptions. This model was further improved by incorporating high-level semantic attribute information [wu2016value, yao2017boosting] or regularizing the RNN decoder [chen2018regularizing]. To distill the salient objects or important regions from an image, different kinds of attention mechanisms were integrated into the captioning framework to exam the relevant image regions when generating sentences [anderson2017bottom, lu2017knowing, xu2015showattendtell, yang2016review, you2016image].
Fusion learning of multiple encoders or decoders forms an essential part of boosting image captioning performance. In [jiang2018recurrent], the authors utilized multiple CNNs to extract complementary image features, which forms a more informative and integrated representation for decoder. Yao \etal[yao2018exploring] proposed GCN-LSTM to build two kinds of graphs to incorporate both semantic and spatial relations into the framework. The outputs from two different separately trained decoders are linearly fused to produce the final prediction.
Similar to [anderson2017bottom, lu2018neural, yao2018exploring], our RDN also utilizes the attention mechanism and follows the encoder-decoder framework. However, we explicitly study the coherence between words, which remedies the drawback of current captioning framework in modeling long-term dependency in decoder.
Language Attention in joint vision and language tasks. Learning language attention has attracted increasing attention in other joint vision and language problems, such as VQA and grounding referential expressions. In [lu2016hierarchical], the authors proposed a model to jointly reason both visual and language attentions for visual question answering. Yu \etal[yu2018mattnet] attentively parsed the expressions into three phrase embeddings to address the task of referring expression comprehension. Different from them, image captioning task is a sequential language generation process. The target description of an image is unknown during inference stage. So, our RDN explores the language attention based on the generated words in previous states. With more time steps, the attended language content will increase dynamically, which enables the word predicted later to capture more useful information for reference.
Language Attention in NLP tasks. Our RDN shares some ideas of the self-attention mechanism in machine translation models [miculicich2018self, tran2016recurrent, vaswani2017attention], abstract summarization model [paulus2017deep] and dialogue system [mei2017coherent]. A typical self-attention model such as Transformer [vaswani2017attention] aims to learn a latent representation for each position of a sequence by referring to the whole context. In contrast, the Reflective Attention Module (RAM) of our RDN is designed as an attachable module which is seamlessly integrated into the recurrent decoding framework. Thanks to our special two-layer recurrent structure, our RAM collaborates smoothly with the visual attention component of our RDN by sharing the same query value to optimize the captioning process jointly, which is beneficial to ensure our generated captions match with the visual content of an image. To our knowledge, this paper is the first work in jointly exploring both visual and language attention in image captioning.
3 Reflective Decoding Network
The overall architecture of our framework is shown in Figure 2. Given an input image, our model first uses Faster R-CNN [ren2015faster] as Encoder to obtain the visual features of objects in the image. The visual features are then fed to the our Reflective Decoding Network (RDN) to generate caption. Our RDN contains three components: (1) Attention-based Recurrent Module, which attends to the visual features from Encoder; (2) Reflective Attention Module, which employs textual attention to model the compatibility between current and past decoding hidden states, thus it is able to capture more historical and comprehensive information for word decision; (3) Reflective Position Module, which introduces relative position information for each word in the generated caption and helps the model to perceive the syntactic structure of sentences. RDN is able to tackle the long-term dependency difficulty in caption decoding.
3.1 Object-Level Encoder
The encoder in encoder-decoder framework aims to extract meaningful semantic representation from an input image. We leverage object detection module (Faster R-CNN [ren2015faster]) with pretrained ResNet-101 [he2016deep] to produce the region-level representation. The set of extracted regional visual representation of an image are denoted as = , , where denotes the number of extracted regions, denotes the feature dimension of each region, and is the mean pooled convolutional feature within the extracted region. Compared to the conventional uniform meshing method on CNN features, the object-level encoder focuses more on salient objects/regions in an image that is closely related to the perception mechanism in human visual system [buschman2007top].
3.2 Reflective Decoder
Given a set of regional image features produced by encoder, the goal for the decoder is to generate the caption , where = consisting of words. The generated caption should not only capture the content information from the image but also be meaningful and coherent. Specifically, in Figure 2, the Attention-based Recurrent Module is employed to selectively attend to the detected regional features and serves the basic function of a captioning decoder while Reflective Attention Module and Reflective Position Module are designed above it as assistants to further enhance captioning quality. Thus, the complete Reflective Decoder is able to take both historical coherence between words and syntactic structure information into consideration while generating image captions.
Attention-based Recurrent Module includes the first LSTM layer and visual attention layer , which is designed mainly for top-down visual attention calculation. Its input at time step contains three concatenated parts, the mean-pooled image feature = , the embedding vector for current input word and the previous output from the second LSTM layer, where represents the contextual information of the given image, is the embedding matrix for the one-hot vector , is the size of the captioning vocabulary and is the embedding size. The formula for updating the LSTM units in the first layer is defined as :
For the visual attention layer , given the generated and the set of image features = , we calculate the normalized attention weight distribution over all the proposed object-level region denotes as :
where , , are learned embedding matrices, denotes the calculated attention probability for each regional feature at time step . So the attended feature is the weighted combination of each subregion, = based on the weight distribution parameter .
3.2.1 Reflective Attention Module.
The Reflective Attention Module contains reflective attention layer , combined with the second layer of LSTM, which is designed to output language description. Its input vector is concatenated by the attended feature result and the hidden state . Thus the formula for updating the LSTM units in the second layer of LSTM is denoted as :
Based on the current hidden state at the time step and the past hidden states set , the reflective attention layer calculates the normalized weight distribution above all the generated hidden states as shown in the top right of Figure 2. The formula is defined as :
where , , are three trainable matrices parameters, denotes the generated attention probability set for each hidden state in the set at time step and reflects the relevance between the past predicted word at -th step and current prediction (at -th step) by measuring the compatibility between their corresponding hidden states. So we can calculate the attended hidden state result = .
The reflective decoding output of the top attention layer is utilized to predict the word under the conditional probability distribution :
where are the trainable weights and are the biases. By calculating in this way, all the generated hidden states play a role in word precision and their extent of contributions can be clearly visualized, as will be demonstrated in section 4.3.2.
It should be noted that our proposed Reflective Attention Module models the dependencies between pairs of words at different time steps explicitly, taking into account the corresponding hidden states. In contrast, LSTM memorizes the historical sequence information by balancing the overall relevance of all time steps instead of modeling the dependency for each pair of words specifically.
3.2.2 Reflective Position Module.
It is often the case that many of the words have relatively fixed positions in a sentence due to the syntactic structure in natural language. For example, the numeral and subject words, \ie‘a man’ or ‘a woman’, mostly appear at the beginning of the sentence while the predicates tend to occupy the middle position. So we propose the Reflective Position Module by injecting the word position information during training as a guidance for the sequence decoding model to perceive its relative position or progress in the whole sentence. When decoding the -th word, its actual relative position and the predicted relative position are calculated as :
where is the length of the sentence, is the sigmoid function and is the trainable relative position embedding matrix, respectively. The reflective position module shown in top left of Figure 2 aims to minimize the difference between and by designing a loss function, which refines the attended hidden state mentioned in 3.2.1 by enabling it to perceive more sequential information of its relative position.
It is different from the popular position embedding methods [gehring2017convolutional, vaswani2017attention], which add the absolute position embedding to the corresponding input features in each dimension. Our Reflective Position Module models the relative position information individually in a supervised way. A key benefit of this design is that it can avoid the potential inter-pollution between the regular input feature and the position embedding, and equip our model with a strong perception of relative position for each word in the caption. Thus, the syntactic structure in natural language can be well preserved.
3.3 Training and Inference
Two kinds of losses are utilized for optimizing our RDN model. The first is the cross entropy loss in traditional captioning training, which is to minimize the negative log likelihood:
where is the given image, is the ground truth caption, formula for calculating is defined in equation 7 and is the start of the sentence.
The second loss is defined as the Position-Perceptive Loss :
where and are the actual relative position and predicted relative position defined in Equation 8 and is designed to minimize the gap between them.
The objective function for optimizing our RDN is defined as :
The trade-off parameter balances the contribution between the traditional caption loss in encoder-decoder framework and the Position-Perceptive Loss.
During the inference stage, since the length of the whole predicted sentence is unknown, the relative position information is removed from the input. As the discrepancy problem [lamb2016professor] between training and inference, which means the previous ground truth captioning token is not available for inference, we use the previously predicted word as input instead of ground truth word as in [chen2015mind, xu2015showattendtell]. This method is called teacher forcing algorithm [williams1989learning]. Also, we adopt the popular beam search strategy which iteratively selects the top-k best sentences at time step as candidates to generate the new top-k sentence at time in our experiment instead of greedy search.
|Review Net [yang2016review]||-||-||-||29.0||23.7||-||88.6||-|
|Review Net [yang2016review]||76.7||60.9||47.3||36.6||27.4||56.8||113.4||20.3|
4.1 Datasets and Experimental Settings
COCO Dataset. COCO captions dataset [chen2015microsoft] contains 82,783 images for training and 40,504 images for validation. Each image has five corresponding human-annotated captions. Also, we adopt the ‘Karpathy’ splits setting [karpathy2015deep], which includes 113,287 training images, 5K validation images and 5K testing images for offline evaluation. For the online server evaluation, the entire images and captions in dataset is used for training. Following the text preprocessing in [anderson2017bottom], we convert all the captions to lower case and remove the less frequent words which occur less than 5 times, obtaining a captioning vocabulary of 10,010 words.
Visual Genome Dataset. Visual Genome [krishna2017visual] is a large dataset for modeling the interactions and relations between objects within an image. The dataset consists of 108K images with densely annotated objects, attributes and pairwise relations. Compared to [yao2018exploring], we only utilize the annotated object and attribute data from the dataset to pretrain the object-level encoder and discard the pairwise relation data. We follow the same data split setting in [anderson2017bottom] to include 98K images for training, 5K images respectively for validation and testing. After cleaning these annotated object and attribute strings, we obtain a dataset including 400 attributes and 1,600 objects classes to train our Faster R-CNN model.
Evaluation Metrics. To objectively evaluate the performance of our captioning model, we use five widely accepted automatic evaluation metrics, including CIDEr [vedantam2015cider], SPICE [anderson2016spice], BLEU [papineni2002bleu], METEOR [denkowski2014meteor] and ROUGE-L [lin2004rouge].
Implementation Details. We implement our RDN using Caffe [jia2014caffe]. To train the object-level encoder, we use the Faster R-CNN with ResNet-101 pre-trained for image classification on ImageNet [russakovsky2015imagenet] and further refine it on the Visual Genome dataset. For each image, we set the IoU thresholds for region proposal suppression and object prediction to 0.7 and 0.3 respectively. For the remaining image subregions, we set a filter threshold 0.2. We rank the leftover boxes by their confidence scores from high to low and choose no more than top 100 as the final feature representations. Each region with dimension number 2,048 is the global average pooling result of the layer Res5c.
We set the word embedding size and the hidden size in each LSTM layer to 1,000. The dimensions for attention layers and are set to 512 respectively. During training, the initial learning rate is set to 0.01 and the polynomial decay strategy is adopted to decline the effective learning rate to zero by 70k iterations using a batch size 100. We tune the trade-off parameter on the ‘Karpathy’ validation split to obtain the best performance and finally set it to 0.02. For data augmentation, we adopt it only during the online test server submission to boost performance by flipping the original image and randomly cropping 90%. During decoding process, the beam search size is set to 5.
|Review Net [yang2016review]||72.0||90.0||31.3||59.7||25.6||34.7||53.3||68.6||96.5||96.9||18.5||64.9|
4.2 Ablation Study on Reflective Modules
To study the effects of Reflective Attention Module and Reflective Position Module in our model, an ablation experiment is designed to compare the performance with following combinations: (1) Baseline: the baseline denotes the RDN without Reflective Attention Module and Reflective Position Module; (2) RDN: the RDN with the Reflective Attention Module removed, with only position module reserved, the number of attention layers in decoder is reduced to one; (3) RDN: the RDN with the Reflective Position Module removed, cutting down the relative position information input; (4) RDN: the complete RDN implementation.
In Table 1, it can be observed that both the Reflective Position Module and Reflective Attention Module are important for our model and RDN improves the caption performance over all the metrics compared to baseline. The fact that RDN outperforms baseline model validates the contribution of Reflective Position Module to enhance the quality of decoding hidden states during caption generation. Also, by injecting the Reflective Attention Module, RDN performs obviously better than baseline, which shows the importance of model’s ability to capture long-term dependency between words. In particular, with a suitable combination of the two modules, RDN achieves the best result with CIDEr score 115.7, BLEU-4 score 37.0 and BLEU-3 score 47.9, yielding the improvement over our baseline model by 2.0%, 2.2% and 1.5% respectively, which is a considerable advancement over the benchmark. Compared the baseline model with total 1.15B parameters, RDN has only 0.84% more in model size, which is neglectable.
4.3 Performance Comparison and Analysis
We compare our proposed RDN with other state-of-the-art image captioning methods considering different aspects both in offline and online situation. Latest and representative works include: (1) Adaptive [lu2017knowing] which proposes the adaptive attention through designing a visual sentinel gate for captioning model to decide whether to attend to the image feature or just rely on the sequential language model, (2) LSTM-A3 [yao2017boosting] which incorporates the high level semantic attribute information to the encoder-decoder model, (3) Up-Down [anderson2017bottom] which introduces the bottom-up and top-down attention mechanism to enable attention calculated at the level of objects or salient subregions and (4) RFNet [jiang2018recurrent] which uses multiple kinds of CNNs to extract complementary image feature and generate a more informative representation for the decoder.
For fair comparison, our model and the baseline use standard ResNet-101 as basic architecture for encoder and all the reported results on test portions of MSCOCO ‘Karpathy’ splits are trained without additional CIDEr optimization [rennie2017self]. GCN-LSTM [yao2018exploring] is not included because it uses the additional densely annotated pair-wise relation data between objects to pretrain semantic relation detector and build convolutional graphs. We only adopt CIDEr optimization strategy for the online server submission, since directly optimizing the CIDEr metric has little effect on perceived caption quality during human evaluation [liu2016optimization] and small difference in its optimization implementation would influence the caption performance a lot.
4.3.1 Quantitative Analysis
For offline evaluation, we compare performance of different models on ‘Karparthy’ split dataset both in single and ensemble model situations. In Table 2, it can be observed that our single RDN achieves the best results among all existing captioning methods across the six evaluation metrics including all the BLEU entries, ROUGE-L and CIDEr, performs on par with RFNet in SPICE and is slightly inferior to it in METEOR. Different from the previous captioning models (Up-Down, RFNet, Review Net, Adaptive, etc.) that boost performance through extracting more indicative and compact visual representation, the enhancement of our captioning model only attributes to a better reasoning and inference ability of the decoder directly on the target side. Moreover, from the Table 3, we can see that our ensembled RDN outperforms other ensemble models in most of evaluation metrics, with the highest CIDEr score 117.3, and performs only inferiorly to RFNet in METEOR and SPICE entry. RDN is the ensembling result of 6 single models with different random seed initialization while RFNet is composed of 4 RFNets with a total of 20 groups of different image representations.
Online Evaluation on COCO Testing Server. We also compare our model with the published state-of-the-art captioning systems on COCO Testing Server with 5(c5) and 40(c40) reference sentences as shown in Table 4. Using the ensemble of 9 CIDEr optimized models, our RDN achieves leading performance over all metrics while performing on par with RFNet [jiang2018recurrent]. Surprisingly, RFNet has a much better performance in online evaluation compared to the offline case, in which it performs much poorer than our model (even poorer than Up-Down [anderson2017bottom] in some cases) shown in Table 2. Since the code of RFNet is not released, it is hard to investigate the inconsistence. Nevertheless, our RDN achieves the superior performance in all the c40 entries. Compared to c5, c40 has far more reference sentences and existing evaluation experiments show it achieves higher correlation with human judgement [chen2015microsoft, vedantam2015cider]. Moreover, our model is more simple and elegant with only one encoder-decoder in single model compared to RFNet, which utilizes multiple encoders (ResNet, DenseNet [huang2017densely], Inception-V3 [szegedy2016rethinking] , Inception-V4, and Inception-ResNet-V2 [szegedy2017inception]) to extract 5 groups of features and includes time-consuming feature fusion steps to produce the final thought vectors. Besides, our RDN boosts captioning performance by optimizing the decoding stage while RFNet mainly focuses on improving the encoder. Thus, it is a promising extension to apply the encoding mechanisms of RFNet to our RDN. Compared to Up-Down [anderson2017bottom], which uses traditional LSTM and object-level encoder, the CIDER-c40, CIDER-c5 and METEOR-c40 are improved by 3.9%, 2.7% and 2.9%.
Evaluation on hard Image Captioning. We further investigate the effect of the average length of annotations (ground truth captions) on the captioning performance, since generally the images with averagely longer annotations contain more complex scenes and thus are harder for captioning. Specifically, we rank the whole ‘Karparthy’ testset (5000 images) according to their average length of annotations in descending order and extract four different size of subsets (all set, top-1000, top-500, top-300 respectively). Smaller subset corresponds to averagely longer annotations and implies harder image captioning. Figure 3 shows the comparison between our RDN and Up-Down [anderson2017bottom] (main difference of the two models is that Up-Down uses traditional LSTM). It reveals that the performance of both models are decreasing with the increasing average length of annotations, which reflects that the captioning is getting harder. However, our model exhibits more superiority over Up-Down in harder cases, which in turn validates the ability of our RDN to capture the long-term dependencies within captions. We also provide one such comparison between our RDN and Att2in [rennie2017self] on hard captioning in the supplementary file.
4.3.2 Qualitative Analysis
To investigate the physical interpretability of RDN model’s improvement, some qualitative comparisons of captioning results are shown in Figure 4. Compared to basis decoder, our RDN is able to generate more detailed and discriminative descriptions for images. Take the first case in Figure 4 as an example, the basis decoder can provide a general and reasonable caption for the image. However, it cannot recognize the word ‘station’ which actually exists in the image while our model successfully infers it based on the previously generated words, especially the closely related words ‘train’,‘tracks’ and ‘sitting’. For the reflective weight visualization, the generated words with the largest contribution to the predicting word are usually strong related in vocabulary coherence, such as the correlations between words “boat” and “water”, “beach” and “surfboards”. We show additional results in supplementary material.
In Figure 5, compared to other captioning models, our RDN is able to predict the word and its relative position in the sentence simultaneously during caption generation. The predicted relative position in blue for each word is highly close to its actual relative position value in sentence, which demonstrates a good position-perceptive ability of our model to capture the syntactic structure of a sentence.
We have presented a novel architecture, Reflective Decoding Network (RDN), which explicitly explores the coherence between words in the captioning sentence and enhances the long-term sequence inference ability of LSTM. Particularly, the attention mechanism applied in both visual and textual domain and the proposed position-perceptive scheme are to maximize the reference information available for captioning model. We also show how the learned attention in textual domain can provide interpretability during the captioning generation process from a new perspective. Extensive experiments conducted on standard and hard COCO image captioning dataset with superior performance validate the effectiveness of our proposal. For future work, we are interested in extending our model to source code captioning and text summarization.