: Meshed-Memory Transformer for Image Captioning
Transformer-based architectures represent the state of the art in sequence modeling tasks like machine translation and language understanding. Their applicability to multi-modal contexts like image captioning, however, is still largely under-explored.
With the aim of filling this gap, we present – a Meshed Transformer with Memory for Image Captioning. The architecture improves both the image encoding and the language generation steps: it learns a multi-level representation of the relationships between image regions integrating learned a priori knowledge, and uses a mesh-like connectivity at decoding stage to exploit low- and high-level features.
Experimentally, we investigate the performance of the Transformer and different fully-attentive models in comparison with recurrent ones. When tested on COCO, our proposal achieves a new state of the art in single-model and ensemble configurations on the “Karpathy” test split and on the online test server. We also assess its performances when describing objects unseen in the training set.
Trained models and code for reproducing the experiments are publicly available at: https://github.com/aimagelab/meshed-memory-transformer.
Image captioning is the task of describing the visual content of an image in natural language. As such, it requires an algorithm to understand and model the relationships between visual and textual elements, and to generate a sequence of output words. This has usually been tackled via Recurrent Neural Network models [42, 17, 45, 44, 7], in which the sequential nature of language is modeled with the recurrent relations of either RNNs or LSTMs. Additive attention or graph-like structures  are often added to the recurrence [45, 14] in order to model the relationships between image regions, words, and eventually tags .
This schema has remained the dominant approach in the last few years, with the exception of the investigation of Convolutional language models , which however did not become a leading choice. The recent advent of fully-attentive models, in which the recurrent relation is abandoned in favour of the use of self-attention, offers unique opportunities in terms of set and sequence modeling performances, as testified by the Transformer  and BERT  models and their applications to retrieval  and video understanding . Also, this setting offers novel architectural modeling capabilities, as for the first time the attention operator is used in a multi-layer and extensible fashion. Nevertheless, the multi-modal nature of image captioning demands for specific architectures, different from those employed for the understanding of a single modality.
Following this premise, we investigate the design of a novel fully-attentive approach for image captioning. Our architecture takes inspiration from the Transformer model  for machine translation and incorporates two key novelties with respect to all previous image captioning algorithms: (i) image regions and their relationships are encoded in a multi-level fashion, in which low-level and high-level relations are taken into account. When modeling these relationships, our model can learn and encode a priori knowledge by using persistent memory vectors. (ii) The generation of the sentence, done with a multi-layer architecture, exploits both low- and high-level visual relationships instead of having just a single input from the visual modality. This is achieved through a learned gating mechanism, which weights multi-level contributions at each stage. As this creates a mesh connectivity schema between encoder and decoder layers, we name our model Meshed-Memory Transformer – Transformer for short. Figure 1 depicts a schema of the architecture.
Experimentally, we explore different fully-attentive baselines and recent proposals, gaining insights on the performance of fully-attentive models in image captioning. Our Transformer, when tested on the COCO benchmark, achieves a new state of the art on the “Karpathy” test set, on both single-model and ensemble configurations. Most importantly, it surpasses existing proposals on the online test server, ranking first among published algorithms.
Contributions. To sum up, our contributions are as follows:
We propose a novel fully-attentive image captioning algorithm. Our model encapsulates a multi-layer encoder for image regions and a multi-layer decoder which generates the output sentence. To exploit both low-level and high-level contributions, encoding and decoding layers are connected in a mesh-like structure, weighted through a learnable gating mechanism;
In our visual encoder, relationships between image regions are encoded in a multi-level fashion exploiting learned a priori knowledge, which is modeled via persistent memory vectors;
We show that the Transformer surpasses all previous proposals for image captioning, achieving a new state of the art on the online COCO evaluation server;
As a complementary contribution, we conduct experiments to compare different fully-attentive architectures on image captioning and validate the performance of our model on novel object captioning, using the recently proposed nocaps dataset. Finally, to improve reproducibility and foster new research in the field, we will publicly release the source code and trained models of all experiments.
2 Related work
A broad collection of methods have been proposed in the field of image captioning in the last few years. Earlier captioning approaches were based on the generation of simple templates, filled by the output of an object detector or attribute predictor [34, 47]. With the advent of Deep Neural Networks, most captioning techniques have employed RNNs as language models and used the output of one or more layers of a CNN to encode visual information and condition language generation [43, 33, 9, 16]. On the training side, while initial methods were based on a time-wise cross-entropy training, a notable achievement has been made with the introduction of Reinforcement Learning, which enabled the use of non-differentiable caption metrics as optimization objectives [33, 31, 25]. On the image encoding side, instead, single-layer attention mechanisms have been adopted to incorporate spatial knowledge, initially from a grid of CNN features [45, 26, 50], and then using image regions extracted with an object detector [4, 29, 27]. To further improve the encoding of objects and their relationships, Yao \etal  have proposed to use a graph convolution neural network in the image encoding phase to integrate semantic and spatial relationships between objects. On the same line, Yang \etal  used a multi-modal graph convolution network to modulate scene graphs into visual representations.
Despite their wide adoption, RNN-based models suffer from their limited representation power and sequential nature. After the emergence of Convolutional language models, which have been explored for captioning as well , new fully-attentive paradigms [39, 8, 36] have been proposed and achieved state-of-the-art results in machine translation and language understanding tasks. Likewise, some recent approaches have investigated the application of the Transformer model  to the image captioning task.
In a nutshell, the Transformer comprises an encoder made of a stack of self-attention and feed-forward layers, and a decoder which uses self-attention on words and cross-attention over the output of the last decoder layer. Herdade \etal  used the Transformer architecture for image captioning and incorporated geometric relations between detected input objects. In particular, they computed an additional geometric weight between object pairs which is used to scale attention weights. Liu \etal  used the Transformer in a model that exploits visual information and additional semantic knowledge given by an external tagger. On a related line, Huang \etal  introduced an extension of the attention operator in which the final attended information is weighted by a gate guided by the context. In their approach, a Transformer-like encoder was paired with an LSTM decoder.
While all the aforementioned approaches have exploited the original Transformer architecture, in this paper we devise a novel fully-attentive model that improves the design of both the image encoder and the language decoder, introducing two novel attention operators and a different design of the connectivity between encoder and decoder.
3 Meshed-Memory Transformer
Our model can be conceptually divided into an encoder and a decoder module, both made of stacks of attentive layers. While the encoder is in charge of processing regions from the input image and devising relationships between them, the decoder reads from the output of each encoding layer to generate the output caption word by word. All intra-modality and cross-modality interactions between word and image-level features are modeled via scaled dot-product attention, without using recurrence. Attention operates on three sets of vectors, namely a set of queries , keys and values , and takes a weighted sum of value vectors according to a similarity distribution between query and key vectors. In the case of scaled dot-product attention, the operator can be formally defined as
where is a matrix of query vectors, and both contain keys and values, all with the same dimensionality, and is a scaling factor.
3.1 Memory-Augmented Encoder
Given a set of image regions extracted from an input image, attention can be used to obtain a permutation invariant encoding of through the self-attention operations used in the Transformer . In this case, queries, keys, and values are obtained by linearly projecting the input features, and the operator can be defined as
where are matrices of learnable weights. The output of the self-attention operator is a new set of elements , with the same cardinality as , in which each element of is replaced with a weighted sum of the values, \ie of linear projections of the input (Eq. 1).
Noticeably, attentive weights depend solely on the pairwise similarities between linear projections of the input set itself. Therefore, the self-attention operator can be seen as a way of encoding pairwise relationships inside the input set. When using image regions (or features derived from image regions) as the input set, can naturally encode the pairwise relationships between regions that are needed to understand the input image before describing it
This peculiarity in the definition of self-attention has, however, a significant limitation. Because everything depends solely on pairwise similarities, self-attention cannot model a priori knowledge on relationships between image regions. For example, given one region encoding a man and a region encoding a basketball ball, it would be difficult to infer the concept of player or game without any a priori knowledge. Again, given regions encoding eggs and toasts, the knowledge that the picture depicts a breakfast could be easily inferred using a priori knowledge on relationships.
Memory-Augmented Attention. To overcome this limitation of self-attention, we propose a memory-augmented attention operator. In our proposal, the set of keys and values used for self-attention is extended with additional “slots” which can encode a priori information. To stress that a priori information should not depend on the input set , the additional keys and values are implemented as plain learnable vectors which can be directly updated via SGD. Formally, the operator is defined as:
where and are learnable matrices with rows, and indicates concatenation. Intuitively, by adding learnable keys and values, through attention it will be possible to retrieve learned knowledge which is not already embedded in . At the same time, our formulation leaves the set of queries unaltered. Intuitively again, this will help to avoid hallucination, given that knowledge is always retrieved because of similarities with queries which are seen in the image.
Just like the self-attention operator, our memory-augmented attention can be applied in a multi-head fashion. In this case, the memory-augmented attention operation is repeated times, using different projection matrices and different learnable memory slots for each head. Then, we concatenate the results from different heads and apply a linear projection.
Encoding layer. We embed our memory-augmented operator into a Transformer-like layer: the output of the memory-augmented attention is applied to a position-wise feed-forward layer composed of two affine transformations with a single non-linearity, which are independently applied to each element of the set. Formally,
where indicates the -th vector of the input set, and the -th vector of the output. Also, is the ReLU activation function, and are learnable weight matrices, and are bias terms.
Each of these sub-components (memory-augmented attention and position-wise feed-forward) is then encapsulated within a residual connection and a layer norm operation. The complete definition of an encoding layer can be finally written as:
where indicates the composition of a residual connection and of a layer normalization.
Full encoder. Given the aforementioned structure, multiple encoding layers are stacked in sequence, so that the -th layer consumes the output set computed by layer . This amounts to creating multi-level encodings of the relationships between image regions, in which higher encoding layers can exploit and refine relationships already identified by previous layers, eventually using a priori knowledge. A stack of encoding layers will therefore produce a multi-level output , obtained from the outputs of each encoding layer.
3.2 Meshed Decoder
Our decoder is conditioned on both previously generated words and region encodings, and is in charge of generating the next tokens of the output caption. Here, we exploit the aforementioned multi-level representation of the input image while still building a multi-layer structure. To this aim, we devise a meshed attention operator which, unlike the cross-attention operator of the Transformer, can take advantage of all encoding layers during the generation of the sentence.
Meshed Cross-Attention. Given an input sequence of vectors , and outputs from all encoding layers , the Meshed Attention operator connects to all elements in through gated cross-attentions. Instead of attending only the last encoding layer, we perform a cross-attention with all encoding layers. These multi-level contributions are then summed together after being modulated. Formally, our meshed attention operator is defined as
where stands for the encoder-decoder cross-attention, computed using queries from the decoder and keys and values from the encoder:
and is a matrix of weights having the same size as the cross-attention results. Weights in modulate both the single contribution of each encoding layer, and the relative importance between different layers. These are computed by measuring the relevance between the result of the cross-attention computed with each encoding layer and the input query, as follows:
where indicates concatenation, is the sigmoid activation, is a weight matrix, and is a learnable bias vector.
Architecture of decoding layers. As for encoding layers, we apply our meshed attention in a multi-head fashion. As the prediction of a word should only depend on previously predicted words, the decoder layer comprises a masked self-attention operation which connects queries derived from the -th element of its input sequence with keys and values obtained from the left-hand subsequence, \ie . Also, the decoder layer contains a position-wise feed-forward layer (as in Eq. 4), and all components are encapsulated within operations. The final structure of the decoder layer can be written as:
where is the input sequence of vectors and indicates a masked self-attention over time. Finally, our decoder stacks together multiple decoder layers, helping to refine both the understanding of the textual input and the generation of next tokens. Overall, the decoder takes as input word vectors, and the -th element of its output sequence encodes the prediction of a word at time , conditioned on . After taking a linear projection and a softmax operation, this encodes a probability over words in the dictionary.
3.3 Training details
Following a standard practice in image captioning [31, 33, 4], we pre-train our model with a word-level cross-entropy loss (XE) and finetune the sequence generation using reinforcement learning. When training with XE, the model is trained to predict the next token given previous ground-truth words; in this case, the input sequence for the decoder is immediately available and the computation of the entire output sequence can be done in a single pass, parallelizing all operations over time.
When training with reinforcement learning, we employ a variant of the self-critical sequence training approach  on sequences sampled using beam search : to decode, we sample the top- words from the decoder probability distribution at each timestep, and always maintain the top- sequences with highest probability. As sequence decoding is iterative in this step, the aforementioned parallelism over time cannot be exploited. However, intermediate keys and values used to compute the output token at time can be reused in the next iterations.
Following previous works , we use the CIDEr-D score as reward, as it well correlates with human judgment . We baseline the reward using the mean of the rewards rather than greedy decoding as done in previous methods [33, 4], as we found it to slightly improve the final performance. The final gradient expression for one sample is thus:
where is the -th sentence in the beam, is the reward function, and is the baseline, computed as the mean of the rewards obtained by the sampled sequences. At prediction time, we decode again using beam search, and keep the sequence with highest predicted probability among those in the last beam.
We first evaluate our model on the COCO dataset , which is the most commonly used test-bed for image captioning. Then, we assess the captioning of novel objects by testing on the recently proposed nocaps dataset .
COCO. The dataset contains more than images, each of them annotated with different captions. We follow the splits provided by Karpathy \etal , where images are used for validation, for testing and the rest for training. We also evaluate the model on the COCO online test server, composed of images for which annotations are not made publicly available.
nocaps. The dataset consists of images taken from the Open Images  validation and test sets, each annotated with human-generated captions. Images are divided into validation and test splits, respectively composed of and elements. Images can be further grouped into three subsets depending on the nearness to COCO, namely in-domain, near-domain, and out-of-domain images. Under this setting, we use COCO as training data and evaluate our results on the nocaps test server.
4.2 Experimental settings
Implementation details. To represent image regions, we use Faster R-CNN  with ResNet-101  finetuned on the Visual Genome dataset [20, 4], thus obtaining a -dimensional feature vector for each region. To represent words, we use one-hot vectors and linearly project them to the input dimensionality of the model . We also employ sinusoidal positional encodings  to represent word positions inside the sequence and sum the two embeddings before the first decoding layer.
In our model, we set the dimensionality of each layer to , the number of heads to , and the number of memory vectors to . We employ dropout with keep probability after each attention and feed-forward layer. In our meshed attention operator (Eq. 6), we normalize the output with a scaling factor of . Pre-training with XE is done following the learning rate scheduling strategy of  with a warmup equal to iterations. Then, during CIDEr-D optimization, we use a fixed learning rate of . We train all models using the Adam optimizer , a batch size of , and a beam size equal to .
Novel object captioning. To train the model on the nocaps dataset, instead of using one-hot vectors, we represent words with GloVe word embeddings . Two fully-connected layers are added to convert between the GloVe dimensionality and before the first decoding layer and after the last decoding layer. Before the final softmax, we multiply with the transpose of the word embeddings. All other implementation details are kept unchanged.
Additional details on model architecture and training can be found in the supplementary material.
|Transformer (w/ 6 layers as in )||79.1||36.2||27.7||56.9||121.8||20.9|
|Transformer (w/ 3 layers)||79.6||36.5||27.8||57.0||123.6||21.1|
|Transformer (w/ AoA )||80.3||38.8||29.0||58.4||129.1||22.7|
|Transformer (w/o mem.)||80.5||38.2||28.9||58.2||128.4||22.2|
|Transformer (w/o mem.)||80.4||38.3||29.0||58.2||129.4||22.6|
|Transformer (w/ softmax)||80.3||38.4||29.1||58.3||130.3||22.5|
4.3 Ablation study
Performance of the Transformer. In previous works, the Transformer model has been applied to captioning only in its original configuration with six layers and self/cross attention, with the structure of connections that has been successful for uni-modal scenarios like machine translation. As we speculate that captioning requires specific architectures, we compare variations of the original Transformer with our approach.
Firstly, we investigate the impact of the number of encoding and decoding layers on captioning performance. As it can be seen in Table 1, the original Transformer (six layers) achieves CIDEr, slightly superior to the Up-Down approach  which uses a two-layer recurrent language model with additive attention and includes a global feature vector ( CIDEr). Varying the number of layers, we observe a significant increase in performance when using three encoding and three decoding layers, which leads to CIDEr. We hypothesize that this is due to the reduced training set size and to the lower semantic complexities of sentences in captioning with respect to those of language understanding tasks. Following this finding, all subsequent experiments will use three layers.
Attention on Attention baseline. We also evaluate a recent proposal that can be straightforwardly applied to the Transformer as an alternative to standard dot-product attention. Specifically, we evaluate the addition of the “Attention on Attention” (AoA) approach  to the attentive layers, both in the encoder and in the decoder. Noticeably, in  this has been done with a Recurrent language model with attention, but the approach is sufficiently general to be applied to any attention stage. In this case, the result of dot-product attention is concatenated with the initial query and fed to two fully connected layers to obtain an information vector and a sigmoidal attention gate, then the two vectors are multiplied together. The final result is used as an alternative to the standard dot-product attention. This addition to a standard Transformer with three layers leads to CIDEr (Table 1), thus underlying the usefulness of the approach also in Transformer-based models.
|Ensemble/Fusion of 2 models|
|Ensemble/Fusion of 4 models|
Meshed Connectivity. We then evaluate the role of the meshed connections between encoder and decoder layers. In Table 1, we firstly introduce a reduced version of our approach in which the -th decoder layer is only connected to the corresponding -th encoder layer (1-to-1), instead of being connected to all encoders. As it can be noticed, using this 1-to-1 connectivity schema already brings an improvement with respect to using the output of the last encoder layer as in the standard Transformer ( CIDEr vs CIDEr), thus confirming that exploiting a multi-level encoding of image regions is beneficial. When we instead use our meshed connectivity schema, that exploits relationships encoded at all levels and weights them with a sigmoid gating, we observe a further performance improvement, from CIDEr to CIDEr. This amounts to a total improvement of CIDEr points with respect to the standard Transformer. Also, the result of our full model is superior to that obtained using the AoA.
As an alternative to the sigmoid gating approach for weighting the contributions from different encoder layers (Eq. 6), we also test with a softmax gating schema. In this case, the element-wise sigmoid applied to each encoder is replaced with the application of a softmax operation over the rows of . Using this alternative brings to a reduction of around 1 CIDEr point, underlying that it is beneficial to exploit the full potentiality of a weighted sum of the contributions from all encoding layers, rather than forcing a peaky distribution in which one layer is given more importance than the others.
Role of persistent memory. We evaluate the role of memory vectors in both the 1-to-1 configuration and in the final configuration with meshed connections. As it can be seen from Table 1, removing memory vectors brings to a reduction in performance of around CIDEr point in both connectivity settings, thus confirming the usefulness of exploiting a priori learned knowledge when encoding image regions. Further experiments on the number of memory vectors can be found in the supplementary material.
4.4 Comparison with state of the art
We compare the performances of our approach with those of several recent proposals for image captioning. The models we compare to include SCST , which uses attention over the grid of features and a one-layer LSTM language model; Up-Down , which introduces attention over regions, and uses a two-layer LSTM language model. Also, we compare to the RFNet approach , which uses a recurrent fusion network to merge different CNN features; GCN-LSTM , which exploits pairwise relationships between image regions through a Graph Convolutional Neural Network; SGAE , which instead uses auto-encoding scene graphs. Further, we compare with the original AoANet  approach, which uses attention on attention for encoding image regions and an LSTM language model. Finally, we compare with ORT , which uses a plain Transformer, and weights attention scores in the region encoder with pairwise distances between detections.
We evaluate our approach on the COCO “Karpathy” test split, using both single model and ensemble configurations, and on the online COCO evaluation server.
Single model. In Table 2 we report the performance of our method in comparison with the aforementioned competitors, using captions predicted from a single model and optimization on the CIDEr-D score. As it can be observed, our method surpasses all other approaches in terms of BLEU-4, METEOR and CIDEr, while being competitive on BLEU-1 and SPICE with the best performer, and slightly worse on ROUGE with respect to AoANet . In particular, it advances the current state of the art on CIDEr by 1.4 points.
Ensemble model. Following the common practice [33, 14] of building an ensemble of models, we also report the performances of our approach when averaging the output probability distributions of multiple and independently trained instances of our model. In Table 3, we use ensembles of two and four models, trained from different random seeds. Noticeably, when using four models our approach achieves the best performance according to all metrics, with an increase of 2.5 CIDEr points with respect to the current state of the art .
|NBT + CBS ||62.1||10.1||62.4||8.9||60.2||9.5|
|Up-Down + CBS ||80.0||12.0||66.4||9.7||73.1||11.1|
|Transformer + CBS||74.3||11.0||62.5||9.2||66.9||10.3|
|Transformer + CBS||81.2||12.0||69.4||10.0||75.0||11.4|
Finally, we also report the performance of our method on the online COCO test server
4.5 Describing novel objects
We also assess the performance of our approach when dealing with images containing object categories that are not seen in the training set. We compare with the Up-Down model  and Neural Baby Talk , when using GloVe word embeddings and Constrained Beam Search (CBS)  to address the generation of out-of-vocabulary words and constrain the presence of categories detected by an object detector. To compare with our model, we use a simplified implementation of the procedure described in  to extract constraints, without using word phrases (\eg plurals).
Results are shown in Table 5: as it can be seen, the original Transformer is significantly less performing than Up-Down on both in-domain and out-of-domain categories, while our approach can properly deal with novel categories, surpassing the Up-Down baseline in both in-domain and out-of-domain images. As expected, the use of CBS significantly enhances the performances, in particular on out-of-domain captioning.
4.6 Qualitative results and visualization
Figure 3 proposes qualitative results generated by our model and the original Transformer. On average, our model is able to generate more accurate and descriptive captions, integrating fine-grained details and object relations.
Finally, to better understand the effectiveness of our Transformer, we investigate the contribution of detected regions to the model output. Differently from recurrent-based captioning models, in which attention weights over regions can be easily extracted, in our model the contribution of one region with respect to the output is given by more complex non-linear dependencies. Therefore, we revert to attribution methods: specifically, we employ the Integrated Gradients approach , which approximates the integral of gradients with respect to the given input. Results are presented in Figure 4, where we observe that our approach correctly grounds image regions to words, also in presence of object details and small detections. More visualizations are included in the supplementary material.
We presented Transformer, a novel Transformer-based architecture for image captioning. Our model incorporates a region encoding approach that exploits a priori knowledge through memory vectors and a meshed connectivity between encoding and decoding modules. Noticeably, this connectivity pattern is unprecedented for other fully-attentive architectures. Experimental results demonstrated that our approach achieves a new state of the art on COCO, ranking first in the on-line leaderboard. Finally, we validated the components of our model through ablation studies, and its performances when describing novel objects.
Appendix A Supplementary material
In the following, we present additional material about our Transformer model. In particular, we provide additional training and implementation details, further experimental results, and visualizations.
a.1 Additional implementation details
Decoding optimization. As mentioned in Sec. 3.3, during the decoding stage computation cannot be parallelized over time as the input sequence is iteratively built. A naive approach would be to feed the model at each iteration with the previous generated words, and sample the next predicted word after computing the results of each attention and feed-forward layer over all timesteps. This in practice requires to re-compute the same queries, keys, values and attentive states multiple times, with intermediate results depending on being recomputed times, where is the length of the sampled sequence (in our experiments is equal to ).
In our implementation, we revert to a more computationally friendly approach in which we re-use intermediate results computed at previous timesteps. Each attentive layer of the decoder internally stores previously computed keys and values. At each timestep of the decoding, the model is fed only with , and we only compute queries, keys and values depending on .
In PyTorch, this can be implemented by exploiting the register_buffer method of nn.Module, and creating buffers to hold previously computed results. When running on a NVIDIA 2080Ti GPU, we found this to reduce training and inference times by approximately a factor of 3.
Vocabulary and tokenization.
We convert all captions to lowercase, remove punctuation characters and tokenize using the spaCy NLP toolkit
Model dimensionality and weight initialization. Using attentive heads, the size of queries, keys and values in each head is set to . Weights of attentive layers are initialized from the uniform distribution proposed by Glorot \etal , while weights of feed-forward layers are initialized using . All biases are initialized to 0. Memory vectors for keys and values are initialized from a normal distribution with zero mean and, respectively, and variance, where is the dimensionality of keys and is the number of memory vectors.
a.2 Additional experimental results
Memory vectors. In Table 6, we report the performance of our approach when using a varying number of memory vectors. As it can be seen, the best result in terms of BLEU, METEOR, ROUGE and CIDEr is obtained with memory vectors, while memory vectors provide a slightly superior result in terms of SPICE. Therefore, all experiments in the main paper are carried out with memory vectors.
Encoder and decoder layers. To complement the analysis presented in Sec. 4.3, we also investigate the performance of the Transformer when changing the number of encoding and decoding layers. Table 7 shows that the best performance is obtained with three encoding and decoding layers, thus confirming the initial findings on the base Transformer model. As our model can deal with a different number of encoding and decoding layers, we also experimented with non symmetric encoding-decoding architectures, without however noticing significant improvements in performance.
SPICE F-scores. Finally, in Table 8 we report a breakdown of SPICE F-scores over various subcategories on the “Karpathy” test split, in comparison with the Up-Down approach  and the base Transformer model with three layers. As it can be seen, our model significantly improves on identifying objects, attributes and relationships between objects.
a.3 Qualitative results and visualization
Figure 6 shows additional qualitative results obtained from our model in comparison to the original Transformer and corresponding ground-truth captions. On average, the proposed model shows an improvement in terms of caption correctness and provides more detailed and exhaustive descriptions.
Figures 7 and 8, instead, report the visualization of attentive states on a variety of sample images, following the approach outlined in Sec. 4.6 of the main paper. Specifically, the Integrated Gradients approach  produces an attribution score for each feature channel of each input region. To obtain the attribution of each region, we average over the feature channels, and re-normalize the obtained scores by their sum. For visualization purposes, we apply a contrast stretching function to project scores in the 0-1 interval.
a.4 Novel object captioning
Figure 5 reports sample captions produced by our approach on images from the nocaps dataset. On each image, we compare to the baseline Transformer and show the constraints provided by the object detector. Overall, the Transformer is able to better incorporate the constraints while maintaining the fluency and properness of the generated sentences.
Following , we use an object detector trained on Open Images
- Equal contribution.
- Taking another perspective, self-attention is also conceptually equivalent to an attentive encoding of graph nodes .
- Specifically, the tf_faster_rcnn_inception_resnet_v2_atrous_oidv2 model from the Tensorflow model zoo.
- (2019) Nocaps: novel object captioning at scale. In Proceedings of the International Conference on Computer Vision, Cited by: §A.4, §4.1, §4.5, Table 5.
- (2016) SPICE: Semantic Propositional Image Caption Evaluation. In Proceedings of the European Conference on Computer Vision, Cited by: §4.2.
- (2017) Guided open vocabulary image captioning with constrained beam search. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Cited by: §4.5.
- (2018) Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §A.2, Table 8, §2, §3.3, §3.3, §3.3, §4.2, §4.3, §4.4, §4.5, Table 2, Table 4.
- (2018) Convolutional image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1, §2.
- (2005) METEOR: an automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Cited by: §4.2.
- (2019) Show, Control and Tell: A Framework for Generating Controllable and Grounded Captions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1.
- (2018) BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1, §2.
- (2015) Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.
- (2010) Understanding the difficulty of training deep feedforward neural networks. In International Conference on Artificial Intelligence and Statistics, Cited by: §A.1.
- (2015) Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §A.1.
- (2016) Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §4.2.
- (2019) Image Captioning: Transforming Objects into Words. arXiv preprint arXiv:1906.05963. Cited by: §2, §4.4, Table 2.
- (2019) Attention on Attention for Image Captioning. In Proceedings of the International Conference on Computer Vision, Cited by: §1, §2, §4.3, §4.4, §4.4, §4.4, Table 1, Table 2, Table 3, Table 4.
- (2018) Recurrent Fusion Network for Image Captioning. In Proceedings of the European Conference on Computer Vision, Cited by: §4.4, Table 2, Table 3, Table 4.
- (2016) DenseCap: Fully convolutional Localization Networks for Dense Captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.
- (2015) Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1, §4.1.
- (2019) Reflective Decoding Network for Image Captioning. In Proceedings of the International Conference on Computer Vision, Cited by: Table 4.
- (2015) Adam: A Method for Stochastic Optimization. In Proceedings of the International Conference on Learning Representations, Cited by: §4.2.
- (2017) Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. International Journal of Computer Vision 123 (1), pp. 32–73. Cited by: §4.2.
- (2018) The Open Images Dataset V4: Unified image classification, object detection, and visual relationship detection at scale. arXiv preprint arXiv:1811.00982. Cited by: §4.1.
- (2004) Rouge: a package for automatic evaluation of summaries. In Text summarization branches out: Proceedings of the ACL Workshop, Vol. 8. Cited by: §4.2.
- (2014) Microsoft COCO: Common Objects in Context. In Proceedings of the European Conference on Computer Vision, Cited by: §4.1.
- (2019) Entangled Transformer for Image Captioning. In Proceedings of the International Conference on Computer Vision, Cited by: §1, §2, Table 3, Table 4.
- (2017) Improved Image Captioning via Policy Gradient Optimization of SPIDEr. In Proceedings of the International Conference on Computer Vision, Cited by: §2.
- (2017) Knowing when to look: adaptive attention via a visual sentinel for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.
- (2018) Neural Baby Talk. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2, §4.5.
- (2002) BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, Cited by: §4.2.
- (2017) Areas of attention for image captioning. In Proceedings of the International Conference on Computer Vision, Cited by: §2.
- (2014) GloVe: Global vectors for word representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Cited by: §4.2.
- (2015) Sequence level training with recurrent neural networks. In Proceedings of the International Conference on Learning Representations, Cited by: §2, §3.3.
- (2015) Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems, Cited by: §4.2.
- (2017) Self-critical sequence training for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2, §3.3, §3.3, §3.3, §4.4, §4.4, Table 2, Table 3, Table 4.
- (2010) Connecting modalities: semi-supervised segmentation and annotation of images using unaligned text corpora. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.
- (2019) Polysemous visual-semantic embedding for cross-modal retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1.
- (2019) Augmenting Self-attention with Persistent Memory. arXiv preprint arXiv:1907.01470. Cited by: §2.
- (2019) Videobert: a joint model for video and language representation learning. In Proceedings of the International Conference on Computer Vision, Cited by: §1.
- (2017) Axiomatic attribution for deep networks. In Proceedings of the International Conference on Machine Learning, Cited by: §A.3, §4.6.
- (2017) Attention is all you need. In Advances in Neural Information Processing Systems, Cited by: §1, §1, §2, §3.1, §4.2, §4.2, Table 1.
- (2015) CIDEr: Consensus-based Image Description Evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §3.3, §4.2.
- (2018) Graph Attention Networks. In Proceedings of the International Conference on Learning Representations, Cited by: footnote 1.
- (2015) Show and tell: a neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1.
- (2016) Show and tell: lessons learned from the 2015 mscoco image captioning challenge. IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (4), pp. 652–663. Cited by: §2.
- (2017) Show and Tell: Lessons Learned from the 2015 MSCOCO Image Captioning Challenge. IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (4), pp. 652–663. Cited by: §1.
- (2015) Show, attend and tell: neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning, Cited by: §1, §2.
- (2019) Auto-Encoding Scene Graphs for Image Captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2, §4.4, Table 2, Table 3, Table 4.
- (2010) I2t: image parsing to text description. Proceedings of the IEEE 98 (8), pp. 1485–1508. Cited by: §2.
- (2018) Exploring Visual Relationship for Image Captioning. In Proceedings of the European Conference on Computer Vision, Cited by: §1, §2, §4.4, Table 2, Table 3, Table 4.
- (2019) Hierarchy Parsing for Image Captioning. In Proceedings of the International Conference on Computer Vision, Cited by: Table 2, Table 3, Table 4.
- (2016) Image captioning with semantic attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.