\mathcal{M}^{2}: Meshed-Memory Transformer for Image Captioning

: Meshed-Memory Transformer for Image Captioning

Abstract

Transformer-based architectures represent the state of the art in sequence modeling tasks like machine translation and language understanding. Their applicability to multi-modal contexts like image captioning, however, is still largely under-explored. With the aim of filling this gap, we present – a Meshed Transformer with Memory for Image Captioning. The architecture improves both the image encoding and the language generation steps: it learns a multi-level representation of the relationships between image regions integrating learned a priori knowledge, and uses a mesh-like connectivity at decoding stage to exploit low- and high-level features. Experimentally, we investigate the performance of the Transformer and different fully-attentive models in comparison with recurrent ones. When tested on COCO, our proposal achieves a new state of the art in single-model and ensemble configurations on the “Karpathy” test split and on the online test server. We also assess its performances when describing objects unseen in the training set. Trained models and code for reproducing the experiments are publicly available at: https://github.com/aimagelab/meshed-memory-transformer. 1

\cvprfinalcopy

1 Introduction

Image captioning is the task of describing the visual content of an image in natural language. As such, it requires an algorithm to understand and model the relationships between visual and textual elements, and to generate a sequence of output words. This has usually been tackled via Recurrent Neural Network models [42, 17, 45, 44, 7], in which the sequential nature of language is modeled with the recurrent relations of either RNNs or LSTMs. Additive attention or graph-like structures [48] are often added to the recurrence [45, 14] in order to model the relationships between image regions, words, and eventually tags [24].

Figure 1: Our image captioning approach encodes relationships between image regions exploiting learned a priori knowledge. Multi-level encodings of image regions are connected to a language decoder through a meshed and learnable connectivity.

This schema has remained the dominant approach in the last few years, with the exception of the investigation of Convolutional language models [5], which however did not become a leading choice. The recent advent of fully-attentive models, in which the recurrent relation is abandoned in favour of the use of self-attention, offers unique opportunities in terms of set and sequence modeling performances, as testified by the Transformer [39] and BERT [8] models and their applications to retrieval [35] and video understanding [37]. Also, this setting offers novel architectural modeling capabilities, as for the first time the attention operator is used in a multi-layer and extensible fashion. Nevertheless, the multi-modal nature of image captioning demands for specific architectures, different from those employed for the understanding of a single modality.

Following this premise, we investigate the design of a novel fully-attentive approach for image captioning. Our architecture takes inspiration from the Transformer model [39] for machine translation and incorporates two key novelties with respect to all previous image captioning algorithms: (i) image regions and their relationships are encoded in a multi-level fashion, in which low-level and high-level relations are taken into account. When modeling these relationships, our model can learn and encode a priori knowledge by using persistent memory vectors. (ii) The generation of the sentence, done with a multi-layer architecture, exploits both low- and high-level visual relationships instead of having just a single input from the visual modality. This is achieved through a learned gating mechanism, which weights multi-level contributions at each stage. As this creates a mesh connectivity schema between encoder and decoder layers, we name our model Meshed-Memory Transformer Transformer for short. Figure 1 depicts a schema of the architecture.

Experimentally, we explore different fully-attentive baselines and recent proposals, gaining insights on the performance of fully-attentive models in image captioning. Our Transformer, when tested on the COCO benchmark, achieves a new state of the art on the “Karpathy” test set, on both single-model and ensemble configurations. Most importantly, it surpasses existing proposals on the online test server, ranking first among published algorithms.

Contributions. To sum up, our contributions are as follows:

  • We propose a novel fully-attentive image captioning algorithm. Our model encapsulates a multi-layer encoder for image regions and a multi-layer decoder which generates the output sentence. To exploit both low-level and high-level contributions, encoding and decoding layers are connected in a mesh-like structure, weighted through a learnable gating mechanism;

  • In our visual encoder, relationships between image regions are encoded in a multi-level fashion exploiting learned a priori knowledge, which is modeled via persistent memory vectors;

  • We show that the Transformer surpasses all previous proposals for image captioning, achieving a new state of the art on the online COCO evaluation server;

  • As a complementary contribution, we conduct experiments to compare different fully-attentive architectures on image captioning and validate the performance of our model on novel object captioning, using the recently proposed nocaps dataset. Finally, to improve reproducibility and foster new research in the field, we will publicly release the source code and trained models of all experiments.

2 Related work

A broad collection of methods have been proposed in the field of image captioning in the last few years. Earlier captioning approaches were based on the generation of simple templates, filled by the output of an object detector or attribute predictor [34, 47]. With the advent of Deep Neural Networks, most captioning techniques have employed RNNs as language models and used the output of one or more layers of a CNN to encode visual information and condition language generation [43, 33, 9, 16]. On the training side, while initial methods were based on a time-wise cross-entropy training, a notable achievement has been made with the introduction of Reinforcement Learning, which enabled the use of non-differentiable caption metrics as optimization objectives [33, 31, 25]. On the image encoding side, instead, single-layer attention mechanisms have been adopted to incorporate spatial knowledge, initially from a grid of CNN features [45, 26, 50], and then using image regions extracted with an object detector [4, 29, 27]. To further improve the encoding of objects and their relationships, Yao \etal [48] have proposed to use a graph convolution neural network in the image encoding phase to integrate semantic and spatial relationships between objects. On the same line, Yang \etal [46] used a multi-modal graph convolution network to modulate scene graphs into visual representations.

Despite their wide adoption, RNN-based models suffer from their limited representation power and sequential nature. After the emergence of Convolutional language models, which have been explored for captioning as well [5], new fully-attentive paradigms [39, 8, 36] have been proposed and achieved state-of-the-art results in machine translation and language understanding tasks. Likewise, some recent approaches have investigated the application of the Transformer model [39] to the image captioning task.

In a nutshell, the Transformer comprises an encoder made of a stack of self-attention and feed-forward layers, and a decoder which uses self-attention on words and cross-attention over the output of the last decoder layer. Herdade \etal [13] used the Transformer architecture for image captioning and incorporated geometric relations between detected input objects. In particular, they computed an additional geometric weight between object pairs which is used to scale attention weights. Liu \etal [24] used the Transformer in a model that exploits visual information and additional semantic knowledge given by an external tagger. On a related line, Huang \etal [14] introduced an extension of the attention operator in which the final attended information is weighted by a gate guided by the context. In their approach, a Transformer-like encoder was paired with an LSTM decoder.

While all the aforementioned approaches have exploited the original Transformer architecture, in this paper we devise a novel fully-attentive model that improves the design of both the image encoder and the language decoder, introducing two novel attention operators and a different design of the connectivity between encoder and decoder.

Figure 2: Architecture of the Transformer. Our model is composed of a stack of memory-augmented encoding layers, which encodes multi-level visual relationships with a priori knowledge, and a stack of decoder layers, in charge of generating textual tokens. For the sake of clarity, operations are not shown. Best seen in color.

3 Meshed-Memory Transformer

Our model can be conceptually divided into an encoder and a decoder module, both made of stacks of attentive layers. While the encoder is in charge of processing regions from the input image and devising relationships between them, the decoder reads from the output of each encoding layer to generate the output caption word by word. All intra-modality and cross-modality interactions between word and image-level features are modeled via scaled dot-product attention, without using recurrence. Attention operates on three sets of vectors, namely a set of queries , keys and values , and takes a weighted sum of value vectors according to a similarity distribution between query and key vectors. In the case of scaled dot-product attention, the operator can be formally defined as

(1)

where is a matrix of query vectors, and both contain keys and values, all with the same dimensionality, and is a scaling factor.

3.1 Memory-Augmented Encoder

Given a set of image regions extracted from an input image, attention can be used to obtain a permutation invariant encoding of through the self-attention operations used in the Transformer [39]. In this case, queries, keys, and values are obtained by linearly projecting the input features, and the operator can be defined as

(2)

where are matrices of learnable weights. The output of the self-attention operator is a new set of elements , with the same cardinality as , in which each element of is replaced with a weighted sum of the values, \ie of linear projections of the input (Eq. 1).

Noticeably, attentive weights depend solely on the pairwise similarities between linear projections of the input set itself. Therefore, the self-attention operator can be seen as a way of encoding pairwise relationships inside the input set. When using image regions (or features derived from image regions) as the input set, can naturally encode the pairwise relationships between regions that are needed to understand the input image before describing it2.

This peculiarity in the definition of self-attention has, however, a significant limitation. Because everything depends solely on pairwise similarities, self-attention cannot model a priori knowledge on relationships between image regions. For example, given one region encoding a man and a region encoding a basketball ball, it would be difficult to infer the concept of player or game without any a priori knowledge. Again, given regions encoding eggs and toasts, the knowledge that the picture depicts a breakfast could be easily inferred using a priori knowledge on relationships.

Memory-Augmented Attention. To overcome this limitation of self-attention, we propose a memory-augmented attention operator. In our proposal, the set of keys and values used for self-attention is extended with additional “slots” which can encode a priori information. To stress that a priori information should not depend on the input set , the additional keys and values are implemented as plain learnable vectors which can be directly updated via SGD. Formally, the operator is defined as:

(3)

where and are learnable matrices with rows, and indicates concatenation. Intuitively, by adding learnable keys and values, through attention it will be possible to retrieve learned knowledge which is not already embedded in . At the same time, our formulation leaves the set of queries unaltered. Intuitively again, this will help to avoid hallucination, given that knowledge is always retrieved because of similarities with queries which are seen in the image.

Just like the self-attention operator, our memory-augmented attention can be applied in a multi-head fashion. In this case, the memory-augmented attention operation is repeated times, using different projection matrices and different learnable memory slots for each head. Then, we concatenate the results from different heads and apply a linear projection.

Encoding layer. We embed our memory-augmented operator into a Transformer-like layer: the output of the memory-augmented attention is applied to a position-wise feed-forward layer composed of two affine transformations with a single non-linearity, which are independently applied to each element of the set. Formally,

(4)

where indicates the -th vector of the input set, and the -th vector of the output. Also, is the ReLU activation function, and are learnable weight matrices, and are bias terms.

Each of these sub-components (memory-augmented attention and position-wise feed-forward) is then encapsulated within a residual connection and a layer norm operation. The complete definition of an encoding layer can be finally written as:

(5)

where indicates the composition of a residual connection and of a layer normalization.

Full encoder. Given the aforementioned structure, multiple encoding layers are stacked in sequence, so that the -th layer consumes the output set computed by layer . This amounts to creating multi-level encodings of the relationships between image regions, in which higher encoding layers can exploit and refine relationships already identified by previous layers, eventually using a priori knowledge. A stack of encoding layers will therefore produce a multi-level output , obtained from the outputs of each encoding layer.

3.2 Meshed Decoder

Our decoder is conditioned on both previously generated words and region encodings, and is in charge of generating the next tokens of the output caption. Here, we exploit the aforementioned multi-level representation of the input image while still building a multi-layer structure. To this aim, we devise a meshed attention operator which, unlike the cross-attention operator of the Transformer, can take advantage of all encoding layers during the generation of the sentence.

Meshed Cross-Attention. Given an input sequence of vectors , and outputs from all encoding layers , the Meshed Attention operator connects to all elements in through gated cross-attentions. Instead of attending only the last encoding layer, we perform a cross-attention with all encoding layers. These multi-level contributions are then summed together after being modulated. Formally, our meshed attention operator is defined as

(6)

where stands for the encoder-decoder cross-attention, computed using queries from the decoder and keys and values from the encoder:

(7)

and is a matrix of weights having the same size as the cross-attention results. Weights in modulate both the single contribution of each encoding layer, and the relative importance between different layers. These are computed by measuring the relevance between the result of the cross-attention computed with each encoding layer and the input query, as follows:

(8)

where indicates concatenation, is the sigmoid activation, is a weight matrix, and is a learnable bias vector.

Architecture of decoding layers. As for encoding layers, we apply our meshed attention in a multi-head fashion. As the prediction of a word should only depend on previously predicted words, the decoder layer comprises a masked self-attention operation which connects queries derived from the -th element of its input sequence with keys and values obtained from the left-hand subsequence, \ie . Also, the decoder layer contains a position-wise feed-forward layer (as in Eq. 4), and all components are encapsulated within operations. The final structure of the decoder layer can be written as:

(9)

where is the input sequence of vectors and indicates a masked self-attention over time. Finally, our decoder stacks together multiple decoder layers, helping to refine both the understanding of the textual input and the generation of next tokens. Overall, the decoder takes as input word vectors, and the -th element of its output sequence encodes the prediction of a word at time , conditioned on . After taking a linear projection and a softmax operation, this encodes a probability over words in the dictionary.

3.3 Training details

Following a standard practice in image captioning [31, 33, 4], we pre-train our model with a word-level cross-entropy loss (XE) and finetune the sequence generation using reinforcement learning. When training with XE, the model is trained to predict the next token given previous ground-truth words; in this case, the input sequence for the decoder is immediately available and the computation of the entire output sequence can be done in a single pass, parallelizing all operations over time.

When training with reinforcement learning, we employ a variant of the self-critical sequence training approach [33] on sequences sampled using beam search [4]: to decode, we sample the top- words from the decoder probability distribution at each timestep, and always maintain the top- sequences with highest probability. As sequence decoding is iterative in this step, the aforementioned parallelism over time cannot be exploited. However, intermediate keys and values used to compute the output token at time can be reused in the next iterations.

Following previous works [4], we use the CIDEr-D score as reward, as it well correlates with human judgment [40]. We baseline the reward using the mean of the rewards rather than greedy decoding as done in previous methods [33, 4], as we found it to slightly improve the final performance. The final gradient expression for one sample is thus:

(10)

where is the -th sentence in the beam, is the reward function, and is the baseline, computed as the mean of the rewards obtained by the sampled sequences. At prediction time, we decode again using beam search, and keep the sequence with highest predicted probability among those in the last beam.

4 Experiments

4.1 Datasets

We first evaluate our model on the COCO dataset [23], which is the most commonly used test-bed for image captioning. Then, we assess the captioning of novel objects by testing on the recently proposed nocaps dataset [1].

COCO. The dataset contains more than images, each of them annotated with different captions. We follow the splits provided by Karpathy \etal [17], where images are used for validation, for testing and the rest for training. We also evaluate the model on the COCO online test server, composed of images for which annotations are not made publicly available.

nocaps. The dataset consists of images taken from the Open Images [21] validation and test sets, each annotated with human-generated captions. Images are divided into validation and test splits, respectively composed of and elements. Images can be further grouped into three subsets depending on the nearness to COCO, namely in-domain, near-domain, and out-of-domain images. Under this setting, we use COCO as training data and evaluate our results on the nocaps test server.

4.2 Experimental settings

Metrics. Following the standard evaluation protocol, we employ the full set of captioning metrics: BLEU [28], METEOR [6], ROUGE [22], CIDEr [40], and SPICE [2].

Implementation details. To represent image regions, we use Faster R-CNN [32] with ResNet-101 [12] finetuned on the Visual Genome dataset [20, 4], thus obtaining a -dimensional feature vector for each region. To represent words, we use one-hot vectors and linearly project them to the input dimensionality of the model . We also employ sinusoidal positional encodings [39] to represent word positions inside the sequence and sum the two embeddings before the first decoding layer.

In our model, we set the dimensionality of each layer to , the number of heads to , and the number of memory vectors to . We employ dropout with keep probability after each attention and feed-forward layer. In our meshed attention operator (Eq. 6), we normalize the output with a scaling factor of . Pre-training with XE is done following the learning rate scheduling strategy of [39] with a warmup equal to iterations. Then, during CIDEr-D optimization, we use a fixed learning rate of . We train all models using the Adam optimizer [19], a batch size of , and a beam size equal to .

Novel object captioning. To train the model on the nocaps dataset, instead of using one-hot vectors, we represent words with GloVe word embeddings [30]. Two fully-connected layers are added to convert between the GloVe dimensionality and before the first decoding layer and after the last decoding layer. Before the final softmax, we multiply with the transpose of the word embeddings. All other implementation details are kept unchanged.

Additional details on model architecture and training can be found in the supplementary material.

B-1 B-4 M R C S
Transformer (w/ 6 layers as in [39]) 79.1 36.2 27.7 56.9 121.8 20.9
Transformer (w/ 3 layers) 79.6 36.5 27.8 57.0 123.6 21.1
Transformer (w/ AoA [14]) 80.3 38.8 29.0 58.4 129.1 22.7
Transformer (w/o mem.) 80.5 38.2 28.9 58.2 128.4 22.2
Transformer 80.3 38.2 28.9 58.2 129.2 22.5
Transformer (w/o mem.) 80.4 38.3 29.0 58.2 129.4 22.6
Transformer (w/ softmax) 80.3 38.4 29.1 58.3 130.3 22.5
Transformer 80.8 39.1 29.2 58.6 131.2 22.6
Table 1: Ablation study and comparison with Transformer-based alternatives. All results are reported after the REINFORCE optimization stage.

4.3 Ablation study

Performance of the Transformer. In previous works, the Transformer model has been applied to captioning only in its original configuration with six layers and self/cross attention, with the structure of connections that has been successful for uni-modal scenarios like machine translation. As we speculate that captioning requires specific architectures, we compare variations of the original Transformer with our approach.

Firstly, we investigate the impact of the number of encoding and decoding layers on captioning performance. As it can be seen in Table 1, the original Transformer (six layers) achieves CIDEr, slightly superior to the Up-Down approach [4] which uses a two-layer recurrent language model with additive attention and includes a global feature vector ( CIDEr). Varying the number of layers, we observe a significant increase in performance when using three encoding and three decoding layers, which leads to CIDEr. We hypothesize that this is due to the reduced training set size and to the lower semantic complexities of sentences in captioning with respect to those of language understanding tasks. Following this finding, all subsequent experiments will use three layers.

Attention on Attention baseline. We also evaluate a recent proposal that can be straightforwardly applied to the Transformer as an alternative to standard dot-product attention. Specifically, we evaluate the addition of the “Attention on Attention” (AoA) approach [14] to the attentive layers, both in the encoder and in the decoder. Noticeably, in [14] this has been done with a Recurrent language model with attention, but the approach is sufficiently general to be applied to any attention stage. In this case, the result of dot-product attention is concatenated with the initial query and fed to two fully connected layers to obtain an information vector and a sigmoidal attention gate, then the two vectors are multiplied together. The final result is used as an alternative to the standard dot-product attention. This addition to a standard Transformer with three layers leads to CIDEr (Table 1), thus underlying the usefulness of the approach also in Transformer-based models.

B-1 B-4 M R C S
SCST [33] - 34.2 26.7 55.7 114.0 -
Up-Down [4] 79.8 36.3 27.7 56.9 120.1 21.4
RFNet [15] 79.1 36.5 27.7 57.3 121.9 21.2
Up-Down+HIP [49] - 38.2 28.4 58.3 127.2 21.9
GCN-LSTM [48] 80.5 38.2 28.5 58.3 127.6 22.0
SGAE [46] 80.8 38.4 28.4 58.6 127.8 22.1
ORT [13] 80.5 38.6 28.7 58.4 128.3 22.6
AoANet [14] 80.2 38.9 29.2 58.8 129.8 22.4
Transformer 80.8 39.1 29.2 58.6 131.2 22.6
Table 2: Comparison with the state of the art on the “Karpathy” test split, in single-model setting.
B-1 B-4 M R C S
Ensemble/Fusion of 2 models
GCN-LSTM [48] 80.9 38.3 28.6 58.5 128.7 22.1
SGAE [46] 81.0 39.0 28.4 58.9 129.1 22.2
ETA [24] 81.5 39.9 28.9 59.0 127.6 22.6
GCN-LSTM+HIP [49] - 39.1 28.9 59.2 130.6 22.3
Transformer 81.6 39.8 29.5 59.2 133.2 23.1
Ensemble/Fusion of 4 models
SCST [33] - 35.4 27.1 56.6 117.5 -
RFNet [15] 80.4 37.9 28.3 58.3 125.7 21.7
AoANet [14] 81.6 40.2 29.3 59.4 132.0 22.8
Transformer 82.0 40.5 29.7 59.5 134.5 23.5
Table 3: Comparison with the state of the art on the “Karpathy” test split, using an ensemble of models.
BLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR ROUGE CIDEr
c5 c40 c5 c40 c5 c40 c5 c40 c5 c40 c5 c40 c5 c40
SCST [33] 78.1 93.7 61.9 86.0 47.0 75.9 35.2 64.5 27.0 35.5 56.3 70.7 114.7 116.7
Up-Down [4] 80.2 95.2 64.1 88.8 49.1 79.4 36.9 68.5 27.6 36.7 57.1 72.4 117.9 120.5
RDN [18] 80.2 95.3 - - - - 37.3 69.5 28.1 37.8 57.4 73.3 121.2 125.2
RFNet [15] 80.4 95.0 64.9 89.3 50.1 80.1 38.0 69.2 28.2 37.2 58.2 73.1 122.9 125.1
GCN-LSTM [48] 80.8 95.9 65.5 89.3 50.8 80.3 38.7 69.7 28.5 37.6 58.5 73.4 125.3 126.5
SGAE [46] 81.0 95.3 65.6 89.5 50.7 80.4 38.5 69.7 28.2 37.2 58.6 73.6 123.8 126.5
ETA [24] 81.2 95.0 65.5 89.0 50.9 80.4 38.9 70.2 28.6 38.0 58.6 73.9 122.1 124.4
AoANet [14] 81.0 95.0 65.8 89.6 51.4 81.3 39.4 71.2 29.1 38.5 58.9 74.5 126.9 129.6
GCN-LSTM+HIP [49] 81.6 95.9 66.2 90.4 51.5 81.6 39.3 71.0 28.8 38.1 59.0 74.1 127.9 130.2
Transformer 81.6 96.0 66.4 90.8 51.8 82.7 39.7 72.8 29.4 39.0 59.2 74.8 129.3 132.1
Table 4: Leaderboard of various methods on the online MS-COCO test server.

Meshed Connectivity. We then evaluate the role of the meshed connections between encoder and decoder layers. In Table 1, we firstly introduce a reduced version of our approach in which the -th decoder layer is only connected to the corresponding -th encoder layer (1-to-1), instead of being connected to all encoders. As it can be noticed, using this 1-to-1 connectivity schema already brings an improvement with respect to using the output of the last encoder layer as in the standard Transformer ( CIDEr vs CIDEr), thus confirming that exploiting a multi-level encoding of image regions is beneficial. When we instead use our meshed connectivity schema, that exploits relationships encoded at all levels and weights them with a sigmoid gating, we observe a further performance improvement, from CIDEr to CIDEr. This amounts to a total improvement of CIDEr points with respect to the standard Transformer. Also, the result of our full model is superior to that obtained using the AoA.

As an alternative to the sigmoid gating approach for weighting the contributions from different encoder layers (Eq. 6), we also test with a softmax gating schema. In this case, the element-wise sigmoid applied to each encoder is replaced with the application of a softmax operation over the rows of . Using this alternative brings to a reduction of around 1 CIDEr point, underlying that it is beneficial to exploit the full potentiality of a weighted sum of the contributions from all encoding layers, rather than forcing a peaky distribution in which one layer is given more importance than the others.

Role of persistent memory. We evaluate the role of memory vectors in both the 1-to-1 configuration and in the final configuration with meshed connections. As it can be seen from Table 1, removing memory vectors brings to a reduction in performance of around CIDEr point in both connectivity settings, thus confirming the usefulness of exploiting a priori learned knowledge when encoding image regions. Further experiments on the number of memory vectors can be found in the supplementary material.

4.4 Comparison with state of the art

We compare the performances of our approach with those of several recent proposals for image captioning. The models we compare to include SCST [33], which uses attention over the grid of features and a one-layer LSTM language model; Up-Down [4], which introduces attention over regions, and uses a two-layer LSTM language model. Also, we compare to the RFNet approach [15], which uses a recurrent fusion network to merge different CNN features; GCN-LSTM [48], which exploits pairwise relationships between image regions through a Graph Convolutional Neural Network; SGAE [46], which instead uses auto-encoding scene graphs. Further, we compare with the original AoANet [14] approach, which uses attention on attention for encoding image regions and an LSTM language model. Finally, we compare with ORT [13], which uses a plain Transformer, and weights attention scores in the region encoder with pairwise distances between detections.

We evaluate our approach on the COCO “Karpathy” test split, using both single model and ensemble configurations, and on the online COCO evaluation server.

GT: A cat looking at his reflection in the mirror. Transformer: A cat sitting in a window sill looking out. Transformer: A cat looking at its reflection in a mirror. GT: A plate of food including eggs and toast on a table next to a stone railing. Transformer: A group of food on a plate. Transformer: A plate of breakfast food with eggs and toast. GT: A truck parked near a tall pile of hay. Transformer: A truck is parked in the grass in a field. Transformer: A green truck parked next to a pile of hay.
Figure 3: Examples of captions generated by our approach and the original Transformer model, as well as the corresponding ground-truths.
Figure 4: Visualization of attention states for three sample captions. For each generated word, we show the attended image regions, outlining the region with the maximum output attribution in red.

Single model. In Table 2 we report the performance of our method in comparison with the aforementioned competitors, using captions predicted from a single model and optimization on the CIDEr-D score. As it can be observed, our method surpasses all other approaches in terms of BLEU-4, METEOR and CIDEr, while being competitive on BLEU-1 and SPICE with the best performer, and slightly worse on ROUGE with respect to AoANet [14]. In particular, it advances the current state of the art on CIDEr by 1.4 points.

Ensemble model. Following the common practice [33, 14] of building an ensemble of models, we also report the performances of our approach when averaging the output probability distributions of multiple and independently trained instances of our model. In Table 3, we use ensembles of two and four models, trained from different random seeds. Noticeably, when using four models our approach achieves the best performance according to all metrics, with an increase of 2.5 CIDEr points with respect to the current state of the art [14].

In-Domain Out-of-Domain Overall
CIDEr SPICE CIDEr SPICE CIDEr SPICE
NBT + CBS [1] 62.1 10.1 62.4 8.9 60.2 9.5
Up-Down + CBS [1] 80.0 12.0 66.4 9.7 73.1 11.1
Transformer 78.0 11.0 29.7 7.8 54.7 9.8
Transformer 85.7 12.1 38.9 8.9 64.5 11.1
Transformer + CBS 74.3 11.0 62.5 9.2 66.9 10.3
Transformer + CBS 81.2 12.0 69.4 10.0 75.0 11.4
Table 5: Performances on nocaps validation set, for in-domain and out-of-domain captioning.

Online Evaluation. Finally, we also report the performance of our method on the online COCO test server3. In this case, we use the ensemble of four models previously described, trained on the “Karpathy” training split. The evaluation is done on the COCO test split, for which ground-truth annotations are not publicly available. Results are reported in Table 4, in comparison with the top-performing approaches of the leaderboard. For fairness of comparison, they also used an ensemble configuration. As it can be seen, our method surpasses the current state of the art on all metrics, achieving an advancement of 1.4 CIDEr points with respect to the best performer.

4.5 Describing novel objects

We also assess the performance of our approach when dealing with images containing object categories that are not seen in the training set. We compare with the Up-Down model [4] and Neural Baby Talk [27], when using GloVe word embeddings and Constrained Beam Search (CBS) [3] to address the generation of out-of-vocabulary words and constrain the presence of categories detected by an object detector. To compare with our model, we use a simplified implementation of the procedure described in [1] to extract constraints, without using word phrases (\eg plurals).

Results are shown in Table 5: as it can be seen, the original Transformer is significantly less performing than Up-Down on both in-domain and out-of-domain categories, while our approach can properly deal with novel categories, surpassing the Up-Down baseline in both in-domain and out-of-domain images. As expected, the use of CBS significantly enhances the performances, in particular on out-of-domain captioning.

4.6 Qualitative results and visualization

Figure 3 proposes qualitative results generated by our model and the original Transformer. On average, our model is able to generate more accurate and descriptive captions, integrating fine-grained details and object relations.

Finally, to better understand the effectiveness of our Transformer, we investigate the contribution of detected regions to the model output. Differently from recurrent-based captioning models, in which attention weights over regions can be easily extracted, in our model the contribution of one region with respect to the output is given by more complex non-linear dependencies. Therefore, we revert to attribution methods: specifically, we employ the Integrated Gradients approach [38], which approximates the integral of gradients with respect to the given input. Results are presented in Figure 4, where we observe that our approach correctly grounds image regions to words, also in presence of object details and small detections. More visualizations are included in the supplementary material.

5 Conclusion

We presented Transformer, a novel Transformer-based architecture for image captioning. Our model incorporates a region encoding approach that exploits a priori knowledge through memory vectors and a meshed connectivity between encoding and decoding modules. Noticeably, this connectivity pattern is unprecedented for other fully-attentive architectures. Experimental results demonstrated that our approach achieves a new state of the art on COCO, ranking first in the on-line leaderboard. Finally, we validated the components of our model through ablation studies, and its performances when describing novel objects.

Appendix A Supplementary material

In the following, we present additional material about our Transformer model. In particular, we provide additional training and implementation details, further experimental results, and visualizations.

a.1 Additional implementation details

Decoding optimization. As mentioned in Sec. 3.3, during the decoding stage computation cannot be parallelized over time as the input sequence is iteratively built. A naive approach would be to feed the model at each iteration with the previous generated words, and sample the next predicted word after computing the results of each attention and feed-forward layer over all timesteps. This in practice requires to re-compute the same queries, keys, values and attentive states multiple times, with intermediate results depending on being recomputed times, where is the length of the sampled sequence (in our experiments is equal to ).

In our implementation, we revert to a more computationally friendly approach in which we re-use intermediate results computed at previous timesteps. Each attentive layer of the decoder internally stores previously computed keys and values. At each timestep of the decoding, the model is fed only with , and we only compute queries, keys and values depending on .

In PyTorch, this can be implemented by exploiting the register_buffer method of nn.Module, and creating buffers to hold previously computed results. When running on a NVIDIA 2080Ti GPU, we found this to reduce training and inference times by approximately a factor of 3.

Vocabulary and tokenization. We convert all captions to lowercase, remove punctuation characters and tokenize using the spaCy NLP toolkit4. To build vocabularies, we remove all words which appear less than times in training and validation splits. For each image, we use a maximum number of region feature vectors equal to .

Model dimensionality and weight initialization. Using attentive heads, the size of queries, keys and values in each head is set to . Weights of attentive layers are initialized from the uniform distribution proposed by Glorot \etal [10], while weights of feed-forward layers are initialized using [11]. All biases are initialized to 0. Memory vectors for keys and values are initialized from a normal distribution with zero mean and, respectively, and variance, where is the dimensionality of keys and is the number of memory vectors.

Memories B-1 B-4 M R C S
No memory 80.4 38.3 29.0 58.2 129.4 22.6
20 80.7 38.9 29.0 58.4 129.9 22.7
40 80.8 39.1 29.2 58.6 131.2 22.6
60 80.0 37.9 28.9 58.1 129.6 22.5
80 80.0 38.2 29.0 58.3 128.9 22.9
Table 6: Captioning results of Transformer using different numbers of memory vectors.
Layers B-1 B-4 M R C S
2 80.5 38.6 29.0 58.4 128.5 22.8
3 80.8 39.1 29.2 58.6 131.2 22.6
4 80.8 38.6 29.1 58.5 129.6 22.6
Table 7: Captioning results of Transformer using different numbers of encoder and decoder layers.

a.2 Additional experimental results

Memory vectors. In Table 6, we report the performance of our approach when using a varying number of memory vectors. As it can be seen, the best result in terms of BLEU, METEOR, ROUGE and CIDEr is obtained with memory vectors, while memory vectors provide a slightly superior result in terms of SPICE. Therefore, all experiments in the main paper are carried out with memory vectors.

SPICE Obj. Attr. Rel. Color Count Size
Up-Down [4] 21.4 39.1 10.0 6.5 11.4 18.4 3.2
Transformer 21.1 38.6 9.6 6.3 9.2 17.5 2.0
Transformer 22.6 40.0 11.6 6.9 12.9 20.4 3.5
Table 8: Breakdown of SPICE F-scores over various subcategories.

Encoder and decoder layers. To complement the analysis presented in Sec. 4.3, we also investigate the performance of the Transformer when changing the number of encoding and decoding layers. Table 7 shows that the best performance is obtained with three encoding and decoding layers, thus confirming the initial findings on the base Transformer model. As our model can deal with a different number of encoding and decoding layers, we also experimented with non symmetric encoding-decoding architectures, without however noticing significant improvements in performance.

SPICE F-scores. Finally, in Table 8 we report a breakdown of SPICE F-scores over various subcategories on the “Karpathy” test split, in comparison with the Up-Down approach [4] and the base Transformer model with three layers. As it can be seen, our model significantly improves on identifying objects, attributes and relationships between objects.

Constraints: horse; cart. Transformer: A horse pulling a cart down a street. Transformer: A white horse pulling a man in a cart. Constraints: bee; lavender. Transformer: A bee lavender of purple flowers in a field. Transformer: A field of lavender purple flowers with bee. Constraints: monkey. Transformer: A brown bear sitting on a rock monkey. Transformer: A small monkey sitting on a rock in the grass. Constraints: flag. Transformer: A red kite with a flag in the sky. Transformer: A red and white flag flying in the sky. Constraints: bookcase. Transformer: A woman holding a bookcase in a store. Transformer: A woman holding a book in front of a bookcase. Constraints: rabbit. Transformer: A cat sitting on the rabbit with a cell phone. Transformer: A rabbit sitting on a table next to a person.
Figure 5: Sample nocaps images and corresponding predicted captions generated by our model and the original Transformer. For each image, we report the Open Images object classes predicted by the object detector and used as constraints during the generation of the caption.

a.3 Qualitative results and visualization

Figure 6 shows additional qualitative results obtained from our model in comparison to the original Transformer and corresponding ground-truth captions. On average, the proposed model shows an improvement in terms of caption correctness and provides more detailed and exhaustive descriptions.

Figures 7 and 8, instead, report the visualization of attentive states on a variety of sample images, following the approach outlined in Sec. 4.6 of the main paper. Specifically, the Integrated Gradients approach [38] produces an attribution score for each feature channel of each input region. To obtain the attribution of each region, we average over the feature channels, and re-normalize the obtained scores by their sum. For visualization purposes, we apply a contrast stretching function to project scores in the 0-1 interval.

a.4 Novel object captioning

Figure 5 reports sample captions produced by our approach on images from the nocaps dataset. On each image, we compare to the baseline Transformer and show the constraints provided by the object detector. Overall, the Transformer is able to better incorporate the constraints while maintaining the fluency and properness of the generated sentences.

Following [1], we use an object detector trained on Open Images 5 and filter detections by removing 39 Open Images classes that contain parts of objects or which are seldom mentioned. We also discard overlapping detections by removing the higher-order of two objects based on the class hierarchy, and we use the top-3 detected objects as constraints based on the detection confidence score. As mentioned in Sec. 4.5 and differently from [1], we do not consider the plural forms or other word phrases of object classes, thus taking into account only the original class names. After decoding, we select the predicted caption with highest probability that satisfies the given constraints.

GT: A man milking a brown and white cow in barn. Transformer: A man is standing next to a cow. Transformer: A man is milking a cow in a barn. GT: A man in a red Santa hat and a dog pose in front of a Christmas tree. Transformer: A Christmas tree in the snow with a Christmas tree. Transformer: A man wearing a Santa hat with a dog in front of a Christmas tree. GT: A woman with blue hair and a yellow umbrella. Transformer: A woman is holding an umbrella. Transformer: A woman with blue hair holding a yellow umbrella. GT: Several people standing outside a parked white van. Transformer: A group of people standing outside of a bus. Transformer: A group of people standing around a white van. GT: Several zebras and other animals grazing in a field. Transformer: A herd of zebras are standing in a field. Transformer: A herd of zebras and other animals grazing in a field. GT: A truck sitting on a field with kites in the air. Transformer: A group of cars parked in a field with a kite. Transformer: A white truck is parked in a field with kites. GT: A woman who is skateboarding down the street. Transformer: A woman walking down a street talking on a cell phone. Transformer: A woman standing on a skateboard on a street. GT: Orange cat walking across two red suitcases stacked on floor. Transformer: An orange cat sitting on top of a suitcase. Transformer: An orange cat standing on top of two red suitcases. GT: Some people are standing in front of a red food truck. Transformer: A group of people standing in front of a bus. Transformer: A group of people standing outside of a food truck. GT: A boat parked in a field with long green grass. Transformer: A field of grass with a fence. Transformer: A boat in the middle of a field of grass. GT: A little girl is eating a hot dog and riding in a shopping cart. Transformer: A little girl sitting on a bench eating a hot dog. Transformer: A little girl sitting in a shopping cart eating a hot dog. GT: A grilled sandwich sits on a cutting board by a knife. Transformer: A sandwich sitting on top of a wooden table. Transformer: A sandwich on a cutting board with a knife. GT: A hotel room with a well-made bed, a table, and two chairs. Transformer: A bedroom with a bed and a table. Transformer: A hotel room with a large bed with white pillows. GT: An open toaster oven with a glass dish of food inside. Transformer: An open suitcase with food in an oven. Transformer: A toaster oven with a tray of food inside of it. GT: A empty bench on a snow covered beach. Transformer: Two benches sitting on a beach near the water. Transformer: A bench sitting on the beach in the snow. GT: A brown and white dog wearing a red and white Santa hat. Transformer: A white dog wearing a red hat. Transformer: A dog wearing a red and white Santa hat. GT: A man riding a small pink motorcycle on a track. Transformer: A man is riding a red motorcycle. Transformer: A man riding a pink motorcycle on a track. GT: Three people sit on a bench looking out over the water. Transformer: Two people sitting on a bench in the water. Transformer: Three people sitting on a bench looking at the water.
Figure 6: Additional sample results generated by our approach and the original Transformer, as well as the corresponding ground-truths.
Figure 7: Visualization of attention states for sample captions generated by our  Transformer. For each generated word, we show the attended image regions, outlining the region with the maximum output attribution in red.
Figure 8: Visualization of attention states for sample captions generated by our  Transformer. For each generated word, we show the attended image regions, outlining the region with the maximum output attribution in red.

Footnotes

  1. Equal contribution.
  2. Taking another perspective, self-attention is also conceptually equivalent to an attentive encoding of graph nodes [41].
  3. https://competitions.codalab.org/competitions/3221
  4. https://spacy.io/
  5. Specifically, the tf_faster_rcnn_inception_resnet_v2_atrous_oidv2 model from the Tensorflow model zoo.

References

  1. H. Agrawal, K. Desai, X. Chen, R. Jain, D. Batra, D. Parikh, S. Lee and P. Anderson (2019) Nocaps: novel object captioning at scale. In Proceedings of the International Conference on Computer Vision, Cited by: §A.4, §4.1, §4.5, Table 5.
  2. P. Anderson, B. Fernando, M. Johnson and S. Gould (2016) SPICE: Semantic Propositional Image Caption Evaluation. In Proceedings of the European Conference on Computer Vision, Cited by: §4.2.
  3. P. Anderson, B. Fernando, M. Johnson and S. Gould (2017) Guided open vocabulary image captioning with constrained beam search. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Cited by: §4.5.
  4. P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould and L. Zhang (2018) Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §A.2, Table 8, §2, §3.3, §3.3, §3.3, §4.2, §4.3, §4.4, §4.5, Table 2, Table 4.
  5. J. Aneja, A. Deshpande and A. G. Schwing (2018) Convolutional image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1, §2.
  6. S. Banerjee and A. Lavie (2005) METEOR: an automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Cited by: §4.2.
  7. M. Cornia, L. Baraldi and R. Cucchiara (2019) Show, Control and Tell: A Framework for Generating Controllable and Grounded Captions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1.
  8. J. Devlin, M. Chang, K. Lee and K. Toutanova (2018) BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1, §2.
  9. J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko and T. Darrell (2015) Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.
  10. X. Glorot and Y. Bengio (2010) Understanding the difficulty of training deep feedforward neural networks. In International Conference on Artificial Intelligence and Statistics, Cited by: §A.1.
  11. K. He, X. Zhang, S. Ren and J. Sun (2015) Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §A.1.
  12. K. He, X. Zhang, S. Ren and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §4.2.
  13. S. Herdade, A. Kappeler, K. Boakye and J. Soares (2019) Image Captioning: Transforming Objects into Words. arXiv preprint arXiv:1906.05963. Cited by: §2, §4.4, Table 2.
  14. L. Huang, W. Wang, J. Chen and X. Wei (2019) Attention on Attention for Image Captioning. In Proceedings of the International Conference on Computer Vision, Cited by: §1, §2, §4.3, §4.4, §4.4, §4.4, Table 1, Table 2, Table 3, Table 4.
  15. W. Jiang, L. Ma, Y. Jiang, W. Liu and T. Zhang (2018) Recurrent Fusion Network for Image Captioning. In Proceedings of the European Conference on Computer Vision, Cited by: §4.4, Table 2, Table 3, Table 4.
  16. J. Johnson, A. Karpathy and L. Fei-Fei (2016) DenseCap: Fully convolutional Localization Networks for Dense Captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.
  17. A. Karpathy and L. Fei-Fei (2015) Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1, §4.1.
  18. L. Ke, W. Pei, R. Li, X. Shen and Y. Tai (2019) Reflective Decoding Network for Image Captioning. In Proceedings of the International Conference on Computer Vision, Cited by: Table 4.
  19. D. P. Kingma and J. Ba (2015) Adam: A Method for Stochastic Optimization. In Proceedings of the International Conference on Learning Representations, Cited by: §4.2.
  20. R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L. Li, D. A. Shamma, M. Bernstein and L. Fei-Fei (2017) Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. International Journal of Computer Vision 123 (1), pp. 32–73. Cited by: §4.2.
  21. A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin, J. Pont-Tuset, S. Kamali, S. Popov, M. Malloci and T. Duerig (2018) The Open Images Dataset V4: Unified image classification, object detection, and visual relationship detection at scale. arXiv preprint arXiv:1811.00982. Cited by: §4.1.
  22. C. Lin (2004) Rouge: a package for automatic evaluation of summaries. In Text summarization branches out: Proceedings of the ACL Workshop, Vol. 8. Cited by: §4.2.
  23. T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár and C. L. Zitnick (2014) Microsoft COCO: Common Objects in Context. In Proceedings of the European Conference on Computer Vision, Cited by: §4.1.
  24. G. L. L. Z. P. Liu and Y. Yang (2019) Entangled Transformer for Image Captioning. In Proceedings of the International Conference on Computer Vision, Cited by: §1, §2, Table 3, Table 4.
  25. S. Liu, Z. Zhu, N. Ye, S. Guadarrama and K. Murphy (2017) Improved Image Captioning via Policy Gradient Optimization of SPIDEr. In Proceedings of the International Conference on Computer Vision, Cited by: §2.
  26. J. Lu, C. Xiong, D. Parikh and R. Socher (2017) Knowing when to look: adaptive attention via a visual sentinel for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.
  27. J. Lu, J. Yang, D. Batra and D. Parikh (2018) Neural Baby Talk. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2, §4.5.
  28. K. Papineni, S. Roukos, T. Ward and W. Zhu (2002) BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, Cited by: §4.2.
  29. M. Pedersoli, T. Lucas, C. Schmid and J. Verbeek (2017) Areas of attention for image captioning. In Proceedings of the International Conference on Computer Vision, Cited by: §2.
  30. J. Pennington, R. Socher and C. Manning (2014) GloVe: Global vectors for word representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Cited by: §4.2.
  31. M. Ranzato, S. Chopra, M. Auli and W. Zaremba (2015) Sequence level training with recurrent neural networks. In Proceedings of the International Conference on Learning Representations, Cited by: §2, §3.3.
  32. S. Ren, K. He, R. Girshick and J. Sun (2015) Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems, Cited by: §4.2.
  33. S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross and V. Goel (2017) Self-critical sequence training for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2, §3.3, §3.3, §3.3, §4.4, §4.4, Table 2, Table 3, Table 4.
  34. R. Socher and L. Fei-Fei (2010) Connecting modalities: semi-supervised segmentation and annotation of images using unaligned text corpora. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.
  35. Y. Song and M. Soleymani (2019) Polysemous visual-semantic embedding for cross-modal retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1.
  36. S. Sukhbaatar, E. Grave, G. Lample, H. Jegou and A. Joulin (2019) Augmenting Self-attention with Persistent Memory. arXiv preprint arXiv:1907.01470. Cited by: §2.
  37. C. Sun, A. Myers, C. Vondrick, K. Murphy and C. Schmid (2019) Videobert: a joint model for video and language representation learning. In Proceedings of the International Conference on Computer Vision, Cited by: §1.
  38. M. Sundararajan, A. Taly and Q. Yan (2017) Axiomatic attribution for deep networks. In Proceedings of the International Conference on Machine Learning, Cited by: §A.3, §4.6.
  39. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems, Cited by: §1, §1, §2, §3.1, §4.2, §4.2, Table 1.
  40. R. Vedantam, C. Lawrence Zitnick and D. Parikh (2015) CIDEr: Consensus-based Image Description Evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §3.3, §4.2.
  41. P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio and Y. Bengio (2018) Graph Attention Networks. In Proceedings of the International Conference on Learning Representations, Cited by: footnote 1.
  42. O. Vinyals, A. Toshev, S. Bengio and D. Erhan (2015) Show and tell: a neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1.
  43. O. Vinyals, A. Toshev, S. Bengio and D. Erhan (2016) Show and tell: lessons learned from the 2015 mscoco image captioning challenge. IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (4), pp. 652–663. Cited by: §2.
  44. O. Vinyals, A. Toshev, S. Bengio and D. Erhan (2017) Show and Tell: Lessons Learned from the 2015 MSCOCO Image Captioning Challenge. IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (4), pp. 652–663. Cited by: §1.
  45. K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov, R. S. Zemel and Y. Bengio (2015) Show, attend and tell: neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning, Cited by: §1, §2.
  46. X. Yang, K. Tang, H. Zhang and J. Cai (2019) Auto-Encoding Scene Graphs for Image Captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2, §4.4, Table 2, Table 3, Table 4.
  47. B. Z. Yao, X. Yang, L. Lin, M. W. Lee and S. Zhu (2010) I2t: image parsing to text description. Proceedings of the IEEE 98 (8), pp. 1485–1508. Cited by: §2.
  48. T. Yao, Y. Pan, Y. Li and T. Mei (2018) Exploring Visual Relationship for Image Captioning. In Proceedings of the European Conference on Computer Vision, Cited by: §1, §2, §4.4, Table 2, Table 3, Table 4.
  49. T. Yao, Y. Pan, Y. Li and T. Mei (2019) Hierarchy Parsing for Image Captioning. In Proceedings of the International Conference on Computer Vision, Cited by: Table 2, Table 3, Table 4.
  50. Q. You, H. Jin, Z. Wang, C. Fang and J. Luo (2016) Image captioning with semantic attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
402553
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description