Teacher-Critical Training Strategies for Image Captioning

Teacher-Critical Training Strategies for Image Captioning

Abstract

Existing image captioning models are usually trained by cross-entropy (XE) loss and reinforcement learning (RL), which set ground-truth words as hard targets and force the captioning model to learn from them. However, the widely adopted training strategies suffer from misalignment in XE training and inappropriate reward assignment in RL training. To tackle these problems, we introduce a teacher model that serves as a bridge between the ground-truth caption and the caption model by generating some easier-to-learn word proposals as soft targets. The teacher model is constructed by incorporating the ground-truth image attributes into the baseline caption model. To effectively learn from the teacher model, we propose Teacher-Critical Training Strategies (TCTS) for both XE and RL training to facilitate better learning processes for the caption model. Experimental evaluations of several widely adopted caption models on the benchmark MSCOCO dataset show the proposed TCTS comprehensively enhances most evaluation metrics, especially the Bleu and Rouge-L scores, in both training stages. TCTS is able to achieve to-date the best published single model Bleu-4 and Rouge-L performances of 40.2% and 59.4% on the MSCOCO Karpathy test split. Our codes and pre-trained models will be open-sourced.

\affiliations

1Department of Electronic Engineering, Tsinghua University

Introduction

Image captioning aims to automatically generate captions for images in natural language, which is of great significance in both Computer Vision and Natural Language Processing fields. This challenging task facilitates lots of practical applications such as human-machine interaction and content-based image retrieval. In general, image captioning models utilize CNN to encode the visual features of the image and leverage the LSTM Hochreiter and Schmidhuber (1997) or the Transformer Vaswani et al. (2017) as the language decoder to generate the captions. Recently, the attention-based encoder-decoder framework Vinyals et al. (2015) becomes prevalent in image captioning. Generally speaking, different attention mechanisms focus on attending to different kinds of information. Visual attention Xu et al. (2015); Anderson et al. (2018) exploits the extracted spatial feature or objects feature to effectively utilize the visual information. Semantic attention You et al. (2016); Huang et al. (2020) and adaptive attention focus more on linguistic information such as image attributes and language features. Recently, X-linear attention Pan et al. (2020) is proposed to model higher level information interactions.

Figure 1: We propose to leverage a teacher caption model to solve the problems in current cross-entropy (XE) and reinforcement learning (RL) training processes.

Although current image captioning methods achieve remarkable performances, the widely adopted model training strategies are far from satisfactory. The captioning models are usually trained in two stages, namely the cross-entropy (XE) training stage and reinforcement learning (RL) training stage. In the XE training stage, the model is trained to maximize the probability of the subsequent word given the previous ground-truth word. However, we notice that about 98% percent of the images in the MSCOCO Chen et al. (2015) training set suffers from misalignment. As is shown in the green circle in Figure 1, at the second time step, the model is fed with the word ‘a’ and is imposed to predict different words such as ‘policeman’, ‘motorcycle’, and ‘cop’ simultaneously. Such supervision would confuse the model in maximizing the probability of which word. In the RL stage, The widely adopted self-critical sequence training (SCST) Rennie et al. (2017) computes the CIDEr score to form the reward for the whole caption instead of for each word, which means that each word in the generated caption is assigned with the same reward. As shown in the blue circle in Figure 1, the generated caption is semantically close to the ground-truth captions and it should be assigned with a relatively high reward. Thus, the inaccurate word ‘bike’ also receives a high reward according to SCST. Consequently, it is hard for the model to identify that the word ‘bike’ is not so appropriate and correct this mistake. Therefore, we propose to leverage a teacher model, which achieves much better performance than current image captioning models, to guide the training process of the caption model (or denoted as the student model).

Instead of using larger models with much more parameters and computation as the teacher Zhang et al. (2020), we propose to utilize attributes to train the teacher model. The image attributes are the most salient words describing the image, thus incorporating the ground-truth attributes enables the teacher model to generate correct keywords in the caption. Several previous works Yao et al. (2017); Li et al. (2019) show that leveraging the ground-truth attributes boosts the performance of the captioning model by nearly 50%. The teacher model equipped with ground-truth attributes is applicable to guiding the student caption model in both XE and RL stages with our proposed Teacher-Critical Training Strategies (TCTS). In the XE stage, apart from utilizing the ground-truth captions as hard labels, we propose to leverage the teacher model to generate word probabilities as soft labels. As the teacher model generates the same probability distribution for the same input, the misalignment in the XE training can be mitigated. In the RL stage, we adopt the caption generated by the teacher, namely the teacher caption, as a sequence-level soft label. Although the teacher caption is not necessarily more accurate than the ground-truth caption, its tone is very similar to the model generated caption. Thus, we can effectively utilize the teacher caption to discriminate the appropriate words and inaccurate words inside the generated caption and adjust their rewards. For instance, the reward for the word ‘bike’ in Figure 1 will be properly lowered. Therefore, the student model will be able to recognize the inaccurate word and predicts a more appropriate word instead.

We evaluate both XE and RL performances of our proposed training strategies on the benchmark MSCOCO dataset. To fairly compare and convincingly validate the effectiveness of our methods, we incorporate TCTS into several baseline models such as Att2in Rennie et al. (2017), BUTD Anderson et al. (2018), X-LAN and X-Trans Pan et al. (2020). Experimental results show that our approaches achieve comprehensive improvements, especially in terms of the Bleu scores, over these baseline models. Our methods achieve to-date the best single model Bleu-4 scores of 40.2% in the offline evaluation and 39.0%/70.3% in the online evaluation. The main contributions of our works are as follows:

  • We propose Teacher-Critical Training Strategies (TCTS) to improve the unreasonable supervision mechanisms in current training strategies to achieve state-of-the-art Bleu-4 and Rouge-L performances.

  • TCTS provides additional soft labels to mitigate the misalignment in the XE training stage.

  • In the RL stage, TCTS adjusts the rewards for appropriate words and inaccurate words to facilitate more reasonable reward assignment.

Figure 2: The overall framework of our proposal. We utilize the ground-truth image attribute to construct the teacher caption model. To effectively utilize attribute information, we implement the attribute attention module and fuse the attended attribute information and visual information. The teacher model is applicable to guide the training process of the student model in both XE stage and the RL stage. In the XE stage, the teacher model generates the same probability distribution with identical previous input word ‘a’ as soft labels to alleviate the misalignment in the ground-truth captions (shown in green). In the RL stage, we compute the Longest Common Subsequence (LCS, shown in red) between the teacher caption and the student caption to adjust the rewards for appropriate words and inaccurate words (shown by red and blue arrows).

Related Works

Image Captioning  Most modern image captioning methods Xu et al. (2015); You et al. (2016); Lu et al. (2017); Anderson et al. (2018); Huang et al. (2019); Pan et al. (2020) encode image through CNNs and then decode it with RNNs or the Transformer Vaswani et al. (2017). Multiple attention mechanisms are proposed to enhance image captioning. In particular, Xu et al. Xu et al. (2015) introduced soft and hard attention mechanisms to select the most relevant image regions in word generation. You et al. You et al. (2016) proposed semantic attention to attend to the most salient words, namely image attributes, in the image in the language decoder. Anderson et al. Anderson et al. (2018) leveraged the Faster R-CNN Ren et al. (2015) to extract more explicit object features and predicted the image captions via bottom-up and top-down attention. Recently, Pan et al. Pan et al. (2020) designed an X-Linear attention block to capture higher-order interactions between the visual features to achieve state-of-the-art performance.

Training Strategies The caption models are always trained in two stages. In the first stage, the model is trained to predict the next token given previous ground-truth words under the cross-entropy (XE) loss. Zhang et al. Zhang et al. (2020) introduced a teacher-recommended method to distill knowledge from an external language model. Although they also proposed to leverage soft label in XE training with a teacher model, taking a pure language model as the teacher may suffer from modality bias since the visual feature is significant in visual captioning. Thus, we propose to directly take an image captioning model as the teacher model to more effectively mitigate the misalignment in XE training. Moreover, while their method requires additional data to train the teacher model, our teacher model is constructed by additionally taking the ground-truth image attribute, which can be easily obtained from the ground-truth captions.

In the second stage, reinforcement learning (RL) is introduced to directly optimize the evaluation metrics such as CIDEr and Bleu. Rennie et al. Rennie et al. (2017) introduced the self-critical sequence training (SCST) method which obtained a reward by taking the current model under the inference algorithm as the baseline. Discriminative reward for each word is not considered under this training strategy. To remedy this, Zhang et al. Zhang et al. (2017) used another RNN to predict the state value function for different words. However, the value they computed is not directly related to the evaluation scores, which introduces estimation bias. Recently, Gao et al. Gao et al. (2019) proposed N-step SCST to estimate the per-token value by additionally generating several captions in the training process. Although their method utilizes the differences of CIDEr scores to assess the words’ value, they omit to evaluate the quality of the whole caption. Thus, the model may fall into local maxima where each word is plausible but the whole caption is less satisfying. Moreover, they suffer from relatively high computation cost, such as inference and computing CIDEr score several times in one iteration. While these previous methods criticize the captioning model by itself or a value network, we leverage a model that achieves much higher performance than current captioning models, as the teacher to criticize the caption model from a more experienced perspective. Moreover, the major additional computational cost of our method is only to search the longest common subsequence at each training step which can be performed using a polynomial time complexity algorithm.

Method

Figure 2 demonstrates the overview of our proposed method. We firstly construct a teacher model that exploits both visual features and the ground-truth image attributes. The teacher model generates easy-to-learn word proposals in both XE and RL training stages to alleviate the problems in current training strategies. The implementation details of the teacher model and the Teacher-Critical Training Strategies (TCTS) are elaborated in this section.

Constructing the Teacher Model

In this work, we adopt the X-LAN Pan et al. (2020) model as the backbone. This model consists of an X-linear attention based visual feature encoder and an X-linear attention module based LSTM decoder. While the vanilla X-LAN model only exploits the visual feature of each image, we additionally leverage another encoder to encode the embedding of ground-truth image attributes as is shown in the brown box in Figure 2. Denoting the attended visual feature and image attributes as and respectively, we fuse them with the output of the LSTM layer as in Eq 1-3. Here and , each element in which is in range , are the fusion weights; the s are trainable parameters.

(1)
(2)
(3)

The fused feature is then concatenated with and sent to the Gated Linear Unit (GLU) layer to generate the context as in Eq 4. The context is finally input to the linear layer to generate the probability distribution of each word in the vocabulary shown in Eq 5.

(4)
(5)

Taking the ground-truth attribute as privilege knowledge, the teacher model is capable of generating correct key words when narrating the captions and forms applicable guidance for the student in both XE and RL training stages.

TCTS for XE Training

Denoting the ground-truth caption as , the caption model is forced to maximize the probability of these ground-truth words in the XE training stage. Specifically, the loss function is computed as in Eq 6, where is the trainable parameter in the caption model and is the word probability distribution predicted by the caption model.

(6)

It can be noticed that only the probability of the ground-truth word is directly optimized since is a one-hot vector which equals to 1 only at the position. This constraint introduces misalignment since some corresponding words, as shown in the green words in Figure 2, such as ‘dog’, ‘small’, and ‘kitten’ are simultaneously forced to reach the top probability given the same previous word ‘a’. Alternatively, we adopt the word probability generated by the teacher model in the forward pass as the soft label as is shown in the bottom of Figure 2. We additionally compute the Kullback-Leibler (KL) divergence between the distribution of the teacher model and the student model to soften the loss function. The loss of KL divergence is formulated in Eq 7, where is the vocabulary size. As the teacher model generates the same probability distribution with identical inputs, the misalignment in the XE training can be mitigated. Thus, the student model is taught to generate a more reasonable probability distribution rather than maximize the probability of ground-truth word at each time step.

(7)

The KL divergence loss is finally summed with XE loss by the weight of to form the teacher-critical XE loss in Eq 8.

(8)

TCTS for RL Training

Only optimizing the captioning model with the XE loss would lead to exposure bias. The model is fed with the ground-truth word in the training process but is fed with the previous generated word in the inference mode. More importantly, the evaluation metric are not directly optimized in the training process. Consequently, several reinforcement learning (RL) strategies are proposed to remedy these inconsistency. We first briefly introduce the general idea of implementing RL in image captioning. The caption model, parameterized by , can be viewed as an agent who receives a reward after it takes an action  (generates a caption ) against the current environments (input features). The goal of reinforcement learning is to minimize the negative expect reward as in Eq 9, where is the policy defined by the parameter . In practice, the reward is estimated with a single sample from . Therefore, the gradient of reinforcement learning is formulated in Eq 10.

(9)
(10)

In the widely-adopted Self-Critical Sequence Training (SCST) Rennie et al. (2017), the reward is the CIDEr score of the sampled caption minus the CIDEr score of the greedy decoded caption as in Eq 11.

(11)

Note that this reward is computed for the whole caption, which means that the reward of all the words in the sampled caption are the same. However, as shown in the top of Figure 2, the student model correctly narrates ‘standing’ but erroneously states ‘two dogs’. It is not reasonable to assign ‘standing’ and ‘two dogs’ the same reward.

Consequently, we additionally leverage a teacher model to criticize the caption sampled from the student model to assign more reasonable weights to each word. Practically, the teacher greedy decodes a caption denoted as . As the teacher model is trained with the ground-truth attributes, most keywords in the teacher captions are correct. Thus, the teacher caption is well formulated and can be leveraged to guide the student model We then search the Longest Common Subsequence (LCS) between the teacher caption and the student caption shown at the top of Figure 2. A subsequence is a sequence that appears in the same relative order, but not necessarily contiguous. The LCS problem is a classic computer science problem that aims at finding the longest subsequence present in both given sequences. The words inside the LCS are assumed to be the appropriate words that should be assigned with more rewards. The rest words are more likely to be inaccurate, therefore their rewards should be decreased. Suppose that there are appropriate words and inaccurate words in the student caption, we compute the normalized CIDEr score of the teacher caption to modify the original reward in Eq 11. The additional rewards for the appropriate words and the inaccurate words are formulated in Eq 12-14.

(12)
(13)
(14)

The additional rewards are added to the original reward with the weight of in Eq 15 to form the teacher-critical reward for each appropriate and inaccurate word respectively as shown by the red and blue arrows in Figure 2.

(15)

Note that the sum of the normalized additional reward of a caption equals zero, therefore, taking the teacher-critical rewards does not violate the goal of optimizing CIDEr score in the reinforcement learning stage but encourages the student model to generate more appropriate words and replace the inaccurate words.

It should be noticed that teacher captions may not be necessarily better than the ground-truth captions semantically. However, since they are generated by a model instead of human beings, they better follow the tone of how a deep model may describe images. Therefore, they are probably more achievable supervisions for the student model compared to the human labelled ground-truth. Actually, a teacher caption can be viewed as sequence-level soft label that effectively bridges the gap between the student model and the human-labeled ground-truth captions. Detailed comparisons will be elaborated in the quantitative analysis.

Methods Cross-Entropy Loss
Metric B-1 B-2 B-3 B-4 M R C S
BUTD Anderson et al. (2018) 77.2 - - 36.2 27.0 56.4 113.5 20.3
+TCTS (ours) 77.5 61.7 47.6 37.0 28.1 57.3 116.6 21.2
X-Trans Pan et al. (2020) 77.3 61.5 47.8 37.0 28.7 57.5 120.0 21.8
+TCTS (ours) 77.6 61.9 48.3 37.5 28.7 57.7 120.8 21.8
X-LAN Pan et al. (2020) 78.0 62.3 48.9 38.2 28.8 58.0 122.0 21.9
+TCTS (ours) 78.3 62.9 49.3 38.3 28.9 58.2 122.3 22.0
Table 1: Single model offline performances (%) of various methods on MSCOCO trained by cross-entropy loss only, where B-N, M, R, C and S are short for Bleu-N, Meteor, Rouge-L, CIDEr and SPICE scores. The best score in each column is marked in boldface.
Methods Reinforcement Learning
Metric Bleu-1 Bleu-2 Bleu-3 Bleu-4 Meteor Rouge-L CIDEr SPICE
GCN-LSTM Yao et al. (2018) 80.5 - - 38.2 28.5 58.3 127.6 22.0
LBPF Qin et al. (2019) 80.5 - - 38.3 28.5 58.4 127.6 22.0
SGAE Yang et al. (2019) 80.8 - - 38.4 28.4 58.6 127.8 22.1
AoA Huang et al. (2019) 80.2 - - 38.9 29.2 58.8 129.8 22.4
MAD+SAP Huang et al. (2020) - - - 38.6 28.7 58.5 128.8 22.2
M2Trans Cornia et al. (2020) 80.8 - - 39.1 29.2 58.6 131.2 22.6
BUTD Anderson et al. (2018) 79.8 - - 36.3 27.7 56.9 120.1 21.4
+TCTS (ours) 80.3 65.0 50.7 38.8 28.6 58.7 126.4 22.1
X-Trans Pan et al. (2020) 80.9 65.8 51.5 39.7 29.5 59.1 132.8 23.4
X-Trans* 81.0 65.8 51.5 39.6 29.4 59.0 132.0 23.4
+TCTS (ours) 81.2 66.1 51.9 40.1 29.5 59.3 132.3 23.5
X-LAN Pan et al. (2020) 80.8 65.6 51.4 39.5 29.5 59.2 132.0 23.4
X-LAN* 80.7 65.5 51.2 39.3 29.5 59.0 131.7 23.3
+TCTS (ours) 81.0 66.1 52.0 40.2 29.5 59.4 132.2 23.4
Table 2: Single model offline performances (%) of various methods on MSCOCO trained by reinforcement learning.

Expriments

Experimental Settings

We evaluate our model on the MSCOCO captioning dataset Chen et al. (2015). Words that appear in the training set for over times are selected to form a vocabulary of the size =. We follow the widely adopted Karpathy’s data split Karpathy and Fei-Fei (2015) in offline evaluation. We utilize the ResNet-101 object features released in BUTD Anderson et al. (2018) for image captioning. For fair comparisons with the SOTA X-LAN Pan et al. (2020) and re-productivity, the explicit model sizes (including the random seeds) and learning rate schedules are set identically to their open-source codes. Our models are trained epochs under XE+TCTS loss and another epochs under SCST+TCTS with a mini-batch size of and respectively. To evaluate the image captioning performance, the following metrics are used: Bleu Papineni et al. (2002), Meteor Denkowski and Lavie (2014), Rouge-L Lin (2004), CIDEr Vedantam et al. (2015), and SPICE Anderson et al. (2016). Similar to Fang et al. (2015); Gan et al. (2017), the most frequent 1000 words are selected to form the attribute vocabulary in the teacher model. The teacher model is fixed after XE and RL training. The weights of TCTS in XE and RL training are set to and respectively.

Performance Evaluation

Offline Evaluation Table 1 reports the performances of our proposed TCTS in terms of XE training on several baseline models. It can be noticed that with the misalignment mitigated, comprehensive improvement is reached for these baseline models. In Table 2 we present the gain of incorporating TCTS in the RL stage. Note that the X-LAN and X-Trans Pan et al. (2020) models are trained with a batch size of . However, we can only set the batch size to due to GPU constraint. Consequently, we re-train these two models with a batch size of for fair comparisons. The performances of the re-trained models are reported with ‘*’. In general, our proposed TCTS also consistently exhibits better performances than the baseline models in the RL stage, especially in terms of the Bleu and the Rouge-L scores. This is because the proposed TCTS encourages student model to generate the captions that have longer LCS with the teacher captions. As the Bleu-4 and Rouge-L score is closely related to the LCS and identical 4-grams between the model generated caption and the ground-truth, higher scores implies that the semantic quality of teacher caption is high enough to guide the student model in the training process.

Comparing with other recent image captioning methods in Table 2, we notice that the SGAE Yang et al. (2019) and the BUTD+TCTS model show comparable performances. While SGAE adopts the BUTD model as the backbone and additionally incorporates scene graph features, the BUTD+TCTS model outperforms SGAE in terms of Bleu-4, Meteor, and Rouge-L scores without taking in any additional inputs. Utilizing the LSTM based X-LAN teacher to guide X-Trans* leads to relatively less improvement, perhaps this is due to the fact that the LSTM is less powerful than the Transformer in the RL stage. Putting Table 1 and Table 2 together, we can witness that directly incorporating the widely adopted SCST Rennie et al. (2017) improves 1.1% Bleu-4 and 1.0% Rouge-L scores on the X-LAN* model. However, the Bleu-4 and Rouge-L scores enjoy 1.9% and 1.2% improvements respectively with the the guidance from the teacher model, which suggests that our proposed RL method is more effective than SCST on the SOTA captioning model. In general, utilizing our propsed TCTS reaches the best Bleu and Rouge-L scores among all compared methods. The performance comparisons again verify the effectiveness of adopting TCTS in the RL stage.

Online Evaluation Table 3 reports the performance gain of utilizing our proposed TCTS on the X-LAN* model in the RL stage on the official MSCOCO evaluation server. Compared with other methods that utilize an ensemble of multiple captioning models, we can see that our single model achieves very competitive performance. Similar to the offline evaluation, we reach the best Bleu-2, Bleu-3, Bleu-4 and Rouge-L scores among all the compared methods.

Methods Bleu-1 Bleu-2 Bleu-3 Bleu-4 Meteor Rouge-L CIDEr
c5 c40 c5 c40 c5 c40 c5 c40 c5 c40 c5 c40 c5 c40
SCST Rennie et al. (2017) 78.1 93.7 61.9 86.0 47.0 75.9 35.2 64.5 27.0 35.5 56.3 70.7 114.7 116.7
LSTM-A Yao et al. (2017) 78.7 93.7 62.7 86.7 47.6 76.5 35.6 65.2 27.0 35.4 56.4 70.5 116.0 118.0
Up-Down Anderson et al. (2018) 80.2 95.2 64.1 88.8 49.1 79.4 36.9 68.5 27.6 36.7 57.1 72.4 117.9 120.5
CAVP Zha et al. (2019) 80.1 94.9 64.7 88.8 50.0 79.7 37.9 69.0 28.1 37.0 58.2 73.1 121.6 123.8
SGAE Yang et al. (2019) - - - - - - 38.5 69.7 28.2 37.2 58.6 73.6 123.8 126.5
MAD+SAP Huang et al. (2020) 80.5 94.9 65.1 89.1 50.4 80.0 38.4 69.4 28.6 37.7 58.7 73.3 125.1 127.0
X-LAN Pan et al. (2020) 80.3 94.7 65.0 88.9 50.5 80.2 38.6 69.8 29.1 38.3 58.7 73.8 125.4 127.3
X-LAN* 80.3 94.8 65.0 88.9 50.5 80.2 38.5 69.8 28.9 38.2 58.6 73.8 124.9 126.9
+TCTS (ours) 80.5 94.8 65.3 89.1 50.9 80.5 39.0 70.3 29.0 38.4 58.9 74.0 125.3 127.2
Table 3: Performance (%) on the online MSCOCO evaluation server. The methods using model ensemble are labeled with .
Figure 3: Performance comparisons of adopting TCTS over SCST in terms of Rouge-L, Bleu-4, and CIDEr scores. The horizotal axes are every epoch and the vertical axes are corresponding metrics on MSCOCO Karpathy test split

Quantitative Analysis

The Quality of Teacher Caption The key to our proposed TCTS in the RL stage is that captions generated by the teacher model are better guidance for the caption model. To quantitatively demonstrate the quality of the teacher caption, we utilize the teacher captions as the ground-truth captions to train the captioning model with the self-critical sequence training Rennie et al. (2017). As the teacher model only generates one caption per image, we randomly select one human-labeled ground-truth caption per image to train another caption model for a fair comparison. The comparison shown in Table 4 suggests that using the teacher caption reaches higher a CIDEr score in the CIDEr optimization SCST than one ground-truth caption. This is because the teacher caption is narrated by a model but not a human being. Although the teacher caption is not necessarily better than the ground-truth caption in depicting the image, the language style of the teacher caption is closer to the model’s way of narrating the captions. Consequently, utilizing such captions as additional guidance in our proposed TCTS is beneficial for training caption models.

Methods Reinforcement Learning-SCST
Metric B-1 B-2 B-3 B-4 M R C S
OneGTCaption 77.3 61.8 48.2 37.5 29.6 58.4 125.4 22.4
TeacherCap 77.2 61.8 48.3 37.3 29.8 58.5 126.1 23.0
Table 4: Single model offline performances (%) of adopting different ground-truth in SCST.
Figure 4: Qualitative results of our proposed TCTS (in red) compared with the ground-truth caption (in green) and the X-LAN* model trained under SCST (in blue).

Evaluating Different Training Strategies We compare our method over several commonly used training strategies in Table 5. For fair comparisons with the previous works, we re-implement the Att2in model Rennie et al. (2017) which adopts ResNet He et al. (2016) feature in caption generation. It can be witnessed that our proposed TCTS achieves comprehensively improvement in the XE training stage. In the RL stage, we outperform the baseline SCST by a large margin. Comparing with the more recent N-SCST Gao et al. (2019) method, our model also achieves the best scores in all the metrics. Specifically, our method enhances the CIDEr score for 3.4% over the N-SCST method, which suggests that our method performs much better in the CIDEr optimization RL training process.

It is worth noticing that adopting TCTS yields more gain on the Att2in model than on the X-LAN* model. Considering that Att2in model contains much fewer parameters than the X-LAN* model, we find that our methods is more effective on the lighter models. Consequently, it is promising that our methods would contribute to the practical implementations of image captioning models.

Methods B-1 B-2 B-3 B-4 M R C S
Cross-Entropy Loss
XE Rennie et al. (2017) - - - 31.3 26.0 54.3 101.4 -
+TCTS(ours) 75.3 59.1 45.2 34.4 26.5 55.4 107.5 19.7
Reinforcement Learning
SCST Rennie et al. (2017) - - - 33.3 26.3 55.3 111.4 -
N-SCST Gao et al. (2019) 77.9 61.5 46.8 35.0 26.9 56.3 115.2 20.4
+TCTS (ours) 78.1 62.1 47.4 35.7 26.9 56.8 118.6 20.4
Table 5: Single model offline performances (%) of adopting various training strategies on the Att2in model.

Improvements Visualization To clearly demonstrate the gain of adopting our proposed teacher-critical training strategies over the widely adopted SCST, we compare the performance of three X-LAN* model and X-LAN*+TCTS model trained with the same three random seeds. These models’ average and standard derivation of Rouge-L, Bleu-4, and CIDEr scores on the MSCOCO test set are shown in Figure 3. We find that our method shows remarkable improvement over the SCST baseline in terms of Rouge-L and Bleu-4 scores due to the discriminative reward computation strategy adopted in our TCTS. Our training strategy encourages the student model to generate longer LCS with the teacher model and punish the words that are not in the LCS. Thus the Rouge-L score and Bleu-4 score can be effectively enhanced. Figure 3(c) shows that adopting TCTS even improves the CIDEr performance in latter epochs. Consequently, additionally adopting our method not only does not violate the goal of optimizing the CIDEr score but also increases the CIDEr with the help of the teacher model.

Qualitative Results

Figure 4 shows some qualitative results of our proposed TCTS method against the ground-truth caption and the X-LAN* model trained with SCST Rennie et al. (2017). These images are chosen from the MSCOCO Karpathy test split. We show three captions for each image, in which the green, blue, and red captions are the ground-truth and generated by X-LAN* and X-LAN*+TCTS respectively. Generally, it can be noticed that with the incorporation of TCTS, the student model can recognize more correct keywords for the image. This is because the teacher model is trained with the ground-truth attributes, therefore it is capable of generating captions with correct keywords. In the training process, the student model is encouraged to generate more appropriate keywords to prolong the LCS. Such ability to recognize the objects and generate correct keywords can be generalized to the test set to assist the student model to generate precise and detailed captions. As shown in Figure 4(a)-(c), while the baseline X-LAN* model fails to depict the correct instances in the image, the X-LAN*+TCTS model is able to precisely recognize the ‘Christmas tree’, the ‘toothbrush’, and the ‘tarmacs’. We show that TCTS helps the student model to generate a more detailed caption in Figure 4(d)-(f). Our model narrates very similar to the ground-truth in Figure 4(d), which means the teacher even helps the student to better learn the tone of the ground-truth to some extent. In Figure 4(f), our model even depicts the image more detailed than the ground-truth caption, which further verifies the effectiveness of the proposed TCTS.

Conclusions

In this paper, we propose Teacher-Critical Training Strategies (TCTS) to effectively leverage the teacher model to improve the unreasonable supervision mechanisms in commonly adopted XE and RL training processes. The experimental results on MSCOCO suggest that our method achieves comprehensive improvement in the XE stage and obtains remarkable enhancement in terms of Bleu and Rouge-L scores in the RL stage. That our method achieves noticeable improvement over SCST on the light Att2in model suggests that our method is more applicable in the practical implementation of image captioning models. Currently, our method is not capable of generate a per-token reward for each word in the student caption. We leave it as the future work that developing a more advanced teacher-critical per-token reward assignment strategy to further boost the CIDEr score.

References

  1. Spice: semantic propositional image caption evaluation. In European Conference on Computer Vision, pp. 382–398. Cited by: Experimental Settings.
  2. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Vol. 3, pp. 6. Cited by: Introduction, Introduction, Related Works, Table 1, Table 2, Experimental Settings, Table 3.
  3. Microsoft coco captions: data collection and evaluation server. arXiv preprint arXiv:1504.00325. External Links: Link Cited by: Introduction, Experimental Settings.
  4. Meshed-memory transformer for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10578–10587. Cited by: Table 2.
  5. Meteor universal: language specific translation evaluation for any target language. In Proceedings of the 9th workshop on statistical machine translation, pp. 376–380. Cited by: Experimental Settings.
  6. From captions to visual concepts and back. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1473–1482. Cited by: Experimental Settings.
  7. Semantic compositional networks for visual captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Vol. 2. Cited by: Experimental Settings.
  8. Self-critical n-step training for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6300–6308. Cited by: Related Works, Quantitative Analysis, Table 5.
  9. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778. Cited by: Quantitative Analysis.
  10. Long short-term memory. Neural Computation 9 (8), pp. 1735–1780. Cited by: Introduction.
  11. Attention on attention for image captioning. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4634–4643. Cited by: Related Works, Table 2.
  12. Image captioning with end-to-end attribute detection and subsequent attributes prediction. IEEE Transactions on Image Processing 29, pp. 4013–4026. Cited by: Introduction, Table 2, Table 3.
  13. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3128–3137. Cited by: Experimental Settings.
  14. Entangled transformer for image captioning. In Proceedings of the IEEE International Conference on Computer Vision, pp. 8928–8937. Cited by: Introduction.
  15. Rouge: a package for automatic evaluation of summaries. In Proc. Text Summarization Branches Out, pp. 1–8. Cited by: Experimental Settings.
  16. Knowing when to look: adaptive attention via a visual sentinel for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3242–3250. Cited by: Related Works.
  17. X-linear attention networks for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10971–10980. Cited by: Introduction, Introduction, Related Works, Constructing the Teacher Model, Table 1, Table 2, Experimental Settings, Performance Evaluation, Table 3.
  18. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pp. 311–318. Cited by: Experimental Settings.
  19. Look back and predict forward in image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8367–8375. Cited by: Table 2.
  20. Faster R-CNN: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, Cited by: Related Works.
  21. Self-critical sequence training for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1179–1195. Cited by: Introduction, Introduction, Related Works, TCTS for RL Training, Performance Evaluation, Quantitative Analysis, Quantitative Analysis, Qualitative Results, Table 3, Table 5.
  22. Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008. Cited by: Introduction, Related Works.
  23. Cider: consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566–4575. Cited by: Experimental Settings.
  24. Show and tell: a neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156–3164. Cited by: Introduction.
  25. Show, attend and tell: neural image caption generation with visual attention. In Proceedings of International Conference on Machine Learning, pp. 2048–2057. Cited by: Introduction, Related Works.
  26. Auto-encoding scene graphs for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10685–10694. Cited by: Table 2, Performance Evaluation, Table 3.
  27. Exploring visual relationship for image captioning. In European Conference on Computer Vision, pp. 684–699. Cited by: Table 2.
  28. Boosting image captioning with attributes. In Proceedings of the IEEE International Conference on Computer Vision, pp. 22–29. Cited by: Introduction, Table 3.
  29. Image captioning with semantic attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4651–4659. Cited by: Introduction, Related Works.
  30. Context-aware visual policy network for fine-grained image captioning. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: Table 3.
  31. Actor-critic sequence training for image captioning. arXiv preprint arXiv:1706.09601. Cited by: Related Works.
  32. Object relational graph with teacher-recommended learning for video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13278–13288. Cited by: Introduction, Related Works.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
414549
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description