Selfcritical step Training for Image Captioning
Abstract
Existing methods for image captioning are usually trained by cross entropy loss, which leads to exposure bias and the inconsistency between the optimizing function and evaluation metrics. Recently it has been shown that these two issues can be addressed by incorporating techniques from reinforcement learning, where one of the popular techniques is the advantage actorcritic algorithm that calculates pertoken advantage by estimating state value with a parametrized estimator at the cost of introducing estimation bias. In this paper, we estimate state value without using a parametrized value estimator. With the properties of image captioning, namely, the deterministic state transition function and the sparse reward, state value is equivalent to its preceding stateaction value, and we reformulate advantage function by simply replacing the former with the latter. Moreover, the reformulated advantage is extended to step, which can generally increase the absolute value of the mean of reformulated advantage while lowering variance. Then two kinds of rollout are adopted to estimate stateaction value, which we call selfcritical step training. Empirically we find that our method can obtain better performance compared to the stateoftheart methods that use the sequence level advantage and parametrized estimator respectively on the widely used MSCOCO benchmark.
1 Introduction
Image captioning aims at generating natural captions automatically for images, which is of great significance in scene understanding. It is a very challenging task, which requires to recognize important objects in the image, as well as their attributes and relationships between each other, such that they can be finally described properly in natural language. The ability of the machine to mimic human in expressing rich information in natural language with correct grammar is important since it can be applied to humanrobot interaction and blind users guiding.
Inspired by the recently introduced encoderdecoder framework for machine translation in [6], most recent works in image captioning have adopted this paradigm to generate captions for images [24]. In general, an encoder, e.g. convolutional neural network (CNN), encode images to visual features, while a decoder, e.g. long short term memory (LSTM) [10], decodes the visual features to generate captions. These methods are trained in an endtoend manner to minimize cross entropy loss, i.e. maximize the likelihood of each groundtruth word given the preceding groundtruth word, which is also known as âTeacher Forcingâ[14].
The first problem of cross entropy loss is that it will lead to âexposure biasâ, since in the training stage, the model is only fed with groundtruth word at each time step, while in the testing stage, the model is fed with the previously predicted word. This discrepancy between training and testing easily results in error accumulation during generation, as the model is not exposed to its predictions during training and difficult to handle the errors which never occur in the training stage. In order to handle exposure bias, Bengio et al. [4] feed back the model own predictions as input with scheduled sampling, while Lamb et al. [14] proposed the â Professor Forcingâ on top of the âTeacher Forcingâ.
The second problem of cross entropy loss is that the generated sentences are evaluated in the testing stage by nondifferentiable metrics, such as BLEU 1234 [20], ROUGE [15], METEOR [3], CIDEr [23], SPICE [1], while during training the model is trained to minimize cross entropy loss, which is the inconsistency between the optimizing function and evaluation metrics. The methods proposed in [4, 14] cannot address this inconsistency. Recently, it has shown that policy gradient algorithm in reinforcement learning (RL) can be trained to avoid exposure bias and directly optimize such nondifferentiable evaluation metrics [21, 17, 29, 22]. In this way, the model can be exposed to its own predictions during training. However, the algorithms in [22] use sequence level advantage that implicitly makes an invalid assumption that every token makes the same contribution to the whole sequence. Many works[21, 17, 29] have been proposed to model pertoken advantage. However, they utilize a parametrized value/baseline estimator at the cost of introducing estimation bias.
In this paper, we improve the advantage actorcritic algorithm to estimate pertoken advantage without introducing the biased parametrized value estimator. With the properties of image captioning, namely, the deterministic state transition function and the sparse reward, state value is equivalent to its preceding stateaction value, and we reformulate advantage function by simply replacing the former with the latter. Since stateaction value cannot be precisely estimated, the model may easily converge to the local maxima trained with the reformulated advantage function. Therefore, we propose step reformulated advantage function, which can generally increase the absolute value of the mean of reformulated advantage while lowering variance. In order to estimate stateaction value, we use Monte Carlo rollouts inspired by [17, 28] and maxprobability rollout inspired by [22], which is termed as selfcritical step training. According to the empirical results, our model improves the performance of image captioning compared to the methods that use the sequence level advantage and parametrized estimator respectively.
Overall, we make the following contributions in this paper: (1) with the special properties of image captioning, we find the equivalence between state value and its preceding stateaction value, and reformulate the original advantage function for each action; (2) on top of the reformulated advantage function, we extend to step reformulated advantage function to generally increase the the absolute value of the mean of reformulated advantage while lowering variance; (3) we utilize two kinds of rollout to estimate stateaction value function to perform selfcritical training.
2 Related Work
Many different models have been developed for image captioning, which can be divided into two categories: templatebased methods [8, 13] and neural networkbased methods. Since our method adopts neural network architecture, we mainly introduce methods in this vein. Efforts of this line have been devoted to two directions: attention mechanism and reinforcement learning.
2.1 Attention Mechanism
The encoderdecoder framework of machine translation [6] was firstly introduced by [24], which feeds the last fully connected feature of the image into RNN to generate the caption. Xu et al. [26] proposed soft and hard attention mechanisms to model the humanâs eye focusing on different regions in the image when generating different words. This work is further improved in [2, 22, 18, 5]. In [18], they introduced a visual sentinel to allow the attention module to selectively attend to visual and language features. Anderson et al. [2] adopted a bottomup module, that uses object detection to detect objects in the image, and a topdown module that utilizes soft attention to dynamically attend to these object features. Chen et al. [5] proposed a spatial and channelwise attention model to attend to visual features. Rennie et al. [22] proposed FC model and Att2in models which achieve good performance.
2.2 Reinforcement Learning
Recently a few works use reinforcement learningbased methods to address the exposure bias and the mismatch between the optimizing function and the nondifferentiable evaluation metrics [21, 22, 17, 29] in image captioning. Ranzato et al. [21] firstly introduced REINFORCE algorithm [25] to sequence training with RNNs. However, REINFORCE algorithm often results in large variance in gradient estimation. To lower the variance of the policy gradient, many works have introduced different kinds of baseline into REINFORCE algorithm. For example, the reward of the caption generated by the inference algorithm is adopted as the baseline in [22], which uses sequence level advantage while the pertoken advantage was not considered. A variety of algorithms proposed in [21, 17, 29] aim at modeling the pertoken advantage. Ranzato et al. [21] used a baseline reward parametric estimator. In [17], they used FC layers to predict the baseline and used Monte Carlo rollouts to predict the stateaction value function. In [29], they combined the advantage actorcritic algorithm and temporal difference learning, and used another RNN to predict the state value function. However, the value/baseline estimator was used in [21, 17, 29], which introduces estimation bias. In this paper, we utilize the properties of image captioning to reformulate the advantage actorcritic method and use different kinds of rollout to estimate the stateaction value function to calculate pertoken advantage without introducing bias.
3 Methodology
3.1 Training with cross entropy loss
Given an image , the goal of image captioning is to generate a token sequence , , where is the dictionary. The captioning model predicts a token sequence starting with and ending with , where is a special token BOS indicating the start of the sentence, and is also a special token EOS indicating the end of the sentence. In order to simplify the formulas, is denoted as the total length of a generated sequence, ignoring the fact that generated token sequences have different lengths. We use the standard encoderdecoder architecture for image captioning, where a CNN as an encoder, encodes an image to an image feature , and a RNN can be adopted as a decoder to decode to output a token sequence . In this work, we adopt the Att2in model proposed by [22]. Given a groundtruth sequence , the model parameters are trained to minimize the cross entropy loss (XENT)
(1) 
where is a probability distribution of the token given the preceding generated tokens and the image feature .
3.2 Training using policy gradient
Problem formulation. To address both problems of the cross entropy loss described above, namely, the exposure bias and the inconsistency between the optimizing function and evaluation metrics, we incorporate the reinforcement learning into image captioning. Formally, we consider captioning process as a finite Markov process (MDP). Our captioning model introduced above can be viewed as an agent, which interacts with an environment (words and images). In the MDP setting , is a state space, is an action space as well as the dictionary, is state transition probability, is reward function and is the discounted factor. The agent selects an action, that corresponds to generating a token, from a conditional probability distribution called policy. In policy gradient algorithms, we consider a set of candidate policies parametrized by . The state is considered as a list composing of the image feature and the tokens/actions generated so far:
(2) 
Here we define the initial state . At each time step, the RNN consumes and uses the hidden state of RNN to generate the next token . With the definition of the state, we have the next state : we simply append the token to . According to the process, the state transition function can be called deterministic state transition function. Formally, we have:
(3) 
When the state is transferred to the next state by selecting action , the agent receives reward issued from the environment. However, in image captioning, we can only obtain a reward when EOS token is generated and is not considered in reward calculation. The reward is computed by evaluating the generated complete sentences compared with corresponding groundtruth sentences under an evaluation metric. Therefore, we define the reward for each action as follows:
(4) 
In reinforcement learning, a value function is a prediction of the expected, accumulative, discounted future reward, measuring how good each state, or stateaction pair, is. We define the stateaction value function and the state value function of the policy as follows:
(5) 
where is the expected discounted accumulated reward under policy starting from taking action at state , and is the expected discounted accumulated reward starting from state . To simplify the notation, we denote and with in the rest of paper. It is obvious that the difference between and lies in whether taking the action or not at state when calculating the accumulated reward. In reinforcement learning, the agent aims to maximize the circumulative reward by estimating the gradient and updating its parameters, instead of minimizing the cross entropy loss as Eq. (1).
In policy gradient methods, the gradient can be written as:
(6) 
where the baseline can be any arbitrary function, as long as it does not depend on action . This baseline does not change the expected gradient, but can decrease the variance of the gradient estimate significantly. This algorithm is known as REINFORCE with a Baseline. Using as the baseline , the algorithm is changed to advantage actorcritic (A2C) algorithm as follows:
(7) 
In Eq. (7), is called advantage function. This equation intuitively guides the agent to an evolution direction that increases the probability of betterthanaverage actions and decrease the probability of worsethanaverage actions [29].
1step reformulated advantage function. Image captioning is a special case in reinforcement learning, for its state transition is deterministic, while other applications can have different next states with a certain probability, such as Atari Games. Here we use this property to reformulate Eq. (7).
With the definition of and in Eq. (5), we have
(8) 
Due to the deterministic state transition function described above in Eq. (3), Eq. (8) can be rewritten as
(9) 
In this paper, we set discounted factor . According to reward function of Eq. (4), when , we have . Then can be written as
(10) 
Eq. (10) indicates that given the two properties of image captioning, namely the deterministic state transition function and the reward function, state value is equivalent to its preceding stateaction value. Then we can rewrite Eq. (7) by incorporating Eq. (10) into Eq. (7) as follows:
(11) 
where is the refomulated advantage function from in Eq. (7). Therefore, is a new baseline of instead of . Each stateaction value uses its preceding stateaction value as baseline, such that it is termed as 1step reformulated advantage function.
In our approach, the agent aims at maximizing Eq. (11) rather than Eq. (7). Eq. (11) has an intuitive interpretation that it helps the agent to increase the probability of the action which has larger expected accumulated rewards compared to that of preceding action and decrease the probability of the action which has smaller expected accumulate rewards compared to that of preceding action.
The most straightforward way to simulate the environment with the current policy is to obtain a Monte Carlo trajectory from the multinomial strategy and estimate the gradient :
(12) 
where , and is an empirical estimate of .
step reformulated advantage function. According to the property of Eq. (11) described above, the model encourages tokens better than its preceding token in terms of the value, and surpress the worse tokens. Though Eq. (11) is a greedy algorithm, Eq. (11) can guide the evolution direction of the model towards the global maxima only when stateaction value is estimated precisely. Image captioning is considered as a modelfree reinforcement learning task, which uses rollouts or function approximation to estimate stateaction value. However, both methods, where the former suffers from a large variance and the latter introduces estimation bias, cannot predict absolutely precise value that may turn out to be wrong to encourage or suppress a token in this strict greedy strategy. Therefore, we introduce step reformulated advantage function. In step reformulated advantage function, we view steps as a large step to perform Eq. (11). Each step within the large step shares the step reformulated advantage as follows:
(13) 
where and is denoted as a rounddown function, and ranges from to which unifies the two extremes, namely, step and step. In the step reformulated advantage, steps show a much clearer evolution trend of the Monte Carlo trajectory from the multinomial strategy than step, and the values of neighboring states in step have a more precise margin than that in step, except that stateaction value estimation use the same strategy of Monte Carlo trajectory that samples one sequence from multinomial strategy. If they use the same strategy, estimated values of each time step are from the same distribution and thus larger cannot enlarge the margin of neighboring state values. Therefore, except for that particular case, as increases, the absolute value of the mean and the variance of reformulated advantage will be increased and reduced respectively.
However, as increases, pertoken advantage is inevitably gradually lost until a sequence level advantage of . Therefore, different methods of estimating stateaction value have different distributions and have different most suitable that performs best in balancing the approximation of pertoken reformulated advantage and the improvement on the absolute mean of reformulated advantage. In general, the performance of small is better than that of .
Estimating the stateaction value function. According to Eq. (13), we only need to estimate . Here, we propose two methods to estimate nonparametric : use Monte Carlo rollouts inspired by [17, 28] and use inference algorithm (maxprobability rollout) inspired by [22]. These processes are illustrated as Fig. 1. Since is an expected accumulated reward, Monte Carlo is more stable and precise than maxprobability rollout to estimate with additional computation cost of rollouts. In 1step reformulated advantage function, the model rollouts every steps, while in step reformulated advantage function, the model rollouts every steps. Therefore, the model adopts selfcritical training [22], which uses rollouts to estimate value functions as a critic.
In Monte Carlo rollouts, we sample continuations of the sequence to obtain , which means that the subsequent tokens are sampled from the multinomial strategy. When , according to Eq. (4) and Eq. (5), the stateaction value function can be computed by the average of the rewards
(14) 
where is denoted as the reward of the âth continuation sampled after from the multinomial strategy. In our experiment, we set . A slight difference between our method and [17, 28] is that we need to rollout from to estimate , and they do not. If , stateaction value estimation and Monte Carlo trajectory both sample a sequence from multinomial strategy in each step, and thus larger cannot enlarge the margin of neighboring state values as discussed above. As increases, though rollouts of estimating stateaction value are also sampled from multinomial strategy, the mean reward of can estimate more precise stateaction value than (i.e. stateaction value estimation and Monte Carlo trajectory use different strategies in ) and thus larger will have larger absolute value of the mean of reformulated advantage with lower variance.
In maxprobability rollout, we sample only one continuations of the sequence to obtain , which are tokens of the largest probabilities at every time step. Then we have
(15) 
where means the reward of the maxprobability rollout sequence after under the inference algorithm. Interestingly, SCST [22] is equivalent to step reformulated advantage function using maxprobability rollout, i.e. SCST is a variant of ours. Here, stateaction value estimation and Monte Carlo trajectory use different strategies, where the former are from maxprobability strategy and the latter are from multinomial strategy. Moreover, maxprobability strategy can always obtain better sequence than multinomial strategy. Therefore, though the reward of maxprobability rollout cannot reflect the real stateaction value, larger can have larger absolute value of the mean of reformulated advantage with lower variance.
It is worth noting that the rollout of preceding step can be used both in preceding token and this token with different effects. Here, we directly optimize CIDEr metric, i.e. is CIDEr score. Moreover, only when calculating the last reformulated advantage of each sequence that includes token EOS, we use CIDEr with EOS as a token. Otherwise, we use CIDEr without EOS as a token. It is because EOS is not a normal token of a sentence like other words but a special token indicating the ending of the sentence, and it is ignored in the standard calculation of evaluation metric scores.
4 Experiments
4.1 Dataset
We evaluate our method on the MSCOCO dataset [16]. For fair comparisons, we use the widely used splits from [11]. The training set contains images with captions for each image and images for validation and images for offline testing. We follow the standard practice to preprocess all captions, including converting all captions to lower case, tokenizing on white space, truncating captions longer than words, and replacing words that do not occur at least times with UNK token resulting in words in the dictionary. To evaluate generated caption quality, we use the standard metrics, namely BLEU , ROUGE, METEOR, CIDEr, SPICE. We extract image features using Resnet101 [9] without finetune.
4.2 Implementation Details
The embedding dimensions of the LSTM hidden, image, word and attention are all fixed to for all the models. We pretrain all the models under XENT loss for epochs using ADAM [12] optimizer with default settings and fixed learning rate . During training under XENT loss, our batch size is set to . We then run RL training with a fixed learning rate . In RL training, we use the models trained under XENT loss as the pretrained model in order to reduce the search space, and the batch size is set to . In the whole training process, we use fixed dropout rate to prevent the models from overfitting.
4.3 Experiment Configuration
Here are the configurations of the basic model and several variants of our models. This series of experiments are designed to explore the effects of different step, different combinations of and Monte Carlo rollouts versus maxprobability rollout. Besides, we reimplement two stateoftheart reinforcement learningbased model SCST [22] and PGCIDEr [17], and all the hyperparameters are the same as those of our proposed models for fair comparison.
(1) XENT is the basic model trained with cross entropy loss, which is then used as the pretrained model of all reinforcement learningbased models.
(2) For maxprobabilty rollout, we conduct stepmaxpro () that are trained with step reformulated advantage throughout the whole training time. We also conduct models trained with different step successively, e.g. stepmaxpro, stepmaxpro, stepmaxpro.
(3) For Monte Carlo rollouts, we conduct stepsample that is trained with 1step reformulated advantage using Monte Carlo rollout to estimate the stateaction value function. We also conduct stepsample.
(4) SCST [22] (i.e. stepmaxpro) uses sequence level advantage for every token in a sampled sequence. Here, we compare selfcritical pertoken advantage with selfcritical sequence level advantage.
(5) PGCIDEr [17] uses Monte Carlo rollouts with a parametrized estimator. Here, we compare selfcritical pertoken advantage with parametrized pertoken advantage.

BLEU1  BLEU2  BLEU3  BLEU4  METEOR  ROUGEL  CIDEr  SPICE 

XENT  74.1  57.4  42.8  31.7  25.8  54.1  102.1  19.2 
PGCIDER[17]  77.44  60.66  45.85  34.32  26.35  55.61  113.9  19.25 
SCST[22](stepmaxpro)  76.83  60.65  46.05  34.61  26.65  56.03  112.7  19.99 
stepsample  77.49  61.19  46.64  35.08  26.88  56.11  115.4  20.05 
stepsample  77.41  61.10  46.46  34.88  26.88  56.10  114.9  20.23 
stepmaxpro  77.24  60.90  46.13  34.46  26.87  56.11  115.1  20.26 
stepmaxpro  77.82  61.30  46.45  34.80  26.95  56.29  114.6  20.35 
stepmaxpro  77.67  61.01  46.30  34.78  26.91  56.05  114.5  20.20 
stepmaxpro  77.45  61.02  46.25  34.59  26.89  56.26  114.8  20.38 
stepmaxpro  77.30  60.77  46.07  34.48  26.74  56.02  114.0  20.16 
stepmaxpro  77.93  61.54  46.75  34.96  26.92  56.27  115.2  20.42 
BLEU1  BLEU2  BLEU3  BLEU4  METEOR  ROUGEL  CIDEr  
c5  c40  c5  c40  c5  c40  c5  c40  c5  c40  c5  c40  c5  c40  
Google NIC[24]  71.3  89.5  54.2  80.2  40.7  69.4  30.9  58.7  25.4  34.6  53.0  68.2  94.3  94.6 
HardAttention[26]  70.5  88.1  52.8  77.9  38.3  65.8  27.7  53.7  24.1  32.2  51.6  65.4  86.5  86.3 
MSRCap[7]  71.5  90.7  54.3  81.9  40.7  71.0  30.8  60.1  24.8  33.9  52.6  68.0  93.1  93.7 
mRNN[19]  71.6  89.0  54.5  79.8  40.4  68.7  29.9  57.5  24.2  32.5  52.1  66.6  91.7  93.5 
ATT[27]  73.1  90.0  56.5  81.5  42.4  70.9  31.6  59.9  25.0  33.5  53.5  68.2  94.3  95.8 
Adaptive[18]  74.8  92.0  58.4  84.5  44.4  74.4  33.6  63.7  26.4  35.9  55.0  70.5  104.2  105.9 
MIXER[21]  74.7    57.9    43.1    31.7    25.8    54.5    99.1   
PGSPIDEr[17]  75.1  91.6  59.1  84.2  44.5  73.8  33.1  62.4  25.5  33.9  55.1  69.4  104.2  107.1 
AC[29]  77.8  92.9  61.2  85.5  45.9  74.5  33.7  62.5  26.4  33.4  55.4  69.1  110.2  112.1 
SCSTAtt2in(Ens. 4)[22]              34.4    26.8    55.9    112.3   
stepmaxpro  77.1  92.5  60.6  85.1  45.8  74.9  34.1  63.5  26.6  35.2  55.6  70.0  111.1  114.0 
stepsample  77.3  92.5  60.9  85.4  46.2  75.2  34.5  64.0  26.6  35.2  55.6  70.2  111.6  114.5 
stepmaxpro  77.4  92.9  60.9  85.6  46.0  75.2  34.3  63.7  26.7  35.2  55.8  70.0  111.3  113.5 
stepmaxpro(Ens. 4)  77.6  93.1  61.3  86.1  46.5  76.0  34.8  64.6  26.9  35.4  56.1  70.4  112.6  115.3 
4.4 Quantitative Analysis
Performance of the Karpathy test split. In Table 1, we report the performance of our models, SCST[22] and PGCIDEr [17] on the Karpathy test split, and all the models are single model. In general, we can see that our models have the best performance on all metrics. Comparing our basic model stepmaxpro and stepsample with XENT, we obtain a significant improvement on CIDEr score over XENT at a great margin from to and of stepmaxpro and stepsample respectively, since our basic models are reinforcement learningbased models and can address the exposure bias and directly optimize the evaluation metric. In particular, the stepsample outperform stepmaxpro in terms of almost all metrics, and we can conclude that the average reward of Monte Carlo rollouts can estimate the more precise stateaction value than maxprobability rollout, which leads to better performance. However, stepsample need to sample rollouts with a greater computation cost.
Regarding maxprobability rollout, we compare different stepmaxpro in Table 1. We can see that intermediate settings attain better overall scores than two extremes and (SCST[22]). Better performance of intermediate settings originates from the fact that they increase the absolute value of the mean of reformulated advantage while lowering variance in most time steps compared to , which are quantitatively shown in Fig. 3(a) & 3(b). Since rolloutbased methods estimate a rough stateaction value, when reformulated advantage is small with large variance and it may turn out to be wrong to encourage or suppress a token in this strict greedy strategy. As increases, the dilemma will be eased but gradually loses pertoken advantage until a sequence level advantage of . This implies intermediate which balances the approximation of pertoken advantage and the improvement of the absolute value of the mean of reformulated advantage, is always better in maxprobability rollout. Moreover, different or combining different has different effects on balancing these two conflicts, e.g. the performance of is better than that of and close to that of , and  and  are both inferior to . We also show the performance curves of the Karpathy validation split during training illustrated in Fig. 2. In Fig. 2(a) & 2(b), our models have an overwhelming advantage over SCST[22] throughout the whole training process, which demonstrates that selfcritical pertoken advantage is better than selfcritical sequence level advantage.
Regarding Monte Carlo rollouts, stepsample and stepsample are superior to PGCIDEr[17], which demonstrates that selfcritical pertoken advantage is better than parametrized pertoken advantage in Table 1 and Fig. 2(c) & 2(d).
Comparing different effects of step towards maxprobability rollout and Monte Carlo rollouts, we find that large can increase the absolute value of the mean of reformulated advantage while lowering the variance using these two kinds of rollouts in Fig. 3. However, stepmaxpro is superior to stepmaxpro and stepsample is close to stepsample in Table 1. Therefore, step () is more effective in maxprobability rollout than in Monte Carlo rollouts. It is possible because degrees of change in the absolute value of the mean and the variance of reformulated advantage across different are relatively small in Monte Carlo rollouts and thus possibly cannot offset the lose of pertoken advantage, while those are relatively large in maxprobability rollout and large (e.g. ) can balance better these two conflicts as illustrated in Fig. 3.
Performance on the official MSCOCO testing server. Table 2 shows the result of our single models and 4 ensembled model using beam search with beam size set to on the official MSCOCO evaluation server, and all other results are based on single model. Our single models and ensembled models outperform all of them in terms of most metrics, even the ones which use complex attention mechanisms [27, 18], and other reinforcement learningbased models which all introduce parameterized estimator [21, 17, 29] and sequence level advantage [22].
4.5 Qualitative Analysis
Fig. 4 shows some qualitative results of stepmaxpro against Ground Truth and the model trained with XENT loss. Each image has three captions from these sources listed below. In general, the captions predicted by stepmaxpro are better compared with the model trained with XENT loss. In Fig. 4, we can see that when the image content is common in the dataset and not too complex to describe, XENT and stepmaxpro can predict correct captions. Since the reinforcement learningbased model can avoid accumulating errors during generating the caption, the captions in Fig. 44 generated by stepmaxpro can describe more important objects and capture their relationships with more distinctive information of the image, while those generated by XENT are less descriptive or incorrect to some degree. When a variety of human activities that appear rarely in the dataset or different activities with the same objects that are difficult to distinguish by the model, the models easily have the incorrect prediction. For example, in Fig. 4, stepmaxpro and XENT both predict wrong captions that the player in the base is throwing the ball, who in fact is catching the ball with a glove.
5 Conclusion
We reformulate advantage function to estimate pertoken advantage without using parametrized estimator. Moreover, step reformulated advantage is proposed to increase the absolute value of the mean of reformulated advantage while lowering variance. Our methods outperform stateoftheart methods that use the sequence level advantage and parametrized estimator on MSCOCO benchmark.
Acknowledgements
This work was supported in part by the NationalÂ KeyÂ R&DÂ ProgramÂ ofÂ China (2017YFC0821005), National Basic Research Program of China (973 Program, 2015CB351800), and Highperformance Computing Platform of Peking University, which are gratefully acknowledged.
References
 [1] P. Anderson, B. Fernando, M. Johnson, and S. Gould. Spice: Semantic propositional image caption evaluation. In European Conference on Computer Vision, pages 382–398. Springer, 2016.
 [2] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang. Bottomup and topdown attention for image captioning and vqa. arXiv preprint arXiv:1707.07998, 2017.
 [3] S. Banerjee and A. Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72, 2005.
 [4] S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks. In Advances in Neural Information Processing Systems, pages 1171–1179, 2015.
 [5] L. Chen, H. Zhang, J. Xiao, L. Nie, J. Shao, W. Liu, and T.S. Chua. Scacnn: Spatial and channelwise attention in convolutional networks for image captioning. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6298–6306. IEEE, 2017.
 [6] K. Cho, B. Van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using rnn encoderdecoder for statistical machine translation. Computer Science, 2014.
 [7] H. Fang, S. Gupta, F. Iandola, R. K. Srivastava, L. Deng, P. Dollár, J. Gao, X. He, M. Mitchell, J. C. Platt, et al. From captions to visual concepts and back. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1473–1482, 2015.
 [8] A. Farhadi, M. Hejrati, M. A. Sadeghi, P. Young, C. Rashtchian, J. Hockenmaier, and D. Forsyth. Every picture tells a story: Generating sentences from images. In European conference on computer vision, pages 15–29. Springer, 2010.
 [9] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
 [10] S. Hochreiter and J. Schmidhuber. Long shortterm memory. Neural Computation, 9(8):1735–1780, 1997.
 [11] A. Karpathy and F. F. Li. Deep visualsemantic alignments for generating image descriptions. In Computer Vision and Pattern Recognition, pages 3128–3137, 2015.
 [12] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [13] G. Kulkarni, V. Premraj, V. Ordonez, S. Dhar, S. Li, Y. Choi, A. C. Berg, and T. L. Berg. Babytalk: Understanding and generating simple image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(12):2891–2903, 2013.
 [14] A. M. Lamb, A. G. A. P. GOYAL, Y. Zhang, S. Zhang, A. C. Courville, and Y. Bengio. Professor forcing: A new algorithm for training recurrent networks. In Advances In Neural Information Processing Systems, pages 4601–4609, 2016.
 [15] C.Y. Lin. Rouge: A package for automatic evaluation of summaries. Text Summarization Branches Out, 2004.
 [16] T.Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
 [17] S. Liu, Z. Zhu, N. Ye, S. Guadarrama, and K. Murphy. Improved image captioning via policy gradient optimization of spider. In Proc. IEEE Int. Conf. Comp. Vis, volume 3, page 3, 2017.
 [18] J. Lu, C. Xiong, D. Parikh, and R. Socher. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 6, page 2, 2017.
 [19] J. Mao, W. Xu, Y. Yang, J. Wang, Z. Huang, and A. Yuille. Deep captioning with multimodal recurrent neural networks (mrnn). arXiv preprint arXiv:1412.6632, 2014.
 [20] K. Papineni, S. Roukos, T. Ward, and W.J. Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pages 311–318. Association for Computational Linguistics, 2002.
 [21] M. Ranzato, S. Chopra, M. Auli, and W. Zaremba. Sequence level training with recurrent neural networks. arXiv preprint arXiv:1511.06732, 2015.
 [22] S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel. Selfcritical sequence training for image captioning. In CVPR, volume 1, page 3, 2017.
 [23] R. Vedantam, C. Lawrence Zitnick, and D. Parikh. Cider: Consensusbased image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575, 2015.
 [24] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. In Computer Vision and Pattern Recognition, pages 3156–3164, 2015.
 [25] R. J. Williams. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine learning, 8(34):229–256, 1992.
 [26] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio. Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning, pages 2048–2057, 2015.
 [27] Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo. Image captioning with semantic attention. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4651–4659, 2016.
 [28] L. Yu, W. Zhang, J. Wang, and Y. Yu. Seqgan: Sequence generative adversarial nets with policy gradient. In AAAI, pages 2852–2858, 2017.
 [29] L. Zhang, F. Sung, F. Liu, T. Xiang, S. Gong, Y. Yang, and T. M. Hospedales. Actorcritic sequence training for image captioning. arXiv preprint arXiv:1706.09601, 2017.