From Credit Assignment to Entropy Regularization:
Two New Algorithms for Neural Sequence Prediction
Abstract
In this work, we study the credit assignment problem in reward augmented maximum likelihood (RAML) learning, and establish a theoretical equivalence between the tokenlevel counterpart of RAML and the entropy regularized reinforcement learning. Inspired by the connection, we propose two sequence prediction algorithms, one extending RAML with finegrained credit assignment and the other improving ActorCritic with a systematic entropy regularization. On two benchmark datasets, we show the proposed algorithms outperform RAML and ActorCritic respectively, providing new alternatives to sequence prediction.
From Credit Assignment to Entropy Regularization:
Two New Algorithms for Neural Sequence Prediction
Zihang Dai^{†}^{†}thanks: Equal contribution. , Qizhe Xie^{†}^{†}footnotemark: , Eduard Hovy Language Technologies Institute Carnegie Mellon University {dzihang, qizhex, hovy}@cs.cmu.edu
1 Introduction
Modeling and predicting discrete sequences is the central problem to many natural language processing tasks. In the last few years, the adaption of recurrent neural networks (RNNs) and the sequencetosequence model (seq2seq) (Sutskever et al., 2014; Bahdanau et al., 2014) has led to a wide range of successes in conditional sequence prediction, including machine translation (Sutskever et al., 2014; Bahdanau et al., 2014), automatic summarization (Rush et al., 2015), image captioning (Karpathy and FeiFei, 2015; Vinyals et al., 2015; Xu et al., 2015) and speech recognition (Chan et al., 2016).
Despite the distinct evaluation metrics for the aforementioned tasks, the standard training algorithm has been the same for all of them. Specifically, the algorithm is based on maximum likelihood estimation (MLE), which maximizes the loglikelihood of the “groundtruth” sequences empirically observed.^{1}^{1}1In this work, we use the terms “groundtruth” and “reference” to refer to the empirical observations interchangeably.
While largely effective, the MLE algorithm has two obvious weaknesses. Firstly, the MLE training ignores the information of the task specific metric. As a result, the potentially large discrepancy between the loglikelihood during training and the task evaluation metric at test time can lead to a suboptimal solution. Secondly, MLE can suffer from the exposure bias, which refers to the phenomenon that the model is never exposed to its own failures during training, and thus cannot recover from an error at test time. Fundamentally, this issue roots from the difficulty in statistically modeling the exponentially large space of sequences, where most combinations cannot be covered by the observed data.
To tackle these two weaknesses, there have been various efforts recently, which we summarize into two broad categories:

A widely explored idea is to directly optimize the task metric for sequences produced by the model, with the specific approaches ranging from minimum risk training (MRT) (Shen et al., 2015) and learning as search optimization (LaSO) (Daumé III and Marcu, 2005; Wiseman and Rush, 2016) to reinforcement learning (RL) (Ranzato et al., 2015; Bahdanau et al., 2016). In spite of the technical differences, the key component to make these training algorithms practically efficient is often a delicate credit assignment scheme, which transforms the sequencelevel signal into dedicated smaller units (e.g., tokenlevel or chunklevel), and allocates them to specific decisions, allowing for efficient optimization with a much lower variance. For instance, the beam search optimization (BSO) (Wiseman and Rush, 2016) utilizes the position of margin violations to produce signals to the specific chunks, while the actorcritic (AC) algorithm (Bahdanau et al., 2016) trains a critic to enable tokenlevel signals.

Another alternative idea is to construct a task metric dependent target distribution, and train the model to match this taskspecific target instead of the empirical data distribution. As a typical example, the reward augmented maximum likelihood (RAML) (Norouzi et al., 2016) defines the target distribution as the exponentiated payoff (sequencelevel reward) distribution. This way, RAML not only can incorporate the task metric information into training, but it can also alleviate the exposure bias by exposing imperfect outputs to the model. However, RAML only works on the sequencelevel training signal.
In this work, we are intrigued by the question whether it is possible to incorporate the idea of finegrained credit assignment into RAML. More specifically, inspired by the tokenlevel signal used in AC, we aim to find the tokenlevel counterpart of the sequencelevel RAML, i.e., defining a tokenlevel target distribution for each autoregressive conditional factor to match. Motived by the question, we first formally define the desiderata the tokenlevel counterpart needs to satisfy and derive the corresponding solution (§2). Then, we establish a theoretical connection between the derived tokenlevel RAML and entropy regularized RL (§3). Motivated by this connection, we propose two algorithms for neural sequence prediction, where one is the tokenlevel extension to RAML, and the other a RAMLinspired improvement to the AC (§4). We empirically evaluate the two proposed algorithms, and show different levels of improvement over the corresponding baseline. We further study the importance of various techniques used in our experiments, providing practical suggestions to readers (§6).
2 Tokenlevel Equivalence of RAML
We first introduce the notations used throughout the paper. Firstly, capital letters will denote random variables and lowercase letters are the values to take. As we mainly focus on conditional sequence prediction, we use for the conditional input, and for the target sequence. With denoting a sequence, then denotes the subsequence from position to inclusively, while denotes the single value at position . Also, we use to indicate the length of the sequence. To emphasize the groundtruth data used for training, we add superscript to the input and target, i.e., and . In addition, we use to denote the set of all possible sequences with one and only one eos symbol at the end, and to denote the set of all possible symbols in a position. Finally, we assume length of sequences in is bounded by .
2.1 Background: RAML
As discussed in §1, given a groundtruth pair , RAML defines the target distribution using the exponentiated payoff of sequences, i.e.,
(1) 
where is the sequencelevel reward, such as BLEU score, and is the temperature hyperparameter controlling the sharpness. With the definition, the RAML algorithm simply minimizes the cross entropy (CE) between the target distribution and the model distribution , i.e.,
(2) 
Note that, this is quite similar to the MLE training, except that the target distribution is different. With the particular choice of target distribution, RAML not only makes sure the groundtruth reference remains the mode, but also allows the model to explore sequences that are not exactly the same as the reference but have relatively high rewards.
Compared to algorithms trying to directly optimize task metric, RAML avoids the difficulty of tracking and sampling from the model distribution that is consistently changing. Hence, RAML enjoys a much more stable optimization without the need of pretraining. However, in order to optimize the RAML objective (Eqn. (2)), one needs to sample from the exponentiated payoff distribution, which is quite challenging in practice. Thus, importance sampling is often used (Norouzi et al., 2016; Ma et al., 2017). We leave the details of the practical implementation to Appendix B.1.
2.2 Tokenlevel Target Distribution
Despite the appealing properties, RAML only operates on the sequencelevel reward. As a result, the reward gap between any two sequences cannot be attributed to the responsible decisions precisely, which often leads to a low sample efficiency. Ideally, since we rely on the autoregressive factorization , the optimization would be much more efficient if we have the target distribution for each tokenlevel factor to match. Conceptually, this is exactly how the AC algorithm improves upon the vanilla sequencelevel REINFORCE algorithm (Ranzato et al., 2015).
With this idea in mind, we set out to find such a tokenlevel target. Firstly, we assume the tokenlevel target shares the form of a Boltzmann distribution but parameterized by some unknown negative energy function , i.e.,^{2}^{2}2To avoid clutter, the conditioning on will be omitted in the sequel, assuming it’s clear from the context.
(3) 
Intuitively, measures how much future payoff one can expect if is generated, given the current status and the reference . This quantity highly resembles the actionvalue function (function) in reinforcement learning. As we will show later, it is indeed the case.
Before we state the desiderata for , we need to extend the definition of in order to evaluate the goodness of an unfinished partial prediction, i.e., sequences without an eos suffix. Let be the set of unfinished sequences, following Bahdanau et al. (2016), we define the payoff function for a partial sequence as
(4) 
where the indicates string concatenation.
With the extension, we are ready to state two requirements for :

Marginal match: For to be the tokenlevel equivalence of , the sequencelevel marginal distribution induced by must match , i.e., for any ,
(5) Note that there are infinitely many ’s satisfying Eqn. (5), because adding any constant value to does not change the Boltzmann distribution, known as shiftinvariance w.r.t. the energy.

Terminal condition: Secondly, let’s consider the value of when emitting an eos symbol to immediately terminate the generation. As mentioned earlier, measures the expected future payoff. Since the emission of eos ends the generation, the future payoff can only come from the immediate increase of the payoff. Thus, we require to be the incremental payoff when producing eos, i.e.
(6) for any . Since Eqn. (6) enforces the absolute of at a point, it also solves the ambiguity caused by the shiftinvariance property.
Based on the two requirements, we can derive the form , which is summarized by Proposition 1.
Proposition 1.
Proof.
See Appendix A.1. ∎
Note that, instead of giving an explicit form for the tokenlevel target distribution, Proposition 1 only provides an equivalent condition in the form of an implicit recursion. Thus, we haven’t obtained a practical algorithm yet. However, as we will discuss next, the recursion has a deep connection to entropy regularized RL, which ultimately inspires our proposed algorithms.
3 Connection to Entropyregularized RL
Before we dive into the connection, we first give a brief review of the entropyregularized RL. For an indepth treatment, we refer readers to (Ziebart, 2010; Schulman et al., 2017).
3.1 Background: Entropyregularized RL
Following the standard convention of RL, we denote a Markov decision process (MDP) by a tuple , where are the state space, action space, transition probability, reward function and discounting factor respectively.^{3}^{3}3In sequence prediction, we are only interested in the periodic (finite horizon) case.
Based on the notation, the goal of entropyregularized RL augments is to learn a policy which maximizes the discounted expected future return and causal entropy (Ziebart, 2010), i.e.,
where denotes the entropy and is a hyperparameter controlling the relative importance between the reward and the entropy. Intuitively, compared to standard RL, the extra entropy term encourages exploration and promotes multimodal behaviors. Such properties are highly favorable in a complex environment.
Given an entropyregularized MDP, for any fixed policy , the statevalue function and the actionvalue function can be defined as
(9)  
With the definitions above, it can further be proved (Ziebart, 2010; Schulman et al., 2017) that the optimal statevalue function , the actionvalue function and the corresponding optimal policy satisfy the following equations
(10)  
(11)  
(12) 
Here, Eqn. (10) and (11) are essentially the entropyregularized counterparts of the optimal Bellman equations in standard RL. Following previous literature, we will refer to Eqn. (10) and (11) as the optimal soft Bellman equations, and the and as optimal soft value functions.
3.2 An RL Equivalence of the Tokenlevel RAML
To reveal the connection, it is convenient to define the incremental payoff
(13) 
and the last term of Eqn. (1) as
(14) 
Substituting the two definitions into Eqn. (1), the recursion simplifies as
(15) 
Now, it is easy to see that the Eqn. (14) and (15), which are derived from the tokenlevel RAML, highly resemble the optimal soft Bellman equations (10) and (11) in entropyregularized RL. The following Corollary formalizes the connection.
Corollary 1.
For any groundtruth pair , the recursion specified by Eqn. (13), (14) and (15) is equivalent to the optimal soft Bellman equation of a “deterministic” MDP in entropyregularized reinforcement learning, denoted as , where

the state space corresponds to ,

the action space corresponds to ,

the transition probability is a deterministic process defined by string concatenation

the reward function corresponds to the incremental payoff defined in Eqn. (13),

the discounting factor ,

the entropy hyperparameter ,

and a period terminates either when eos is emitted or when its length reaches and we enforce the generation of eos.
Moreover, the optimal soft value functions and of the MDP exactly match the and defined by Eqn. (14) and (15) respectively. The optimal policy is hence equivalent to the tokenlevel target distribution .
Proof.
See Appendix A.1. ∎
The connection established by Corollary 1 is quite inspiring:

Firstly, it provides a rigorous and generalized view of the connection between RAML and entropyregularized RL. In the original work, Norouzi et al. (2016) point out RAML can be seen as reversing the direction of , which is a sequencelevel view of the connection. Now, with the equivalence between the tokenlevel target and the optimal , it generalizes to matching the future action values consisting of both the reward and the entropy.

Secondly, due to the equivalence, if we solve the optimal soft function of the corresponding MDP, we directly obtain the tokenlevel target distribution. This hints at a practical algorithm with tokenlevel credit assignment.

Moreover, since RAML is able to improve upon MLE by injecting entropy, the entropyregularized RL counterpart of the standard AC algorithm should also lead to an improvement in a similar manner.
4 Proposed Algorithms
In this section, we explore the insights gained from Corollary 1 and present two new algorithms for sequence prediction.
4.1 Value Augmented Maximum Likelihood
The first algorithm we consider is the tokenlevel extension of RAML, which we have been discussing since §2. As mentioned at the end of §2.2, Proposition 1 only gives an implicit form of , and so is the tokenlevel target distribution (Eqn. (3)). However, thanks to Corollary 1, we now know that is the same as the optimal soft actionvalue function of the entropyregularized MDP . Hence, by finding the , we will have access to .
At the first sight, it seems recovering is as difficult as solving the original sequence prediction problem, because solving from the MDP is essentially the same as learning the optimal policy for sequence prediction. However, it is not true because (i.e., ) can condition on the correct reference . In contrast, the model distribution can only depend on . Therefore, the function approximator trained to recover can take as input, making the estimation task much easier. Intuitively, when recovering , we are trying to train an ideal “oracle”, which has access to the groundtruth reference output, to decide the best behavior (policy) given any arbitrary (good or not) state.
Thus, following the reasoning above, we first train a parametric function approximator to search the optimal soft action value. In this work, for simplicity, we employ the Soft Qlearning algorithm (Schulman et al., 2017) to perform the policy optimization. In a nutshell, Soft QLearning is the entropyregularized version of QLearning, an offpolicy algorithm which minimizes the mean squared soft Bellman residual according to Eqn. (11). Specifically, given groundtruth pair , for any trajectory , the training objective is
(16) 
where is the onestep lookahead target Qvalue, and as defined in Eqn. (10). In the recent instantiation of QLearning (Mnih et al., 2015), to stabilize training, the target Qvalue is often estimated by a separate slowly updated target network. In our case, as we have access to a significant amount of reference sequences, we find the target network not necessary. Thus, we directly optimize Eqn. (16) using gradient descent, and let the gradient flow through both and (Baird, 1995).
After the training of converges, we fix the parameters of , and optimize the cross entropy w.r.t. the model parameters , which is equivalent to^{4}^{4}4See Appendix A.2 for a detailed derivation.
(17) 
Compared to the of objective of RAML in Eqn. (2), having access to allows us to provide a distinct tokenlevel target for each conditional factor of the model. While directly sampling from is practically infeasible (§2.1), having a parametric target distribution makes it theoretically possible to sample from and perform the optimization. However, empirically, we find the samples from are not diverse enough (§6). Hence, we fall back to the same importance sampling approach (see Appendix B.2) as used in RAML.
Finally, since the algorithm utilizes the optimal soft actionvalue function to construct the tokenlevel target, we will refer to it as value augmented maximum likelihood (VAML) in the sequel.
4.2 Entropyregularized Actor Critic
The second algorithm follows the discussion at the end of §3.2, which is essentially an actorcritic algorithm based on the entropyregularized MDP in Corollary 1. For this reason, we name the algorithm entropyregularized actor critic (ERAC). As with standard AC algorithm, the training process interleaves the evaluation of current policy using the parametric critic and the optimization of the actor policy given the current critic.
Critic Training.
The critic is trained to perform policy evaluation using the temporal difference learning (TD), which minimizes the TD error
(18) 
where the TD target is constructed based on fixed policy iteration in Eqn. (9), i.e.,
(19) 
It is worthwhile to emphasize that the objective (18) trains the critic to evaluate the current policy. Hence, it is entirely different from the objective (16), which is performing policy optimization by Soft QLearning. Also, the trajectories used in (18) are sequences drawn from the actor policy , while objective (16) theoretically accepts any trajectory since Soft QLearning can be fully offpolicy.^{5}^{5}5Different from Bahdanau et al. (2016), we don’t use a delayed actor network to collect trajectories for critic training. Finally, following Bahdanau et al. (2016), the TD target in Eqn. (9) is evaluated using a target network, which is indicated by the bar sign above the parameters, i.e., . The target network is slowly updated by linearly interpolating with the uptodate network, i.e., the update is for in (Lillicrap et al., 2015).
We also adapt another technique proposed by Bahdanau et al. (2016), which smooths the critic by minimizing the “variance” of Qvalues, i.e.,
where is the mean Qvalue, and is a hyperparameter controlling the relative weight between the TD loss and the smooth loss.
Actor Training.
Given the critic , the actor gradient (to maximize the expected return) is given by the policy gradient theorem of the entropyregularized RL (Schulman et al., 2017), which has the form
(20) 
Here, for each step , we follow Bahdanau et al. (2016) to sum over the entire symbol set , instead of using the single sample estimation often seen in RL. Hence, no baseline is employed. It is worth mentioning that Eqn. (4.2) is not simply adding an entropy term to the standard policy gradient as in A3C (Mnih et al., 2016). The difference lies in that the critic trained by Eqn. (18) additionally captures the entropy from future steps, while the term only captures the entropy of the current step.
Finally, similar to (Bahdanau et al., 2016), we find it necessary to first pretrain the actor using MLE and then pretrain the critic before the actorcritic training. Also, to prevent divergence during actorcritic training, it is helpful to continue performing MLE training along with Eqn. (4.2), though using a smaller weight .
5 Related Work
Task Loss Optimization and Exposure Bias
Apart from the previously introduced RAML, BSO, ActorCritic (§1), MIXER (Ranzato et al., 2015) also utilizes chunklevel signals where the length of chunk grows as training proceeds. In contrast, minimum risk training (Shen et al., 2015) directly optimizes sentencelevel BLEU. As a result, it requires a large number (100) of samples per data to work well. To solve the exposure bias, scheduled sampling (Bengio et al., 2015) adopts a curriculum learning strategy to bridge the training and the inference. Professor forcing (Lamb et al., 2016) introduces an adversarial training mechanism to encourage the dynamics of the model to be the same at training time and inference time. For image caption, selfcritic sequence training (SCST) (Rennie et al., 2016) extends the MIXER algorithm with an improved baseline based on the current model performance.
Entropyregularized RL
Entropy regularization been explored by early work in RL and inverse RL (Williams and Peng, 1991; Ziebart et al., 2008). Lately, Schulman et al. (2017) establish the equivalence between policy gradients and Soft QLearning under entropyregularized RL. Motivated by the multimodal behavior induced by entropyregularized RL, Haarnoja et al. (2017) apply energybased policy and Soft QLearning to continuous domain. Later, Nachum et al. (2017) proposes path consistency learning, which can be seen as a multistep extension to Soft QLearning. More recently, in the domain of simulated control, Haarnoja et al. (2018) also consider the actor critic algorithm under the framework of entropy regularized reinforcement learning. Despite the conceptual similarity to ERAC presented here, Haarnoja et al. (2018) focuses on continuous control and employs the advantage actor critic variant as in (Mnih et al., 2016), while ERAC follows the Q actor critic as in (Bahdanau et al., 2016).
6 Experiments

MT (w/o input feeding) MT (w/ input feeding) Image Captioning Algorithm Mean Min Max Mean Min Max Mean Min Max MLE 27.01 0.20 26.72 27.27 28.06 0.15 27.84 28.22 29.54 0.21 29.27 29.89 RAML 27.74 0.15 27.47 27.93 28.56 0.15 28.35 28.80 29.84 0.21 29.50 30.17 VAML 28.16 0.11 28.00 28.26 28.84 0.10 28.62 28.94 29.93 0.22 29.51 30.24 AC 28.04 0.05 27.97 28.10 29.05 0.06 28.95 29.16 30.90 0.20 30.49 31.16 ERAC 28.30 0.06 28.25 28.42 29.31 0.04 29.26 29.36 31.44 0.22 31.07 31.82
6.1 Experiment Settings
In this work, we focus on two sequence prediction tasks: machine translation and image captioning. Due to the space limit, we only present the information necessary to compare the empirical results at this moment. For a more detailed description, we refer readers to Appendix B and the code^{6}^{6}6https://github.com/zihangdai/ERACVAML.
Machine Translation
Following Ranzato et al. (2015), we evaluate on IWSLT 2014 GermantoEnglish dataset (Mauro et al., 2012). The corpus contains approximately sentence pairs in the training set. We follow the preprocessing procedure used in (Ranzato et al., 2015).
Architecture wise, we employ a seq2seq model with dotproduct attention (Bahdanau et al., 2014; Luong et al., 2015), where the encoder is a bidirectional LSTM (Hochreiter and Schmidhuber, 1997) with each direction being size , and the decoder is another LSTM of size . Moreover, we consider two variants of the decoder, one using the input feeding technique (Luong et al., 2015) and the other not.
For all algorithms, the sequencelevel BLEU score is employed as the payoff function , while the corpuslevel BLEU score (Papineni et al., 2002) is used for the final evaluation. The sequencelevel BLEU score is scaled up by the sentence length so that the scale of the immediate reward at each step is invariant to the length.
Image Captioning
For image captioning, we consider the MSCOCO dataset (Lin et al., 2014). We adapt the same preprocessing procedure and the train/dev/test split used by Karpathy and FeiFei (2015).
The NIC (Vinyals et al., 2015) is employed as the baseline model, where a feature vector of the image is extracted by a pretrained CNN and then used to initialize the LSTM decoder. Different from the original NIC model, we employ a pretrained layer ResNet (He et al., 2016) rather than a GoogLeNet as the CNN encoder.
For training, each imagecaption pair is treated as an i.i.d. sample, and sequencelevel BLEU score is used as the payoff. For testing, the standard multireference BLEU4 is used.
6.2 Comparison with the Direct Baseline
Firstly, we compare ERAC and VAML with their corresponding direct baselines, namely AC (Bahdanau et al., 2016) and RAML (Norouzi et al., 2016) respectively. As a reference, the performance of MLE is also provided.
Due to nonneglected performance variance observed across different runs, we run each algorithm for 9 times with different random seeds,^{7}^{7}7For AC, ERAC and VAML, 3 different critics are trained first, and each critic is then used to train 3 actors. and report the average performance, the standard deviation and the performance range (min, max).
Machine Translation
The results on MT are summarized in the left half of Tab. 1. Firstly, all four advanced algorithms significantly outperform the MLE baseline. More importantly, both VAML and ERAC improve upon their direct baselines, RAML and AC, by a clear margin on average. The result suggests the two proposed algorithms both well combine the benefits of a delicate credit assignment scheme and the entropy regularization, achieving improved performance.
Image Captioning
The results on image captioning are shown in the right half of Tab. 1. Despite the similar overall trend, the improvement of VAML over RAML is smaller compared to that in MT. Meanwhile, the improvement from AC to ERAC becomes larger in comparison. We suspect this is due to the multireference nature of the MSCOCO dataset, where a larger entropy is preferred. As a result, the explicit entropy regularization in ERAC becomes immediately fruitful. On the other hand, with multiple references, it can be more difficult to learn a good oracle (Eqn. (15)). Hence, the tokenlevel target can be less accurate, resulting in smaller improvement.
6.3 Comparison with Existing Work
To further evaluate the proposed algorithms, we compare ERAC and VAML with the large body of existing algorithms evaluated on IWSTL 2014. As a note of caution, previous works don’t employ the exactly same architectures (e.g. number of layers, hidden size, attention type, etc.). Despite that, for VAML and ERAC, we use an architecture that is most similar to the majority of previous works, which is the one described in §6.1 with input feeding.
Based on the setting, the comparison is summarized in Table 2.^{8}^{8}8For a more detailed comparison of performance together with the model architectures, see Table 7 in Appendix C. As we can see, both VAML and ERAC outperform previous methods, with ERAC leading the comparison with a significant margin. This further verifies the effectiveness of the two proposed algorithms.
Algorithm  BLEU 

MIXER (Ranzato et al., 2015)  20.73 
BSO (Wiseman and Rush, 2016)  27.9 
Q(BLEU) (Li et al., 2017)  28.3 
AC (Bahdanau et al., 2016)  28.53 
RAML (Ma et al., 2017)  28.77 
VAML  28.94 
ERAC  29.36 
6.4 Ablation Study
Due to the overall excellence of ERAC, we study the importance of various components of it, hopefully offering a practical guide for readers. As the input feeding technique largely slows down the training, we conduct the ablation based on the model variant without input feeding.
0.001  0.01  0.1  1  

27.91  26.27  28.88  27.38  
29.41  29.26  29.32  27.44 
Firstly, we study the importance of two techniques aimed for training stability, namely the target network and the smoothing technique (§4.2). Based on the MT task, we vary the update speed of the target critic, and the , which controls the strength of the smoothness regularization. The average validation performances of different hyperparameter values are summarized in Tab. 3.

Comparing the two rows of Tab. 3, the smoothing technique consistently leads to performance improvement across all values of . In fact, removing the smoothing objective often causes the training to diverge, especially when and . But interestingly, we find the divergence does not happen if we update the target network a little bit faster () or quite slowly ().

In addition, even with the smoothing technique, the target network is still necessary. When the target network is not used (), the performance drops below the MLE baseline. However, as long as a target network is employed to ensure the training stability, the specific choice of target network update rate does not matter as much. Empirically, it seems using a slower () update rate yields the best result.
Next, we investigate the effect of enforcing different levels of entropy by varying the entropy hyperparameter . As shown in Fig. 1, it seems there is always a sweet spot for the level of entropy. On the one hand, posing an over strong entropy regularization can easily cause the actor to diverge. Specifically, the model diverges when reaches on the image captioning task or on the machine translation task. On the other hand, as we decrease from the best value to 0, the performance monotonically decreases as well. This observation further verifies the effectiveness of entropy regularization in ERAC, which well matches our theoretical analysis.
Finally, as discussed in §4.2, ERAC takes the effect of future entropy into consideration, and thus is different from simply adding an entropy term to the standard policy gradient as in A3C (Mnih et al., 2016). To verify the importance of explicitly modeling the entropy from future steps, we compared ERAC with the variant that only applies the entropy regularization to the actor but not to the critic. In other words, the is set to 0 when performing policy evaluating according to Eqn. (4.2), while the for the entropy gradient in Eqn. (4.2) remains. The comparison result based on 9 runs on test set of IWSTL 2014 is shown in Table 4. As we can see, simply adding a local entropy gradient does not even improve upon the AC. This further verifies the difference between ERAC and A3C, and shows the importance of taking future entropy into consideration.
Algorithm  Mean  Max 

ERAC  28.30 0.06  28.42 
ERAC w/o Future Ent.  28.06 0.05  28.11 
AC  28.04 0.05  28.10 
7 Discussion
In this work, motivated by the intriguing connection between the tokenlevel RAML and the entropyregularized RL, we propose two algorithms for neural sequence prediction. Despite the distinct training procedures, both algorithms combine the idea of finegrained credit assignment and the entropy regularization, leading to positive empirical results.
However, many problems remain widely open. In particular, the oracle Qfunction we obtain is far from perfect. We believe the groundtruth reference contains sufficient information for such an oracle, and the current bottleneck lies in the RL algorithm. Given the numerous potential applications of such an oracle, we believe improving its accuracy will be a promising future direction.
References
 Bahdanau et al. (2016) Dzmitry Bahdanau, Philemon Brakel, Kelvin Xu, Anirudh Goyal, Ryan Lowe, Joelle Pineau, Aaron Courville, and Yoshua Bengio. 2016. An actorcritic algorithm for sequence prediction. arXiv preprint arXiv:1607.07086 .
 Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 .
 Baird (1995) Leemon Baird. 1995. Residual algorithms: Reinforcement learning with function approximation. In Machine Learning Proceedings 1995, Elsevier, pages 30–37.
 Bengio et al. (2015) Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. 2015. Scheduled sampling for sequence prediction with recurrent neural networks. In Advances in Neural Information Processing Systems. pages 1171–1179.
 Chan et al. (2016) William Chan, Navdeep Jaitly, Quoc Le, and Oriol Vinyals. 2016. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on. IEEE, pages 4960–4964.
 Daumé III and Marcu (2005) Hal Daumé III and Daniel Marcu. 2005. Learning as search optimization: Approximate large margin methods for structured prediction. In Proceedings of the 22nd international conference on Machine learning. ACM, pages 169–176.
 Haarnoja et al. (2017) Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. 2017. Reinforcement learning with deep energybased policies. arXiv preprint arXiv:1702.08165 .
 Haarnoja et al. (2018) Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. 2018. Soft actorcritic: Offpolicy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290 .
 He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. pages 770–778.
 Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long shortterm memory. Neural computation 9(8):1735–1780.
 Huang et al. (2017) PoSen Huang, Chong Wang, Dengyong Zhou, and Li Deng. 2017. Toward neural phrasebased machine translation .
 Karpathy and FeiFei (2015) Andrej Karpathy and Li FeiFei. 2015. Deep visualsemantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition. pages 3128–3137.
 Lamb et al. (2016) Alex M Lamb, Anirudh Goyal ALIAS PARTH GOYAL, Ying Zhang, Saizheng Zhang, Aaron C Courville, and Yoshua Bengio. 2016. Professor forcing: A new algorithm for training recurrent networks. In Advances In Neural Information Processing Systems. pages 4601–4609.
 Li et al. (2017) Jiwei Li, Will Monroe, and Dan Jurafsky. 2017. Learning to decode for future success. arXiv preprint arXiv:1701.06549 .
 Lillicrap et al. (2015) Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. 2015. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971 .
 Lin et al. (2014) TsungYi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In European conference on computer vision. Springer, pages 740–755.
 Luong et al. (2015) MinhThang Luong, Hieu Pham, and Christopher D Manning. 2015. Effective approaches to attentionbased neural machine translation. arXiv preprint arXiv:1508.04025 .
 Ma et al. (2017) Xuezhe Ma, Pengcheng Yin, Jingzhou Liu, Graham Neubig, and Eduard Hovy. 2017. Softmax qdistribution estimation for structured prediction: A theoretical interpretation for raml. arXiv preprint arXiv:1705.07136 .
 Mauro et al. (2012) Cettolo Mauro, Girardi Christian, and Federico Marcello. 2012. Wit3: Web inventory of transcribed and translated talks. In Conference of European Association for Machine Translation. pages 261–268.
 Mnih et al. (2016) Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. 2016. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning. pages 1928–1937.
 Mnih et al. (2015) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. 2015. Humanlevel control through deep reinforcement learning. Nature 518(7540):529.
 Nachum et al. (2017) Ofir Nachum, Mohammad Norouzi, Kelvin Xu, and Dale Schuurmans. 2017. Bridging the gap between value and policy based reinforcement learning. In Advances in Neural Information Processing Systems. pages 2772–2782.
 Norouzi et al. (2016) Mohammad Norouzi, Samy Bengio, Navdeep Jaitly, Mike Schuster, Yonghui Wu, Dale Schuurmans, et al. 2016. Reward augmented maximum likelihood for neural structured prediction. In Advances In Neural Information Processing Systems. pages 1723–1731.
 Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics, pages 311–318.
 Ranzato et al. (2015) Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. 2015. Sequence level training with recurrent neural networks. arXiv preprint arXiv:1511.06732 .
 Rennie et al. (2016) Steven J Rennie, Etienne Marcheret, Youssef Mroueh, Jarret Ross, and Vaibhava Goel. 2016. Selfcritical sequence training for image captioning. arXiv preprint arXiv:1612.00563 .
 Rush et al. (2015) Alexander M Rush, Sumit Chopra, and Jason Weston. 2015. A neural attention model for abstractive sentence summarization. arXiv preprint arXiv:1509.00685 .
 Schulman et al. (2017) John Schulman, Pieter Abbeel, and Xi Chen. 2017. Equivalence between policy gradients and soft qlearning. arXiv preprint arXiv:1704.06440 .
 Shen et al. (2015) Shiqi Shen, Yong Cheng, Zhongjun He, Wei He, Hua Wu, Maosong Sun, and Yang Liu. 2015. Minimum risk training for neural machine translation. arXiv preprint arXiv:1512.02433 .
 Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems. pages 3104–3112.
 Vinyals et al. (2015) Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on. IEEE, pages 3156–3164.
 Williams and Peng (1991) Ronald J Williams and Jing Peng. 1991. Function optimization using connectionist reinforcement learning algorithms. Connection Science 3(3):241–268.
 Wiseman and Rush (2016) Sam Wiseman and Alexander M Rush. 2016. Sequencetosequence learning as beamsearch optimization. arXiv preprint arXiv:1606.02960 .
 Xu et al. (2015) Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In International Conference on Machine Learning. pages 2048–2057.
 Ziebart (2010) Brian D Ziebart. 2010. Modeling purposeful adaptive behavior with the principle of maximum causal entropy. Carnegie Mellon University.
 Ziebart et al. (2008) Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Anind K Dey. 2008. Maximum entropy inverse reinforcement learning. In AAAI. Chicago, IL, USA, volume 8, pages 1433–1438.
Appendix A Proofs
a.1 Main Proofs
Proposition 1.
For any groundtruth pair , and satisfy the following marginal match condition and terminal condition:
(21) 
(22) 
if and only if for any ,
(23) 
Proof.
To avoid clutter, we drop the dependency on and . The following proof holds for each possible pair of .
Firstly, it is easy to see that the terminal condition in Eqn. (22) exactly corresponds to the case of Eqn. (23), since for . So, we will focus on the nonterminal case next.
Sufficiency
For convenience, define . Suppose Eqn. (23) is true. Then for any ,
where denotes when and is an empty set. Since is a valid distribution by construction, we have
Hence,
which satisfies the marginal match requirement.
Necessity
Now, we show that the specific formulation of (Eqn. (23)) is also a necessary condition of the marginal match condition (Eqn. (21)).
The tokenlevel target distribution can be simplified as
Corollary 1.
Please refer to §3.2 for the Corollary.
Proof.
Similarly, we drop the dependency on and to avoid clutter. We first prove the equivalence of with by induction.

Base case: When , for any , can only be eos. So, by definition, we have
Hence,
For the first case, it directly follows
For the second case, since only eos is allowed to be generated, the target distribution should be a singlepoint distribution at eos. This is equivalent to define
which proves the second case. Combining the two cases, it concludes

Induction step: When , assume the equivalence holds when , i.e.,
Then,
Thus, holds for .
With the equivalence between and , we can easily prove and ,
∎