NonAutoregressive Neural Dialogue Generation
Abstract
Maximum Mutual information (MMI), which models the bidirectional dependency between responses () and contexts (), i.e., the forward probability and the backward probability , has been widely used as the objective in the Seq2Seq model to address the dullresponse issue in opendomain dialog generation. Unfortunately, under the framework of the Seq2Seq model, direct decoding from is infeasible since the second part (i.e., ) requires the completion of target generation before it can be computed, and the search space for is enormous. Empirically, an Nbest list is first generated given , and is then used to rerank the Nbest list, which inevitably results in nongloballyoptimal solutions.
In this paper, we propose to use nonautoregressive (nonAR) generation model to address this nonglobal optimality issue.
Since
target tokens are generated independently in nonAR generation,
for each target word can be computed as soon as it’s generated, and does not have to wait for the completion of the whole sequence.
This naturally resolves the nonglobal optimal issue in decoding.
Experimental results demonstrate that the proposed nonAR strategy produces more diverse, coherent, and appropriate responses, yielding substantive gains in BLEU scores and in human evaluations.
1 Introduction
Opendomain neural dialogue generation (Vinyals and Le, 2015; Sordoni et al., 2015; Li et al., 2016a; Mou et al., 2016; Serban et al., 2016a; Asghar et al., 2016; Mei et al., 2016; Serban et al., 2016e, b, d; Baheti et al., 2018; Wang et al., 2018; Ghazvininejad et al., 2018; Zhang et al., 2018; Gao et al., 2019) treats dialog contexts () as sources,and responses () as targets and uses the encoderdecoder model Sutskever et al. (2014); Vaswani et al. (2017b) as the backbone to generate responses. Seq2Seq models offer the promise of scalability and languageindependence, along with the capacity to capture contextual dependencies semantic and syntactic relations between sources and targets.
One of key issues with the Seq2Seq structure is that it exhibits a strong tendency to generate dull, trivial or noncommittal responses (e.g., I don’t know or I’m OK) regardless of the input, which has been observed by many recent works Li et al. (2016a); Sordoni et al. (2015); Serban et al. (2016c); Niu and Bansal (2020). Various strategies Li et al. (2016a); Vijayakumar et al. (2016); Baheti et al. (2018); Niu and Bansal (2020) have been proposed to address the dullresponse issue in neural dialogue generation, and one of the most widely used strategies is to replace the MLE objective in the Seq2Seq training with the maximum mutual information objective (MMI for short) Li et al. (2016a). MMI models the bidirectional dependency between responses () and contexts (). It takes the form of the linear combination of the forward probability and the backward probability . The intuition behind MMI is straightforward: it is easy to predict a dull response given any context, but hard to predict the context given a dull response since the context that corresponds to a dull response could be anything.
Unfortunately, under the framework of the Seq2Seq model, direct decoding from is infeasible since the second part (i.e., ) requires the completion of target generation before can be computed, and the search space for is huge. Empirically, an Nbest list is first generated given , and is then used to rerank the Nbest list. Since one of the longstanding drawback of beam search is lack of diversity in the beam: candidates often differ only by punctuation or minor morphological variations, with most of the words overlapping. This reranking strategy inevitably results in nongloballyoptimal solutions. Some strategies have been proposed to alleviate this nonglobaloptimality issue, such as generating a more diverse Nbest list Li et al. (2016c); Gu et al. (2017); Vijayakumar et al. (2016), or using reinforcement learning to estimate the future score of Li et al. (2017a), which help alleviate the nongloballyoptimal issue, but cannot fully address it.
Nonautoregressive (nonAR) generation Gu et al. (2018); Ma et al. (2019); Lee et al. (2018) provides a natural way to address the nonglobaloptimality issue. Under the formalization of nonAR generation, target tokens are generated independently, which enables to be computed as soon as is generated. This naturally resolves the nonglobal optimal issue in decoding. We conduct experiments on the widely used Opensubtitle dataset and experimental results demonstrate that the proposed strategy produces more diverse, coherent, and appropriate responses, yielding substantive gains in BLEU scores and in human evaluations.
The rest of this paper is organized as follows: Section 2 and section 3 present related work and background knowledge respectively. The propose model is described in Section 4. Experimental results and ablation studies are detailed in Section 5 and 6, followed by a brief conclusion in Section 7.
2 Related Work
2.1 Neural Dialogue Generation
Endtoend neural approaches for dialogue generation use Seq2Seq architectures Sutskever et al. (2014); Vaswani et al. (2017b) as the backbone to generate syntactically fluent and meaningful responses, providing the flexibility to capture contextual semantics between source contexts and target responses. Recent studies have endowed these models with the ability to model contexts Sordoni et al. (2015); Serban et al. (2016e, b); Tian et al. (2017); Lewis et al. (2017), generating coherent and personalized responses Li et al. (2016b); Zhao et al. (2017); Shao et al. (2017); Xing et al. (2017); Zhang et al. (2018); Bosselut et al. (2018), generating uttterances with different attributes or topics Wang et al. (2017); Niu and Bansal (2018) and interacting fluently with humans Ghazvininejad et al. (2018); Zhang et al. (2019); Adiwardana et al. (2020).
2.2 Diverse Decoding
One major issue with Seq2Seq systems is their propensity to select dull, noncommittal responses regardless of the input, for which many diverse decoding algorithms have been proposed to tackle this problem Li et al. (2016a); Li and Jurafsky (2016); Vijayakumar et al. (2016); Cho (2016); Kulikov et al. (2018); Kriz et al. (2019); Ippolito et al. (2019). Li et al. (2016a) proposed to use Maximum Mutual Information (MMI) as the objective function in neural dialog models. MMI models use both the forward probability and the backward probability to better capture the contextual relations between the source and target sequences. Li and Jurafsky (2016) introduced a Beam Search diversification heuristic to discourage sequences from sharing common roots, implicitly resulting in diverse sequences. Vijayakumar et al. (2016) improved upon Li and Jurafsky (2016) and presented Diverse Beam Search, which formalizes beam search as an optimization problem and augments the objective with a diversity term. Cho (2016) introduced Noisy Parallel Approximate Decoding, a method encouraging diversity by adding small amounts of noise to the hidden state of the decoder at each step, instead of manipulating the probabilities outputted from the model. Kulikov et al. (2018) attempted to explore larger beam search space by running beam search many times, where the states explored by subsequent beam searches are restricted based on the intermediate states explored by previous iterations. These works have pushed dialogue models to generate more interesting and diverse responses that are both highquality and meaningful.
2.3 NonAutoregressive Sequence Generation
Besides diverse responses, another problem for these dialogue generation models is their autoregressive generation strategy that decodes words onebyone, making it extremely slow to execute on long sentences, especially on conditions where multiturn dialogue often appears Adiwardana et al. (2020). One solution is to use nonautoregressive sequence generation methods, which has recently aroused general interest in the community of neural machine translation (NMT) Gu et al. (2018); Lee et al. (2018); Ma et al. (2019); Sun et al. (2019); Shu et al. (2019); Bao et al. (2019). Gu et al. (2018) proposed to alleviate latency by using fertility during inference in autoregressive Seq2Seq NMT systems, which led to a 15 times speedup to traditional autoregressive methods, whereas the performance degrades rapidly. Lee et al. (2018); Ma et al. (2019); Shu et al. (2019) proposed to use latent variables to model intermediate word alignments between source and target sequence pairs and mitigate the tradeoff between decoding speed and performance. Bao et al. (2019) pointed out position information is crucial for nonautoregressive models and thus proposed to explicitly model position as latent variables. Sun et al. (2019) incorporated CRF into nonautoregressive models to enhance local dependencies during decoding. This work is greatly inspired by these advances in nonautoregressive sequence generation.
3 Background
3.1 Autoregressive Seq2Seq Models
An encoderdecoder model Sutskever et al. (2014); Vaswani et al. (2017b); Bahdanau et al. (2015) defines the probability of a target sequence , which is a response in the context of dialogue generation, given a source sequence , where where and are the length of the source and target sentence respectively.
An autoregressive encoderdecoder model decomposes the distribution over a target sequence into a chain of conditional probabilities:
(1)  
with being the special token and being the special token. The probability of generating a token depends on all tokens in the source , and all its previous tokens in . The concatenation of and is mapped to a representation using LSTMs Sutskever et al. (2014), CNNs Gehring et al. (2017) or transformers Vaswani et al. (2017b). denotes the representation for .
During decoding, the algorithm terminates when the token is predicted. At each time step, either a greedy approach or beam search can be adopted for word prediction. Greedy search selects the token with the largest conditional probability, the embedding of which is then combined with preceding output to predict the token at the next step.
3.2 NonAutoregressive Seq2Seq Models
Overview
Autoregressive generation models have two major drawbacks: it prohibits generating multiple tokens simultaneously, which leads to inefficiency in GPU usage; and erroneously generated tokens leads to error accumulation and the performance of beam search deteriorates when exposed to a larger search space (Koehn and Knowles, 2017). Nonautoregressive methods address these two issues by removing the sequential dependencies within the target sentence and generating all target tokens simultaneously, with the probability giving as follows:
(2) 
Now that each target token only depends on the source sentence , the full target sentence can be decoded in parallel, where argmax
is applied to each token. A vital challenge
that nonautoregressive face is the inconsistency problem Gu et al. (2018), which indicates the decoded sequence contains duplicated or missing tokens. Improving decoding consistency on the target side is thus crucial to NonAR models.
4 Model
4.1 Overview
The maximum mutual information (MMI) model, proposed in (Li et al., 2016a),
tries to find the response that has the largest value of mutual information with respect to the context.
The form of MMI is given as follows:
(3) 
This weighted MMI objective function can be viewed as representing a tradeoff between sources given targets (i.e., ) and targets given sources (i.e., ). Direct decoding from is infeasible since the second part (i.e., ) requires the completion of target generation before can be computed. Empirically, an Nbest list is first generated given , and is then used to rerank the Nbest list, which inevitably results in nongloballyoptimal solutions.
Here to propose to use NonAR generation models to handle to nongloballyoptimality issue. The generation of each target word is independent under the nonAR formalization, and the forward probability is given as follows:
(4) 
For the backward probability , which denotes the probability of generating a source sequence given a target sequence, we propose to replace it with the geometric mean of the probability of generating the source sequence given each target token, denoted as follows:
(5) 
We also use the nonAR framework to model the backward probability. Based on the independence assumption of nonAR, in which the generations of are independent, Eq. 5 can be further factorized as follows:
(6) 
A close look at Equ.6 shows that it actually mimics the backward probability in the IBM model Brown et al. (1993): handles the pairwise word alignment between sources and targets. Since position representations are incorporated at both the encoding and decoding stage, Equ.6 actually mimics IBM model2, where relative positions between source and target words are modeled.
Combining the forward probability in Eq. 4.2 and the backward probability in Eq.6, the full form of mutual information of Eq.3 can be rewritten as follows:
(7)  
as can be seen, we are able to factorize the full form of the MMI objective with respect to under the framework of nonAR generation. This means that the mutual information between source and different target words are independent and can be computed in parallel. Also, for each token , its mutual information with respect to the source can be readily computed as soon as is generated, and we do not have to wait until the completion of the entire sequence. This naturally resolves the nongloballyoptimality issue in the AR generation model. Figure 1 gives an illustration for the proposed model.
4.2 Forward Probability
We use the nonautoregressive Seq2Seq model as the backbone to compute , which consists of two major components: the encoder and the decoder.
Encoder
We use transformers Vaswani et al. (2017a) as a backbone and use a stack of identical transformer blocks as the encoder. Given the source sequence , the encoder produces its contextual representations from the last layer of the encoder.
Decoder
Target Length We first need to obtain the length of the target sequence for decoding. We follow previous works Gu et al. (2018); Ma et al. (2019); Bao et al. (2019) to predict the length difference between source and target sequences using a classifier with a range of [20, 20]. This is accomplished by maxpooling the source embeddings into a single vector, running this through a linear layer followed by a softmax operation, as follows:
(8) 
Decoder Structure The decoder also consists of identical transformer blocks. The th position of the input to the decoder is the th input’s contextual representation copied from the encoder, which is equivalent to scanning the source inputs from left to right and leads to a deterministic decoding process given the predicted target length. Both absolute and relative positional embeddings are incorporated. For relative position information, we follow Shaw et al. (2018) which produces a different learned embedding according to the offset between the “key” and “query” in the selfattention mechanism with a clipping distance (we set ) for relative positions. For absolute positional embeddings, we follow Radford et al. (2019) and used a learnable positional embedding for position .
Attention over Vocabulary Layerwise attention over vocabulary is incorporated into each decoding layer to make the model aware of which token is to be generated regarding each position. More concretely, we use to denote the contextual representations for the th decoder layer , and to denote the input to the decoder. The intermediate token attention representation of position in the th decoder layer is thus given by:
(9) 
where is the representation matrix of the token vocabulary. By doing so, each position is able to know which token is about to be decoded at the current position. The input to the next layer is the concatenation of the contextual representations and the intermediate token representations .
softmax For each position , is computed by outputting the representation for that position to a softmax function.
4.3 Backward Probability
We use the nonAR model to obtain .
Encoder
The encoder for is again a stack of identical transformer blocks. The input to the encoder is a text sequence with length being , which is identical to the length of the target. The th position of the input sequence is the word , with the rest being the placeholding dummy token. For each posiition, the embedding for the absolute position and the embedding for the relative position are appended.
Decoder
The decoder for the backward probability is the same as that of the forward probability, with the only difference being changing target to source .
4.4 Decoding from Mutual Information
The most commonly used decoding strategy for nonAR generation is the noisy parallel decoding strategy (NPD for short) proposed in Gu et al. (2018): a number of sequence candidates are first generated by the nonAR generation, then an AR Seq2Seq model is used to select the candidate that has the largest value of probability output from the AR model. Since this NPD strategy is used for the MLE objective which only concerns about the forward probability, we need to tailor it to the MMI objective. Specifically, we first generate Nbest sequences based on the score of nonAR MMI function, computed from Eq.7. The final selected response is the sequence with highest AR MMI score, which is computed based on two AR Seq2Seq models, one to model the forward probability and the other to model the backward probability.
5 Experiments
5.1 Datasets
We use the OpenSubtitles dataset for evaluation.
It’s a widely used opendomain dataset, which contains roughly 60M70M scripted lines spoken by movie characters.
It
has been used in a broad range of recent
work on datadriven conversation
This dataset does not specify which character speaks each subtitle line, which prevents us from inferring speaker turns.
Following (Vinyals and Le, 2015; Li et al., 2016a), we make an assumption that each line of subtitle constitutes a full speaker turn.
Although this assumption is often violated, prior work has successfully trained
and evaluated neural conversation models using
this corpus.
In our experiments we used a preprocessed version of this dataset distributed by Li et al. (2016a).
The noisy nature of the OpenSubtitle
dataset renders it unreliable for evaluation purposes.
We thus follow Li et al. (2016a)
to use
data from the Internet Movie Script Database (IMSDB)
5.2 Baselines
Our baselines include the AR generation models (using or not using MMI) based on transformers Vaswani et al. (2017b), with the number of encoder and decoder blocks set to 6. For the standard AR model, the value of beam size is set to 10 for decoding, and the sequence with the largest value of is selected. For AR+MMI, we followed Li et al. (2016a), and first use to generate an Nbest list with beamsize 10. Then is used to rerank the Nbest list. is treated as the hyperparameter to be tuned on the dev set.
We also implement two variant of the AR+MMI model: (1) AR+MMI+diverse Li et al. (2016c), which uses a diverse decoding model to generate the Nbest list and uses the backward probability to rerank the diverse Nbest list. The diverse decoding model adds an additional term to penalize siblings in beam searchâexpansions of the same parent node in the searchâ thus favoring choosing hypotheses from diverse parents; and (2) AR+MMI+RL Li et al. (2017a), which incorporates the critic that estimates further backward probability into decoding.
5.3 Training Details
All experiments were run using 64 Nvidia V100 GPUs with minibatches of approximately 100K tokens. We use the same hyperparameters for all experiments, i.e., word representations of size 1024, feedforward layers with inner dimension 4096. Dropout rate is set to 0.2 and the number of attention heads is set to 16. Models are optimized with Adam Kingma and Ba (2014) using , , . Differentiable scheduled sampling Goyal et al. (2017) is used to mitigate the exposure bias issue. We train models with 16bit floating point operations.
Model  BLEU  distinct1  distinct2  Avg.length  Stopword  adv succ 
Human    16.8%  58.1%  14.2  69.8%  
AR  1.64  3.7%  9.5%  6.4  82.3%  2.7% 
AR+MMI  2.10  10.6%  20.5%  7.2  76.4%  6.3% 
AR+MMI+diverse  2.16  16.0%  27.3%  7.5  72.1%  6.4% 
AR+MMI+RL  2.34  13.7%  25.2%  7.3  73.0%  8.0% 
NonAR  1.54  8.9%  14.6%  7.1  77.9%  2.4% 
NonAR+MMI  2.68  15.9%  27.0%  7.4  71.9%  9.2% 
5.4 Automatic Evaluation
For automatic evaluation, we report the results of the following metrics:

the BLEU score following previous work. It should be noted that BLEU is not generally accepted Liu et al. (2016) to match human evaluation in generation tasks since there are distinct ways to reply to an input.

distinct1 and distinct2 Li et al. (2016a): calculating the number of distinct unigrams and bigrams in generated responses scaled by total number of generated unigrams and bigrams.

Avg.length: the average length of the generated response.

Stopword%: the percentage of stopwords
^{5} of the responses generated by each model. 
Adversarial Success: the adversarial evaluation strategy proposed by Kannan and Vinyals (2017); Li et al. (2017b). Adversarial evaluation trains a discriminator (or evaluator) function to labels dialogues as machinegenerated (negative) or humangenerated (positive). Positive examples are taken from training dialogues, while negative examples are decoded using generative models from a model. Adversarial success is the percentage of the generated responses that can fool the evaluator to believe that it is humangenerated. We refer readers to Li et al. (2017b) for more details about the adversarial evaluation.
Results are shown in Table 1. When comparing AR with AR+MMI, AR+MMI significantly outperforms AR across all metrics, which is in line with previous findings Li et al. (2016a). For the variants of AR+MMI, AR+MMI+diverse generates a more diverse Nbest list for reranking, and thus outperforms AR+MMI; AR+MMI+RL uses lookahead strategy to estimate future backward probability, and thus outperforms AR+MMI as well. It’s hard to tell which model performs better, AR or nonAR: AR performs better than nonAR for BLEU and adversarial success, but worse for the other metrics. This means comparing with AR model, nonAR model tends to generate more diverse responses, but might be less coherent. Because of the ability to handle the nonlocaloptimality issue, NonAR+MMI consistently outperforms AR+MMI by a large margin across all evaluation metrics. When comparing nonAR with AR+MMI+diverse, nonAR has relatively lower distinct score, but significantly higher scores BLEU and adversarial success. This is because the diverse decoding strategy in AR sacrifices language model probability for diversity, and thus harms the BLEU score but promotes the diversity score. NonAR+MMI outperforms AR+MMI+RL across all metrics.
5.5 Examples
Input  AR+MMI  NonAR+MMI 

it feels like i must have been asleep for weeks.  i âm sorry that i can ’t help you.  Yeah, it has been long. 
where does she work?  i don ’t know where she works.  she has work to do at the library. 
who is in charge?  I don ’t know sir.  i ’m the captain in charge! 
I am off all week next week.  what time you going?  then who will be in charge ? 
why can ât you just believe us?  I mean , it âs not like we âre gonna be together forever .  because you are not a good person. 
can ât you see how they âre exploiting you ?  i âm not a human being .  it ’s not that simple. 
I mean, weâd have to talk to him.  I mean, he ’s a good guy.  let ’s do it. 
i ’m sorry to detain you for so long.  i ’m sorry i can ’t.  that ’s fine since i ’m not leaving until tomorrow. 
do you have any idea what caused the explosion ?  i have no idea.  the explosion was caused by the failure of the motor. 
5.6 Qualitative Evaluation
We employed crowdsourced judges to provide evaluations for a random sample of 1000 items from the test set. Following protocols in Baheti et al. (2018), we assigned each output to a human judge, who were asked to score every model response on a 5point scale (Strongly Agree, Agree, Unsure, Disagree, Strongly Disagree) on 2 categories: 1) Coherence  is the response coherent to the given source? and 2) Content Richness  does the response add new information to the conversation? Ratings were later collapsed to 3 categories (Agree, Unsure, Disagree).
Model  disagr (%)  un(%)  agr(%) 
Coherence  
Human  17.4  20.8  61.8 
AR  28.6  29.5  41.9 
AR+MMI  25.3  27.9  46.8 
AR+MMI+diverse  24.8  27.8  47.4 
AR+MMI+RL  24.1  26.5  49.4 
nonAR  29.9  28.7  41.4 
nonAR+MMI  23.1  24.0  52.9 
Content Richness  
Human  14.0  16.6  69.4 
AR  38.2  30.4  31.4 
AR+MMI  30.6  26.2  43.2 
AR+MMI+diverse  23.9  21.3  54.8 
AR+MMI+RL  26.4  24.9  48.7 
NonAR  31.4  25.0  44.6 
NonAR+MMI  24.2  20.5  55.3 
The results for plausibility and content richness of different models are presented in Table 3. For dialogue coherence, the trend is that NonAR+MMI is better than AR+MMI, followed by AR and NonAR. AR is slightly better than NonAR. For Content Richness, the proposed NonAR+MMI is significantly better than AR+MMI, and the gap is greater than dialogue coherence. This is because the Nbest list generated by the AR model tends to be dull and generic, and the reranking model in AR+MMI can help alleviate but cannot fully address this issue. The output from the AR+MMI model is thus by far less diverse than nonAR+MMI, which obtains the MMI score for each generated token.
To verify the statistical significance of the reported results, we performed a pairwise bootstrap test Johnson (2001); BergKirkpatrick et al. (2012) to compare the difference between percentage of responses that were labeled as yes. We computed pvalues for nonAR+MMI vs AR+MMI and nonAR vs AR. Regarding nonAR vs AR, we did not find a significant difference (pvalue = 0.18) for coherence, but a significant difference for content richness (pvalue 0.01). For nonAR+MMI vs AR+MMI and AR+MMI+RL, we find a significant difference for both coherence (pvalue 0.01) and content richness (pvalue 0.01). For nonAR+MMI vs AR+MMI+RL, the difference for coherence is significant (pvalue 0.01), but content richness is insignificant (pvalue=0.25).
5.7 Sample Responses
Sample responses are presented in Table 2. As can be seen, the nonAR+MMI tends to generate more diverse and contentrich responses. It is also interesting to see that responses from the AR+MMI model mostly start with the word “I ”. This is because of the fact that the Nbest list from the AR model lacks for diversity. The prefixes of the responses are mostly the same and the reranking process can only affect suffixes. On the contrary, for nonAR+MMI, MMI reranking is performed once a token is generated, and does not wait for the completion of the whole target sequence, leading to more diverse and appropriate responses.
5.8 Results on Machine Translation
Mutual information has been found to improve machine translation, both in the context of NMT models Li and Jurafsky (2016) and phrasebased MT models Och and Ney (2002); Shen et al. (2010). It would be interesting to see whether the proposed model can also help nonAR NMT. We evaluate the proposed method on the three widely used machine translation benchmark tasks (three datasets): WMT2014 DeEn (4.5M sentence pairs), WMT2014 EnDe, WMT2016 RoEn (610K sentence pairs) and IWSLT2014 DeEn (150K sentence pairs). We use the Transformer (Vaswani et al., 2017a) as a backbone. Knowledge Distillation is applied for all models.
WMT14 EnDe  WMT14 DeEn  WMT16 RoEn  

NAT (Gu et al., 2018)  17.69  20.62  29.79 
iNAT (Lee et al., 2018)  21.54  25.43  29.32 
FlowSeqlarge (raw data) (Ma et al., 2019)  20.85  25.40  29.86 
NAT (our implementation)  22.32  24.83  29.93 
NAT +MMI  23.80  26.05  30.50 
(+1.48)  (+1.22)  (+0.57) 
Since achieving SOTA nonAR MT performances is out of the scope of this paper, we used the commonly used NonAR model described in Section 4.2 as the backbone. Results are shown in Table 4. As can be seen, the incorporation of MMI model significantly improves MT performances. This shows that the proposed model has potentials to benefit a wide range of generation tasks.
6 Conclusion
In this paper, we propose to use nonautoregressive (nonAR) generation to address the nonglobal optimality issue for MMI in neural dialog generation. Target tokens are generated independently in nonAR generation. for each target word can thus be computed as soon as it s generated, and does not have to wait for the completion of the whole sequence. This naturally resolves the nonglobal optimal issue in decoding. Experimental results demonstrate that the proposed strategy produces more diverse, coherent, and appropriate responses, yielding substantive gains in BLEU scores and in human evaluations.
Footnotes
 Qinghong and Yuxian contribute equally to this work.
 We refer readers to (Li et al., 2016a) for how Eq.3 is obtained.
 http://nlp.stanford.edu/data/OpenSubData.tar
 http://www.imsdb.com/
 Thecombinationofstopwordsinhttps://www.ranks.nl/stopwordsandpunctuations.
References
 Daniel Adiwardana, MinhThang Luong, David R. So, Jamie Hall, Noah Fiedel, Romal Thoppilan, Zi Yang, Apoorv Kulshreshtha, Gaurav Nemade, Yifeng Lu, and Quoc V. Le. 2020. Towards a humanlike opendomain chatbot.
 Nabiha Asghar, Pasca Poupart, Jiang Xin, and Hang Li. 2016. Online sequencetosequence reinforcement learning for opendomain conversational agents. arXiv preprint arXiv:1612.03929.
 Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In Proc. of ICLR.
 Ashutosh Baheti, Alan Ritter, Jiwei Li, and Bill Dolan. 2018. Generating more interesting responses in neural conversation models with distributional constraints. arXiv preprint arXiv:1809.01215.
 Yu Bao, Hao Zhou, Jiangtao Feng, Mingxuan Wang, Shujian Huang, Jiajun Chen, and Lei Li. 2019. Nonautoregressive transformer by position learning. arXiv preprint arXiv:1911.10677.
 Taylor BergKirkpatrick, David Burkett, and Dan Klein. 2012. An empirical investigation of statistical significance in nlp. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 995–1005. Association for Computational Linguistics.
 Antoine Bosselut, Asli Celikyilmaz, Xiaodong He, Jianfeng Gao, PoSen Huang, and Yejin Choi. 2018. Discourseaware neural rewards for coherent text generation. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 173–184, New Orleans, Louisiana. Association for Computational Linguistics.
 Peter F Brown, Vincent J Della Pietra, Stephen A Della Pietra, and Robert L Mercer. 1993. The mathematics of statistical machine translation: Parameter estimation. Computational linguistics, 19(2):263–311.
 Kyunghyun Cho. 2016. Noisy parallel approximate decoding for conditional recurrent language model. arXiv preprint arXiv:1605.03835.
 Jianfeng Gao, Michel Galley, Lihong Li, et al. 2019. Neural approaches to conversational ai. Foundations and Trends® in Information Retrieval, 13(23):127–298.
 Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N Dauphin. 2017. Convolutional sequence to sequence learning. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 1243–1252. JMLR. org.
 Marjan Ghazvininejad, Chris Brockett, MingWei Chang, Bill Dolan, Jianfeng Gao, Wentau Yih, and Michel Galley. 2018. A knowledgegrounded neural conversation model. In ThirtySecond AAAI Conference on Artificial Intelligence.
 Kartik Goyal, Chris Dyer, and Taylor BergKirkpatrick. 2017. Differentiable scheduled sampling for credit assignment. arXiv preprint arXiv:1704.06970.
 Jiatao Gu, James Bradbury, Caiming Xiong, Victor O. K. Li, and Richard Socher. 2018. Nonautoregressive neural machine translation. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30  May 3, 2018, Conference Track Proceedings.
 Jiatao Gu, Kyunghyun Cho, and Victor OK Li. 2017. Trainable greedy decoding for neural machine translation. arXiv preprint arXiv:1702.02429.
 Daphne Ippolito, Reno Kriz, Joao Sedoc, Maria Kustikova, and Chris CallisonBurch. 2019. Comparison of diverse decoding methods from conditional language models. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3752–3762, Florence, Italy. Association for Computational Linguistics.
 Roger W Johnson. 2001. An introduction to the bootstrap. Teaching Statistics, 23(2):49–54.
 Anjuli Kannan and Oriol Vinyals. 2017. Adversarial evaluation of dialogue models. arXiv preprint arXiv:1701.08198.
 Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
 Philipp Koehn and Rebecca Knowles. 2017. Six challenges for neural machine translation. In Proceedings of the First Workshop on Neural Machine Translation, pages 28–39, Vancouver. Association for Computational Linguistics.
 Reno Kriz, João Sedoc, Marianna Apidianaki, Carolina Zheng, Gaurav Kumar, Eleni Miltsakaki, and Chris CallisonBurch. 2019. Complexityweighted loss and diverse reranking for sentence simplification. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 3137–3147, Minneapolis, Minnesota. Association for Computational Linguistics.
 Ilia Kulikov, Alexander H. Miller, Kyunghyun Cho, and Jason Weston. 2018. Importance of search and evaluation strategies in neural dialogue modeling.
 Jason Lee, Elman Mansimov, and Kyunghyun Cho. 2018. Deterministic nonautoregressive neural sequence modeling by iterative refinement. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1173–1182, Brussels, Belgium. Association for Computational Linguistics.
 Mike Lewis, Denis Yarats, Yann Dauphin, Devi Parikh, and Dhruv Batra. 2017. Deal or no deal? endtoend learning of negotiation dialogues. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2443–2453, Copenhagen, Denmark. Association for Computational Linguistics.
 Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. 2016a. A diversitypromoting objective function for neural conversation models. In Proc. of NAACLHLT.
 Jiwei Li, Michel Galley, Chris Brockett, Georgios Spithourakis, Jianfeng Gao, and Bill Dolan. 2016b. A personabased neural conversation model. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 994–1003, Berlin, Germany.
 Jiwei Li and Dan Jurafsky. 2016. Mutual information and diverse decoding improve neural machine translation.
 Jiwei Li, Will Monroe, and Dan Jurafsky. 2016c. A simple, fast diverse decoding algorithm for neural generation. arXiv preprint arXiv:1611.08562.
 Jiwei Li, Will Monroe, and Dan Jurafsky. 2017a. Learning to decode for future success. arXiv preprint arXiv:1701.06549.
 Jiwei Li, Will Monroe, Tianlin Shi, Sébastien Jean, Alan Ritter, and Dan Jurafsky. 2017b. Adversarial learning for neural dialogue generation. arXiv preprint arXiv:1701.06547.
 ChiaWei Liu, Ryan Lowe, Iulian V Serban, Michael Noseworthy, Laurent Charlin, and Joelle Pineau. 2016. How not to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for dialogue response generation. arXiv preprint arXiv:1603.08023.
 Xuezhe Ma, Chunting Zhou, Xian Li, Graham Neubig, and Eduard Hovy. 2019. Flowseq: Nonautoregressive conditional sequence generation with generative flow.
 Hongyuan Mei, Mohit Bansal, and Matthew R Walter. 2016. Coherent dialogue with attentionbased language models. arXiv preprint arXiv:1611.06997.
 Lili Mou, Yiping Song, Rui Yan, Ge Li, Lu Zhang, and Zhi Jin. 2016. Sequence to backward and forward sequences: A contentintroducing approach to generative shorttext conversation. arXiv preprint arXiv:1607.00970.
 Tong Niu and Mohit Bansal. 2018. Polite dialogue generation without parallel data. Transactions of the Association for Computational Linguistics, 6(0).
 Tong Niu and Mohit Bansal. 2020. Avgout: A simple outputprobability measure to eliminate dull responses. arXiv preprint arXiv:2001.05467.
 Franz Josef Och and Hermann Ney. 2002. Discriminative training and maximum entropy models for statistical machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pages 295–302. Association for Computational Linguistics.
 Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI Blog, 1(8).
 Iulian V Serban, II Ororbia, G Alexander, Joelle Pineau, and Aaron Courville. 2016a. Multimodal variational encoderdecoders. arXiv preprint arXiv:1612.00377.
 Iulian V Serban, Alessandro Sordoni, Yoshua Bengio, Aaron Courville, and Joelle Pineau. 2016b. Building endtoend dialogue systems using generative hierarchical neural network models. In Proceedings of the 30th AAAI Conference on Artificial Intelligence (AAAI16).
 Iulian V Serban, Alessandro Sordoni, Yoshua Bengio, Aaron Courville, and Joelle Pineau. 2016c. Building endtoend dialogue systems using generative hierarchical neural network models. In Proceedings of AAAI.
 Iulian Vlad Serban, Ryan Lowe, Laurent Charlin, and Joelle Pineau. 2016d. Generative deep neural networks for dialogue: A short review.
 Iulian Vlad Serban, Alessandro Sordoni, Ryan Lowe, Laurent Charlin, Joelle Pineau, Aaron Courville, and Yoshua Bengio. 2016e. A hierarchical latent variable encoderdecoder model for generating dialogues. arXiv preprint arXiv:1605.06069.
 Yuanlong Shao, Stephan Gouws, Denny Britz, Anna Goldie, Brian Strope, and Ray Kurzweil. 2017. Generating highquality and informative conversation responses with sequencetosequence models. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2210–2219, Copenhagen, Denmark. Association for Computational Linguistics.
 Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. 2018. Selfattention with relative position representations. arXiv preprint arXiv:1803.02155.
 Libin Shen, Jinxi Xu, and Ralph Weischedel. 2010. Stringtodependency statistical machine translation. Computational Linguistics, 36(4):649–671.
 Raphael Shu, Jason Lee, Hideki Nakayama, and Kyunghyun Cho. 2019. Latentvariable nonautoregressive neural machine translation with deterministic inference using a delta posterior.
 Alessandro Sordoni, Michel Galley, Michael Auli, Chris Brockett, Yangfeng Ji, Meg Mitchell, JianYun Nie, Jianfeng Gao, and Bill Dolan. 2015. A neural network approach to contextsensitive generation of conversational responses. In Proceedings of NAACLHLT.
 Zhiqing Sun, Zhuohan Li, Haoqing Wang, Di He, Zi Lin, and Zhihong Deng. 2019. Fast structured decoding for sequence models. In Advances in Neural Information Processing Systems, pages 3011–3020.
 Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112.
 Zhiliang Tian, Rui Yan, Lili Mou, Yiping Song, Yansong Feng, and Dongyan Zhao. 2017. How to make context more useful? an empirical study on contextaware neural conversational models. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 231–236, Vancouver, Canada. Association for Computational Linguistics.
 Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017a. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5998–6008. Curran Associates, Inc.
 Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017b. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008.
 Ashwin K Vijayakumar, Michael Cogswell, Ramprasath R Selvaraju, Qing Sun, Stefan Lee, David Crandall, and Dhruv Batra. 2016. Diverse beam search: Decoding diverse solutions from neural sequence models. arXiv preprint arXiv:1610.02424.
 Oriol Vinyals and Quoc Le. 2015. A neural conversational model. In Proceedings of ICML Deep Learning Workshop.
 Di Wang, Nebojsa Jojic, Chris Brockett, and Eric Nyberg. 2017. Steering output style and topic in neural response generation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2140–2150, Copenhagen, Denmark. Association for Computational Linguistics.
 William Yang Wang, Jiwei Li, and Xiaodong He. 2018. Deep reinforcement learning for nlp. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts, pages 19–21.
 Chen Xing, Wei Wu, Yu Wu, Jie Liu, Yalou Huang, Ming Zhou, and WeiYing Ma. 2017. Topic aware neural response generation. In Proceedings of the ThirtyFirst AAAI Conference on Artificial Intelligence, AAAIâ17, page 3351â3357. AAAI Press.
 Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur Szlam, Douwe Kiela, and Jason Weston. 2018. Personalizing dialogue agents: I have a dog, do you have pets too? arXiv preprint arXiv:1801.07243.
 Yizhe Zhang, Siqi Sun, Michel Galley, YenChun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, and Bill Dolan. 2019. Dialogpt: Largescale generative pretraining for conversational response generation.
 Tiancheng Zhao, Ran Zhao, and Maxine Eskenazi. 2017. Learning discourselevel diversity for neural dialog models using conditional variational autoencoders. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 654–664, Vancouver, Canada. Association for Computational Linguistics.