Adversarial Bootstrapping for Dialogue Model Training
Open domain neural dialogue models, despite their successes, are known to produce responses that lack relevance, diversity, and in many cases coherence. These shortcomings stem from the limited ability of common training objectives to directly express these properties as well as their interplay with training datasets and model architectures. Toward addressing these problems, this paper proposes bootstrapping a dialogue response generator with an adversarially trained discriminator as an effective solution. The proposed method involves training a neural generator in both auto-regressive and traditional teacher-forcing modes, with the maximum likelihood loss of the auto-regressive outputs weighted by the score from a metric-based discriminator model. The discriminator input is a mixture of ground truth labels, the teacher-forcing outputs of the generator, and distractors sampled from the dataset, thereby allowing for richer feedback on the autoregressive outputs of the generator. To improve the calibration of the discriminator output, we also bootstrap the discriminator with the matching of the intermediate features of the ground truth and the generator’s autoregressive output. We explore different sampling and adversarial policy optimization strategies during training in order to understand how to encourage response diversity without sacrificing relevance. Our experiments shows that adversarial bootstrapping is effective at addressing exposure bias, leading to improvement in response relevance and coherence. The improvement is demonstrated with the state-of-the-art results on Movie and Ubuntu dialogue datasets with respect to human evaluations and BLUE, ROGUE, and DISTINCT scores.
End-to-end neural dialogue models have demonstrated the ability to generate reasonable responses to human interlocutors. However, a significant gap remains between these state-of-the-art dialogue models and human-level discourse. The fundamental problem with neural dialogue modeling is exemplified by their generic responses, such as I don’t know, I’m not sure, or how are you, when conditioned on broad ranges of dialogue contexts. In addition to the limited contextual information in single-turn Seq2Seq models [Sutskever, Vinyals, and Le2014, Vinyals and Le2015, Li et al.2016a], which has motivated the need for hierarchical recurrent encoder decoder (HRED) multi-turn models [Serban et al.2016, Xing et al.2017, Serban et al.2017b, Serban et al.2017a, Xing et al.2017, Olabiyi et al.2018, Olabiyi, Khazan, and Mueller2018, Olabiyi et al.2019], previous work points to three underlying reasons as to why neural models fail at dialogue response generation.
i) Exposure Bias: Similar to language and machine translation models, traditional conversation models are trained with the model input taken from the ground truth rather than a previous output (a method known as teacher forcing [Williams and Zipser1989]). During inference, however, the model uses past outputs, i.e., is used autoregressively. Interestingly, training with teacher forcing does not present a significant problem in the machine translation setting since the conditional distribution of the target given the source is well constrained. On the other hand, this is problematic in the dialogue setting since the learning task is unconstrained [Lowe et al.2015]. In particular, there are several suitable target responses per dialogue context and vice versa. This discrepancy between training and inference is known as exposure bias [Williams and Zipser1989, Lamb et al.2016] and significantly limits the informativeness of the responses as the decoding error compounds rapidly during inference. Training methods that incorporates autoregressive sampling into model training have been explored to address this [Li et al.2016b, Li et al.2017, Yu et al.2017, Che et al.2017, Zhang et al.2017, Xu et al.2017, Zhang et al.2018b].
ii) Training data: The inherent problem with dialogue training data, although identified, has not been particularly addressed in the literature [Sharath, Tandon, and Bauer2017]. Human conversations contain a large number of generic, uninformative responses with little or no semantic meaning, giving rise to a classic class-imbalance problem. This problem also exists at the word and turn level; human dialogue [Banchs2012, Serban et al.2017b] contains non-uniform sequence entropy that is concave with respect to the token position, with the tokens at the beginning and end of a sequence having lower entropy than those in the middle (see Fig. 1). This initial positive energy gradient can create learning barriers for recurrent models, and is a primary contributing factor to their short, generic outputs. In [Shao et al.2017], the use of glimpse-based decoding is seemingly able to circumvent this problem by breaking this data-induced pattern but at the expense of both training and inference time.
iii) Training Objective: Most existing dialogue models are trained using the maximum likelihood estimation (MLE) [Sutskever, Vinyals, and Le2014, Vinyals and Le2015, Serban et al.2016, Xing et al.2017] with teacher forcing because autoregressive sampling leads to unstable training. Unfortunately, the use of MLE is incongruent with the redundant nature of dialogue datasets, exacerbates the exposure bias problem in dialogue datasets, and is the primary factor leading to uninteresting and generic responses. Alternative training frameworks that complement MLE with other constraints such as generative adversarial networks, reinforcement learning, and variational auto-encoders that specifically encourage diversity have been explored to overcome the limitations of the MLE objective alone [Li et al.2016a, Li et al.2016b, Li et al.2017, Yu et al.2017, Che et al.2017, Zhang et al.2017, Xu et al.2017, Serban et al.2017b, Zhang et al.2018b, Olabiyi et al.2018, Olabiyi, Khazan, and Mueller2018, Olabiyi et al.2019].
In this paper, we propose an adversarial bootstrapping framework for training dialogue models. This framework tackles the class imbalance caused by the redundancy in dialogue training data, and addresses the problem of exposure bias in dialogue models. Bootstrapping has been proposed in the literature as a way to handle data with noisy, subjective, and incomplete labels by combining cross-entropy losses from both the ground truth (i.e. teacher forcing) and model outputs (i.e. autoregression) [Reed et al.2015, Grandvalet and Bengio2005, Grandvalet and Bengio2006]. Here, we first extend its use to dialogue model training to encourage the generation of high variance response sequences for a given ground truth target [Reed et al.2015]. This expectedly should reduce the tendency of dialogue models to reproduce those generic and uninteresting target responses present in the training data. This is achieved by training a discriminator adversarially, and use the feedback from the discriminator to weigh the cross-entropy loss from the model-predicted target. The gradient from the feedback provided by the discriminator encourages the dialogue model to generate a wide range of structured outputs. Second, we bootstrap the discriminator to improve the calibration of its output. We use the similarity between the representations of the generator’s autoregressive output and the groundtruth from an intermediate layer of the discriminator as an addition target for the discriminator. This further improves the diversity of the generator’s output without sacrificing relevance. We apply adversarial bootstrapping to multi-turn dialogue models. Architecture wise, we employ an HRED generator and an HRED discriminator depicted in Figure 2, with a shared hierarchical recurrent encoder. In our experiments, the proposed adversarial bootstrapping demonstrates state-of-the-art performances on Movie and Ubuntu datasets as measured in terms of both automatic (BLUE, ROGUE, and DISTINCT scores) and human evaluations.
2 Related Work
The literature on dialogue modeling even in multi-turn scenario is vast (see [Serban et al.2016, Xing et al.2017, Serban et al.2017b, Serban et al.2017a, Xing et al.2017, Olabiyi et al.2018, Olabiyi, Khazan, and Mueller2018, Olabiyi et al.2019, Li et al.2016b]), and so in this section, we focus on key relevant previous papers. The proposed adversarial bootstrapping is closely related to the use of reinforcement learning for dialogue response generation with an adversarially trained discriminator serving as a reward function [Li et al.2017]. First, we employ a different discriminator training strategy from \citeauthorLi2017 \shortciteLi2017. The negative samples of our discriminator consists of the generator’s deterministic teacher forcing output, and distractors sampled from the training set. This makes the discriminator’s task more challenging and improve the quality of the feedback to the generator by discouraging the generation of high frequency generic responses. Also, while [Li et al.2017] samples over all the possible outputs of the generator, we take samples from the generator’s top k outputs or the MAP output with Gaussian noise as additional inputs. This allows our model to explore mostly plausible trajectories during training compared to [Li et al.2017] where the discriminator mostly score the generated samples very low. The top_k sampling strategy also mitigates the gradient variance problem found in the traditional policy optimization employed by [Li et al.2017]. Finally, we bootstrap our discriminator with the similarity between the intermediate representations of the generator’s autoregressive output and the groundtruth to improve the calibration of the discriminator output.
Let denote the context or conversation history up to turn and let denote the associated target response. Provided input-target samples , we aim to learn a generative model which scores representative hypotheses given arbitrary dialogue contexts such that responses that are indistinguishable from informative and diverse target responses are favored with high scores and otherwise given low scores. Notationally, we write the collection of possible responses at turn as the set containing elements where is the length of the -th candidate response and is the -th word of that response.
3.1 Generator Bootstrapping
To achieve the goal outlined above, we propose an Adversarial Bootstrapping (AB) approach to training multi-turn dialogue models such as the one depicted in Fig. 2. The adversarial bootstrapping for the generator can be expressed according to the objective
where is the target variable that controls the generator training. Indeed, hard bootstrapping [Reed et al.2015] is one such special case of (1) wherein , , and otherwise, where is a hyperparameter. Similarly, MLE is another special case in which , and otherwise. It is reasonable to assume from these formulations that bootstrapping will outperform MLE since it does not assume all negative outputs are equally wrong.
Interestingly, \citeauthorLi2017 \shortciteLi2017 make use of the MLE setting but additionally relies on the sampling stochasticity to obtain non-zero credit assignment information from the discriminator for the generator policy updates. To avoid this inconsistency, we instead modify the generator target to
where is a hyperparamter and is the bootstrapped target obtained from a neural network discriminator with parameters . The first two assignments in (2) are also used in training the discriminator in addition to the human-generated distractors, denoted , from the dataset. In detail, we make use of the term
within the context of the objective function. Namely, the discriminator objective is the cross-entropy between the output and the target of the discriminator given by
The inclusion of human-generated negative samples encourages the discriminator to assign low scores to high frequency, generic target responses in the dataset, thereby discouraging the generator from reproducing them.
3.2 Discriminator Bootstrapping
In addition to the generator bootstrapping with the discriminator in section 3.1, we can also bootstrap the discriminator using the similarity measure, , between latent representations of the sampled generator outputs, , and ground truth encoded by discriminator. i.e.
In our experiments, we chose the cosine similarity metric and the output of the discriminator before the logit projection for and respectively. This helps to better calibrate the discriminator’s judgment of the generator’s outputs.
3.3 Sampling Strategy
To backpropagate the learning signal for the case where , we explore both stochastic and deterministic policy gradient methods. For stochastic policies, we approximate the gradient of w.r.t. by Monte Carlo samples using the REINFORCE policy gradient method [Li et al.2017, Glynn1990, Williams1992], i.e.
where and is the source of randomness. We denote the model trained with Eq. 7 as . To reduce the variance of Eq. 6, we propose a novel approach of sampling from top_k generator outputs using (i) categorical distribution based on the output logits (), similar to the treatment in [Radford et al.2019], and (ii) uniform distribution (); where top_k is a hyperparameter. This is especially useful for dialogue modeling with large vocabulary sizes.
Referring to the network architecture in Fig. 2, the generator and discriminator share the same encoder. The encoder uses two RNNs to handle multi-turn representations similar to the approach in [Serban et al.2016]. First, during turn , a bidirectional encoder RNN, , with an initial state of maps the conversation context comprising the sequence of input symbols , where is the sequence length, into a sequence of hidden state vectors according to
where is the embedding lookup and is the embedding matrix with dimension, and vocabulary size, . The vector representation of the input sequence is the pooling over the encoded sequence, [Serban et al.2016]. In addition, we use the output sequence as an attention memory to the generator as depicted in Fig. 2. This is done to improve the relevance of the generated response.
To capture we use a unidirectional context RNN, , to combine the past dialogue context, with the pooling of the encoded sequence as
Note that we decided not to allow attention at turn-level, as was done in [Xing et al.2017], since the number of dialogue turns is unpredictable. Also, the use of a single vector representation helps to simplify both training and inference procedures. We however think that converting the turn-level sequential memory to a random access memory [Graves, Wayne, and Danihelka2014, Graves et al.2016] should lead to further system performance improvements without much impact on inference time. We hope to explore this direction in the future.
The generator, denoted , is a unidirectional decoder RNN with an attention mechanism [Bahdanau, Cho, and Bengio2015, Luong et al.2015]. Similar to [Serban et al.2016], the decoder RNN is initialized with the last state of the context RNN. The generator outputs a hidden state representation, for each previous token according to
where is the attention over the encoded sequence, . When the generator is run in teacher-forcing mode, as is typically used during training, the previous token from the ground truth is used, i.e., , whereas during inference (autoregressive mode), we use the generator’s previous decoded output, i.e., .
The decoder hidden state, is mapped to a probability distribution typically through a logistic layer, , yielding,
where is a hyperparameter, , and is the logit bias. Note that, the output projection matrix is the same as the embedding matrix similar to [Vaswani et al.2017]. The generative model can then be derived as:
The discriminator, is a binary classifier that takes as input, a response sequence, , and a dialogue context, , and trained with output labels provided in Eq. 3. The discriminator, as shown in Figure 2 is an RNN, , that shares the hierarchical encoder and the word embeddings with the generator, with the initial state being the final state of the context RNN. The last layer of the discriminator RNN is fed to a logistic layer and sigmoid function to produce the a normalized (action-value function) value of the pair of dialogue context (state) and response (action). This definition is more general than in [Li et al.2017], as the target is not just a binary classification of the dialogue context and response pair as either human-generated () or machine-generated (). That is, the target also includes the distractor classification, and the output of a similarity function. This formulation allows us to view the discriminator as a metric learner [Kulis2013] as opposed to the generator, which is a maximum likelihood learner. Therefore, adversarial bootstrapping can be viewed as joint metric and maximum likelihood learning.
We explore two options of estimating the value, i.e., word or utterance level. At utterance level, we use Eq. 4 in conjunction with a unidirectional discriminator RNN. The value is calculated using the last output of , i.e,
where , , and are the logit projection and bias respectively. At word level, the discriminator RNN (we use a bidirectional RNN in our implementation) produces a word-level evaluation. The normalized value and the adversarial bootstrapping objective function are then respectively given by,
We trained both the generator and discriminator simultaneously with two samples for the generator and three for the discriminator. In all our experiments, we use the generator’s teacher forcing outputs to train the discriminator (i.e., cases of Eqs. 2 and 3). The encoder parameters are included with the generator, i.e., we did not update the encoder during discriminator updates. Each RNN is a 3-layer GRU cells, with a hidden state size () of 512. The word embedding size is the same , and the vocabulary size, is . Other hyperparameters include, , , , for and , and for . Although, we used a single top_k value during training, we avoided training with multiple top_k values by searching for the optimum top_k (between 1 and 20) on the validation set using the BLEU score. We used the obtained optimum values for inference. Other training parameters include, the initial learning rate is with decay rate factor of , applied when the generator loss has increased over two iterations. We use a batch size of 64 and clip gradients around . All parameters are initialized with Xavier uniform random initialization [Glorot and Bengio2010]. Due to the large vocabulary size, we use sampled softmax loss [Jean et al.2015] for the generator to limit the GPU memory requirement and expedite the training process. However, we use full softmax for evaluation. The model is trained end-to-end using the stochastic gradient descent algorithm.
We evaluated the proposed adversarial bootstrapping (aBoots) with both generator and discriminator bootstrapping, on the Movie Triples and Ubuntu Dialogue corpora randomly split into training, validation, and test sets, using 90%, 5%, and 5% proportions. We performed minimal preprocessing of the datasets by replacing all words except the top 50,000 most frequent words by an UNK symbol. The Movie dataset [Serban et al.2016] spans a wide range of topics with few spelling mistakes and contains about 240,000 dialogue triples which makes it suitable for studying the relevance vs. diversity tradeoff in multi-turn conversations. The Ubuntu dataset, extracted from the Ubuntu Relay Chat Channel, [Serban et al.2017b], contains about 1.85 million conversations with an average of 5 utterances per conversation. This dataset is ideal to train dialogue models that can provide expert knowledge/recommendation in domain-specific conversations.
We explore different variants of aBoots along the choice of discrimination either word(_w) or utterance(_u) level and sampling strategy either uniform(_uni), categorical(_cat) or with Gaussian noise (_gau). We compare their performances with existing state-of-the-art dialogue models including, (V)HRED111implementation obtained from https://github.com/julianser/hed-dlg-truncated [Serban et al.2016, Serban et al.2017b], and DAIM222implementation obtained from https://github.com/dreasysnail/converse_GAN [Zhang et al.2018b]. For the purpose of completeness, we also include results from a transformer-based Seq2Seq model [Vaswani et al.2017].
We compare the performance of the models based on the informativeness (a combination of relevance and diversity metrics) of the generated responses. For relevance, we adopted BLEU-2 [Papineni et al.2002] and ROUGE-2 [Lin2014] scores. For diversity, we adopted distinct unigram (DIST-1) and bi-gram (DIST2) [Li et al.2016a] as well as and normalized average sequence length (NASL) scores [Olabiyi et al.2018].
For human evaluation, we follow a similar setup as [Li et al.2016a], employing crowd-sourced judges to evaluate a random selection of 200 samples. We present both the multi-turn context and the generated responses from the models to 3 judges and asked them to rank the response quality in terms of informativeness. Ties are not allowed. The informativeness measure captures the temporal appropriateness, i.e, the degree to which the generated response is temporally and semantically appropriate for the dialogue context as well as other factors such as length of the response, and repetitions. For analysis, we pair the models and compute the average number of times each model is ranked higher than the other.
|aBoots_w_cat – DAIM||0.957 – 0.043||0.960 – 0.040|
|aBoots_w_cat – HRED||0.645 – 0.355||0.770 – 0.230|
|aBoots_w_cat – VHRED||0.610 – 0.390||0.746 – 0.254|
|aBoots_w_cat – hredGAN_w||0.550 – 0.450||0.556 – 0.444|
6 Results and Discussion
6.1 Quantitative Evaluation
The quantitative measures reported in Table 1 shows that adversarial bootstrapping, gives the best overall relevance and diversity performance in comparison to other models, (V)HRED, hredGAN, DAIM and Transformer on both Movie and Ubuntu datasets. We believe the combination of improved discriminator training and the policy-based objective is responsible for the observed performance improvement. On the other hand, (V)HRED and hredGAN suffer performance loss due to exposure bias, since autoregressive sampling is not included in their training. Although DAIM uses autoregressive sampling, its poor performance shows the limitation of single-turn architecture and GAN objective compared to multi-turn architecture and policy-based objective in . The transformer Seq2Seq model, which performs better than RNNs on the machine translation task, also suffers from the exposure bias, and overfits very quickly to the low entropy regions in the data leading to a poor inference performance. Also, the results from models indicate that word-level discrimination performs better than utterance-level discrimination, consistent with the result reported in [Olabiyi et al.2018] for the hredGAN model. While it is difficult to identify why some models generate very long responses, we observe that models with Gaussian noise inputs (e.g., hredGAN and ) may be using the latent Gaussian distribution to better encode response length information; indeed, this is an area of on-going work. Within the variants of , we observe that models trained with stochastic policy, and , outperform those trained with a deterministic policy, . Notably, we find that for the stochastic policy, there is a tradeoff on relevance and diversity between top_k categorical and uniform sampling. The categorical sampling tends to perform better with relevance but worse with diversity. We believe that this is because top_k categorical sampling causes the generator to exploit high likelihood (i.e., more likely to be encountered during inference) than uniform sampling of the top candidates, while still allowing the policy to explore. This however comes with some loss of diversity, although not significant. Overall, the automatic evaluation indicates that adversarial bootstrapping trained with stochastic policy using top_k categorical sampling strategy gives the best performance.
6.2 Qualitative Evaluation
As part of our evaluation we also consider scores from human judges. Specifically, we had each evaluator compare responses from five models: , hredGAN_w, (V)HRED and DAIM. The pairwise human preferences are reported in Table 2. These data indicate a significant preference for responses generated by as compared to both (V)HRED and DAIM. We observe that is preferred over hredGAN_w on average, although not by a significant margin. We note that this score was computed from only 200 evaluation samples, which is likely too small to demonstrate a strong preference for . It is also worthy to note that the hredGAN_w model represents a strong baseline, based on previous human evaluations [Olabiyi et al.2018], against which to compare our adversarially trained models. It is interesting to note that although automatic evaluation scores hredGAN_w much lower than on relevance, the long response length from hredGAN_w which indicates strong diversity has a considerable impact on how human evaluators judge the informativeness of responses generated by hredGAN_w. Table 4 shows example responses from the models.
|Context_0||we ’ re grateful , person . it ’ s not that . those craftsman tools . they ’ re a dream come true .|
|Context_1||your dad and i were raised with a certain work ethic , that ’ s all .|
|VHRED||person , mr . person . person , i ’ m not going to tell you , person . i ’ m sorry . i ’ m sorry , person . i ’ m sorry|
|hredGAN_w||person , i know you ’ re see . continued_utterance person , please .|
|person , i ’ m a little worried about my future . i ’ m not here to marry you .|
|Context_0||where would they be going ?|
|Context_1||we ’ ve got two choices . one of them is wrong .|
|HRED||i ’ m sorry , person .|
|VHRED||i know .|
|hredGAN_w||person , i know , it ’ s a real hair . with you a little little little , little little little little somethin|
|i don ’ t care where they are , i don ’ t know anyone will be outside .|
|Context_0||DJones: update manager won’t detect 12.04.1 as a new version if you are already running 12.04, because 12.04.1 = 12.04 + lots of package updates|
|Context_1||DJones I am running 10.04.|
|Context_2||what does ’gksu update-manager c’ report?|
|HRED||I don’t know what the problem is|
|VHRED||you mean ?|
|hredGAN_w||yes , I get the new one , I just want to get the new one|
|it says that command not found|
|Context_0||libgnome2-dev is the package that has that pkgconfig file|
|Context_1||After I installed libgnome2-dev I got a different output. http://pastebin.com/zpDeGs81 sorry it is so long.|
|Context_2||I thought that was what was going on thanks. :) I ran the command ./configure && sudo make && sudo make install would that have worked.|
|HRED||I don’t know what the problem is|
|VHRED||what do you mean ? __eou__ what are you trying ?|
|hredGAN_w||I have no idea , I can just copy the files and copy files to the other one|
|yes I did . I just did sudo apt-get update and it worked|
6.3 Ablation Studies
In this section, we examine the effect of partial bootstrapping on the model performance. Here, the target in Eq. (5) is excluded from the discriminator. The automatic evaluation results on all the variants of are reported in Table 3. The table shows that the generator models bootstrapped by a discriminator that is not bootstrapped generally performs worse than ones with a bootstrapped discriminator. This improvement is particularly more evident in the best performing variant, . We attribute this performance improvement to the better calibration of discriminator obtained from the bootstrapping of the discriminator output with the similarity measure between the generator’s autoregressive output and the groundtruth during training.
In this paper, we have proposed a novel training technique, adversarial bootstrapping, which is useful for dialogue modeling. The proposed method addresses the issues of data-induced redundancy and exposure bias in dialogue models trained with maximum likelihood. This is achieved by bootstrapping the teacher-forcing MLE objective with feedback on autoregressive outputs from an adversarially trained discriminator. This feedback discourages the generator with from producing bland and generic responses that are characteristic of MLE training. Experimental results indicate that a doubly bootstrapped system gives a better performance than a system, where only the generator is bootstrapped. Also, the model variant characterized by choosing top_k categorical sampling, stochastic policy optimization, and word-level discrimination gives the best performance. The results demonstrate that the proposed method leads to models generating more relevant and diverse responses in comparison to existing methods.
- [Bahdanau, Cho, and Bengio2015] Bahdanau, D.; Cho, K.; and Bengio, Y. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of International Conference of Learning Representation (ICLR 2015).
- [Banchs2012] Banchs, R. E. 2012. Movie-dic: A movie dialogue corpus for research and development. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, 203––207.
- [Che et al.2017] Che, T.; Li, Y.; Zhang, R.; Hjelm, R. D.; Li, W.; Song, Y.; and Bengio, Y. 2017. Maximum-likelihood augmented discrete generative adversarial networks. In arXiv preprint arXiv:1702.07983.
- [Glorot and Bengio2010] Glorot, X., and Bengio, Y. 2010. Understanding the difficulty of training deep feedforward neural networks. In International conference on artificial intelligence and statistics.
- [Glynn1990] Glynn, P. W. 1990. Likelihood ratio gradient estimation for stochastic systems. Communications of the ACM 33(10):75––84.
- [Grandvalet and Bengio2005] Grandvalet, Y., and Bengio, Y. 2005. Semi-supervised learning by entropy minimization. In NIPS, 529––536.
- [Grandvalet and Bengio2006] Grandvalet, Y., and Bengio, Y. 2006. 9 entropy regularization. mitpress: 10.7551/mitpress/9780262033589.003.0009.
- [Graves et al.2016] Graves, A.; Wayne, G.; Reynolds, M.; Harley, T.; Danihelka, I.; Grabska-Barwińska, A.; Colmenarejo, S. G.; Grefenstette, E.; Ramalho, T.; Agapiou, J.; Badia, A. P.; Hermann, K. M.; Zwols, Y.; Ostrovski, G.; Cain, A.; King, H.; Summerfield, C.; Blunsom, P.; Kavukcuoglu, K.; and Hassabis, D. 2016. Hybrid computing using a neural network with dynamic external memory. Nature Online 538:471––476.
- [Graves, Wayne, and Danihelka2014] Graves, A.; Wayne, G.; and Danihelka, I. 2014. Neural turing machines. In arXiv preprint arXiv:1410.5401, 2014.
- [Jean et al.2015] Jean, S.; Cho, K.; Memisevic, R.; and Bengio, Y. 2015. On using very large target vocabulary for neural machine translation. In arXiv preprint arXiv:1412.2007.
- [Kulis2013] Kulis, B. 2013. Metric learning: A survey. Foundations and Trends in Machine Learning 5(4):287––364.
- [Lamb et al.2016] Lamb, A.; Goyah, A.; Zhang, Y.; Zhang, S.; Courville, A.; and Bengio, Y. 2016. Professor forcing: A new algorithm for training recurrent networks. In Proceedings of Advances in Neural Information Processing Systems (NIPS 2016).
- [Li et al.2016a] Li, J.; Galley, M.; Brockett, C.; Gao, J.; and Dolan, B. 2016a. A diversity-promoting objective function for neural conversation models. In Proceedings of NAACL-HLT.
- [Li et al.2016b] Li, J.; Monroe, W.; Ritter, A.; Galley, M.; Gao, J.; and Jurafsky, D. 2016b. Deep reinforcement learning for dialogue generation. In arXiv preprint arXiv:1606.01541v4.
- [Li et al.2017] Li, J.; Monroe, W.; Shi, T.; Ritter, A.; and Jurafsky, D. 2017. Adversarial learning for neural dialogue generation. In arXiv preprint arXiv:1701.06547.
- [Lin2014] Lin, C. Y. 2014. Rouge: a package for automatic evaluation of summaries. In Proceedings of the Workshop on Text Summarization Branches Out.
- [Lowe et al.2015] Lowe, R.; Pow, N.; Serban, I.; and Pineau, J. 2015. The ubuntu dialogue corpus: A large dataset for research in unstructured multi-turn dialogue systems. In SIGDIAL.
- [Luong et al.2015] Luong, M. T.; Sutskever, I.; Le, Q. V.; Vinyals, O.; and Zaremba, W. 2015. Addressing the rare word problem in neural machine translation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics.
- [Nakamura et al.2019] Nakamura, R.; Sudoh, K.; Yoshino, K.; and Nakamura, S. 2019. Another diversity-promoting objective function for neural dialogue generation. In AAAI Workshop on Reasoning and Learning for Human-Machine Dialogues (DEEP-DIAL).
- [Olabiyi et al.2018] Olabiyi, O.; Salimov, A.; Khazane, A.; and Mueller, E. 2018. Multi-turn dialogue response generation in an adversarial learning framework. In arXiv preprint arXiv:1805.11752.
- [Olabiyi et al.2019] Olabiyi, O.; Khazan, A.; Salimov, A.; and Mueller, E. 2019. An adversarial learning framework for a persona-based multi-turn dialogue model. In NAACL NeuralGen Workshop.
- [Olabiyi, Khazan, and Mueller2018] Olabiyi, O.; Khazan, A.; and Mueller, E. 2018. An adversarial learning framework for a persona-based multi-turn dialogue model. In 17th IEEE International Conference on Machine Learning and Applications (ICMLA).
- [Papineni et al.2002] Papineni, K.; Roukos, S.; Ward, T.; and Zhu, W. 2002. Bleu: A method for automatic evalution of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 311––318.
- [Radford et al.2019] Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; and Sutskever, I. 2019. Language models are unsupervised multitask learners. In https://d4mucfpksywv.cloudfront.net/better-language-models.
- [Reed et al.2015] Reed, S.; Lee, H.; Anguelov, D.; Szegedy, C.; Erhan, D.; and Rabinovich, A. 2015. Training deep neural networks on noisy labels with bootstrapping. In ICLR.
- [Serban et al.2016] Serban, I.; Sordoni, A.; Bengio, Y.; Courville, A.; and Pineau, J. 2016. Building end-to-end dialogue systems using generative hierarchical neural network models. In Proceedings of The Thirtieth AAAI Conference on Artificial Intelligence (AAAI 2016), 3776–3784.
- [Serban et al.2017a] Serban, I. V.; Klinger, T.; Tesauro, G.; Talamadupula, K.; Zhou, B.; Bengio, Y.; and Courville, A. 2017a. Multiresolution recurrent neural networks: An application to dialogue response generation. In Proceedings of The Thirty-first AAAI Conference on Artificial Intelligence (AAAI 2017).
- [Serban et al.2017b] Serban, I. V.; Sordoni, A.; Lowe, R.; Charlin, L.; Pineau, J.; Courville, A.; and Bengio, Y. 2017b. A hierarchical latent variable encoder-decoder model for generating dialogue. In Proceedings of The Thirty-first AAAI Conference on Artificial Intelligence (AAAI 2017).
- [Shao et al.2017] Shao, L.; Gouws, S.; Britz, D.; Goldie, A.; Strope, B.; and Kurzweil, R. 2017. Generating long and diverse responses with neural conversational models. In Proceedings of International Conference of Learning Representation (ICLR).
- [Sharath, Tandon, and Bauer2017] Sharath, T.; Tandon, S.; and Bauer, R. 2017. A dual encoder sequence to sequence model for open-domain dialogue modeling. In arXiv preprint arXiv:1710.10520.
- [Silver et al.2014] Silver, D.; Lever, G.; Heess, N.; Degris, T.; Wierstra, D.; and Riedmiller, M. 2014. Deterministic policy gradient algorithms. In ICML.
- [Sutskever, Vinyals, and Le2014] Sutskever, I.; Vinyals, O.; and Le, Q. 2014. Sequence to sequence learning with neural networks. In Proceedings of Advances in Neural Information Processing Systems (NIPS), 3104––3112.
- [Vaswani et al.2017] Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, L.; and Polosukhin, I. 2017. Attention is all you need. In NIPS.
- [Vinyals and Le2015] Vinyals, O., and Le, Q. 2015. A neural conversational model. In Proceedings of ICML Deep Learning Workshop.
- [Williams and Zipser1989] Williams, R. J., and Zipser, D. 1989. A learning algorithm for continually running fully recurrent neural networks. Neural computation 1(2):270––280.
- [Williams1992] Williams, R. J. 1992. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine learning 8(3-4):229––256.
- [Xing et al.2017] Xing, C.; Wu, W.; Wu, Y.; Zhou, M.; Huang, Y.; and Ma, W. 2017. Hierarchical recurrent attention network for response generation. In arXiv preprint arXiv:1701.07149.
- [Xu et al.2017] Xu, Z.; Liu, B.; Wang, B.; Chengjie, S.; Wang, X.; Wang, Z.; and Qi, C. 2017. Neural response generation via gan with an approximate embedding layer. In EMNLP.
- [Yu et al.2017] Yu, L.; Zhang, W.; Wang, J.; and Yu, Y. 2017. Seqgan: sequence generative adversarial nets with policy gradient. In Proceedings of The Thirty-first AAAI Conference on Artificial Intelligence (AAAI 2017).
- [Zhang et al.2017] Zhang, Y.; Gan, Z.; Fan, K.; Chen, Z.; Henao, R.; Shen, D.; and Carin, L. 2017. Adversarial feature matching for text generation. In arXiv preprint arXiv:1706.03850.
- [Zhang et al.2018a] Zhang, S.; Dinan, E.; Urbanek, J.; Szlam, A.; Kiela, D.; and Weston, J. 2018a. Personalizing dialogue agents: I have a dog, do you have pets too? In arXiv preprint arXiv:1801.07243v3.
- [Zhang et al.2018b] Zhang, Y.; Galley, M.; Gao, J.; Gan, Z.; Li, X.; Brockett, C.; and Dolan, B. 2018b. Generating informative and diverse conversational responses via adversarial information maximization. In NeurIPS.