Low-Resource Knowledge-Grounded Dialogue Generation

Low-Resource Knowledge-Grounded Dialogue Generation


Responding with knowledge has been recognized as an important capability for an intelligent conversational agent. Yet knowledge-grounded dialogues, as training data for learning such a response generation model, are difficult to obtain. Motivated by the challenge in practice, we consider knowledge-grounded dialogue generation under a natural assumption that only limited training examples are available. In such a low-resource setting, we devise a disentangled response decoder in order to isolate parameters that depend on knowledge-grounded dialogues from the entire generation model. By this means, the major part of the model can be learned from a large number of ungrounded dialogues and unstructured documents, while the remaining small parameters can be well fitted using the limited training examples. Evaluation results on two benchmarks indicate that with only training data, our model can achieve the state-of-the-art performance and generalize well on out-of-domain knowledge.


[1][red] on line, arc = 0pt, outer arc = 0pt, colback = #1!10!white, colframe = #1!50!black, boxsep = 0pt, left = 1pt, right = 1pt, top = 2pt, bottom = 2pt, boxrule = 0pt, bottomrule = 1pt, toprule = 1pt \iclrfinalcopy

1 Introduction

Open domain dialogue systems, due to the applications on social chatbots such as Microsoft XiaoIce (Shum et al., 2018) and virtual assistants such as Amazon Alexa (Ram et al., 2018), have drawn increasing attention from the research community of natural language processing and artificial intelligence. Thanks to the advances in neural sequence modeling (Vaswani et al., 2017; Sutskever et al., 2014) and machine learning techniques (Li et al., 2017, 2016), such systems now are able to reply with plausible responses regarding to conversation history, and thus allow an agent to have a natural conversation with humans. On the other hand, when people attempt to dive into a specific topic, they may clearly realize the gap between the conversation with a state-of-the-art system and the conversation with humans, as the system is only able to awkwardly catch up with the conversation, owing to the lack of knowledge of the subject.

We consider grounding open domain dialogue generation with knowledge which is assumed to be unstructured documents. While documents are abundant on the Web, it is difficult to obtain large scale dialogues that are naturally grounded on the documents for learning of a neural generation model. To overcome the challenge, some recent work (Zhou et al., 2018b; Dinan et al., 2019) resorts to crowd-sourcing and builds benchmarks with the source of Wikipedia. On the one hand, the datasets pave the way to the recent research on knowledge-grounded response generation/selection (Zhao et al., 2019; Lian et al., 2019; Li et al., 2019); on the other hand, we argue that there still a long way to go for application of the existing models in real scenarios, since (1) the models, especially those achieve state-of-the-art performance via sophisticated neural architectures, just overfit to the small training data (e.g., 18k dialogues). An evidence is that when they are applied to documents out of the domain of the training data, their performance drops dramatically, as will be seen in our experiments; and (2) it is difficult to collect enough training data for a new domain or a new language, as human effort is expensive.

As a step towards application of knowledge-grounded dialogue generation in real-world systems, we explore how to learn a model with as few knowledge-grounded dialogues as possible, yet the model achieves state-of-the-art performance and generalizes well on out-of-domain documents. The key idea is to make parameters that rely on knowledge-grounded dialogues small and independent by disentangling the response decoder, and thus we can learn the major part of the generation model from ungrounded dialogues and plain text that are much easier to acquire. Specifically, the encoder of the generation model consists of two independent components with one for encoding the context and the other for representing the knowledge. The decoder is decomposed into conditionally independent components including a language model, a context processor, and a knowledge processor, and the three components are coordinated by a decoding manager that dynamically determines which component is activated for response prediction. The language model predicts the next word of a response based on the prior sub-sequence, and the context processor ensures coherence of the dialogue by attending over the conversation history. Both components, along with the context encoder, are independent with the extra knowledge, and thus can be pre-trained using the ungrounded dialogues. The knowledge encoder has nothing to do with dialogues, and thus can be pre-trained with the plain text. The knowledge processor is responsible for grounding response generation on the document. This part, together with the decoding manager, depends on the knowledge-grounded dialogues, but the parameters are small in size, and estimation of these parameters just requires a few training examples depending on specific domains or tasks. By fixing the pre-trained parameters, we can adapt the model to a new domain with only a little cost.

We pre-train the language model, the context processor, and the context encoder with a clean version of Reddit data (Dziri et al., 2018), pre-train the knowledge encoder using a Wikipedia dump available on ParlAI, and compare our model with baselines that hold state-of-the-art performance on two benchmarks including the Wizard of Wikipedia (Wizard) (Dinan et al., 2019) and CMU Document Grounded Conversations (CMUDoG) (Zhou et al., 2018b). Evaluation results indicate that (1) to achieve the state-of-the-art performance, our model only needs training data ( k dialogues on Wizard and k dialogues on CMUDoG); (2) on Wizard, the model significantly outperforms the baseline models on out-of-domain documents even though the baselines have leveraged all training data, while our model is only learned with training data; and (3) the model performs comparably well on in-domain and out-of-domain documents in a low-resource setting.

Contributions in this work are three-fold: (1) exploration of knowledge-grounded dialogue generation under a low-resource setting; (2) proposal of pre-training the knowledge-grounded dialogue generation model with a disentangled decoder using ungrounded dialogues and documents; and (3) empirical verification of the effectiveness of the model on two benchmarks.

2 Approach

We elaborate our approach to learning a response generation model with knowledge-grounded dialogues, ungrounded dialogues, and plain text.

2.1 Problem Formalization

Suppose that we have a dataset , where , is a document that serves as the background of the dialogue , is the context of the dialogue with the -th utterance, and is the response regarding to and . In addition to , we further assume that there are and with a document and a context-response pair, and . and . The goal is to learn a generation model ( denotes the parameters of the model) with . Thus, given a new document with the associated dialogue context , one can generate a response following .

Our idea is inspired by the observation on the nature of open domain dialogues: despite the fact that a dialogue is based on a document , words and utterances in the dialogue are not always related to (e.g., a reply just echoing the previous turn), even for the turns from the interlocutor who has access to , as demonstrated by the examples in (Dinan et al., 2019; Zhou et al., 2018b). Therefore, we postulate that formation of a response could be decomposed into three uncorrelated actions: (1) selecting a word according to what has generated to make the sentence linguistically valid (corresponding to a language model); (2) selecting a word according to the context to make the dialogue coherent (corresponding to a context processor); and (3) selecting a word according to the extra knowledge to ground the dialogue (corresponding to a knowledge processor). The three actions can be independently learned, which becomes the key to aiding the small with the large and .

Figure 1: Architecture of the generation model.

2.2 Generation Model

Figure 1 illustrates the architecture of the model. The model is made up of a context encoder, a knowledge encoder, a decoder, and a decoding manager. The major difference lies in the decoding phase which simulates the aforementioned actions by decomposing the decoder into a language model, a context processor, and a knowledge processor. The three components are independent conditioned on the hidden states of the decoder, and are coordinated by the manager.


Given a dialogue context , the context encoder concatenates as with the -th word in the sequence, and then exploits a recurrent neural network with gated recurrent units (GRUs) (Chung et al., 2014) to transform the word sequence into a sequence of hidden vectors given by


where is the embedding of initialized with GloVe (Pennington et al., 2014). serve as the input of the context processor in decoding.

In the meanwhile, given a document with the -th sentence, the knowledge encoder represents as a sequence of hidden vectors through a bidirectional GRU (Cho et al., 2014):


where is the embedding of the -th word in initialized using GloVe. are fed to the knowledge processor to ground response prediction on .

Different from Transformer Memory Network (Dinan et al., 2019), our model does not perform knowledge selection in the encoding phase (e.g., via attention over ), but leaves it to the decoding phase. This could remove the dependency between context encoding and knowledge encoding, and facilitate us to estimate and with and respectively.

Disentangled Decoder

The decoder maintains a hidden sequence . Let be the embedding of the word predicted at step , then is defined by


where . Based on , the three components are defined as follows:

Language Model. The language model predicts a word based on . For words that do not need the context and the document (e.g., function words), employing the language model may enhance decoding speed without loss of accuracy. Formally, the generation probability is defined by


Context Processor. The context processor predicts a word by attending over . The word could be either fetched from the vocabulary or copied from the context . Let be the context vector at step , then can be formulated as


where denotes the attention distribution and . The generation probability is defined by


In Equation (6), the first term models the correspondence between a context and a response, and is formulated as . The second term models the copy mechanism, and a trade-off between the two terms.

Knowledge Processor. The knowledge processor goes through the document by a hierarchical attention mechanism, and predicts a word in a similar way as Equation (6). Formally, let and be the sentence-level attention distribution and the word-level attention distributions respectively at step , then and , and are calculated by


where and are normalization factors, and represents the average pooling of . A knowledge vector that is analogous to is then defined by


Finally, the generation probability is formulated as


where , is the -th word of , , and acts as a trade-off between the common term and the copy term.

Decoding Manager

The three components are controlled by the decoding manager with one picked up at each step of response prediction. Then, the probability to predict word can be formulated as


In training, to handle the discrete and undifferentiable process, we employ the Gumbel trick (Jang et al., 2016) and define as


where , denotes the Gumbel-Softmax function (Jang et al., 2016), and is the temperature (hyperparameter). approaches to a one-hot vector when . We start from a high temperature and gradually reduce it. In test, we discretize as a one-hot vector according to the distribution in Equation (11).

2.3 Learning Details

Let us denote as the parameters of word embedding in response prediction corresponding to the language model, the context processor, and the knowledge processor respectively. For simplicity, we let . Then (including parameters of the context encoder, parameters of the hidden states of the decoder, and parameters of the context processor) are estimated with maximum likelihood estimation (MLE) on .

To estimate (i.e., parameters of the language model) and , we construct a corpus with a response or an utterance from a context in , and then learn the parameters with MLE on with fixed.

Inspired by Peters et al. (2018), we estimate (i.e., parameters of the knowledge encoder) using a bidirectional language model by minimizing the following loss function on :


The remaining parameters (i.e., parameters of the knowledge processor and parameters of the decoding manager) are learned with MLE on with all other parameters fixed. Note that parameters of word embedding in the encoders are supposed to be included in and .

Remarks. We focus on document-grounded dialogue generation in this work, but the approach proposed actually provides a recipe for a general solution to low-resource knowledge-grounded dialogue generation in which the knowledge could be a structured knowledge base, images, or videos. To do that, one only needs to modify the knowledge encoder and the knowledge processor to make them compatible with the specific type of knowledge, and pre-train the knowledge encoder, if possible, on single-modal knowledge data.

3 Experiments

We test the proposed model on Wizard of Wikipedia (Wizard) published in Dinan et al. (2019) and CMU Document Grounded Conversations (CMUDoG) published in Zhou et al. (2018b).

3.1 Datasets and Evaluation Metrics

Both Wizard and CMUDoG consist of open domain dialogues grounded on wiki articles, and the dialogues are collected from crowd-workers on Amazon Mechanical Turk. In Wizard, the articles cover a wide range of topics (totally ) such as bowling, Gouda cheese, and Arnold Schwarzenegger, etc. Each conversation happens between a wizard who has access to knowledge about a specific topic and an apprentice who is just eager to learn from the wizard about the topic. On average, each wizard turn is associated with sentences retrieved from the wiki articles and each sentence contains words. The data is split as a training set, a validation set, and a test set by the data owner. The test set is split into two subsets: Test Seen and Test Unseen. Test Seen contains new dialogues with topics appearing in the training set, while topics in Test Unseen never appear in the training set and the validation set, and thus the data allow us to examine the generalization ability of models. The task is to generate a response for each wizard turn based on the dialogue history and the retrieved knowledge. As pre-processing, for each wizard turn in the training/validation/test sets, the latest words in the dialogue history are kept as a context. The pre-processing strictly follows the procedure in Dinan et al. (2019), and is conducted with the code published on ParlAI1.

Different from Wizard, CMUDoG focuses on movie domain (although covering various genres). In addition to wizard & apprentice, the data also contain dialogues between two workers who know the document and try to discuss the content in depth. Each document consists of sections and these sections are shown to the workers one by one every turns (the first section lasts turns due to initial greetings). On average, each section contains sentences and words per sentence. The data has been divided into a training set, a validation set, and a test set by the data owner. The task is to generate a response for each turn from a worker who has access to the document based on the dialogue history and the associated section as knowledge. Similar to Wizard, the latest words in the dialogue history are kept as a context. More details of the datasets can be found in Appendix A.

We choose Reddit Conversation Corpus2 cleaned by  Dziri et al. (2018) as . The data contain context-response pairs for training and context-response pairs for validation. On average, each context consists of utterances. We use the Wikipedia dump published on ParlAI3 as . The training set and the validation set contain articles and articles respectively with the first paragraph kept for learning. Articles that appear in Wizard and CMUDoG are removed beforehand. For both Wizard and CMUDoG, the vocabulary is made up of top most frequent words appearing in with other words regarded as .

Following the common practice in evaluating open domain dialogue generation, we choose perplexity (PPL) of the ground-truth response, BLEU (Papineni et al., 2002), and BOW Embedding (Liu et al., 2016) as metrics. Besides, we also follow Dinan et al. (2019) and employ unigram F1 as a metric. BLEU and Embedding-based metrics are computed with an NLG evaluation open source available at https://github.com/Maluuba/nlg-eval, and unigram F1 is calculated with the code published at https://github.com/facebookresearch/ParlAI/blob/master/parlai/core/metrics.py. Besides quantitative evaluation, we also recruit human annotators to do qualitative analysis on response quality, which is presented in Appendix C.

3.2 Baselines

The following models are selected as baselines:

Transformer Memory Network (TMN). The model proposed by Dinan et al. (2019) along with the release of the Wizard data. It is built upon a transformer architecture with an external memory hosting the knowledge. We implement the model using the code shared at https://github.com/facebookresearch/ParlAI/blob/master/projects/wizard_of_wikipedia.

Incremental Transformer with Deliberation Decoder (ITDD). A transformer-based model published very recently on ACL’19 (Li et al., 2019). The encoder incrementally represents multi-turn dialogues and knowledge, and the decoder conducts response decoding in two passes similar to the deliberation network in machine translation. We implement the model using the code shared at https://github.com/lizekang/ITDD.

Note that to make the comparison fair, we employ the end-to-end version of TMN without the knowledge regularization in learning. After all, one can include ground-truth signals on knowledge selection in both our model and TMN, and improve the two in the same way, although such signals are not available in most scenarios (e.g., in CMUDoG).


\backslashboxModelsMetrics PPL F1 BLEU-1 BLEU-2 BLEU-3 BLEU-4 Average Extrema Greedy
TMN (Dinan et al., 2019) 66.5 15.9 0.184 0.073 0.033 0.017 0.844 0.427 0.658
ITDD (Li et al., 2019) 17.8 16.2 0.158 0.071 0.040 0.025 0.841 0.425 0.654
FULL DATA 23.0 18.0 0.218 0.115 0.075 0.055 0.835 0.434 0.658
1/2 DATA 25.3 17.5 0.217 0.113 0.073 0.053 0.833 0.431 0.657
1/4 DATA 29.2 16.9 0.212 0.105 0.064 0.044 0.833 0.429 0.658
1/8 DATA 33.5 16.3 0.206 0.098 0.059 0.039 0.832 0.425 0.658
1/16 DATA 38.6 15.7 0.197 0.091 0.052 0.033 0.834 0.428 0.655


Table 1: Evaluation results on Test Seen of Wizard.

3.3 Evaluation Results

To simulate a low-resource scenario, we start from using the full training data as , and gradually reduce the number of training examples by halving the training set. Note that baseline models are learned with the full training sets. Table 1 and Table 2 report evaluation results on Test Seen and Test Unseen of Wizard respectively, and Table 3 reports evaluation results on CMUDoG. Through pre-training % parameters with the ungrounded dialogues and the plain text and fixing the parameters afterwards, our model holds the state-of-the-art performance in terms of most metrics on all test sets even when the training sets have been cut to , and has stable performance on Test Unseen with respect to different training sizes. Particularly, the model achieves more significant improvement over the baselines on Test Unseen, and when the training set shrinks, the performance gap on Test Seen and Test Unseen becomes marginal. The results show a good generalization ability of the proposed model on out-of-domain knowledge. ITDD achieves low PPL on both Test Seen and CMUDoG, which may stem from overfitting by the two-pass decoder. As an evidence, the model is just comparable with TMN on most metrics except PPL on Test Seen and CMUDoG, and is worse than our model on Test Unseen even in terms of PPL.


\backslashboxModelsMetrics PPL F1 BLEU-1 BLEU-2 BLEU-3 BLEU-4 Average Extrema Greedy
TMN (Dinan et al., 2019) 103.6 14.3 0.168 0.057 0.022 0.009 0.839 0.408 0.645
ITDD (Li et al., 2019) 44.8 11.4 0.134 0.047 0.021 0.011 0.826 0.364 0.624
FULL DATA 25.6 16.5 0.207 0.101 0.062 0.043 0.828 0.422 0.628
1/2 DATA 27.7 16.7 0.208 0.103 0.064 0.045 0.827 0.421 0.647
1/4 DATA 32.4 16.2 0.205 0.098 0.060 0.041 0.828 0.423 0.650
1/8 DATA 35.8 16.0 0.201 0.093 0.054 0.035 0.831 0.419 0.653
1/16 DATA 41.0 15.3 0.191 0.087 0.050 0.032 0.832 0.424 0.652


Table 2: Evaluation results on Test Unseen of Wizard.


\backslashboxModelsMetrics PPL F1 BLEU-1 BLEU-2 BLEU-3 BLEU-4 Average Extrema Greedy
TMN (Dinan et al., 2019) 75.2 9.9 0.115 0.040 0.016 0.007 0.789 0.399 0.615
ITDD (Li et al., 2019) 26.0 10.4 0.095 0.036 0.017 0.009 0.748 0.390 0.587
FULL DATA 54.4 10.7 0.150 0.057 0.025 0.012 0.809 0.413 0.633
1/2 DATA 57.0 10.4 0.142 0.052 0.022 0.010 0.808 0.414 0.635
1/4 DATA 61.7 10.5 0.131 0.046 0.019 0.009 0.781 0.402 0.613
1/8 DATA 67.6 10.2 0.121 0.044 0.019 0.009 0.787 0.407 0.622


Table 3: Evaluation results on CMUDoG.
Figure 2: Performance of variants of the proposed model on Wizard. (a) Comparison of parameter fine-tuning and parameter fixing on Test Seen. (b) Comparison of parameter fine-tuning and parameter fixing on Test Unseen. (c) Results of pre-training ablation on Test Seen. (d) Results of pre-training ablation on Test Unseen.

3.4 Discussions

In addition to the performance of the model under low-resource settings, we are also curious about Q1: what if we fine-tune the pre-trained parameters, rather than fixing them, with the training data of the knowledge-grounded dialogues, given that pre-training fine-tuning has become the fashion in NLP research and engineering? Q2: can we somehow leverage the ungrounded dialogues and the plain text in learning of TMN, and in this case, will there be any change in the comparison with our model? and Q3: what is the impact of pre-training to different components of the proposed model?

Answer to Q1: Figure 2 and Figure 2 compare our models with fine-tuned parameters and fixed parameters on Test Seen and Test Unseen respectively. Basically, when there are enough training data (e.g., ), fine-tuning can further improve the model on both in-domain and out-of-domain knowledge. On the other hand, when the training size is small, which is the assumption of the paper, fine-tuning may cause overfitting and lead to performance drop on the test sets. Test Unseen is more vulnerable than Test Seen, and the smaller the training size is, the bigger the gap is between the model with fixed parameters and the model with fine-tuned parameters. Therefore, in a low-resource setting (e.g., less than k training dialogues), it is better to fix the pre-trained parameters and only estimate the remaining % parameters with the training data.

Answer to Q2: Normally, it is not trivial to learn an entangled architecture like TMN with ungrounded dialogues and plain text. However, to make the comparison even more fair, we first pre-train a transformer-based encoder-decoder with the Reddit data. The encoder is fixed and used for TMN, and the parameters of the decoder is used to initialize the parameters of the decoder of TMN. Then, we pre-train the document representation in TMN with the Wikipedia dump. Finally, the knowledge attention in encoding and the decoder are learned (fine-tuned) with the training data of knowledge-grounded dialogues, as knowledge and dialogue contexts are entangled in the two modules. Figure 3 compares the pre-trained TMN with our model. Even though we have tried our best to make TMN use and , it is still much worse than our model. The results indicate the importance of disentangling to leveraging ungrounded dialogues and plain text for low-resource knowledeg-grounded dialogue generation.

Answer to Q3: Figure 2 and Figure 2 show the results of ablation study in terms of pre-training. -lm means that and are estimated using together with . Similarly, -context and -knowledge mean that pre-training is removed from and respectively. We can conclude that (1) pre-training is crucial to low-resource knowledge-grounded dialogue generation, since removing any component from pre-training causes performance drop when training data is small; and (2) in terms of impact to performance, lmcontextknowledge on Test Seen, while knowledgelmcontext on Test Unseen.

(a) PPL
(b) F1
(c) BLEU-1
Figure 3: Comparison with pre-trained TMN on Wizard.

4 Related Work

Research on end-to-end open domain dialogue generation is encouraged by the success of neural sequence-to-sequence models on machine translation (Sutskever et al., 2014). On top of the basic architecture (Shang et al., 2015; Vinyals and Le, 2015), various extensions have been made to tackle the safe response problem (Li et al., 2015; Xing et al., 2017; Zhao et al., 2017; Song et al., 2018; Tao et al., 2018; Qiu et al., 2019); to model dialogue history for multi-turn conversation (Serban et al., 2016, 2017); and to learn with advanced machine learning techniques (Li et al., 2016, 2017). Very recently, grounding response generation on a specific type of knowledge, such as triples from a knowledge base (Zhou et al., 2018a), documents (Ghazvininejad et al., 2018; Zhao et al., 2019), personas (Zhang et al., 2018), and images (Mostafazadeh et al., 2017), has emerged as a new fashion in the research of open domain dialogue systems. This work aligns with the trend by considering document-grounded dialogue generation. Our model is built upon state-of-the-art neural generation techniques such as attention (Bahdanau et al., 2015; Yang et al., 2016) and copying (See et al., 2017; Raghu and Gupta, 2019; Yavuz et al., 2019), but is unique in that components are pre-trained from various sources, thanks to the disentangled design. Thus, rather than testing new architectures on the benchmarks, our main contribution lies in investigation of knowledge-grounded dialogue generation under a low-resource setting with pre-training techniques, which roots in the requirement from practice.

The idea of “disentangling response decoding” is inspired by the similar research in representation learning that aims to seek a representation axis aligning with the generative factors of data (Bengio et al., 2013). State-of-the-art models are built within the framework of variational auto-encoding (Kingma and Welling, 2013) either under an unsupervised assumption (Higgins et al., 2017; Kim and Mnih, 2018; Chen et al., 2016, 2018) or aided by a few labels (Narayanaswamy et al., 2017; Locatello et al., 2019). In this work, we borrow the concept of “disentangling”, but apply it to the structure of the decoder of a response generation model. The result is a few independent components that allow asynchronous parameter estimation. The work is also encouraged by the recent breakthrough on pre-training for NLP tasks (Peters et al., 2018; Devlin et al., 2018; Yang et al., 2019; Liu et al., 2019; Song et al., 2019). We take advantage of disentanglement, and employ pre-training techniques to tackle the low-resource challenge in the task of knowledge-grounded dialogue generation.

5 Conclusions

We study knowledge-grounded dialogue generation under a low-resource setting. To overcome the challenge from insufficient training data, we propose decomposing the response decoder into independent components in which most parameters do not rely on the training data any more and can be estimated from large scale ungrounded dialogues and unstructured documents. Evaluation results on two benchmarks indicate that our model achieves the state-of-the-art performance with only training data, and exhibits a good generalization ability on out-of-domain knowledge.


We would like to thank the reviewers for their constructive comments. This work was supported by the National Key Research and Development Program of China (No. 2017YFC0804001), the National Science Foundation of China (NSFC No. 61876196 and NSFC No. 61672058). Rui Yan was sponsored as the young fellow of Beijing Academy of Artificial Intelligence (BAAI). Rui Yan is the corresponding author.


Appendix A Details of datasets

Table 4 reports the statistics of the Wizard data and the CMUDOG data.


Wizard of Wikipedia CMUDoG
Train Valid Test Seen Test Unseen Train Valid Test
Number of Utterances 166,787 17,715 8,715 8,782 74,717 4,993 13,646
Number of Conversations 18,430 1,948 965 968 3,373 229 619
Number of Topics/Documents 1,247 599 533 58 30 30 30
Average Turns per Dialogue 9.0 9.1 9.0 9.1 22.2 21.8 22.0


Table 4: Statistics of the two datasets.

Appendix B More Implementation Details

In both Wizard and CMUDOG, we set the size of word embedding as , the hidden size of the context encoder, the knowledge encoder, and the decoder as . The context encoder and the decoder have layers respectively. The and are similarity functions which contain two single-layer feed-forward networks (FFNs) of size with tanh non-linearity. The , and are two-layer FFNs of size and respectively. The , and are single-layer FFNs. All models are learned with Adam (Kingma and Ba, 2015) optimizer with , , and an initial learning rate . We increase the learning rate linearly for the first training steps and decrease it thereafter proportionally to the inverse square root of the step number. We set the initial temperature, the minimum temperature, and the anneal rate of as , , and respectively. In training, we choose as the size of mini-batches, and add dropout to and , but do not see much difference. Early stopping on validation is adopted as a regularization strategy. We employ beam search in response decoding with a beam size . We add weak supervision to guide the training of the decoding manager where the words that belong to modal verbs4 are forced to be classified as language model.

Appendix C Human Evaluation


\backslashboxModelsMetrics Seen Unseen
Fluency Context Knowledge Kappa Fluency Context Knowledge Kappa
Coherence Relevance Coherence Relevance
TMN (Dinan et al., 2019) 1.26 0.51 0.47 0.60 1.40 0.35 0.46 0.68
ITDD (Li et al., 2019) 1.69 1.18 1.16 0.70 1.72 0.73 0.71 0.69
1/4 DATA 1.77 1.54 1.17 0.58 1.75 1.26 1.18 0.57
1/8 DATA 1.68 1.44 1.13 0.60 1.73 1.21 1.25 0.57


Table 5: Human evaluation results on Wizard.

The goal of human study is to get more insights on quality of responses generated by different models from human annotators. To this end, we randomly sample examples from Test Seen and Test Unseen respectively, and recruit well educated native speakers as the annotators. Comparison is conducted among TMN, ITDD, our model (with training data), and our model (with training data). On each test set, for each of the examples, an annotator is provided with a context, the ground-truth knowledge, and responses provided by the models under evaluation (the top one response in beam search). Responses are pooled and randomly shuffled to hide their sources. Then, each annotator judges the responses from three aspects including fluency, context coherence, and knowledge relevance, and assigns a score from to each of the response on each aspect, in which means bad, means fair, and means good. Each response receives scores on each aspect, and agreement among the annotators are calculated with Fleiss’ kappa (Fleiss, 1971). Table 5 shows the average scores on the three aspects. Overall, the proposed model achieves the state-of-the-art performance in terms of all the three aspects on both Test Seen and Test Unseen when only training examples are left. All kappa values exceed or are close to , indicating substantial agreement among the annotators. The results are consistent with those reported in Table 1 and Table 2. Our model estimates the decoder with abundant extra resources, and ITDD exploits a two-pass decoder. Therefore, both of the two models can provide grammatical and fluent responses, no matter the background knowledge is within the domain of training or out of the domain of training. On the other hand, with the M Reddit data in learning of the context processor, our model can make the dialogues more coherent than the baselines, although there is a little drop on Test Unseen compared to Test Seen. Since the model only obtains limited guidance from training in terms of the connection between the knowledge and the dialogues, how to make the responses relevant to the knowledge is still challenging, although our model has done a better job than the baselines.

Table 6: A case from Test Unseen of Wizard.

Table 6 shows an example from Test Unseen, from which we can see that the response from our model (with training data) not only smoothly catches the context, but also expands the topic with proper pieces of knowledge (highlighted in red). On the other hand, responses from the baselines just reply to the context but lose the connection with the knowledge, as we have analyzed with the results in Table 5. Moreover, we also visualize the sources of words in the response with colors. Basically, words that have weak or no correlation with the context and the knowledge are generated by the language model, words that connect with the context but have nothing to do with the knowledge are generated by the context processor, and words that are copied from the knowledge are generated by the knowledge processor.

Appendix D Comparison with MASS

(a) F1
(b) BLEU-1
Figure 4: Comparison with MASS on Wizard.

We compare our model with MASS (Song et al., 2019), a pre-training technique that achieves state-of-the-art performance on several language generation tasks such as machine translation, text summarization, and conversational response generation. MASS firstly pre-trains an encoder-decoder architecture with large-scale monolingual data from WMT News Crawl datasets by reconstructing a fragment of a sentence from the remaining, and then fine-tunes the architecture on downstream language generation tasks. We use the code and the model published at https://github.com/microsoft/MASS. The original model is for sequence-to-sequence generation. To adapt it to the knowledge-grounded dialogue generation task, we concatenate the knowledge sentences and conversational history as a long context as the input of the encoder.

Figure 4 shows the evaluation results. Note that we do not include PPL as a metric like in Figure 3, since MASS performs generation with sub-words, and thus is not comparable with our model on PPL. On both Test Seen and Test Unseen, our model consistently outperforms MASS over all training sizes. The reason might be that “mask then predict”, which is basically the pre-training strategy exploited by MASS, is not an effective way to leverage the text data for knowledge-grounded dialogue generation, since the task needs more complicated operations such as deep copying. Another reason might be that MASS is designed for the sequence-to-sequence generation task and isn’t compatible with the knowledge-grounded response generation task which has extra knowledge input.

Appendix E Ablation over Components

Figure 5: Ablation study over the three components of the decoder. (a) Results on Test Seen. (b) Results on Test Unseen.

We conduct ablation study over the language model, the context processor, and the knowledge processor by completely dropping any of them from the decoding manager (in both training and test). Figure 5 and Figure 5 report the results on Test Seen and Test Unseen respectively. First of all, all the three components are useful, since removing any of them in general will cause performance drop. Second, in terms of importance, knowledge processorcontext processorlanguage model. The explanation is that (1) part of the function of the language model may be covered by the context processor and the knowledge processor after it is removed5, since both the context processor and the knowledge processor also contain language models, although in the full model, the language model generates % words in the responses of Test Seen and Test Unseen; (2) the context processor is important (generating % words), but not always, since a large proportion of responses in the Wizard data highly depend on the knowledge (e.g., the examples shown in (Dinan et al., 2019)); (3) the knowledge processor (generating % words) is the most important component due to the nature of the Wizard data. The results also remind us that perhaps we can try pre-training the language model with larger and more heterogeneous data such as Common Crawl in the future.

Appendix F Comparison with non-pretraining

Figure 6: Comparison with the proposed model without pre-training. (a) Results on Test Seen. (b) Results on Test Unseen.

Figure 6 and Figure 6 compare two versions of our model on Test Seen and Test Unseen respectively. One version is the model pre-trained using ungrounded dialogues and documents, and the other version is the one trained with knowledge-grounded dialogues (i.e., no pre-training is performed). Besides, we also include the results of TMN to get more insights. We can see that when there are enough training data (e.g., full data), our model without pre-training outperforms both TMN and the pre-trained version on Test Seen. This is because the attention and copying operations can well capture the correlation among the knowledge, the contexts, and the responses in the training data, while in the pre-trained version, only a small proportion of the model can benefit from the training data, and a large proportion may suffer from the gap between the knowledge-grounded dialogues collected from crowd-sourcing and the ungrounded dialogues and documents collected from the Web. However, when the training size shrinks, which is basically the problem we study in the paper, the performance of our model without pre-training drops dramatically, and becomes even worse than that of TMN on Test Seen when the training size is no more than . This is because when training data is not enough, our model is more prone to overfit the small training set than TMN, and thus results in bad generalization ability. In the low-resource setting, pre-training, especially with the disentangled decoder if we consider the results in Figure 3, is an effective approach to obtaining good generalization ability on test data. The conclusions are further verified by the comparison on Test Unseen, where non-pre-training is worse than pre-training over all training sizes, and non-pre-training quickly drops below TMN when the training data is halved. On Test Unseen, with training data, the pre-trained model achieves the performance of the model learned from the full training data without pre-training.


  1. https://github.com/facebookresearch/ParlAI/blob/master/projects/wizard_of_wikipedia
  2. https://github.com/nouhadziri/THRED
  3. https://github.com/facebookresearch/ParlAI/tree/master/parlai/tasks/wikipedia
  4. “can”, “would”, “could”, “will”, “should”, “may”
  5. “Part of” is because the language model is pre-trained with monolingual Reddit data, which is different from the context processor and the knowledge processor.


  1. Neural machine translation by jointly learning to align and translate. In ICLR, Cited by: §4.
  2. Representation learning: a review and new perspectives. IEEE transactions on pattern analysis and machine intelligence 35 (8), pp. 1798–1828. Cited by: §4.
  3. Isolating sources of disentanglement in variational autoencoders. In Advances in Neural Information Processing Systems, pp. 2610–2620. Cited by: §4.
  4. Infogan: interpretable representation learning by information maximizing generative adversarial nets. In Advances in neural information processing systems, pp. 2172–2180. Cited by: §4.
  5. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078. Cited by: §2.2.1.
  6. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555. Cited by: §2.2.1.
  7. BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §4.
  8. Wizard of wikipedia: knowledge-powered conversational agents. In ICLR, Cited by: Table 5, Appendix E, §1, §1, §2.1, §2.2.1, §3.1, §3.1, §3.2, Table 1, Table 2, Table 3, §3.
  9. Augmenting neural response generation with context-aware topical attention. arXiv preprint arXiv:1811.01063. Cited by: §1, §3.1.
  10. Measuring nominal scale agreement among many raters.. Psychological bulletin 76 (5), pp. 378. Cited by: Appendix C.
  11. A knowledge-grounded neural conversation model. In AAAI, Cited by: §4.
  12. Beta-vae: learning basic visual concepts with a constrained variational framework.. ICLR 2 (5), pp. 6. Cited by: §4.
  13. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144. Cited by: §2.2.3.
  14. Disentangling by factorising. In International Conference on Machine Learning, pp. 2654–2663. Cited by: §4.
  15. Adam: a method for stochastic optimization. In ICLR, Cited by: Appendix B.
  16. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §4.
  17. A diversity-promoting objective function for neural conversation models. NAACL, pp. 110–119. Cited by: §4.
  18. Deep reinforcement learning for dialogue generation. In EMNLP, pp. 1192–1202. Cited by: §1, §4.
  19. Adversarial learning for neural dialogue generation. In EMNLP, pp. 2157–2169. Cited by: §1, §4.
  20. Incremental transformer with deliberation decoder for document grounded conversations. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 12–21. Cited by: Table 5, §1, §3.2, Table 1, Table 2, Table 3.
  21. Learning to select knowledge for response generation in dialog systems. arXiv preprint arXiv:1902.04911. Cited by: §1.
  22. How not to evaluate your dialogue system: an empirical study of unsupervised evaluation metrics for dialogue response generation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2122–2132. Cited by: §3.1.
  23. Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: §4.
  24. Disentangling factors of variation using few labels. arXiv preprint arXiv:1905.01258. Cited by: §4.
  25. Image-grounded conversations: multimodal context for natural question and response generation. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 462–472. Cited by: §4.
  26. Learning disentangled representations with semi-supervised deep generative models. In Advances in Neural Information Processing Systems, pp. 5925–5935. Cited by: §4.
  27. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pp. 311–318. Cited by: §3.1.
  28. Glove: global vectors for word representation. In EMNLP, pp. 1532–1543. Cited by: §2.2.1.
  29. Deep contextualized word representations. In NAACL, pp. 2227–2237. Cited by: §2.3, §4.
  30. Are training samples correlated? learning to generate dialogue responses with multiple references. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 3826–3835. External Links: Link, Document Cited by: §4.
  31. Disentangling language and knowledge in task-oriented dialogs. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 1239–1255. Cited by: §4.
  32. Conversational ai: the science behind the alexa prize. arXiv preprint arXiv:1801.03604. Cited by: §1.
  33. Get to the point: summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1073–1083. Cited by: §4.
  34. Building end-to-end dialogue systems using generative hierarchical neural network models.. In AAAI, Vol. 16, pp. 3776–3784. Cited by: §4.
  35. A hierarchical latent variable encoder-decoder model for generating dialogues.. In AAAI, pp. 3295–3301. Cited by: §4.
  36. Neural responding machine for short-text conversation. In ACL, pp. 1577–1586. Cited by: §4.
  37. From eliza to xiaoice: challenges and opportunities with social chatbots. Frontiers of IT & EE 19 (1), pp. 10–26. Cited by: §1.
  38. MASS: masked sequence to sequence pre-training for language generation. In International Conference on Machine Learning, pp. 5926–5936. Cited by: Appendix D, §4.
  39. An ensemble of retrieval-based and generation-based human-computer conversation systems.. In IJCAI, pp. 4382–4388. Cited by: §4.
  40. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pp. 3104–3112. Cited by: §1, §4.
  41. Get the point of my utterance! learning towards effective responses with multi-head attention mechanism.. In IJCAI, pp. 4418–4424. Cited by: §4.
  42. Attention is all you need. In NIPS, pp. 5998–6008. Cited by: §1.
  43. A neural conversational model. arXiv preprint arXiv:1506.05869. Cited by: §4.
  44. Topic aware neural response generation.. In AAAI, pp. 3351–3357. Cited by: §4.
  45. XLNet: generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237. Cited by: §4.
  46. Hierarchical attention networks for document classification. In Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies, pp. 1480–1489. Cited by: §4.
  47. Deepcopy: grounded response generation with hierarchical pointer networks. arXiv preprint arXiv:1908.10731. Cited by: §4.
  48. Personalizing dialogue agents: i have a dog, do you have pets too?. arXiv preprint arXiv:1801.07243. Cited by: §4.
  49. Learning discourse-level diversity for neural dialog models using conditional variational autoencoders. In ACL, pp. 654–664. Cited by: §4.
  50. A document-grounded matching network for response selection in retrieval-based chatbots. In IJCAI, pp. 5443–5449. Cited by: §1, §4.
  51. Commonsense knowledge aware conversation generation with graph attention.. In IJCAI, pp. 4623–4629. Cited by: §4.
  52. A dataset for document grounded conversations. arXiv preprint arXiv:1809.07358. Cited by: §1, §1, §2.1, §3.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description