Skeleton-to-Response: Dialogue Generation Guided by Retrieval Memory

Skeleton-to-Response: Dialogue Generation Guided by Retrieval Memory


Traditional generative dialogue models generate responses solely from input queries. Such information is insufficient for generating a specific response since a certain query could be answered in multiple ways. Recently, researchers have attempted to fill the information gap by exploiting information retrieval techniques. For a given query, similar dialogues are retrieved from the entire training data and considered as an additional knowledge source. While the use of retrieval may harvest extensive information, the generative models could be overwhelmed, leading to unsatisfactory performance. In this paper, we propose a new framework which exploits retrieval results via a skeleton-to-response paradigm. At first, a skeleton is extracted from the retrieved dialogues. Then, both the generated skeleton and the original query are used for response generation via a novel response generator. Experimental results show that our approach significantly improves the informativeness of the generated responses.


This paper focuses on tackling the challenges to develop a chit-chat style dialogue system (also known as chatbot). Chit-chat style dialogue system aims at giving meaningful and coherent responses given a dialogue query in the open domain. Most modern chit-chat systems can be categorized into two categories, namely, information retrieval-based (IR) models and generative models.

The IR-based models [\citeauthoryearJi, Lu, and Li2014, \citeauthoryearHu et al.2014] directly copy an existing response from a training corpus when receiving a response request. Since the training corpus is usually collected from real-world conversations and possibly post-edited by a human, the retrieved responses are informative and grammatical. However, the performance of such systems drops when a given dialogue history is substantially different from those in the training corpus.

The generative models [\citeauthoryearShang, Lu, and Li2015, \citeauthoryearVinyals and Le2015, \citeauthoryearLi et al.2016a], on the other hand, generate a new utterance from scratch. While those generative models have better generalization capacity in rare dialogue contexts, the generated responses tend to be universal and non-informative (e.g., “I don’t know”, “I think so” etc.) [\citeauthoryearLi et al.2016a]. It is partly due to the diversity of possible responses to a single query (i.e., the one-to-many problem). The dialogue query alone cannot decide a meaningful and specific response. Thus a well-trained model tends to generate the most frequent (safe but boring) responses instead.

To summarize, IR-based models may give informative but inappropriate responses while generative models often do the opposite. It is desirable to combine both merits. \citeauthorsong2016two (\citeyearsong2016two) used an extra encoder for the retrieved response. The resulted dense representation, together with the original query, is used to feed the decoder in a standard Seq2Seq model [\citeauthoryearBahdanau, Cho, and Bengio2014]. \citeauthorweston2018retrieve (\citeyearweston2018retrieve) used a single encoder that takes the concatenation of the original query and the retrieved as input. \citeauthorwu2018response (\citeyearwu2018response) noted that the retrieved information should be used in awareness of the context difference, and further proposed to construct an edit vector by explicitly encoding the lexical differences between the input query and the retrieved query.

However, in our preliminary experiments, we found that the IR-guided models are inclined to degenerate into a copy mechanism, in which the generative models simply repeat the retrieved response without necessary modifications. Sharp performance drop is caused when the retrieved response is irrelevant to the input query. A possible reason is that both useful and useless information is mixed in the dense vector space, which is uninterpretable and uncontrollable.

To address the above issue, we propose a new framework, skeleton-to-response, for response generation. Our motivations are two folds: (1) The guidance from IR results should only specify a response aspect or pattern, but leave the query-specific details to be elaborated by the generative model itself; (2) The retrieval results typically contain excessive information, such as inappropriate words or entities. It is necessary to filter out irrelevant words and derive a useful skeleton before use.

Our approach consists of two components: a skeleton generator and a response generator. The skeleton generator extracts a response skeleton by detecting and removing unwanted words in a retrieved response. The response generator is responsible for adding query-specific details to the generated skeleton for query-to-response generation. A dialogue example illustrating our idea is shown in Fig. 1. Due to the discrete choice of skeleton words, the gradient in the training process is no longer differentiable from the response to the skeleton generator. Two techniques are proposed to solve this issue. The first technique is to employ the policy gradient method for rewarding the output of the skeleton generator based on the feedback from a pre-trained critic. An alternative technique is to solve both the skeleton generation and the response generation in a multi-task learning fashion.

Our contributions are summarized as below: (1) We develop a novel framework to inject the power of IR results into generative response models by introducing the idea of skeleton generation; (2) Our approach generates response skeletons by detecting and removing unnecessary words, which facilitates the generation of specific responses while not spoiling the generalization ability of the underlying generative models; (3) Experimental results show that our approach significantly outperforms other compared methods, resulting in more informative and specific responses.



In this work, we propose to construct a response skeleton based on the result of IR systems for guiding the response generation. The skeleton-then-response paradigm helps reduce the output space of possible responses and provides useful elements missing in the current query.

For each query , a set of historical query-response pairs are retrieved by some IR techniques. We estimate the generation probability of a response conditioned on and . The whole process is decomposed into two parts. First, we assume that there exists a probabilistic model mapping each to a response skeleton . Basically, we mask some parts (ideally useless or unnecessary parts) of a retrieved response for producing a response skeleton. Armed with this skeleton, the final response is generated by revising the skeletons by . Our overall model consists of two components, namely, the skeleton generator and the response generator. These components are parameterized by the above two probabilistic models, denoted by and respectively.

For clarity, the proposed model is explained in detail under the default setting of (i.e., ) in the following part of this section. It should be noted that our model is readily extended to incorporate multiple IR results. Fig. 2 depicts the architecture of our proposed framework.

Figure 1: Our idea of leveraging the retrieved query-response pair. It first constructs a response skeleton by removing some words in the retrieved response, then a response is generated via rewriting based on the skeleton.
Figure 2: The architecture of our framework. Given a query “Do you like banana”, a similar historical query “Do you like apple” is retrieved along with its response, i.e., “Yes, apple is my favorite”. Upper: The skeleton generator removes inappropriate words and extracts a response skeleton. Lower: The response generator generates a response based on both the skeleton and the query.

Skeleton Generator

The skeleton generator transforms a retrieved response into a skeleton by explicitly removing inappropriate or useless information regarding the input query . We consider this procedure as a series of word-level masking actions. Following [\citeauthoryearWu et al.2018], we first construct an edit vector by comparing the difference between the original query and the retrieved query . In [\citeauthoryearWu et al.2018] the edit vector is used to guide the response generation directly. In our model, the edit vector is used to estimate the probability of being reserved or being masked for every word in a sentence. We define two word sets, namely insertion words and deletion words . The insertion words include words that are in the original query , but not in the retrieved query , while the deletion words do the opposite.

The two bags of words highlight the changes in the dialogue context, corresponding to the changes in the response. The edit vector is thus defined as the concatenation of the representations of the two bags of words. We use the weighted sum of the word embeddings to get the dense representations of and . The edit vector is computed as:


where is the concatenation operation. maps a word to its corresponding embedding vector, and are the weights of an insertion word and a deletion word respectively. The weights of different words are derived by an attention mechanism [\citeauthoryearLuong, Pham, and Manning2015]. Formally, the retrieved response is processed by a bidirectional GRU network (biGRU). We denote the states of the biGRU (i.e. concatenation of forward and backward GRU states) as . The weight is calculated by:


where and are learnable parameters. The weight is obtained in a similar way with another set of parameters and .

After acquiring the edit vector, we transform the prototype response to a skeleton by the following equations:


where is the indicator and equals 0 if is replaced with a placeholder “blank” and 1 otherwise. The probability of is computed by


Response Generator

The response generator can be implemented using most existing IR-augmented models [\citeauthoryearSong et al.2016, \citeauthoryearWeston, Dinan, and Miller2018, \citeauthoryearPandey et al.2018], just by replacing the retrieved response input with the corresponding skeleton. We discuss our choices below.

Encoders Two separate bidirectional LSTM (biLSTM) networks are used to obtain the distributed representations of the query memories and the skeleton memories, respectively. For biLSTM, the concatenation of the forward and the backward hidden states at each token position is considered a memory slot, producing two memory pools: for the input query, and for the skeleton.1

Decoder During the generation process, our decoder reads information from both the query and the skeleton using attention mechanism [\citeauthoryearBahdanau, Cho, and Bengio2014, \citeauthoryearLuong, Pham, and Manning2015]. To query the memory pools, the decoder uses the hidden state of itself as the searching key. The matching score function is implemented by bilinear functions:


where and are trainable parameters. A query context vector is then computed as a weighted sum of all memory slots in , where the weight for a memory slot is . A skeleton context vector is computed in a similar spirit by using ’s.

The probability of generating the next word is then jointly determined by the decoder’s state , the query context and the skeleton context . We first fuse the information of and by a linear transformation. For , a gating mechanism is additionally introduced to control the information flow from skeleton memories. Formally, the probability of the next token is estimated by followed by a softmax function over the vocabulary:


where is implemented by a single layer neural network with sigmoid output layer.


Given that our skeleton generator performs non-differentiable hard masking, the overall model cannot be trained end-to-end using the standard maximum likelihood estimate (MLE). A possible solution that circumvents this problem is to treat the skeleton generation and the response generation as two parallel tasks and solve them jointly in a multi-task learning fashion. An alternative is to bridge the skeleton generator and the final response output using reinforcement learning (RL) methods, which can exclusively inform the skeleton generator with the ultimate goal. The latter option is referred as cascaded integration while the former is called joint integration.

Recall that we have formulated the skeleton generation as a series of binary classifications. Nevertheless, most of the dialogue datasets are end-to-end query-response pairs without explicit skeletons. Hence, we propose to construct proxy skeletons to facilitate the training.

Definition 1 Proxy Skeleton: Given a training quadruplet and a stop word list , the proxy skeleton for is generated by replacing some tokens in with a placeholder “blank”. A token is kept if and only if it meets the following conditions
2. is a part of the longest common sub-sequence (LCS) [\citeauthoryearWagner and Fischer1974] of and .

0:  a training quadruplet , stop word list
0:  the proxy skeleton , the proxy labels .
1:   remove the stop words in and
2:   LongestCommonSubsequence
3:  for  to  do
4:      if and else
5:      if else “blank
6:  end for
7:  return  
Algorithm 1 Proxy Skeleton Construction

The detailed construction process is given in Algorithm 1. The proxy skeletons are used in different manners according to the integration method, which we will introduce below.

Joint Integration

To avoid breaking the differentiable computation, we connect the skeleton generator and the response generator via shared network architectures rather than by passing the discrete skeletons. Concretely, the last hidden states in our skeleton generator (i.e, the hidden states that are utilized to make the masking decisions) are directly used as the skeleton memories in response generation. The skeleton generation and response generation are considered as two tasks. For skeleton generation, the object is to maximize the log likelihood of the proxy skeleton labels:


while for response generation, it is trained to maximize the following log likelihood:


The joint network is then trained to maximize two parts of log likelihood:


where is a harmonic weight, and it is set as in our experiments.

Cascaded Integration

Policy gradient methods [\citeauthoryearWilliams1992] can be applied to optimize the full model while keeping it running as cascaded process. We regard the skeleton generator as the first RL agent, and the response generator as the second one. The final output generated by the pipeline process and the intermediate skeleton are denoted by and respectively. Given the original query and the generated response , a reward for generating is calculated. All network parameters are then optimized to maximize the expected reward by the policy gradient. According to the policy gradient theorem [\citeauthoryearWilliams1992], the gradient for the first agent is


and the gradient for the second agent is


The reward function should convey both the naturalness of the generated response and its relevance to the given query . A pre-trained critic is utilized to make the judgment. Inspired by comparative adversarial learning in [\citeauthoryearLi et al.2018], we design the critic as a classifier that receives four inputs every time: the query , a human-written response , a machine-generated response and a random response (yet written by human). The critic is trained to correctly pick the human-written response among others. Formally, the following objective is maximized:


where is a vector representation of , produced by a bidirectional LSTM (the last hidden state), and is a trainable matrix.2 The reward function of is defined as:


However, when randomly initialized, the skeleton generator and the response generator transmit noisy signals to each other, which leads to sub-optimal policies. We hence propose pre-training each component using Equation (7) and (8) sequentially.

Related Work

Multi-source Dialogue Generation Chit-chat style dialogue system dates back to ELIZA [\citeauthoryearWeizenbaum1966]. Early work uses handcrafted rules, while modern systems usually use data-driven approaches, e.g., information retrieval techniques. Recently, end-to-end neural approaches [\citeauthoryearVinyals and Le2015, \citeauthoryearSerban et al.2016, \citeauthoryearLi et al.2016a, \citeauthoryearSordoni et al.2015] have attracted increasing interest. For those generative models, a notorious problem is the “safe response” problem: the generated responses are dull and generic, which may attribute to the lack of sufficient input information. The query alone cannot specify an informative response. To mitigate the issue, many research efforts have been paid to introducing other information source, such as unsupervised latent variable [\citeauthoryearSerban et al.2017, \citeauthoryearZhao, Lee, and Eskenazi2018, \citeauthoryearCao and Clark2017, \citeauthoryearShen et al.2017], discourse-level variations [\citeauthoryearZhao, Zhao, and Eskenazi2017], topic information [\citeauthoryearXing et al.2017], speaker personality [\citeauthoryearLi et al.2016b] and knowledge base [\citeauthoryearGhazvininejad et al.2018, \citeauthoryearZhou et al.2018]. Our work follows the similar motivation and uses the output of IR systems as the additional knowledge source.

Combination of IR and Generative models To combine IR and generative models, early work [\citeauthoryearQiu et al.2017] tried to re-rank the output from both models. However, the performance of such models is limited by the capacity of individual methods. Most related to our work, [\citeauthoryearSong et al.2016, \citeauthoryearWeston, Dinan, and Miller2018] and [\citeauthoryearWu et al.2018] encoded the retrieved result into distributed representation and used it as the additional conditionals along with the standard query representation. While the former two only used the target side of the retrieved pairs, the latter took advantages of both sides. In a closed domain conversation setting, [\citeauthoryearPandey et al.2018] further proposed to weight different training instances by context similarity. Our model differs from them in that we take an extra intermediate step for skeleton generation to filter the retrieval information before use, which shows the effectiveness in avoiding the erroneous copy in our experiments.

Multi-step Language Generation Our work is also inspired by recent successes of decomposing an end-to-end language generation task into several sequential sub-tasks. For document summarization, \citeauthorchen2018fast (\citeyearchen2018fast) first select salient sentences and then rewrite them in parallel. For sentiment-to-sentiment translation, \citeauthorunpaired-sentiment-translation (\citeyearunpaired-sentiment-translation) first use a neutralization module to remove emotional words and then add sentiment to the neutralized content. Not only does their decomposition improve the overall performance, but also makes the whole generation process more interpretable. Our skeleton-to-response framework also sheds some light on the use of retrieval memories.



We use the preprocessed data in [\citeauthoryearWu et al.2018] as our test bed. The total dataset consists of about 20 million single-turn query-response pairs collected from Douban Group3. Since similar contexts may correspond to totally different responses, the training quadruples for IR-augmented models are constructed based on response similarity. All response are indexed by Lucene.4 For each pair, top 30 similar responses with their corresponding contexts are retrieved . However, only those satisfying are leveraged for training, where measures the Jaccard distance. The reason for the data filter is that nearly identical responses drive the model to do simple copy while distantly different responses make the model ignore the retrieval input. About 42 million quadruples are obtained afterward.

For computational efficiency, we randomly sample 5 million quadruples as training data for all experiments. The test set consists of 1,000 randomly selected queries that are not in our training data.5 For a fair comparison, when training a generative model without the help of IR, the quadruples are split to pairs.

Model Details

We implement the skeleton generator based on a bidirectional recurrent neural network with 500 LSTM units. We concatenate the hidden states from both directions. The word embedding size is set to 300. For the response generator, the encoder for queries, the encoder for skeletons and the decoder are three two-layer recurrent neural networks with 500 LSTM units where both encoders are bidirectional. We use dropout [\citeauthoryearSrivastava et al.2014] to alleviate overfitting. The dropout rate is set to 0.3 across different layers. The same architecture for the encoders and the decoder is shared across the following baseline models, if applicable.

Compared Methods

  • Seq2Seq the standard attention-based RNN encoder-decoder model [\citeauthoryearBahdanau, Cho, and Bengio2014].

  • MMI Seq2Seq with Maximum Mutual Information (MMI) objective in decoding [\citeauthoryearLi et al.2016a]. In practice, an inverse (response-to-query) Seq2Seq model is used to rerank the -best hypothesizes from the standard Seq2Seq model ( equals 100 in our experiments).

  • EditVec the model proposed in [\citeauthoryearWu et al.2018], where the edit vector is used directly at each decoding step by concatenating it to the word embeddings.

  • IR the Lucene system is also used a benchmark.6

  • IR+rerank rerank the results of IR by MMI.

Besides, We use JNT to denote our model with joint integration, and CAS for our model with cascaded integration. To validate the usefulness of the proposed skeletons. We design a response generator that takes an intact retrieval response as its skeleton input (i.e., to completely skip the skeleton generation step), denoted by SKP.7

model human score dist-1 dist-2
IR 2.093 0.238 0.723
IR+rerank 2.520 0.208 0.586
\hdashlineSeq2Seq 2.433 0.156 0.336
MMI 2.554 0.170 0.464
EditVec 2.588 0.154 0.394
SKP 2.581 0.152 0.406
\hdashlineJNT 2.612 0.147 0.377
CAS 2.747 0.156 0.411
Table 1: Response performance of different models. Sign tests on human score show that the CAS is significantly better than all other methods with p-value , and the p-value except for those marked by .
Figure 3: Response quality v.s. query similarity.8
model P R F Acc.
JNT 0.32 0.61 0.42 0.60
CAS 0.50 0.86 0.63 0.76
Table 2: Performance of skeleton generator.

Evaluation Metrics

Our method is designed to promote the informativeness of the generative model and alleviate the inappropriateness problem of the retrieval model. To measure the performance effectively, we use human evaluation along with two automatic evaluation metrics.

  • Human evaluation We asked three experienced annotators to score the group of responses (the best output of each model) for 300 test queries. The responses are rated on a five-point scale. A response should be scored 1 if it can hardly be considered a valid response, 3 if it is a valid but not informative response, 5 if it is an informative response, which can deepen the discussion of the current topic or lead to a new topic. 2 and 4 are for decision dilemmas.

  • dist-1 & dist-2 It is defined as the number of unique uni-grams (dist-1) or bi-grams (dist-2) dividing by the total number of tokens, measuring the diversity of the generated responses [\citeauthoryearLi et al.2016a]. Note the two metrics do not necessarily reflect the response quality as the target queries are not taken into consideration.

Response Generation Results

The results are depicted in Table 1. Overall, both of our models surpass all other methods, and our cascaded model (CAS) gives the best performance according to human evaluation. The contrast with the SKP model illustrates that the use of skeletons brings a significant performance gain.

According to the dist-1&2 metrics, the generative models achieve significantly better diversity by the use of retrieval results. The retrieval method yields the highest diversity, which is consistent with our intuition that the retrieval responses typically contain a large amount of information though they are not necessarily appropriate. The model of MMI also gives strong diversity, yet we find that it tends to merely repeat the words in queries. By removing the words in queries, the dist-2 of MMI and CAS become 0.710 and 0.751 respectively. It indicates our models are better at generating new words.

To further reveal the source of performance gain, we study the relation between response quality and query similarity (measured by the Jaccard similarity between the input query and the retrieved query). Our best model (CAS) is compared with the strong IR system (IR-rerank) and the previous state-of-the-art (EditVec) in Fig. 3. The CAS model significantly boosts the performance when query similarity is relatively low, which indicates that introducing skeletons can alleviate erroneous copy and keep a strong generalization ability of the underlying generative model.

Table 3: Upper: Skeleton-to-response examples of the CAS model. Lower: Responses from different models are for comparison.

More Analysis of Our Framework

Here, we present further discussions and empirical analysis of our framework.

Generated Skeletons Although generating skeletons is not our primary goal, it is interesting to assess the skeleton generation. The word-level precision (P), recall (R), F score (F) and accuracy (Acc.) of the well-trained skeleton generators are reported in Table 2, taking the proxy skeletons as golden references.

Table 3 shows some skeleton-to-response examples of the CAS model and a case study among different models. In the leftmost example in Table 3, the MMI and the EditVec simply repeat the query while the retrieved response is weakly related to the query. Our CAS model extracts a useful word ’boy’ from the retrieved response and generates a more interesting response. In the middle example, the MMI response makes less sense, and some private information is included in the retrieved response. Our CAS model removes the privacy without the loss of informativeness, while the outputs by other models are less informative. The rightmost case shows that our response generator is able to recover the possible mistakes made by the skeleton generator.

Retrieved Response v.s. Generated Response To measure the extent that the generative models are copying the retrieval, we compute the edit distances between generated responses and retrieved responses. As shown in Fig. 4, in the comparison between the SKP and other models, the use of skeletons makes the generated response deviate more from its prototype response. Ideally, when the retrieved context is very similar to the input query, the changes between the generated response and the prototype response should be minor. Conversely, the changes should be drastic. Fig. 4 also shows that our models can learn this intuition.

Single v.s. Multiple Retrieval Pair(s) For a given query , the retrieval pair set could contain multiple query-response pairs. We investigate two ways of using it under the CAS setting.

  • Single For each query-response pair , a response is generated solely based on , and . The resulted responses are re-ranked by generation probability.

  • Multiple The whole retrieval set is used in a single run. Multiple skeletons are generated and concatenated in the response generation stage.

The results are shown in Table 4. We attribute the failure of Multiple to the huge variety of the retrieved responses. The response generator receives many heterogeneous skeletons, yet it has no idea which to use. It remains an open question on how to effectively use multiple retrieval pairs for generating one single response, and we leave it for future work.

Figure 4: Changes between retrieved and generated responses v.s. query similarity.
setting human score dist-1 dist-2
Single 2.747 0.156 0.411
Multiple 1.976 0.178 0.414
Table 4: Comparison of the usages of the retrieval set.


In this paper, we proposed a new methodology to enhance generative models with information retrieval technologies for dialogue response generation. Given a dialogue context, our methods generate a skeleton based on historical responses that respond to a similar context. The skeleton serves as an additional knowledge source that helps specify the response direction and complement the response content. Experiments on real world data validated the effectiveness of our method for more informative and appropriate responses.


  1. Note the skeleton memory pool could contain multiple response skeletons, further discussed in the experiment section.
  2. Note the classifier could be fine-tuned with the training of our generators, which falls into the adversarial learning setting [?].
  5. Note the retrieval results for test data are based on query similarity, and no data filter is adopted.
  6. Note IR selects response candidates from the entire data collection, not restricted to the filtered one.
  7. There are some other IR-augmented models using standard seq2seq models as SKP. \citeauthorweston2018retrieve (\citeyearweston2018retrieve)used a rule to select either the generated response or the retrieved response as output, while we would like to focus on improving the quality of generated responses. \citeauthorpandey2018exemplar (\citeyearpandey2018exemplar) concentrated on closed domain conversations, their hierarchical encoder is not suitable for our open domain setting. We thus omit the empirical comparison with them.
  8. We merge the ranges and due to the sparsity of highly similar pairs.


  1. Bahdanau, D.; Cho, K.; and Bengio, Y. 2014. Neural machine translation by jointly learning to align and translate. In ICLR.
  2. Cao, K., and Clark, S. 2017. Latent variable dialogue models and their diversity. In EACL, 182–187.
  3. Chen, Y.-C., and Bansal, M. 2018. Fast abstractive summarization with reinforce-selected sentence rewriting. In ACL.
  4. Ghazvininejad, M.; Brockett, C.; Chang, M.-W.; Dolan, B.; Gao, J.; Yih, W.-t.; and Galley, M. 2018. A knowledge-grounded neural conversation model. In AAAI, 5110–5117.
  5. Hu, B.; Lu, Z.; Li, H.; and Chen, Q. 2014. Convolutional neural network architectures for matching natural language sentences. In NIPS, 2042–2050.
  6. Ji, Z.; Lu, Z.; and Li, H. 2014. An information retrieval approach to short text conversation. arXiv preprint arXiv:1408.6988.
  7. Li, J.; Galley, M.; Brockett, C.; Gao, J.; and Dolan, B. 2016a. A diversity-promoting objective function for neural conversation models. In NAACL, 110–119.
  8. Li, J.; Galley, M.; Brockett, C.; Spithourakis, G. P.; Gao, J.; and Dolan, B. 2016b. A persona-based neural conversation model. In ACL, 994–1003.
  9. Li, D.; He, X.; Huang, Q.; Sun, M.-T.; and Zhang, L. 2018. Generating diverse and accurate visual captions by comparative adversarial learning. arXiv preprint arXiv:1804.00861.
  10. Luong, T.; Pham, H.; and Manning, C. D. 2015. Effective approaches to attention-based neural machine translation. In EMNLP, 1412–1421.
  11. Pandey, G.; Contractor, D.; Kumar, V.; and Joshi, S. 2018. Exemplar encoder-decoder for neural conversation generation. In ACL, 1329–1338.
  12. Qiu, M.; Li, F.-L.; Wang, S.; Gao, X.; Chen, Y.; Zhao, W.; Chen, H.; Huang, J.; and Chu, W. 2017. Alime chat: A sequence to sequence and rerank based chatbot engine. In ACL, 498–503.
  13. Serban, I. V.; Sordoni, A.; Bengio, Y.; Courville, A. C.; and Pineau, J. 2016. Building end-to-end dialogue systems using generative hierarchical neural network models. In AAAI, volume 16, 3776–3784.
  14. Serban, I. V.; Sordoni, A.; Lowe, R.; Charlin, L.; Pineau, J.; Courville, A. C.; and Bengio, Y. 2017. A hierarchical latent variable encoder-decoder model for generating dialogues. In AAAI, 3295–3301.
  15. Shang, L.; Lu, Z.; and Li, H. 2015. Neural responding machine for short-text conversation. In ACL, 1577–1586.
  16. Shen, X.; Su, H.; Li, Y.; Li, W.; Niu, S.; Zhao, Y.; Aizawa, A.; and Long, G. 2017. A conditional variational framework for dialog generation. In ACL, 504–509.
  17. Song, Y.; Yan, R.; Li, X.; Zhao, D.; and Zhang, M. 2016. Two are better than one: An ensemble of retrieval-and generation-based dialog systems. arXiv preprint arXiv:1610.07149.
  18. Sordoni, A.; Galley, M.; Auli, M.; Brockett, C.; Ji, Y.; Mitchell, M.; Nie, J.-Y.; Gao, J.; and Dolan, B. 2015. A neural network approach to context-sensitive generation of conversational responses. In NAACL, 196–205.
  19. Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; and Salakhutdinov, R. 2014. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15(1):1929–1958.
  20. Vinyals, O., and Le, Q. 2015. A neural conversational model. In ICML (Deep Learning Workshop).
  21. Wagner, R. A., and Fischer, M. J. 1974. The string-to-string correction problem. Journal of the ACM (JACM) 21(1):168–173.
  22. Weizenbaum, J. 1966. Eliza—a computer program for the study of natural language communication between man and machine. Communications of the ACM 9(1):36–45.
  23. Weston, J.; Dinan, E.; and Miller, A. H. 2018. Retrieve and refine: Improved sequence generation models for dialogue. arXiv preprint arXiv:1808.04776.
  24. Williams, R. J. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8(3-4):229–256.
  25. Wu, Y.; Wei, F.; Huang, S.; Li, Z.; and Zhou, M. 2018. Response generation by context-aware prototype editing. arXiv preprint arXiv:1806.07042.
  26. Xing, C.; Wu, W.; Wu, Y.; Liu, J.; Huang, Y.; Zhou, M.; and Ma, W.-Y. 2017. Topic aware neural response generation. In AAAI, 3351–3357.
  27. Xu, J.; Sun, X.; Zeng, Q.; Ren, X.; Zhang, X.; Wang, H.; and Li, W. 2018. Unpaired sentiment-to-sentiment translation: A cycled reinforcement learning approach. In ACL, 675–686.
  28. Zhao, T.; Lee, K.; and Eskenazi, M. 2018. Unsupervised discrete sentence representation learning for interpretable neural dialog generation. In ACL, 1098–1107.
  29. Zhao, T.; Zhao, R.; and Eskenazi, M. 2017. Learning discourse-level diversity for neural dialog models using conditional variational autoencoders. In ACL, 654–664.
  30. Zhou, H.; Young, T.; Huang, M.; Zhao, H.; Xu, J.; and Zhu, X. 2018. Commonsense knowledge aware conversation generation with graph attention. In IJCAI, 4623–4629.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description