Retrospective and Prospective Mixture-of-Generators for Task-oriented Dialogue Response Generation

Retrospective and Prospective Mixture-of-Generators for Task-oriented Dialogue Response Generation


Dialogue response generation (DRG) is a critical component of task-oriented dialogue systems (TDSs). Its purpose is to generate proper natural language responses given some context, e.g., historical utterances, system states, etc. State-of-the-art work focuses on how to better tackle DRG in an end-to-end way. Typically, such studies assume that each token is drawn from a single distribution over the output vocabulary, which may not always be optimal. Responses vary greatly with different intents, e.g., domains, system actions. We propose a novel mixture-of-generators network (MoGNet) for dialogue response generation (DRG), where we assume that each token of a response is drawn from a mixture of distributions. mixture-of-generators network (MoGNet) consists of a chair generator and several expert generators. Each expert is specialized for DRG w.r.t. a particular intent. The chair coordinates multiple experts and combines the output they have generated to produce more appropriate responses. We propose two strategies to help the chair make better decisions, namely, a retrospective mixture-of-generators (RMoG) and a prospective mixture-of-generators (PMoG). The former only considers the historical expert-generated responses until the current time step while the latter also considers possible expert-generated responses in the future by encouraging exploration. In order to differentiate experts, we also devise a global-and-local (GL) learning scheme that forces each expert to be specialized towards a particular intent using a local loss and trains the chair and all experts to coordinate using a global loss. We carry out extensive experiments on the MultiWOZ benchmark dataset. MoGNet significantly outperforms state-of-the-art methods in terms of both automatic and human evaluations, demonstrating its effectiveness for DRG.

1 Introduction

Figure 1: Density of the relative token frequency distribution for different intents (domains in the top plot, system actions in the bottom plot). We use kernel density estimation1 to estimate the probability density function of a random variable from a relative token frequency distribution.
Figure 2: Overview of MoGNet. It illustrates how the model generates the token given sequence as an input in the process of generating the whole sequence as a dialogue response.

Task-oriented dialogue systems (task-oriented dialogue systems) have sparked considerable interest due to their broad applicability, e.g., for booking flight tickets or scheduling meetings [32, 34]. Existing TDS methods can be divided into two broad categories: pipeline multiple-module models [2, 5, 34] and end-to-end single-module models [11, 30]. The former decomposes the TDS task into sequentially dependent modules that are addressed by separate models while the latter proposes to use an end-to-end model to solve the entire task. In both categories, there are many factors to consider in order to achieve good performance, such as user intent understanding [31], dialogue state tracking [37], and dialogue response generation (DRG). Given a dialogue context (dialogue history, states, retrieved results from a knowledge base, etc.), the purpose of DRG is to generate a proper natural language response that leads to task-completion, i.e., successfully achieving specific goals, and that is fluent, i.e., generating natural and fluent utterances.

Recently proposed DRG methods have achieved promising results (see, e.g., LaRLAttnGRU [36]). However, when generating a response, all current models assume that each token is drawn from a single distribution over the output vocabulary. This may be unreasonable because responses vary greatly with different intents, where intent may refer to domain, system action, or other criteria for partioning responses, e.g., the source of dialogue context [24]. To support this claim, consider the training set of the Multi-domain Wizard-of-Oz (MultiWOZ) benchmark dataset [4], where 67.4% of the dialogues span across multiple domains and all of the dialogues span across multiple types of system actions. We plot the density of the relative token frequency distributions in responses of different intents over the output vocabulary in Fig. 2. Although there is some overlap among distributions, there are also clear differences. For example, when generating the token , it has a high probability of being drawn from the distributions for the intent of booking an attraction, but not from booking a taxi. Thus, we hypothesize that a response should be drawn from a mixture of distributions for multiple intents rather than from a single distribution for a general intent.

We propose a mixture-of-generators network (MoGNet) for DRG, which consists of a chair generator and several expert generators. Each expert is specialized for a particular intent, e.g., one domain, or one type of action of a system, etc. The chair coordinates multiple experts and generates the final response by taking the utterances generated by the experts into consideration. Compared with previous methods, the advantages of MoGNet are at least two-fold: First, the specialization of different experts and the use of a chair for combining the outputs breaks the bottleneck of a single model [10, 19]. Second, it is more easily traceable: we can analyze who is responsible when the model makes a mistake and generates an inappropriate response.

We propose two strategies to help the chair make good decisions, i.e., retrospective mixture-of-generators (RMoG) and prospective mixture-of-generators (PMoG). retrospective mixture-of-generators (RMoG) only considers the retrospective utterances generated by the experts, i.e., the utterances generated by all the experts prior to the current time step. However, a chair without a long-range vision is likely to make sub-optimal decisions. Consider, for example, these two responses: “what day will you be traveling?” and “what day and time would you like to travel?” If we only consider these responses until the 2nd token (which RMoG does), then the chair might choose the first response due to the absence of a more long-range view of the important token “time” located after the 2nd token. Hence, we also propose a prospective mixture-of-generators (PMoG), which enables the chair to make full use of the prospective predictions of experts as well.

To effectively train MoGNet, we devise a global-and-local (GL) learning scheme. The local loss is defined on a segment of data with a certain intent, which forces each expert to specialize. The global loss is defined on all data, which forces the chair and all experts to coordinate with each other. The global loss can also improve data utilization by enabling the backpropagation error of each data sample to influence all experts as well as the chair.

To verify the effectiveness of MoGNet, we carry out experiments on the MultiWOZ benchmark dataset. MoGNet significantly outperforms state-of-the-art DRG methods, improving over the best performing model on this dataset by 5.64% in terms of overall performance (0.5*Inform0.5*SuccessBLEU) and 0.97% in terms of response generation quality (Perplexity).

The main contributions of this paper are:

  • a novel MoGNet model that is the first framework that devises chair and expert generators for DRG, to the best of our knowledge;

  • two novel coordination mechanisms, i.e., RMoG and PMoG, to help the chair make better decisions; and

  • a global-and-local (GL) learning scheme to differentiate experts and fuse data efficiently.

2 Mixture-of-Generators Network

We focus on task-oriented DRG (a.k.a. the context-to-text generation task [4]). Formally, given a current dialogue context , where is a combination of previous utterances, are the belief states, and are the retrieved database results based on , the goal of task-oriented DRG is to generate a fluent natural language response that contains appropriate system actions to help users accomplish their task goals, e.g., booking a flight ticket. We propose MoGNet to model the generation probability .

2.1 Overview

The MoGNet framework consists of two types of roles:

  • expert generators, each of which is specialized for a particular intent, e.g., a domain, a type of action of a system, etc. Let denote a dataset with independent samples of . Expert-related intents partition into pieces , where . Then is used to train each expert by predicting . We expect the -th expert to perform better than the others on .

  • a chair generator, which learns to coordinate a group of experts to make an optimal decision. The chair is trained to predict , where is a sample from .

Fig. 2 shows our implementation of MoGNet; it consists of three types of components, i.e., a shared context encoder, expert decoders, and a chair decoder.

2.2 Shared context encoder

The role of the shared context encoder is to read the dialogue context and construct a representation. We follow Budzianowski et al. [3] and model the current dialogue context as a combination of user utterances , belief states , and retrieval results from a database .

First, we employ a Recurrent Neural Network (RNN) [7] to map a sequence of input tokens to hidden vectors . The hidden vector at the -th step can be represented as:


where is the embedding of the token . The initial state of the RNN is set to 0.

Then, we represent the current dialogue context as a combination of the user utterance representation , the belief state vector , and the database vector :


where is the final hidden state from Eq. 1; is a 0-1 vector with each dimension representing a state (slot-value pair); is also a 0-1 vector, which is built by querying the database with the current state . Each dimension of represents a particular result from the database (e.g., whether a flight ticket is available).

2.3 Expert decoder

Given the current dialogue context and the current decoded tokens , the -th expert outputs the probability over the vocabulary at the -th step by:


where is the parameter matrix and is bias; is the state vector, which is initialized by the dialogue context vector from the shared context encoder, i.e., ; is the embedding of the generated token at time step ; is the concatenation operation; is the context vector which is calculated with a concatenation attention mechanism [1, 18] over the hidden representations from a shared context encoder as follows:


where is a set of attention weights; is the concatenation operation. , , are learnable parameters, which are not shared by different experts in our experiments.

2.4 Chair decoder

Given the current dialogue context and the current decoded tokens , the chair decoder estimates the final token prediction distribution by combining the prediction probabilities from experts. Here, we consider two strategies to leverage the prediction probabilities from experts, i.e., RMoG and PMoG. The former only considers expert generator outputs from history (until the -th time step), which follows the typical neural Mixture-of-Experts (MoE) architecture [25, 27]. We propose the latter to make the chair generator envision the future (i.e., after the -th time step) by exploring expert generator outputs from extra steps ().

Specifically, the chair determines the prediction as follows:


where is the prediction probability from the chair itself; is the prediction probability from expert ; are normalized coordination coefficients, which are calculated as:


, and are estimated w.r.t. , and , respectively. is a list of retrospective decoding outputs from all experts, which is defined as follows:


where is a special token “[BOS]” indicating the start of decoding; is the output of expert from the 1-st to the -th step using Eq. 3; is a list of prospective decoding outputs from all experts, which is defined as follows:


where are the outputs of expert from the -th to ()-th step. We obtain by forcing expert to generate steps using Eq. 3 based on the current generated tokens .

2.5 Learning scheme

We devise a global-and-local learning scheme to train MoGNet. Each expert is optimized by a localized expert loss defined on , which forces each expert to specialize on one of the portions of data . We use cross-entropy loss for each expert and the joint loss for all experts is as follows:


where is the token prediction by expert (Eq. 3) computed on the -th data sample; is a one-hot vector indicating the ground truth token at .

We also design a global chair loss to differentiate the losses incurred from different experts. The chair can attribute the source of errors to the expert in charge. For each data sample in , we calculate the combined taken prediction (Eq. 5). Then the global loss becomes:


Our overall optimization follows the joint learning paradigm that is defined as a weighted combination of constituent losses:


where is a hyper-parameter to regulate the importance between the experts and the chair for optimizing the loss.

3 Experimental Setup

3.1 Research questions

We seek to answer the following research questions: {enumerate*}[label=(RQ0)]

Does MoGNet outperform state-of-the-art end-to-end single-module DRG models?

How does the choice of a particular coordination mechanism (i.e., RMoG, PMoG, or neither of the two) affect the performance of MoGNet?

How does the GL learning scheme compare to using the general global learning as a learning scheme?

3.2 Dataset

Our experiments are conducted on the Multi-domain Wizard-of-Oz (MultiWOZ[4] dataset. This is the latest large-scale human-to-human TDS dataset with rich semantic labels, e.g., domains and dialogue actions, and benchmark results of response generation.2 MultiWOZ consists of 10k natural conversations between a tourist and a clerk. It has 6 specific action-related domains, i.e., Attraction, Hotel, Restaurant, Taxi, Train, and Booking, and 1 universal domain, i.e., General. 67.4% of the dialogues are cross-domain which covers 2–5 domains on average. The average number of turns per dialogue is 13.68; a turn contains 13.18 tokens on average. The dataset is randomly split into into 8,438/1,000/1,000 dialogues for training, validation, and testing, respectively.

3.3 Model variants and baselines

We consider a number of variants of the proposed mixture-of-generators model:

  • MoGNet: the proposed model with RMoG and PMoG and GL learning scheme.

  • MoGNet-P: the model without prospection ability by removing PMoG coordination mechanism from MoGNet.

  • MoGNet-P-R: the model removing the two coordination mechanisms and remaining GL learning scheme.

  • MoGNet-GL: the model that removes GL learning scheme from MoGNet.

See Table 1 for a summary. Without further indications, the intents used are based on identifying eight different domains: Attraction, Booking, Hotel, Restaurant, Taxi, Train, General, and UNK.

MoGNet True True True 0.5
MoGNet-P True True False 0.5
MoGNet-P-R True False False 0.5
MoGNet-GL True True True 0.0
, , are from Eq. 5. “True” means we preserve it and learn it as it is. “False” means we remove it (set it to 0). is from Eq. 11 and we report two settings, 0.0 and 0.5. See § 5.2.
Table 1: Model variants.

To answer RQ1, we compare MoGNet with the following methods that have reported results on this task according to the official leaderboard.3

  • S2SAttnLSTM. We follow the dominant Sequence-to-Sequence (Seq2Seq) model under an encoder-decoder architecture [5] and reproduce the benchmark baseline, i.e., single-module model named S2SAttnLSTM [4, 3], based on the source code provided by the authors. See footnote 4.

  • S2SAttnGRU. A variant of S2SAttnLSTM, with Gated Recurrent Units instead of LSTMs and other settings kept the same.

  • Structured Fusion. It learns the traditional dialogue modules and then incorporates these pre-trained sequentially dependent modules into end-to-end dialogue models by structured fusion networks [20].

  • LaRLAttnGRU. The state-of-the-art model [36], which uses reinforcement learning and models system actions as latent variables. LaRLAttnGRU uses ground truth system action annotations and user goals to estimate the rewards for reinforcement learning during training.

3.4 Evaluation metrics

We use the following commonly used evaluation metrics [4, 36]:

  • Inform: the fraction of responses that provide a correct entity out of all responses.

  • Success: the fraction of responses that answer all the requested attributes out of all responses.

  • BLEU: for comparing the overlap between a generated response to one or more reference responses.

  • Score: defined as . This measures the overall performance in term of both task completion and response fluency [20].

  • PPL: denotes the perplexity of the generated responses, which is defined as the exponentiation of the entropy. This measures how well a probability DRG model predicts a token in a response generation process.

We use the toolkit released by Budzianowski et al. [3] to compute the metrics.4 Following their settings, we also use Score as the selection criterion to choose the best model on the validation set and report the performance of the model on the test set. We use a paired t-test to measure statistical significance () of relative improvements.

3.5 Implementation details

Theoretically, the training time complexity of each data sample is , where is the number of response tokens. To reduce the computation cost, we assign and compute the expert prediction with Eq. 3. This means that the chair will make a final decision only after all the experts have decoded their final tokens. Thus, the time complexity decreases to .

For a fair comparison, the vocabulary size is the same as Budzianowski et al. [4], which has 400 tokens. Out-of-vocabulary words are replaced with “[UNK]”. We set the word embedding size to 50 and all GRU hidden state sizes to 150. We use Adam [13] as our optimization algorithm with hyperparameters , , and . We also apply gradient clipping [22] with range [–5, 5] during training. We use regularization to alleviate overfitting, the weight of which is set to . We set the mini-batch size to 64. We use greedy search to generate the responses during testing. Please note that if a data point has multiple intents, then we assign it to each corresponding expert, respectively. The code is available online.5

4 Results

4.1 Automatic evaluation

We evaluate the overall performance of MoGNet and the comparable baselines on the metrics defined in §3.4. The results are shown in Table 2. First of all, MoGNet outperforms all baselines by a large margin in terms of overall performance metric, i.e., satisfaction Score.

BLEU Inform Success Score PPL
S2SAttnLSTM 18.90% 71.33% 60.96% 85.05 3.98
S2SAttnGRU 18.21% 81.50% 68.80% 93.36 4.12
Structured Fusion [20] 16.34% 82.70% 72.10% 93.74
LaRLAttnGRU [36] 12.80% 82.78% 79.20% 93.79 5.22
MoGNet 20.13% 85.30% 73.30% 99.43 4.25
Bold face indicates leading results. Significant improvements over the best baseline are marked with (paired t-test, ).
Table 2: Comparison results of MoGNet and the baselines.

It significantly outperforms the state-of-the-art baseline LaRLAttnGRU by 5.64% (Score) and 0.97 (PPL). Thus, MoGNet not only improves the satisfaction of responses but also improves the quality of the language modeling process. MoGNet also achieves more than 6.70% overall improvement over the benchmark baseline S2SAttnLSTM and its variant S2SAttnGRU. This proves the effectiveness of the proposed MoGNet model.

Second, LaRLAttnGRU achieves the highest performance in terms of Success, followed by MoGNet. However, it results in a 7.33% decrease in BLEU and a 2.56% decrease in Inform compared to MoGNet. Hence, LaRLAttnGRU is good at answering all requested attributes but not as good at providing more appropriate entities with high fluency as MoGNet. LaRLAttnGRU tends to generate more slot values to increase the probability of answering the requested attributes. Take an extreme case as an example: if we force a model to generate all tokens with slot values, then it will achieve an extremely high Success but a low BLEU.

Third, S2SAttnLSTM is the worst model in terms of overall performance (Score). But it achieves the best PPL. It tends to generate frequent tokens from the vocabulary which exhibits better language modeling characteristics. However, it fails to provide useful information (the requested attributes) to meet the user goals. By contrast, MoGNet improves the user satisfaction (i.e., Score) greatly and achieves response fluency by taking specialized generations from all experts into account.

4.2 Human evaluation

To further understand the results in Table 2, we conducted a human evaluation of the generated responses from S2SAttnGRU, LaRLAttnGRU, and MoGNet. We ask workers on Amazon Mechanical Turk (AMT)6 to read the dialogue context, and choose the responses that satisfy the following criteria: (i) Informativeness measures whether the response provides appropriate information that is requested by the user query. No extra inappropriate information is provided. (ii) Consistency measures whether the generated response is semantically aligned with the ground truth response. (iii) Satisfactory measures whether the response has a overall satisfactory performance promising both Informativeness and Consistency. As with existing studies [20], we sample one hundred context-response pairs to do human evaluation. Each sample is labeled by three workers. The workers are asked to choose either all responses that satisfy the specific criteria or the “NONE” option, which denotes none of the responses satisfy the criteria. To make sure that the annotations are of high quality, we calculate the fraction of the responses that satisfy each criterion out of all responses that passes the golden test. That is, we only consider the data from the workers who have chosen the golden response as an answer.

Informativeness 56.79% 31.03% 76.54% 44.83% 80.25% 53.45%
Consistency 45.21% 23.53% 71.23% 39.22% 80.82% 50.98%
Satisfactory 26.79% 25.00% 44.64% 21.88% 60.71% 37.50%
Bold face indicates the best results. means that at least AMT workers regard it as a good response w.r.t. Informativeness, Consistency and Satisfactory.
Table 3: Results of human evaluation.

The results are displayed in Table 3. MoGNet performs better than S2SAttnGRU and LaRLAttnGRU on Informativeness because it frequently outputs responses that provide richer information (compared with S2SAttnGRU) and fewer extra inappropriate information (compared with LaRLAttnGRU). MoGNet obtains the best results, which means MoGNet is able to generate responses that are semantically similar to the golden responses with large overlaps. The results of LaRLAttnGRU outperforms S2SAttnGRU in all cases except for Satisfactory under the strict condition (). This reveals that balancing between Informativeness and Consistency makes it difficult for the mturk workers to assess the overall quality measured by Satisfactory. In this case, MoGNet receives the most votes on Satisfactory under the strict condition () as well as the loose condition (). This shows that the workers consider the responses from MoGNet more appropriate than the other two models with a high degree of agreement. To sum up, MoGNet is able to generate user-favored responses in addition to the improvements for automatic metrics.

4.3 Coordination mechanisms

In Table 4 we contrast the effectiveness of different coordination mechanisms. We can see that MoGNet-P loses 4.32% overall performance with a 0.62% decrease of BLEU, 5.90% decrease of Inform and 1.50% decrease of Success. This shows that the prospection design of the PMoG mechanism is beneficial to both task completion and response fluency. Especially, most improvements come from providing more correct entities while improving generation fluency. MoGNet-P-R reduces 2.62% Score with 1.97% lower of BLEU, 0.2% lower of Inform and 1.10% of Success. Thus, the MoGNet framework is effective thanks to its design with two types of roles: the chair and the experts.

BLEU Inform Success Score PPL
MoGNet 20.13% 85.30% 73.30% 99.43 4.25
MoGNet-P 19.51% 79.40% 71.80% 95.11 4.19
MoGNet-P-R 18.16% 85.10% 72.20% 96.81 4.12
Underlined results indicate the worst results with a statistically significant decrease compared to MoGNet (paired t-test, ).
Table 4: The impact of coordination mechanisms.

4.4 Learning scheme

We use MoGNet-GL to refer to the model that removes the GL learning scheme from MoGNet and uses the general global learning instead. MoGNet-GL results in a sharp reduction of 6.95% overall performance with 0.80% of BLEU, 6.90% of Inform and 5.40% of Success. The main improvement is attributed to the strong task completion ability. This shows the effectiveness and importance of the GL learning scheme as it encourages each expert to specialize on a particular intent while the chair prompts all experts to coordinate with each other.

BLEU Inform Success Score PPL
MoGNet 20.13% 85.30% 73.30% 99.43 4.25
MoGNet-GL 19.33% 78.40% 67.90% 92.48 3.97
Underlined results indicate the worst results with a statistically significant decrease compared with MoGNet (paired t-test, ).
Table 5: Impact of the learning scheme.

5 Analysis

In this section, we explore MoGNet in more detail. In particular, we examine {enumerate*}[label=()]

whether the intent partition affects the performance of MoGNet5.1);

whether the improvements of MoGNet could simply be attributed to having a larger number of parameters (§5.2);

how the hyper-parameter (Eq. 11) affects the performance of MoGNet5.2); and

how RMoG, PMoG and GL influence DRG using a small case study (§5.3).

5.1 Intent partition analysis

As stated above, the responses vary a lot for different intents which are differentiated by the domain and the type of system action. Therefore, we experiment with two types of intents as shown in Table 6.

Type Intents
Attraction, Booking, Hotel, Restaurant, Taxi, Train, General, UNK.
Book, Inform, NoBook, NoOffer, OfferBook, OfferBooked, Select,
Recommend, Request, Bye, Greet, Reqmore, Welcome, UNK.
Table 6: Two groups of intents that are divided by domains and the type of system actions.

To address (i), we compared two ways of partitioning intents. MoGNet-domain and MoGNet-action denote the intent partitions w.r.t. domains and system actions, respectively. MoGNet-domain has 8 intents (domains) and MoGNet-action has 14 intents (actions), as shown in Table 6. The results are shown in Table 7.

BLEU Inform Success Score PPL
MoGNet-domain 20.13% 85.30% 73.30% 99.43 4.25
MoGNet-action 17.28% 79.40% 69.70% 91.83 4.48
Table 7: Results of MoGNet with two intent partition ways.

MoGNet consistently outperforms the baseline S2SAttnGRU for both ways of partitioning intents. Interestingly, MoGNet-domain greatly outperforms MoGNet-action. We believe there are two reasons: First, the system actions are not suitable for grouping intents because some partition subsets are hard to be distinguished from each other, e.g., OfferBook and OfferBooked. Second, some system actions only have a few data samples, simply not enough to specialize the experts. The results show that different ways of partitioning intents may greatly affect the performance of MoGNet. Therefore, more effective intent partition methods, e.g., adaptive implicit intent partitions, need to be explored in future work.

5.2 Hyper-parameter analysis

To address (ii), we show the results of MoGNet and S2SAttnGRU with different hidden sizes in Fig. 3. S2SAttnGRU outperforms MoGNet when the number of parameters is less than 0.6e7. However, MoGNet achieves much better results with more parameters. Most importantly, the results from both models show that a larger number of parameters does not always mean better performance, which indicates that the improvement of MoGNet is not simply due to more parameters.

Figure 3: Score of MoGNet and S2SAttnGRU with different number of parameters.
Figure 4: Score of MoGNet with different values of .

To address (iii), we report the Score values of MoGNet with different values of (Eq. 11), as shown in Fig. 4. When , no expert is trained on a particular intent. When , the model ignores the global loss, i.e., the RMoG and PMoG mechanisms are not used and the chair is only trained as a general expert. We can see that these two settings decrease the performance greatly which further verifies the effectiveness of RMoG and PMoG as well as the MoGNet framework. We also note that the performance of MoGNet is quite stable when with the best performance for . Hence, MoGNet is not very sensitive to the hyper-parameter .

5.3 Case study

Turn User System
i am looking for a train departing
from [value_place] that will arrive
at [value_place].
S2SAttnGRU: what day would you like to travel ?
LaRLAttnGRU: there are [value_count] trains leaving [value_place] on [value_day] . what
day would you like to travel ?
MoGNet: what day and time would you like to travel?
MoGNet-P: what day will you be traveling?
MoGNet-P-R: what day will you be traveling?
MoGNet-GL: there are [value_count] trains … do you have a time you would like to depart?
i would like to leave on [value_day]
and arrive no later than [value_time].
S2SAttnGRU: [train_id] will arrive at [value_time] .
LaRLAttnGRU: i have train [train_id] leaving at [value_time] and arriving at [value_time] .
would you like me to book this for you ?
MoGNet: [train_id] leaves at [value_time] and arrives by [value_time] .
MoGNet-P: [train_id] will arrive at [value_time].
MoGNet-P-R: the [train_id] arrives at [value_time].
MoGNet-GL: i have train [train_id] that leaves at [value_time] and arrives at [value_time].
+All: would you like me to book that for you?
3 can i please have the reference number? All: … how many tickets do you need?
could you book [value_count] tickets
for me and send me the reference
S2SAttnGRU: i have booked you [value_count] tickets on the [train_id] . your reference
number is [train_reference]. is there anything else i can help you with ?
LaRLAttnGRU: i have booked [value_count] tickets on [train_id] . the reference number is
[train_reference] .
MoGNet: i have booked [value_count] tickets for you. the total fee is [value_price] gbp
payable at the station. your reference number is [train_reference]. is there anything else
i can help you with ?
MoGNet-P: i have booked [value_count] tickets. the total fee is [value_price] gbp payable
at the station. your reference number is [train_reference].
MoGNet-P-R: booking was successful , the total fee is [value_price] gbp payable at the
station. reference number is [train_reference].
MoGNet-GL: i have booked [value_count] tickets for you. the reference number is
[train_reference]. is there anything else i can help you with ?
Table 8: Example responses from MoGNet with the ablation settings in a 4-turn dialogue.

To address (iv), we select an example to illustrate the influence of RMoG, PMoG, and GL. Table 8 exhibits the responses generated by comparable baselines (i.e., S2SAttnGRU, LaRLAttnGRU) and MoGNet variants as in Table 4. In red we highlight the tokens that show the differences in terms of task completion. Generally, MoGNet can generate more appropriate and meaningful responses. Specifically, without PMoG, MoGNet-P and MoGNet-P-R ignore the fact that the attribute time is important for searching a train ticket (1st turn) and omit the exact departure time ([value_time]) of the train (2nd turn). Without GL, MoGNet-GL ignores the primary time information need day (1st turn) and omits the implicit need of [value_price] (4th turn). There are also some low-quality cases, e.g., MoGNet and the baselines occasionally generate redundant and lengthy responses, because none of them has addressed this issue explicitly during training.

6 Related Work

Traditional models for DRG [8, 33] decompose the task into sequentially dependent modules, e.g., Dialogue State Tracking (DST[37], Policy Learning (PL[35], and Natural Language Generation (NLG[21]. Such models allow for targeted failure analyses, but inevitably incur upstream propagation problems [5]. Recent work views DRG as a source-to-target transduction problem, which maps a dialogue context to a response [11, 17, 31]. Sordoni et al. [28] show that using an RNN to generate text conditioned on dialogue history results in more natural conversations. Later improvements include the addition of attention mechanisms [16, 29], modeling the hierarchical structure of dialogues [26], or jointly learning belief spans [15]. Strengths of these methods include global optimization and easier adaptation to new domains [5].

The studies listed above assume that each token of a response is sampled from a single distribution, given a complex dialogue context. In contrast, MoGNet uses multiple cooperating modules, which exploits the specialization capabilities of different experts and the generalization capability of a chair. Work most closely related to ours in terms of modeling multiple experts includes [6, 12, 14, 23]. Le et al. [14] integrate a chat model with a question answering model using an LSTM-based mixture-of-experts method. Their model is similar to MoGNet-GL-P (without PMoG and GL) except that they simply use two implicit expert generators that are not specialized on particular intents. Guo et al. [12] introduce a mixture-of-experts to use the data relationship between multiple domains for binary classification and sequence tagging. Sequence tagging generates a set of fixed labels; DRG generates diverse appropriate response sequence. The differences between MoGNet and these two approaches are three-fold: First, MoGNet consists of a group of modules including a chair generator and several expert generators; this design addresses the module interdependence problem since each module is independent from the others. Second, the chair generator alleviates the error propagation problem because it is able to manage the overall errors through an effective learning scheme. Third, the models of those two approaches cannot be directly applied to task-oriented DRG. The recently published HDSA [6] slightly outperforms MoGNet on Score (+0.07), but it overly relies on BERT [9] and graph structured dialog acts. MoGNet follow the same modular TDS framework [23], but it preforms substantially better due to fitting the expert generators with both retrospection and prospection abilities and adopting the GL learning scheme to conduct more effective learning.

7 Conclusion and Future Work

In this paper, we propose a novel mixture-of-generators network (MoGNet) model with different coordination mechanisms, namdely, RMoG and PMoG, to enhance dialogue response generation. We also devise a GL learning scheme to effectively learn MoGNet. Experiments on the MultiWOZ benchmark demonstrate that MoGNet significantly outperforms state-of-the-art methods in terms of both automatic and human evaluations. We also conduct analyses that confirm the effectiveness of MoGNet, the RMoG and PMoG mechanisms, as well as the GL learning scheme.

As to future work, we plan to devise more fine-grained expert generators and to experiment on more datasets to test MoGNet. In addition, MoGNet can be advanced in many directions: First, better mechanisms can be proposed to improve the coordination between chair and expert generators. Second, it would be interesting to study how to do intent partition automatically. Third, it is also important to investigate how to avoid redundant and lengthy responses in order to provide a better user experience.


This research was partially supported by Ahold Delhaize, the Association of Universities in the Netherlands (VSNU), the China Scholarship Council (CSC), and the Innovation Center for Artificial Intelligence (ICAI). All content represents the opinion of the authors, which is not necessarily shared or endorsed by their respective employers and/or sponsors.


  3. The Context-to-Text Generation task at


  1. D. Bahdanau, K. Cho and Y. Bengio (2015) Neural machine translation by jointly learning to align and translate. In ICLR, Cited by: §2.3.
  2. A. Bordes and J. Weston (2017) Learning end-to-end goal-oriented dialog. In ICLR, Cited by: §1.
  3. P. Budzianowski, I. Casanueva, B. Tseng and M. Gasic (2018) Towards end-to-end multi-domain dialogue modelling. Technical report Cambridge University. Cited by: §2.2, 1st item, §3.4.
  4. P. Budzianowski, T. Wen, B. Tseng, I. Casanueva, S. Ultes, O. Ramadan and M. Gasic (2018) MultiWOZ - a large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue modelling. In EMNLP, pp. 5016–5026. Cited by: §1, §2, 1st item, §3.2, §3.4, §3.5.
  5. H. Chen, X. Liu, D. Yin and J. Tang (2017) A survey on dialogue systems: recent advances and new frontiers. ACM SIGKDD Explorations Newsletter 19 (2), pp. 25–35. Cited by: §1, 1st item, §6.
  6. W. Chen, J. Chen, P. Qin, X. Yan and W. Y. Wang (2019) Semantically conditioned dialog response generation via hierarchical disentangled self-attention. In ACL, pp. 3696–3709. Cited by: §6.
  7. K. Cho, B. van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk and Y. Bengio (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. In EMNLP, pp. 1724–1734. Cited by: §2.2.
  8. P. Crook, A. Marin, V. Agarwal, K. Aggarwal, T. Anastasakos, R. Bikkula, D. Boies, A. Celikyilmaz, S. Chandramohan, Z. Feizollahi, R. Holenstein, M. Jeong, O. Z. Khan, Y.-B. Kim, E. Krawczyk, X. Liu, D. Panic, V. Radostev, N. Ramesh, J.-P. Robichaud, A. Rochette, S. L. and R. Sarikaya (2016) Task completion platform: a self-serve multi-domain goal oriented dialogue platform. In NAACL, pp. 47–51. Cited by: §6.
  9. J. Devlin, M. Chang, K. Lee and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In ACL, pp. 4171–4186. Cited by: §6.
  10. T. G. Dietterich (2000) Ensemble methods in machine learning. In Proceedings of the First International Workshop on Multiple Classifier Systems, pp. 1–15. Cited by: §1.
  11. M. Eric, L. Krishnan, F. Charette and C. D. Manning (2017) Key-value retrieval networks for task-oriented dialogue. In SIGDIAL, pp. 37–49. Cited by: §1, §6.
  12. J. Guo, D. J. Shah and R. Barzilay (2018) Multi-source domain adaptation with mixture of experts. In EMNLP, pp. 4694–4703. Cited by: §6.
  13. D. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. In ICLR, Cited by: §3.5.
  14. P. Le, M. Dymetman and J. Renders (2016) LSTM-based mixture-of-experts for knowledge-aware dialogues. In Proceedings of the 1st Workshop on Representation Learning for NLP, pp. 94–99. Cited by: §6.
  15. W. Lei, X. Jin, M. Kan, Z. Ren, X. He and D. Yin (2018) Sequicity: simplifying task-oriented dialogue systems with single sequence-to-sequence architectures. In ACL, pp. 1437–1447. Cited by: §6.
  16. J. Li, M. Galley, C. Brockett, G. P. Spithourakis, J. Gao and B. Dolan (2016) A persona-based neural conversation model. In ACL, pp. 994–1003. Cited by: §6.
  17. J. Li, W. Monroe, T. Shi, S. Jean, A. Ritter and D. Jurafsky (2017) Adversarial learning for neural dialogue generation. In EMNLP, pp. 2157–2169. Cited by: §6.
  18. T. Luong, H. Pham and C. D. Manning (2015) Effective approaches to attention-based neural machine translation. In EMNLP, pp. 1412–1421. Cited by: §2.3.
  19. S. Masoudnia and R. Ebrahimpour (2014) Mixture of experts: a literature survey. Artificial Intelligence Review 42 (2), pp. 275–293. Cited by: §1.
  20. S. Mehri, T. Srinivasan and M. Eskenazi (2019) Structured fusion networks for dialog. In SIGDIAL, pp. 165–177. Cited by: 3rd item, 4th item, §4.2, Table 2.
  21. F. Mi, M. Huang, J. Zhang and B. Faltings (2019) Meta-learning for low-resource natural language generation in task-oriented dialogue systems. In IJCAI, pp. 3151–3157. Cited by: §6.
  22. R. Pascanu, T. Mikolov and Y. Bengio (2013) On the difficulty of training recurrent neural networks. In ICML, pp. 1310–1318. Cited by: §3.5.
  23. J. Pei, P. Ren and M. de Rijke (2019) A modular task-oriented dialogue system using a neural mixture-of-experts. In SIGIR Workshop on Conversational Interaction Systems, Cited by: §6.
  24. J. Pei, A. Stienstra, J. Kiseleva and M. de Rijke (2019-08) SEntNet: source-aware recurrent entity networks for dialogue response selection. In 4th International Workshop on Search-Oriented Conversational AI (SCAI), Cited by: §1.
  25. P. Schwab, D. Miladinovic and W. Karlen (2019) Granger-causal attentive mixtures of experts: learning important features with neural networks. In AAAI, pp. 4846–4853. Cited by: §2.4.
  26. I. V. Serban, A. Sordoni, Y. Bengio, A. C. Courville and J. Pineau (2016) Building end-to-end dialogue systems using generative hierarchical neural network models. In AAAI, pp. 3776–3784. Cited by: §6.
  27. N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton and J. Dean (2017) Outrageously large neural networks: the sparsely-gated mixture-of-experts layer. In ICLR, Cited by: §2.4.
  28. A. Sordoni, M. Galley, M. Auli, C. Brockett, Y. Ji, M. Mitchell, J. Nie, J. Gao and B. Dolan (2015) A neural network approach to context-sensitive generation of conversational responses. In NAACL-HLT, pp. 196–205. Cited by: §6.
  29. O. Vinyals and Q. Le (2015) A neural conversational model. In ICML Deep Learning Workshop, Cited by: §6.
  30. T. Wen, D. Vandyke, N. Mrkšić, M. Gasic, L. M. R. Barahona, P. Su, S. Ultes and S. Young (2017) A network-based end-to-end trainable task-oriented dialogue system. In EACL, pp. 438–449. Cited by: §1.
  31. T. Wen, D. Vandyke, N. Mrkšić, M. Gasic, L. M. R. Barahona, P. Su, S. Ultes and S. Young (2017) A network-based end-to-end trainable task-oriented dialogue system. In EACL, pp. 438–449. Cited by: §1, §6.
  32. J. D. Williams, K. Asadi and G. Zweig (2017) Hybrid code networks: practical and efficient end-to-end dialog control with supervised and reinforcement learning. In ACL, pp. 665–677. Cited by: §1.
  33. Z. Yan, N. Duan, P. Chen, M. Zhou, J. Zhou and Z. Li (2017) Building task-oriented dialogue systems for online shopping. In AAAI, pp. 4618–4626. Cited by: §6.
  34. S. Young, M. Gašić, B. Thomson and J. D. Williams (2013) POMDP-based statistical spoken dialog systems: a review. Proceedings of the IEEE 101 (5), pp. 1160–1179. Cited by: §1.
  35. Z. Zhang, M. Huang, Z. Zhao, F. Ji, H. Chen and X. Zhu (2019) Memory-augmented dialogue management for task-oriented dialogue systems. TOIS 37 (3), pp. 34. Cited by: §6.
  36. T. Zhao, K. Xie and M. Eskenazi (2019) Rethinking action spaces for reinforcement learning in end-to-end dialog agents with latent variable models. In NAACL, pp. 1208–1218. Cited by: §1, 4th item, §3.4, Table 2.
  37. V. Zhong, C. Xiong and R. Socher (2018) Global-locally self-attentive encoder for dialogue state tracking. In ACL, pp. 1458–1467. Cited by: §1, §6.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description