Semantically Conditioned Dialog Response Generation via Hierarchical Disentangled Self-Attention

Semantically Conditioned Dialog Response Generation
via Hierarchical Disentangled Self-Attention

Wenhu Chen, Jianshu Chen, Pengda Qin, Xifeng Yan and William Yang Wang
University of California, Santa Barbara, CA, USA
Tencent AI Lab, Bellevue, WA, USA
Beijing University of Posts and Telecommunications, China

Semantically controlled neural response generation on limited-domain has achieved great performance. However, moving towards multi-domain large-scale scenarios is shown to be difficult because the possible combinations of semantic inputs grow exponentially with the number of domains. To alleviate such scalability issue, we exploit the structure of dialog acts to build a multi-layer hierarchical graph, where each act is represented as a root-to-leaf route on the graph. Then, we incorporate such graph structure prior as an inductive bias to build a hierarchical disentangled self-attention network, where we disentangle attention heads to model designated nodes on the dialog act graph. By activating different (disentangled) heads at each layer, combinatorially many dialog act semantics can be modeled to control the neural response generation. On the large-scale Multi-Domain-WOZ dataset, our algorithm can yield an improvement of over 5.0 BLEU score, and in human evaluation, it also significantly outperforms other baselines over various metrics including consistency, etc.

Semantically Conditioned Dialog Response Generation
via Hierarchical Disentangled Self-Attention

Wenhu Chen, Jianshu Chen, Pengda Qin, Xifeng Yan and William Yang Wang University of California, Santa Barbara, CA, USA Tencent AI Lab, Bellevue, WA, USA Beijing University of Posts and Telecommunications, China {wenhuchen,xyan,william}

1 Introduction

Conversational artificial intelligence (Young et al., 2013) is one of the critical milestones in artificial intelligence. Recently, there have been increasing interests in industrial companies to build task-oriented conversational agents (Wen et al., 2017; Li et al., 2017; Rojas-Barahona et al., 2017) to solve pre-defined tasks such as restaurant or flight bookings, etc (see Figure 1 for an example dialog from MultiWOZ (Budzianowski et al., 2018)). Traditional agents are built based on slot-filling techniques, which requires significant human handcraft efforts. And it is hard to generate naturally sounding utterances in a generalizable and scalable manner. Therefore, different semantically controlled neural language generation models have been developed  (Wen et al., 2015, 2016a, 2016b; Dusek and Jurcícek, 2016) to replace the traditional systems, where an explicit semantic representation (dialog act) are used to influence the RNN generation. The canonical approach is proposed in (Wen et al., 2015) to encode each individual dialog act as a one-hot vector and use it as an extra input feature into the cell of long short-term memory (LSTM) to influence the generation. As pointed in (Wen et al., 2016b), these models though achieving good performance on limited domains, suffer from scalability problem as the possible combinations of dialog acts grow exponentially with the number of domains.

Figure 1: An example dialog from MultiWOZ dataset, where the upper rectangle includes the dialog history, the tables at the bottom represent the external database, and the lower rectangle contains the dialog action and the language surface form that we need to predict.

In order to alleviate such issue, we propose a hierarchical graph representation by leveraging the structural property of dialog acts. Specifically, we first build a multi-layer tree to represent the entire dialog act space based on their inter-relationships. Then, we merge the tree nodes with similar semantics to construct an acyclic multi-layered graph, where each dialog act is interpreted as a root-to-leaf route on the graph. Such graph representation of dialog acts not only grasps the inter-relationships between different acts but also reduces the exponential representation cost to almost linear, which will also endow it with greater generalization ability. Instead of simply feeding such vectorized representation to neural networks like Semantically Controlled LSTM (SC-LSTM) (Wen et al., 2015, 2016b; Dusek and Jurcícek, 2016), we propose to incorporate this structural prior of dialog act space as an inductive prior for designing the neural architecture, which we name as hierarchical disentangled self-attention network (HDSA). In Figure 2, we show how the dialog act graph structure is explicitly encoded into model architecture. Specifically, HDSA consists of multiple layers of disentangled self-attention modules (DSA). Each DSA has multiple switches to set the on/off state for its heads, and each head is bound for modeling a designated node in the dialog act graph. At the training stage, conditioned on the given dialog acts and the target output sentences, we only activate the heads in HDSA corresponding to the given acts (i.e., the path in the graph) to assign the heads with their designated semantics. At test time, we first predict the dialog acts and then use them to activate the corresponding heads to generate the output sequence, thereby controlling the semantics of the generated responses without handcrafting rules. As depicted in  Figure 2, by activating the orange nodes (“taxi” and “request”), we can generate querying sentences about “taxi”.

Figure 2: The left part is the graph representation of the dialog acts, where each path in the graph denotes a unique dialog act. The right part denotes our proposed HDSA, where the orange nodes are activated while the others are blocked. (For details, refer to Figure 5)

Experiment results on the large-scale MultiWOZ dataset (Budzianowski et al., 2018) show that our HDSA significantly outperforms other competing algorithms.111The code and data are released in In particular, the proposed hierarchical dialog act representation effectively improves the generalization ability on the unseen test cases and decreases the sample complexity on seen cases. In summary, our contributions include: (i) we propose a hierarchical graph representation of dialog acts to exploit their inter-relationships, which greatly reduces the sample complexity and improves generalization, (ii) we propose to incorporate the structure prior in semantic space to design HDSA to explicitly model the semantics of neural generation, and outperforms baselines.

2 Related Work & Background

Figure 3: Illustration of the neural dialog system. We decompose it into two parts: the lower part describes the dialog state tracking and DB query, and the upper part denotes the Dialog Action Prediction and Response Generation. In this paper, we are mainly interested in improving the performance of the upper part.

Canonical task-oriented dialog systems are built as pipelines of separately trained modules: (i) user intention classification (Shi et al., 2016; Goo et al., 2018), which is for understanding human intention. (ii) belief state tracker (Williams et al., 2013; Mrksic et al., 2017a, b; Zhong et al., 2018; Chen et al., 2018), which is used to track user’s query constraint and formulate DB query to retrieve entries from a large database. (iii) dialog act prediction (Wen et al., 2017), which is applied to classify the system action. (iv) response generation (Rojas-Barahona et al., 2017; Wen et al., 2016b; Li et al., 2017; Lei et al., 2018) to realize language surface form given the semantic constraint. In order to handle the massive number of entities in the response,  Rojas-Barahona et al. (2017); Wen et al. (2016b, 2015) suggest to break response generation into two steps: first generate delexicalized sentences with placeholders like Res.Name, and then post-process the sentence by replacing the placeholders with the DB record. The existing modularized neural models have achieved promising performance on limited-domain datasets like DSTC (Williams et al., 2016), CamRes767 (Rojas-Barahona et al., 2017) and KVRET (Eric et al., 2017), etc. However, a recently introduced multi-domain and large-scale dataset MultiWOZ (Budzianowski et al., 2018) poses great challenges to these approaches due to the large number of slots and complex ontology. Dealing with such large semantic space remains a challenging research problem.

Here we first follow the nomenclature proposed in Rojas-Barahona et al. (2017) to visualize the overview of the pipeline system in Figure 3, and then decompose it into two parts: the lower part (blue rectangle) contains state tracking and symbolic DB execution, the upper part consists of dialog act prediction and response generation conditioned on the state tracking and DB results. In this paper, we are particularly interested in the upper part (act prediction and response generation) by assuming the ground truth belief state and DB records are available. More specifically, we set out to investigate how to handle the large semantic space of dialog acts and leverage it to control the neural response generation. Our approach encodes the history utterances into distributed representations to predict dialog acts and then uses the predicted dialog acts to control neural response generation. The key idea of our model is to formulate a more compact structured representation of the dialog acts to reduce the exponential growth issue and then incorporate the structural prior for the semantic space into the neural architecture design. Our proposed HDSA is inspired by the linguistically-inform self-attention (Strubell et al., 2018), which combines multi-head self-attention with multi-task NLP tasks to enhance the linguistic awareness of the model. In contrast, our model disentangles different heads to model different semantic conditions in a single task.

Figure 4: The left figure describes the tree representation of the dialog acts, and the right figure denotes the obtained hierarchical graph representation from the left after merging the cross-branch nodes that have the same semantics.

3 Dialog Act Representation

As discussed in (Wen et al., 2015; Budzianowski et al., 2018; Wen et al., 2016b; Dusek and Jurcícek, 2016; Novikova et al., 2017), dialog acts are defined as the semantic condition of the language sequence, comprising of domains, act types and slots. One standard approach is to encode each individual dialog act in as a one-hot vector. However, this method is difficult to scale up as its number of possible combinations can grow dramatically as the domain increases. Moreover, such one-hot representation makes every dialog act equally distant, which fails to grasp the inter-relationships between different dialog acts. Furthermore, since each dialog act only provides one sparse instance for training, it could also lead to poor generalization ability. To address these issues, we propose a two-step strategy to build a hierarchical graph representation:

Tree Structure

First, we note that dialog acts defined in different spoken dialog system (Wen et al., 2016b; Dusek and Jurcícek, 2016; Tran et al., 2017; Sharma et al., 2016; Nayak et al., 2017; Budzianowski et al., 2018) have universally unexploited structural property, which is inherently due to the different semantic granularity among dialog acts. For example, consider the following two examples of dialog acts: (i) a general dialog act “hotel-inform”: tell the user information about a hotel, and (ii) a more specific dialog act “hotel-inform-name”: tell the user the name of a hotel. We note that the second dialog act can be viewed as inheriting from the first one. Another example is “restaurant-inform-location” vs “restaurant-inform-name”. Both dialog acts give restaurant information to the user: one is about the location and the other is about the name, which has a sibling relationship. Given such kinship nature of dialog acts, we propose to expand all the potential dialog acts as a tree shown in the left part of Figure 4, where each dialog act is a root-to-leaf path.222We add dummy node “none” to transform those non-leaf acts into leaf act to normalize all acts into triplet; for example “hotel-inform” is converted into “hotel-inform-none”. Using such tree structure, the kinship can be better captured, i.e. “restaurant-inform-location” has stronger similarity with “restaurant-inform-name” than “hotel-request-address”.

Graph Structure

However, the tree-structure representation cannot capture the cross-branch relationship like “restaurant-inform-location” vs. “hotel-inform-location”, leading to a huge expansion of the tree. Therefore, we further merge the cross-branch nodes that share the same semantics to build a compact acyclic graph in the right part of Figure 4333We call it graph because now one child node can have multiple parents, which violates the tree’s definition.. Formally, we let denote the set of all the original dialog acts. And for each act , we use to denote its -layer graph form, where is its one-hot representation in the layer of the graph. For example, a dialog act “hotel-inform-name” has a graph representation . More formally, let denote the number of nodes at the layer of , respectively. Ideally, the total representation cost can be dramatically decreased from in one-hot representation to = in our graph representation. Due to the page limit, we include the full dialog act graph and its corresponding semantics in the supplementary material. When multiple dialog acts are involved in the single response, we propose to aggregate them as as the -dimensional graph representation, where denotes the bit-wise OR operator444For example, two acts, and , are aggregated into ..

Generalization Ability

Such a graph representation of dialog acts has a great advantage under sparse training instances. For example, suppose the exact dialog act “hotel-recommend-area” never appears in the training set. Then, at test time when used for response generation, the one-hot representation will obviously fail. In contrast, with our hierarchical representation, “hotel”, “recommend” and “area” may have appeared separately in other instances (e.g., “recommend” appears in “attraction-recommend-name”). Its graph representation could still be well-behaved and generalize well to the unseen (or less frequent) cases, which is attributed to its strong compositionality.

4 Model

Figure 5: The left figure describes the dialog act predictor and HDSA, and the right figure describes the details of DSA. The predicted hierarchical dialog acts are used to control the switch in HDSA at each layer. Here we use layers, the head numbers at each layer are heads, the graph representation =. We use to denote the dialog history length and for response.

Figure 5 gives an overview of our dialog system. We now proceed to discuss its components below.

Dialog Act Predictor

We first explain the utterance encoder module, which uses a neural network to encode the dialog history (i.e., concatenation of previous utterances from both the user and the system turns ), into distributed token-wise representations with its overall representation as follows:


where can be CNN, LSTM, Transformer, etc, are the representation. The overall feature is used to predict the hierarchical representation of dialog act. That is, we output a vector , whose component gives the probability of the node in the dialog act graph being activated:


where is the attention matrix, the weights are the learnable parameters to project the input to space, and is the Sigmoid function. Here, we follow Budzianowski et al. (2018); Rojas-Barahona et al. (2017) to use one-hot vector and for representing the DB records and belief state (see the original papers for details). For convenience, we use to collect all the parameters of the utterance encoder and action predictor. At training time, we propose to maximize the cross-entropy objective as follows:


where denotes the inner product between two vectors. At test time, we predict the dialog acts , where is the threshold and is the indicator function.

Disentangled Self-Attention

Recently, the self-attention-based Transformer model has achieved state-of-the-art performance on various NLP tasks such as machine translation (Vaswani et al., 2017), and language understanding (Devlin et al., 2018). The success of Transformer is partly attributed to the multi-view representation using multi-head attention architecture. Unlike the standard transformer which concatenates vectors from different heads into one vector, we propose to uses a switch to activate certain heads and only pass through their information to the next level (depicted in the right of Figure 5). Hence, we are able to disentangle the attention heads to model different semantic functionalities, and we refer to such module as the disentangled self-attention (DSA). Formally, we follow the canonical Transformer (Vaswani et al., 2017) to define the Scaled Dot-Product Attention function of given the input features as:


where denotes the sequence length of the input, denotes query, key and value. Here, we use different self attention functions with their independent parameterization to compute the multi-head representation as follows:


where the input matrices are either computed from the input token embedding , denotes the dimension of the embedding. The head adopts its own parameters , , to compute the output . We shrink the dimension by to reduce the computation cost as suggested in Vaswani et al. (2017).

We first use the cross-attention network to incorporate the encoded dialog history , and then use a position-wise feed forward neural network , a layer normalization , finally a linear projection layer to obtain . These layers are shared across different heads. The main innovation of our architecture lies in the head disentanglement, instead of concatenating to obtain the layer output like the standard Transformer, we employ a binary switch vector to control different heads and aggregate them as a output matrix . Specifically, -th row of , denoted as , can be thought of as the output corresponding to the -th input token in the response. This approach is similar to a gating function to selectively pass desired information. By manipulating the attention-head switch , we can better control the information flow inside the self-attention module. We visualize of the gated summation over multi-heads in Figure 6.

Figure 6: The disentangled multi-head attention, with a sequence length of 3, 3 different heads are used with hidden dimension 7. The switch only enables the information flow from the 1st and 3rd head.

Hierarchical DSA

When the dialog system involves more complex ontology, the semantic space can grow rapidly. In consequence, a single-layer disentangled self-attention with a large number of heads is difficult to handle the complexity. Therefore, we further propose to stack multiple DSA layers to better model the huge semantic space with strong compositionality. As depicted in Figure 3, the lower layers are responsible for grasping coarse-level semantics and the upper layers are responsible for capturing fine-level semantics. Such progressive generation bears a strong similarity with human brains in constructing precise responses. In each DSA layer, we feed the utterance encoding and last layer output as the input to obtain the newer output matrix . We collect the output from the last DSA layer to compute the joint probability over a observed sequence , which can be decomposed as a series of product over the probabilities:555We follow the standard approach in Transformer to use a mask to make depend only on during training. And during test time, we decode sequentially from left-to-right.

where and are the projection weight and bias onto a vocabulary of size , is the index, denotes the softmax operation, denotes the set of the attention switches , , over the layers, and denotes all the decoder parameters.

Recall that the graph structure of dialog acts is explicitly encoded into HDSA as a prior, where each head in HDSA is bound for modeling a designated semantic node on the graph. In consequence, the graph representation can be used to control the head switch . At training time, the model parameters are optimized from the training data triple to maximize the likelihood of ground truth acts and responses given the dialog history. Formally, we propose to maximize the following objective function as follows:

At test time, we propose to use the predicted dialog act to control the language generation. The errors can be seen as coming from two sources, one is from inaccurate dialog act prediction, the other is from erroneous response generation.

5 Experiments


To evaluate our proposed methods, we use the recently proposed MultiWOZ dataset (Budzianowski et al., 2018) as the benchmark, which was specifically designed to cover the challenging multi-domain and large-scale dialog managements (see the summary in Table 1). This new benchmark involves a much larger dialog action space due to the inclusion of multiple domains and complex database backend. We represent the 625 potential dialog acts into a three-layered hierarchical graph that with a total nodes (see Appendix for the complete graph).

#Dialogs Total #turns Unique #tokens #Value
8538 115,424 24,071 4510
#Dialog Acts #Domain #Act Types #Slots
625 10 7 27
Table 1: Summary of the MultiWOZ dataset.

We follow Budzianowski et al. (2018) to select 1000 dialogs as the test set and 1000 dialogs as the development set. And we mainly focus on the context-to-response problem, with the dialog act prediction being a preliminary task. The best HDSA uses three DSA layers with 10/7/27 heads to separately model the semantics of domain, act-types and slot (dummy head is included to model “none” node). Adam (Kingma and Ba, 2014) with a learning rate of is used to optimize the objective. A beam size of 2 is adopted to search the hypothesis space during decoding with vocabulary size of 3130. Also, by small-scale search, we fix the threshold due to better empirical results.

Methods Precision Recall F1
Bi-directional LSTM 72.4 70.5 71.4
Word-CNN 72.8 70.3 71.5
3-layer Transformer 73.3 72.6 73.1
12-layer BERT 77.5 77.4 77.3
Table 2: Accuracy of Dialog Act Prediction
Dialog-Act Methods Delexicalized Restored
BLEU Inform Request Entity F1 BLEU
None LSTM (Budzianowski et al., 2018) 18.8 71.2 60.2 54.8 15.1
3-layer Transformer (Vaswani et al., 2017) 19.1 71.1 59.9 55.1 15.2
One-Hot SC-LSTM (Wen et al., 2015) 19.0 73.5 62.5 55.2 15.7
3-layer Transformer-out 18.9 74.4 61.1 55.4 15.6
3-layer Transformer-in 19.1 73.8 62.1 55.3 15.5
(Pred Act)
3-layer Transformer-out 22.5 80.8 64.8 64.2 19.3
3-layer Transformer-in 22.7 80.4 65.1 64.6 19.9
Straight DSA (44 heads) + 2 x SA 22.6 80.3 67.1 65.0 20.0
2-layer HDSA (7/27 heads) + SA 23.2 82.9 69.1 65.1 20.3
3-layer HDSA (10/7/27 heads) 23.6 82.9 68.9 65.7 20.6
(Groundtruth Act)
3-layer Transformer-in 29.1 85.5 72.6 83.8 25.1
Straight DSA (44 heads) + 2 x SA 29.6 86.4 75.6 84.1 25.5
3-layer HDSA (10/7/27 heads) 30.4 87.9 78.0 86.2 26.2
Table 3: Empirical Results on MultiWOZ Response Generation, we experiment with three forms of dialog act, namely none, one-hot and hierarchical.

Dialog Act Prediction

We first train dialog act predictors using different neural networks to compare their performances. The experimental results (measured in F1 scores) are reported in Table 2. Experimental results show that fine-tuning pre-trained BERT (Devlin et al., 2018) can lead to significant better performance than the other models. Therefore, we stick to this model in the following experiments. Instead of jointly training the predictor and the response generator, we simply fix the trained predictor to optimize the generator independently.

5.1 Automatic Evaluation

We follow Budzianowski et al. (2018) to use delexicalized-BLEU (Papineni et al., 2002), inform rate and request success as three basic metrics to compare the delexicalized generation against the delexicalized reference. We further propose entity F1 (Rojas-Barahona et al., 2017) to evaluate the entity coverage accuracy (including all slot values, days, numbers, and reference, etc), and restore-BLEU to compare the restored generation against the raw reference. The evaluation metrics are detailed in the supplementary.

Before diving into the experiments, we first list all the models we experiment with as follows:

  1. Without Dialog Act, we use the official code 666 (i) LSTM (Budzianowski et al., 2018): it uses history as the attention context and applies belief state and KB results as side inputs. (ii) Transformer (Vaswani et al., 2017): it uses stacked Transformer architecture with dialog history as source attention context.

  2. With One-Hot Dialog Act, we follow the publicly available multi-domain dialog act forms777 as the dialog act representation. (i) SC-LSTM (Wen et al., 2015): it uses semantic gate to influence the generation process. (ii) Transformer-in: it appends the one-hot vector to input word embedding (iii) Transformer-out: it appends the one-hot vector to the last layer output, before the softmax function.

  3. With Hierarchical Dialog Act (Pred Act): these results are obtained with predicted dialog acts. (i) Transformer-in/out: concatenated with hierarchical representation. (ii) Straight DSA: it models all the potential dialog acts by a single-layer DSA followed with two layers of self-attention. (iii) 2-layer HDSA: it only adopts the function and argument level act, which is used as an ablation study. (iv) 3-layer HDSA: it captures full hierarchical dialog act layers.

  4. With Hierarchical Dialog Act (Groundtruth Act): these results are obtained with ground truth dialog acts to see the upper bound of the proposed response generator for ablation study.

In order to make these models comparable, we design different hidden dimensions to make their total parameter size comparable. We demonstrate the performance of different models in Table 3, and briefly conclude with the following points: (i) by simply adding the one-hot representation to input/output layer (Transformer-in/out), the model fails to capture the large semantics space of dialog acts with sparse training instances, which unsurprisingly leads to zero performance gain against the Transformer with none dialog act input. (ii) the dialog act structure is important in guiding response generation. By replacing one-hot with graph representation, we observe very significant and consistent improvement over different methods (Transformer-in/out). (iii) the hierarchical graph structure prior is an efficient inductive bias; the structure-aware HDSA can better model the compositional semantic space of dialog acts to yield a decent gain over Transformer-in/out. (vi) our approaches yield significant gain (10+%) on the Inform/Request success rate, which reflects that the explicit structured representation of dialog act is very effective in guiding dialog response in accomplishing the desired tasks. (v) the generator is greatly hindered by the predictor accuracy, by feeding the ground truth acts, the proposed HDSA is able to achieve 7.0 BLEU increase.

Generalization Ability

To better understand the performance gain of the hierarchical representation, we now examine its generalization ability and compare to the one-hot representation. Specifically, we divide the dialog acts into five categories based on their frequency of appearance in the training data: very few shot (1-100 times), few shot (100-500 times), medium shot (500-2K times), many shot (2K-5K times), and very many shot (5K+ times). We compute the average BLEU score of the turns within each frequency category and plot the result in Figure 7. First, we observe that for small number of shots, even when used with Transformer-in network, the hierarchical representation is significantly better. This validates our hypothesis in Sec. 3 that the hierarchical representation generalizes better to unseen (or less frequent) cases. Furthermore, we also observe that HDSA is a more effective way of exploiting the hierarchical representation as its improvement is even larger than the Transformer-in network.

Figure 7: The BLEU scores for dialog acts with different number of shots.

5.2 Human Evaluation

Response Quality

Owing to the low consistency between automatic metrics and human perception on conversational tasks, we also recruit trustful judges from Amazon Mechanical Turk (AMT) (with prior approval rate 95%)888 to perform human comparison between the generated responses from HDSA and SC-LSTM. Three criteria are adopted: (i) relevance: the response correctly answers the recent user query. (ii) coherence: the response is coherent with the dialog history. (iii) consistency: the generated sentence is semantically aligned with ground truth. During the evaluation, each AMT worker is presented two responses separately generated from HDSA and SC-LSTM, as well the ground truth dialog history. Each HIT assignment has 5 comparison problems, and we have a total of 200 HIT assignments to distribute. In the end, we perform statistical analysis on the harvested results after rejecting the failure cases and display the statistics in  Table 4.

Winer Consistency Relevance Coherence
SC-LSTM 32.8% 38.8% 36.1%
Tie 11.8% 11.4% 19.0%
HDSA 55.4% 49.8% 44.8%
Model Match Partial Match Mismatch
HDSA 90% 7% 3%
Trans-in 81% 12% 7%
SC-LSTM 72% 10% 18%
Table 4: Experimental results of two human evaluations for HDSA vs. SC-LSTM vs. Transformer-in. The top table gives the response quality evaluation and the bottom table demonstrates the controllability evaluation results in subsection 5.2.

From the results, we can observe that our model significantly outperforms SC-LSTM in the coherence, i.e., our model can better control the generation to maintain its coherence with the dialog history.

Semantic Controllability

In order to quantitatively compare the controllability of HDSA, Tranformer-in, andSC-LSTM, we further design a synthetic NLG experiment, where we randomly pick 50 dialog history as the context from test set, and then randomly select 3 dialog acts and their combinations as the semantic condition to control the model’s responses generation. We demonstrate an example in the supplementary to visualize the evaluation procedure. Quantitatively, we hire human workers to rate (measured in match, partially match, and totally mismatch) whether the model follows the given semantic condition to generate coherent sentences. The experimental results are reported in the bottom half of Table 4, which demonstrate that both the compact graph representation and the hierarchical structure prior are essential for better controllability.

6 Discussion

Graph Representation as Transfer Learning

The proposed graph representation works well under the cases where the set of domain slot-value pairs have significant overlaps, like Restaurant, Hotel, where the knowledge is easy to transfer. Under occasions where such exact overlap is scarce, we propose to use group similar concepts together as hypernym and use one switch to control the hypernym, which can generalize the proposed method to the broader domain.

Compression vs. Expressiveness

A trade-off that we found in our structure-based encoding scheme is that: when multiple dialog acts are predicted with middle-layer overlap, ambiguity will happen under the graph representation. For example, the two dialog acts “restaurant-inform-price” and “hotel-inform-location” are merged as “[restaurant, hotel] [inform] [price, location]”, the current compressed representation is unable to distinguish them with “hotel-inform-price” or “restaurant-inform-location”. Though these unnatural cases are very rare in the given dataset without hurting the performance per se, we argue such pending expressiveness problem needs to be addressed in the future research.

7 Conclusion and Future Work

In this paper, we propose a new semantically-controlled neural generation framework to resolve the scalability and generalization problem of existing models. Currently, our proposed method only considers the supervised setting where we have annotated dialog acts, and we have not investigated the situation where such annotation is not available. In the future, we intend to infer the dialog acts from the annotated responses and use such noisy data to guide the response generation.

8 Acknowledgements

We really appreciate the efforts of the anonymous reviews and cherish their valuable comments, they have helped us improve the paper a lot. We are gratefully supported by a Tencent AI Lab Rhino-Bird Gift Fund. We are also very thankful for the public available dialog dataset released by University of Cambridge and PolyAI.


  • Budzianowski et al. (2018) Pawel Budzianowski, Tsung-Hsien Wen, Bo-Hsiang Tseng, Iñigo Casanueva, Stefan Ultes, Osman Ramadan, and Milica Gasic. 2018. Multiwoz - A large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue modelling. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pages 5016–5026.
  • Chen et al. (2018) Wenhu Chen, Jianshu Chen, Yu Su, Xin Wang, Dong Yu, Xifeng Yan, and William Yang Wang. 2018. Xl-nbt: A cross-lingual neural belief tracking framework. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 414–424.
  • Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  • Dusek and Jurcícek (2016) Ondrej Dusek and Filip Jurcícek. 2016. A context-aware natural language generator for dialogue systems. In Proceedings of the SIGDIAL 2016 Conference, The 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue, 13-15 September 2016, Los Angeles, CA, USA, pages 185–190.
  • Eric et al. (2017) Mihail Eric, Lakshmi Krishnan, Francois Charette, and Christopher D. Manning. 2017. Key-value retrieval networks for task-oriented dialogue. In Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue, Saarbrücken, Germany, August 15-17, 2017, pages 37–49.
  • Goo et al. (2018) Chih-Wen Goo, Guang Gao, Yun-Kai Hsu, Chih-Li Huo, Tsung-Chieh Chen, Keng-Wei Hsu, and Yun-Nung Chen. 2018. Slot-gated modeling for joint slot filling and intent prediction. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 2 (Short Papers), pages 753–757.
  • Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  • Lei et al. (2018) Wenqiang Lei, Xisen Jin, Min-Yen Kan, Zhaochun Ren, Xiangnan He, and Dawei Yin. 2018. Sequicity: Simplifying task-oriented dialogue systems with single sequence-to-sequence architectures. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers, pages 1437–1447.
  • Li et al. (2017) Xiujun Li, Yun-Nung Chen, Lihong Li, Jianfeng Gao, and Asli Çelikyilmaz. 2017. End-to-end task-completion neural dialogue systems. In Proceedings of the Eighth International Joint Conference on Natural Language Processing, IJCNLP 2017, Taipei, Taiwan, November 27 - December 1, 2017 - Volume 1: Long Papers, pages 733–743.
  • Mrksic et al. (2017a) Nikola Mrksic, Diarmuid Ó Séaghdha, Tsung-Hsien Wen, Blaise Thomson, and Steve J. Young. 2017a. Neural belief tracker: Data-driven dialogue state tracking. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, pages 1777–1788.
  • Mrksic et al. (2017b) Nikola Mrksic, Ivan Vulic, Diarmuid Ó Séaghdha, Ira Leviant, Roi Reichart, Milica Gasic, Anna Korhonen, and Steve J. Young. 2017b. Semantic specialization of distributional word vector spaces using monolingual and cross-lingual constraints. TACL, 5:309–324.
  • Nayak et al. (2017) Neha Nayak, Dilek Hakkani-Tür, Marilyn A Walker, and Larry P Heck. 2017. To plan or not to plan? discourse planning in slot-value informed sequence to sequence models for language generation. In INTERSPEECH, pages 3339–3343.
  • Novikova et al. (2017) Jekaterina Novikova, Ondrej Dusek, and Verena Rieser. 2017. The E2E dataset: New challenges for end-to-end generation. In Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue, Saarbrücken, Germany, August 15-17, 2017, pages 201–206.
  • Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pages 311–318. Association for Computational Linguistics.
  • Rojas-Barahona et al. (2017) Lina Maria Rojas-Barahona, Milica Gasic, Nikola Mrksic, Pei-Hao Su, Stefan Ultes, Tsung-Hsien Wen, Steve J. Young, and David Vandyke. 2017. A network-based end-to-end trainable task-oriented dialogue system. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2017, Valencia, Spain, April 3-7, 2017, Volume 1: Long Papers, pages 438–449.
  • Sharma et al. (2016) Shikhar Sharma, Jing He, Kaheer Suleman, Hannes Schulz, and Philip Bachman. 2016. Natural language generation in dialogue using lexicalized and delexicalized data. arXiv preprint arXiv:1606.03632.
  • Shi et al. (2016) Yangyang Shi, Kaisheng Yao, Le Tian, and Daxin Jiang. 2016. Deep LSTM based feature mapping for query classification. In NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego California, USA, June 12-17, 2016, pages 1501–1511.
  • Strubell et al. (2018) Emma Strubell, Patrick Verga, Daniel Andor, David Weiss, and Andrew McCallum. 2018. Linguistically-informed self-attention for semantic role labeling. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pages 5027–5038.
  • Tran et al. (2017) Van-Khanh Tran, Le-Minh Nguyen, and Satoshi Tojo. 2017. Neural-based natural language generation in dialogue using RNN encoder-decoder with semantic aggregation. In Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue, Saarbrücken, Germany, August 15-17, 2017, pages 231–240.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008.
  • Wen et al. (2016a) Tsung-Hsien Wen, Milica Gasic, Nikola Mrksic, Lina Maria Rojas-Barahona, Pei-Hao Su, Stefan Ultes, David Vandyke, and Steve J. Young. 2016a. Conditional generation and snapshot learning in neural dialogue systems. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1-4, 2016, pages 2153–2162.
  • Wen et al. (2016b) Tsung-Hsien Wen, Milica Gasic, Nikola Mrksic, Lina Maria Rojas-Barahona, Pei-Hao Su, David Vandyke, and Steve J. Young. 2016b. Multi-domain neural network language generation for spoken dialogue systems. In NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego California, USA, June 12-17, 2016, pages 120–129.
  • Wen et al. (2015) Tsung-Hsien Wen, Milica Gasic, Nikola Mrksic, Pei-hao Su, David Vandyke, and Steve J. Young. 2015. Semantically conditioned lstm-based natural language generation for spoken dialogue systems. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, September 17-21, 2015, pages 1711–1721.
  • Wen et al. (2017) Tsung-Hsien Wen, Yishu Miao, Phil Blunsom, and Steve J. Young. 2017. Latent intention dialogue models. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, pages 3732–3741.
  • Williams et al. (2016) Jason D. Williams, Antoine Raux, and Matthew Henderson. 2016. The dialog state tracking challenge series: A review. D&D, 7(3):4–33.
  • Williams et al. (2013) Jason D. Williams, Antoine Raux, Deepak Ramachandran, and Alan W. Black. 2013. The dialog state tracking challenge. In Proceedings of the SIGDIAL 2013 Conference, The 14th Annual Meeting of the Special Interest Group on Discourse and Dialogue, 22-24 August 2013, SUPELEC, Metz, France, pages 404–413.
  • Young et al. (2013) Steve Young, Milica Gašić, Blaise Thomson, and Jason D Williams. 2013. Pomdp-based statistical spoken dialog systems: A review. Proceedings of the IEEE, 101(5):1160–1179.
  • Zhong et al. (2018) Victor Zhong, Caiming Xiong, and Richard Socher. 2018. Global-locally self-attentive dialogue state tracker. In ACL.

Appendix A Automatic Evaluation

We simply demonstrate an example of our automatic evaluation metrics in Figure 8.

Figure 8: Illustration of different evaluation metrics, in the delexicalized and non-delexicalized form.

Appendix B Details of Model Implementation

Here we detailedly explain the model implementation of the baselines and our proposed HDSA model. In the encoder side, we use a three-layered transformer with input embedding size of 64 and 4 heads, the dimension of query/value/key are all set to 16, in the output layer, the results of 4 heads are concatenated to obtain a 64-dimensional vector, which is first broadcast into 256-dimension and then back-projected to 64-dimension. By stacking three layers of such architecture, we obtain at the end the series of 64-dimensional vectors. Following BERT, we use the first symbol as the sentence-wise representation , and compute its matching score against all the tree node to predict the hierarchical representation of dialog acts .

In the decoder, we adopt take as input any length features , each with dimension of 64, in the first layer, since we have 10 heads, the dimension for each head is 6, thus the key,query feature dimensions are fixed to 6, the second layer with dimension of 9, the third with dimension of 2. The value feature are all fixed to 16, which is equivalent to the encoder side. After self-attention, the position-wise feed-forward neural network projects each feature back to 64 dimension, which is further projected to 3.1K vocabulary dimension to model word probability.

Appendix C Baseline Implementation

Here we visualize how we feed the dialog act input in as an embedding into the transformer to control the sequence generation process as Figure 9.

Figure 9: Illustration of the architecture of Transformer-in.

Appendix D Human Evaluation Interface

To better understand the human evaluation procedure, we demonstrate the user interface in Figure 10.

Figure 10: Illustration of Human Evaluation Interface.

Appendix E Controllability Evaluation

To better understand the results, we depict an example in Figure 11, where 3 different dialog acts are picked as the semantic condition to constrain the response generation.

Figure 11: Illustration of an example in controlling response generation given dialog act condition. Check mark means pass and cross mark means fail.

Appendix F Enumeration of all the Dialog Acts

Here we first enumerate the node semantics of the graph representation as follows:

  1. Domain-Layer 10 choices: ’restaurant’, ’hotel’, ’attraction’, ’train’, ’taxi’, ’hospital’, ’police’, ’bus’, ’booking’, ’general’.

  2. Function-Layer 7 choices: ’inform’, ’request’, ’recommend’, ’book’, ’select’, ’sorry’, ’none’.

  3. Slot-Layer 27 choices: ’pricerange’, ’id’, ’address’, ’postcode’, ’type’, ’food’, ’phone’, ’name’, ’area’, ’choice’, ’price’, ’time’, ’reference’, ’none’, ’parking’, ’stars’, ’internet’, ’day’, ’arriveby’, ’departure’, ’destination’, ’leaveat’, ’duration’, ’trainid’, ’people’, ’department’, ’stay’.

Then we enumerate the entire graph as follows:

Figure 12: Illustration of entire dialog graph.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description