Addressee and Response Selection in Multi-Party Conversations
with Speaker Interaction RNNs
In this paper, we study the problem of addressee and response selection in multi-party conversations. Understanding multi-party conversations is challenging because of complex speaker interactions: multiple speakers exchange messages with each other, playing different roles (sender, addressee, observer), and these roles vary across turns. To tackle this challenge, we propose the Speaker Interaction Recurrent Neural Network (SI-RNN). Whereas the previous state-of-the-art system updated speaker embeddings only for the sender, SI-RNN uses a novel dialog encoder to update speaker embeddings in a role-sensitive way. Additionally, unlike the previous work that selected the addressee and response separately, SI-RNN selects them jointly by viewing the task as a sequence prediction problem. Experimental results show that SI-RNN significantly improves the accuracy of addressee and response selection, particularly in complex conversations with many speakers and responses to distant messages many turns in the past.
Addressee and Response Selection in Multi-Party Conversations
with Speaker Interaction RNNs
Rui Zhang Yale University firstname.lastname@example.org Honglak Lee University of Michigan email@example.com Lazaros Polymenakos IBM T. J. Watson Research Center firstname.lastname@example.org Dragomir Radev Yale University email@example.com
Copyright © 2018, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.
Real-world conversations often involve more than two speakers. In the Ubuntu Internet Relay Chat channel (IRC), for example, one user can initiate a discussion about an Ubuntu-related technical issue, and many other users can work together to solve the problem. Dialogs can have complex speaker interactions: at each turn, users play one of three roles (sender, addressee, observer), and those roles vary across turns.
In this paper, we study the problem of addressee and response selection in multi-party conversations: given a responding speaker and a dialog context, the task is to select an addressee and a response from a set of candidates for the responding speaker. The task requires modeling multi-party conversations and can be directly used to build retrieval-based dialog systems (?; ?; ?; ?).
The previous state-of-the-art Dynamic-RNN model from ? (?) maintains speaker embeddings to track each speaker status, which dynamically changes across time steps. It then produces the context embedding from the speaker embeddings and selects the addressee and response based on embedding similarity. However, this model updates only the sender embedding, not the embeddings of the addressee or observers, with the corresponding utterance, and it selects the addressee and response separately. In this way, it only models who says what and fails to capture addressee information. Experimental results show that the separate selection process often produces inconsistent addressee-response pairs.
To solve these issues, we introduce the Speaker Interaction Recurrent Neural Network (SI-RNN). SI-RNN redesigns the dialog encoder by updating speaker embeddings in a role-sensitive way. Speaker embeddings are updated in different GRU-based units depending on their roles (sender, addressee, observer). Furthermore, we note that the addressee and response are mutually dependent and view the task as a joint prediction problem. Therefore, SI-RNN models the conditional probability (of addressee given the response and vice versa) and selects the addressee and response pair by maximizing the joint probability.
On a public standard benchmark data set, SI-RNN significantly improves the addressee and response selection accuracy, particularly in complex conversations with many speakers and responses to distant messages many turns in the past. Our code and data set are available online.111The released code: https://github.com/ryanzhumich/sirnn
2 Related Work
We follow a data-driven approach to dialog systems. ? (?), ? (?), and ? (?) optimize the dialog policy using Reinforcement Learning or the Partially Observable Markov Decision Process framework. In addition, ? (?) propose to use a predefined ontology as a logical representation for the information exchanged in the conversation. The dialog system can be divided into different modules, such as Natural Language Understanding (?; ?), Dialog State Tracking (?; ?), and Natural Language Generation (?). Furthermore, ? (?) and ? (?) propose end-to-end trainable goal-oriented dialog systems.
Recently, short text conversation has been popular. The system receives a short dialog context and generates a response using statistical machine translation or sequence-to-sequence networks (?; ?; ?; ?; ?; ?). In contrast to response generation, the retrieval-based approach uses a ranking model to select the highest scoring response from candidates (?; ?; ?; ?). However, these models are single-turn responding machines and thus still are limited to short contexts with only two speakers. As for larger context, ? (?) propose the Next Utterance Classification (NUC) task for multi-turn two-party dialogs. ? (?) extend NUC to multi-party conversations by integrating the addressee detection problem. Since the data is text based, they use only textual information to predict addressees as opposed to relying on acoustic signals or gaze information in multimodal dialog systems (?; ?).
Furthermore, several other papers are recently presented focusing on modeling role-specific information given the dialogue contexts (?; ?; ?). For example, ? (?) combine content and temporal information to predict the utterance speaker. By contrast, our SIRNN explicitly utilizes the speaker interaction to maintain speaker embeddings and predicts the addressee and response by joint selection.
3.1 Addressee and Response Selection
? (?) propose the addressee and response selection task for multi-party conversation. Given a responding speaker and a dialog context , the task is to select a response and an addressee. is a list ordered by time step:
where says to at time step , and is the total number of time steps before the response and addressee selection. The set of speakers appearing in is denoted . As for the output, the addressee is selected from , and the response is selected from a set of candidates . Here, contains the ground-truth response and one or more false responses. We provide some examples in Table 4 (Section 6).
|Sender ID at time|
|Addressee ID at time|
|Utterance at time|
|Utterance embedding at time|
|Speaker embedding of at time|
3.2 Dynamic-RNN Model
In this section, we briefly review the state-of-the-art Dynamic-RNN model (?), which our proposed model is based on. Dynamic-RNN solves the task in two phases: 1) the dialog encoder maintains a set of speaker embeddings to track each speaker status, which dynamically changes with time step ; 2) then Dynamic-RNN produces the context embedding from the speaker embeddings and selects the addressee and response based on embedding similarity among context, speaker, and utterance.
Figure 1 (Left) illustrates the dialog encoder in Dynamic-RNN on an example context. In this example, says to , then says to , and finally says to . The context will be:
with the set of speakers .
For a speaker , the bold letter denotes its embedding at time step . Speaker embeddings are initialized as zero vectors and updated recurrently as hidden states of GRUs (?; ?). Specifically, for each time step with the sender and the utterance , the sender embedding is updated recurrently from the utterance:
where is the embedding for utterance . Other speaker embeddings are updated from . The speaker embeddings are updated until time step .
To summarize the whole dialog context , the model applies element-wise max pooling over all the speaker embeddings to get the context embedding :
The probability of an addressee and a response being the ground truth is calculated based on embedding similarity. To be specific, for addressee selection, the model compares the candidate speaker , the dialog context , and the responding speaker :
where is the final speaker embedding for the responding speaker , is the final speaker embedding for the candidate addressee , is the logistic sigmoid function, is the row-wise concatenation operator, and is a learnable parameter. Similarly, for response selection,
where is the embedding for the candidate response , and is a learnable parameter.
4 Speaker Interaction RNN
While Dynamic-RNN can track the speaker status by capturing who says what in multi-party conversation, there are still some issues. First, at each time step, only the sender embedding is updated from the utterance. Therefore, other speakers are blind to what is being said, and the model fails to capture addressee information. Second, while the addressee and response are mutually dependent, Dynamic-RNN selects them independently. Consider a case where the responding speaker is talking to two other speakers in separate conversation threads. The choice of addressee is likely to be either of the two speakers, but the choice is much less ambiguous if the correct response is given, and vice versa. Dynamic-RNN often produces inconsistent addressee-response pairs due to the separate selection. See Table 4 for examples.
In contrast to Dynamic-RNN, the dialog encoder in SI-RNN updates embeddings for all the speakers besides the sender at each time step. Speaker embeddings are updated depending on their roles: the update of the sender is different from the addressee, which is different from the observers. Furthermore, the update of a speaker embedding is not only from the utterance, but also from other speakers. These are achieved by designing variations of GRUs for different roles. Finally, SI-RNN selects the addressee and response jointly by maximizing the joint probability.
4.1 Utterance Encoder
To encode an utterance of words, we use a RNN with Gated Recurrent Units (?; ?):
where is the word embedding for , and is the hidden state. is initialized as a zero vector, and the utterance embedding is the last hidden state, i.e. .
4.2 Dialog Encoder
Figure 1 (Right) shows how SI-RNN encodes the example in Eq 1. Unlike Dynamic-RNN, SI-RNN updates all speaker embeddings in a role-sensitive manner. For example, at the first time step when says to , Dynamic-RNN only updates using , while other speakers are updated using . In contrast, SI-RNN updates each speaker status with different units: updates the sender embedding from the utterance embedding and the addressee embedding ; updates the addressee embedding from and ; updates the observer embedding from .
Algorithm 1 gives a formal definition of the dialog encoder in SI-RNN. The dialog encoder is a function that takes as input a dialog context (lines 1-5) and returns speaker embeddings at the final time step (lines 28-30). Speaker embeddings are initialized as -dimensional zero vectors (lines 6-9). Speaker embeddings are updated by iterating over each line in the context (lines 10-27).
4.3 Role-Sensitive Update
In this subsection, we explain in detail how // update speaker embeddings according to their roles at each time step (Algorithm 1 lines 19-26).
As shown in Figure 2, // are all GRU-based units. updates the sender embedding from the previous sender embedding , the previous addressee embedding , and the utterance embedding :
The update, as illustrated in the upper part of Figure 2, is controlled by three gates. The gate controls the previous sender embedding , and controls the previous addressee embedding . Those two gated interactions together produce the sender embedding proposal . Finally, the update gate combines the proposal and the previous sender embedding to update the sender embedding . The computations in (including gates , , , the proposal embedding , and the final updated embedding ) are formulated as:
where are learnable parameters. uses the same formulation with a different set of parameters, as illustrated in the middle of Figure 2. In addition, we update the observer embeddings from the utterance. is implemented as the traditional GRU unit in the lower part of Figure 2. Note that the parameters in // are not shared. This allows SI-RNN to learn role-dependent features to control speaker embedding updates. The formulations of and are similar.
4.4 Joint Selection
The dialog encoder takes the dialog context as input and returns speaker embeddings at the final time step, . Recall from Section 3.2 that Dynamic-RNN produces the context embedding using Eq 2 and then selects the addressee and response separately using Eq 3 and Eq 4.
In contrast, SI-RNN performs addressee and response selection jointly: the response is dependent on the addressee and vice versa. Therefore, we view the task as a sequence prediction process: given the context and responding speaker, we first predict the addressee, and then predict the response given the addressee. (We also use the reversed prediction order as in Eq 7.)
In addition to Eq 3 and Eq 4, SI-RNN is also trained to model the conditional probability as follows. To predict the addressee, we calculate the probability of the candidate speaker to be the ground-truth given the ground-truth response (available during training time):
The key difference from Eq 3 is that Eq 5 is conditioned on the correct response with embedding . Similarly, for response selection, we calculate the probability of a candidate response given the ground-truth addressee :
At test time, SI-RNN selects the addressee-response pair from to maximize the joint probability :
In Eq 7, we decompose the joint probability into two terms: the first term selects the response given the context, and then selects the addressee given the context and the selected response; the second term selects the addressee and response in the reversed order.222Detail: We also considered an alternative decomposition of the joint probability as , but the performance was similar to Eq 7.
|RES-CAND = 2||RES-CAND = 10|
|SI-RNN w/ shared IGRUs||15||59.50||59.47||74.20||78.08||28.31||28.45||73.35||36.00|
|SI-RNN w/o joint selection||15||63.13||63.40||77.56||80.38||32.24||32.53||77.61||39.73|
|Adr Mention Freq||-||0.32||0.34||0.34|
|# Speakers / Doc||26.8||26.3||30.7||32.1|
|# Utters / Doc||326.3||317.9||360.8||396.1|
|# Words / Utter||11.1||11.1||11.2||11.3|
5 Experimental Setup
We use the Ubuntu Multiparty Conversation Corpus (?) and summarize the data statistics in Table 3.
The whole data set (including the Train/Dev/Test split and the false response candidates) is publicly available.333https://github.com/hiroki13/response-ranking/tree/master/data/input
The data set is built from the Ubuntu IRC chat room where a number of users discuss Ubuntu-related technical issues.
The log is organized as one file per day corresponding to a document .
Each document consists of (Time, SenderID, Utterance) lines.
If users explicitly mention addressees at the beginning of the utterance, the addresseeID is extracted.
Then a sample, namely a unit of input (the dialog context and the current sender) and output (the addressee and response prediction) for the task, is created to predict the ground-truth addressee and response of this line.
Note that samples are created only when the addressee is explicitly mentioned for clear, unambiguous ground-truth labels.
False response candidates are randomly chosen from all other utterances within the same document.
Therefore, distractors are likely from the same sub-conversation or even from the same sender but at different time steps.
This makes it harder than ? (?) where distractors are randomly chosen from all documents.
If no addressee is explicitly mentioned, the addressee is left blank and the line is marked as a part of the context.
Baselines. Apart from Dynamic-RNN, we also include several other baselines. Recent+TF-IDF always selects the most recent speaker (except the responding speaker ) as the addressee and chooses the response to maximize the tf-idf cosine similarity with the context. We improve it by using a slightly different addressee selection heuristic (Direct-Recent+TF-IDF): select the most recent speaker that directly talks to by an explicit addressee mention. We select from the previous 15 utterances, which is the longest context among all the experiments. This works much better when there are multiple concurrent sub-conversations, and responds to a distant message in the context. We also include another GRU-based model Static-RNN from ? (?). Unlike Dynamic-RNN, speaker embeddings in Static-RNN are based on the order of speakers and are fixed. Furthermore, inspired by ? (?) and ? (?), we implement Static-Hier-RNN, a hierarchical version of Static-RNN. It first builds utterance embeddings from words and then uses high-level RNNs to process utterance embeddings.
Implementation Details For a fair comparison, we follow the hyperparameters from ? (?), which are chosen based on the validation data set. We take a maximum of 20 words for each utterance. We use 300-dimensional GloVe word vectors444http://nlp.stanford.edu/projects/glove/, which are fixed during training. SI-RNN uses 50-dimensional vectors for both speaker embeddings and hidden states. Model parameters are initialized with a uniform distribution between -0.01 and 0.01. We set the mini-batch size to 128. The joint cross-entropy loss function with 0.001 L2 weight decay is minimized by Adam (?). The training is stopped early if the validation accuracy is not improved for 5 consecutive epochs. All experiments are performed on a single GTX Titan X GPU. The maximum number of epochs is 30, and most models converge within 10 epochs.
6 Results and Discussion
For fair and meaningful quantitative comparisons, we follow ? (?)’s evaluation protocols.
SI-RNN improves the overall accuracy on the addressee and response selection task.
Two ablation experiments further analyze the contribution of role-sensitive units and joint selection respectively.
We then confirm the robustness of SI-RNN with the number of speakers and distant responses.
Finally, in a case study we discuss how SI-RNN handles complex conversations by either engaging in a new sub-conversation or responding to a distant message.
Overall Result. As shown in Table 2, SI-RNN significantly improves upon the previous state-of-the-art. In particular, addressee selection (ADR) benefits most, with different number of candidate responses (denoted as RES-CAND): around 12% in RES-CAND and more than 10% in RES-CAND . Response selection (RES) is also improved, suggesting role-sensitive GRUs and joint selection are helpful for response selection as well. The improvement is more obvious with more candidate responses (2% in RES-CAND and 4% in RES-CAND ). These together result in significantly better accuracy on the ADR-RES metric as well.
Ablation Study. We show an ablation study in the last rows of Table 2. First, we share the parameters of //. The accuracy decreases significantly, indicating that it is crucial to learn role-sensitive units to update speaker embeddings. Second, to examine our joint selection, we fall back to selecting the addressee and response separately, as in Dynamic-RNN. We find that joint selection improves ADR and RES individually, and it is particularly helpful for pair selection ADR-RES.
Number of Speakers.
Numerous speakers create complex dialogs and increased candidate addressee, thus the task becomes more challenging.
In Figure 3 (Upper), we investigate how ADR accuracy changes with the number of speakers in the context of length 15, corresponding to the rows with T=15 in Table 2.
Recent+TF-IDF always chooses the most recent speaker and the accuracy drops dramatically as the number of speakers increases.
Direct-Recent+TF-IDF shows better performance, and Dynamic-RNNis marginally better.
SI-RNN is much more robust and remains above 70% accuracy across all bins.
The advantage is more obvious for bins with more speakers.
Addressing Distance. Addressing distance is the time difference from the responding speaker to the ground-truth addressee. As the histogram in Figure 3 (Lower) shows, while the majority of responses target the most recent speaker, many responses go back five or more time steps. It is important to note that for those distant responses, Dynamic-RNN sees a clear performance decrease, even worse than Direct-Recent+TF-IDF. In contrast, SI-RNN handles distant responses much more accurately.
Case Study. Examples in Table 4 show how SI-RNN can handle complex multi-party conversations by selecting from 10 candidate responses. In both examples, the responding speakers participate in two or more concurrent sub-conversations with other speakers.
Example (a) demonstrates the ability of SI-RNN to engage in a new sub-conversation. The responding speaker “wafflejock” is originally involved in two sub-conversations: the sub-conversation 1 with “codepython”, and the ubuntu installation issue with “theoletom”. While it is reasonable to address “codepython” and “theoletom”, the responses from other baselines are not helpful to solve corresponding issues. TF-IDF prefers the response with the “install” key-word, yet the response is repetitive and not helpful. Dynamic-RNN selects an irrelevant response to “codepython”. SI-RNN chooses to engage in a new sub-conversation by suggesting a solution to “releaf” about Ubuntu dedicated laptops.
Example (b) shows the advantage of SI-RNN in responding to a distant message. The responding speaker “nicomachus” is actively engaged with “VeryBewitching” in the sub-conversation 1 and is also loosely involved in the sub-conversation 2: “chingao” mentions “nicomachus” in the most recent utterance. SI-RNN remembers the distant sub-conversation 1 and responds to “VeryBewitching” with a detailed answer. Direct-Recent+TF-IDF selects the ground-truth addressee because “VeryBewitching” talks to “nicomachus”, but the response is not helpful. Dynamic-RNN is biased to the recent speaker “chingao”, yet the response is not relevant.
SI-RNN jointly models who says what to whom by updating speaker embeddings in a role-sensitive way. It provides state-of-the-art addressee and response selection, which can instantly help retrieval-based dialog systems. In the future, we also consider using SI-RNN to extract sub-conversations in the unlabeled conversation corpus and provide a large-scale disentangled multi-party conversation data set.
We thank the members of the UMichigan-IBM Sapphire Project and all the reviewers for their helpful feedback. This material is based in part upon work supported by IBM under contract 4915012629. Any opinions, findings, conclusions or recommendations expressed above are those of the authors and do not necessarily reflect the views of IBM.
-  Bordes, A., and Weston, J. 2017. Learning end-to-end goal-oriented dialog. In ICLR.
-  Chen, P.-C.; Chi, T.-C.; Su, S.-Y.; and Chen, Y.-N. 2017. Dynamic time-aware attention to speaker roles and contexts for spoken language understanding. In ASRU.
-  Chi, T.-C.; Chen, P.-C.; Su, S.-Y.; and Chen, Y.-N. 2017. Speaker role contextual modeling for language understanding and dialogue policy learning. In IJCNLP.
-  Cho, K.; van Merrienboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; and Bengio, Y. 2014. Learning phrase representations using rnn encoder–decoder for statistical machine translation. In EMNLP.
-  Chung, J.; Gulcehre, C.; Cho, K.; and Bengio, Y. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. NIPS 2014 Deep Learning and Representation Learning Workshop.
-  Henderson, J.; Lemon, O.; and Georgila, K. 2008. Hybrid reinforcement/supervised learning of dialogue policies from fixed data sets. Computational Linguistics 34(4):487–511.
-  Henderson, M.; Thomson, B.; and Williams, J. 2014. The second dialog state tracking challenge. In SIGDIAL.
-  Henderson, M.; Thomson, B.; and Young, S. 2014. Word-based dialog state tracking with recurrent neural networks. In SIGDIAL.
-  Hu, B.; Lu, Z.; Li, H.; and Chen, Q. 2014. Convolutional neural network architectures for matching natural language sentences. In NIPS.
-  Ji, Z.; Lu, Z.; and Li, H. 2014. An information retrieval approach to short text conversation. arXiv preprint arXiv:1408.6988.
-  Jovanović, N.; Akker, R. o. d.; and Nijholt, A. 2006. Addressee identification in face-to-face meetings. In EACL.
-  Kingma, D., and Ba, J. 2015. Adam: A method for stochastic optimization. International Conference for Learning Representations (ICLR).
-  Li, J.; Galley, M.; Brockett, C.; Spithourakis, G.; Gao, J.; and Dolan, B. 2016. A persona-based neural conversation model. In ACL.
-  Lowe, R.; Pow, N.; Serban, I.; and Pineau, J. 2015. The Ubuntu Dialogue Corpus: A large dataset for research in unstructured multi-turn dialogue systems. In SIGDIAL.
-  Lu, Z., and Li, H. 2013. A deep architecture for matching short texts. In NIPS.
-  Mei, H.; Bansal, M.; and Walter, M. R. 2017. Coherent dialogue with attention-based language models. In AAAI.
-  Meng, Z.; Mou, L.; and Jin, Z. 2017. Towards neural speaker modeling in multi-party conversation: The task, dataset, and models. arXiv preprint arXiv:1708.03152.
-  Mesnil, G.; Dauphin, Y.; Yao, K.; Bengio, Y.; Deng, L.; Hakkani-Tur, D.; He, X.; Heck, L.; Tur, G.; Yu, D.; et al. 2015. Using recurrent neural networks for slot filling in spoken language understanding. Audio, Speech, and Language Processing, IEEE/ACM Transactions on 23(3):530–539.
-  op den Akker, R., and Traum, D. 2009. A comparison of addressee detection methods for multiparty conversations. Workshop on the Semantics and Pragmatics of Dialogue.
-  Ouchi, H., and Tsuboi, Y. 2016. Addressee and response selection for multi-party conversation. In EMNLP.
-  Ritter, A.; Cherry, C.; and Dolan, W. B. 2011. Data-driven response generation in social media. In EMNLP.
-  Serban, I. V.; Sordoni, A.; Bengio, Y.; Courville, A.; and Pineau, J. 2016. Building end-to-end dialogue systems using generative hierarchical neural network models. In AAAI.
-  Shang, L.; Lu, Z.; and Li, H. 2015. Neural responding machine for short-text conversation. In ACL.
-  Singh, S. P.; Kearns, M. J.; Litman, D. J.; and Walker, M. A. 1999. Reinforcement learning for spoken dialogue systems. In NIPS.
-  Vinyals, O., and Le, Q. 2015. A neural conversational model. ICML Deep Learning Workshop.
-  Wang, M.; Lu, Z.; Li, H.; and Liu, Q. 2015. Syntax-based deep matching of short texts. In IJCAI.
-  Wen, T.-H.; Gašić, M.; Mrkšić, N.; Su, P.-H.; Vandyke, D.; and Young, S. 2015. Semantically conditioned lstm-based natural language generation for spoken dialogue systems. In EMNLP.
-  Wen, T.-H.; Vandyke, D.; Mrksic, N.; Gasic, M.; Rojas-Barahona, L. M.; Su, P.-H.; Ultes, S.; and Young, S. 2016. A network-based end-to-end trainable task-oriented dialogue system. arXiv preprint arXiv:1604.04562.
-  Williams, J.; Raux, A.; and Henderson, M. 2016. The dialog state tracking challenge series: A review. Dialogue & Discourse 7(3):4–33.
-  Yao, K.; Peng, B.; Zhang, Y.; Yu, D.; Zweig, G.; and Shi, Y. 2014. Spoken language understanding using long short-term memory neural networks. In Spoken Language Technology Workshop (SLT), 2014 IEEE, 189–194. IEEE.
-  Young, S.; Gasic, M.; Thomson, B.; and Williams, J. D. 2013. Pomdp-based statistical spoken dialog systems: A review. Proceedings of the IEEE 101(5):1160–1179.
-  Zhou, X.; Dong, D.; Wu, H.; Zhao, S.; Yu, D.; Tian, H.; Liu, X.; and Yan, R. 2016. Multi-view response selection for human-computer conversation. In EMNLP.