Dialog Intent Induction with Deep Multi-View Clustering
We introduce the dialog intent induction task and present a novel deep multi-view clustering approach to tackle the problem. Dialog intent induction aims at discovering user intents from user query utterances in human-human conversations such as dialogs between customer support agents and customers.111We focus on inducing abstract intents like BookFlight and ignore detailed arguments such as departure date and destination. Motivated by the intuition that a dialog intent is not only expressed in the user query utterance but also captured in the rest of the dialog, we split a conversation into two independent views and exploit multi-view clustering techniques for inducing the dialog intent. In particular, we propose alternating-view k-means (Av-Kmeans) for joint multi-view representation learning and clustering analysis. The key innovation is that the instance-view representations are updated iteratively by predicting the cluster assignment obtained from the alternative view, so that the multi-view representations of the instances lead to similar cluster assignments. Experiments on two public datasets show that Av-Kmeans can induce better dialog intent clusters than state-of-the-art unsupervised representation learning methods and standard multi-view clustering approaches.222When ready, the data and code will be published at https://github.com/asappresearch/dialog-intent-induction.
Goal-oriented dialog systems assist users to accomplish well-defined tasks with clear intents within a limited number of dialog turns. They have been adopted in a wide range of applications, including booking flights and restaurants hemphill1990atis; williams2012belief, providing tourist information kim2016fifth, aiding in the customer support domain, and powering intelligent virtual assistants such as Apple Siri, Amazon Alexa, or Google Assistant. The first step towards building such systems is to determine the target tasks and construct corresponding ontologies to define the constrained set of dialog states and actions henderson2014word; mrkvsic2015multi.
Existing work assumes the target tasks are given and excludes dialog intent discovery from the dialog system design pipeline. Because of this, most of the works focus on few simple dialog intents and fail to explore the realistic complexity of user intent space williams2013dialog; budzianowski2018multiwoz. The assumption puts a great limitation on adapting goal-oriented dialog systems to important but complex domains like customer support and healthcare where having a complete view of user intents is impossible. For example, as shown in Fig. 1, it is non-trivial to predict user intents for troubleshooting a newly released product in advance. To address this problem, we propose to employ data-driven approaches to automatically discover user intents in dialogs from human-human conversations. Follow-up analysis can then be performed to identify the most valuable dialog intents and design dialog systems to automate the conversations accordingly.
Similar to previous work on user question/query intent induction sadikov2010clustering; haponchyk2018supervised, we can induce dialog intents by clustering user query utterances333We treat the initial user utterances of the dialogs as user query utterances. in human-human conversations. The key is to learn discriminative query utterance representations in the user intent semantic space. Unsupervised learning of such representations is challenging due to the semantic shift across different domains nida2015componential. We propose to overcome this difficulty by leveraging the rest of a conversation in addition to the user query utterance as a weak supervision signal. Consider the two dialogs presented in Fig. 1 where both of the users are looking for how to find their AirPods. Although the user query utterances vary in the choice of lexical items and syntactic structures, the human agents follow the same workflow to assist the users, resulting in similar conversation structures.444Note this is not always the case. For the same dialog intent, the agent treatments may differ depending on the user profiles. The user may also change intent in the middle of a conversation. Thus, the supervision is often very noisy.
We present a deep multi-view clustering approach, alternating-view k-means (Av-Kmeans), to leverage the weak supervision for the semantic clustering problem. In this respect, we partition a dialog into two independent views: the user query utterance and the rest of the conversation. Av-Kmeans uses different neural encoders to embed the inputs corresponding to the two views and to encourage the representations learned by the encoders to yield similar cluster assignments. Specifically, we alternatingly perform k-means-style updates to compute the cluster assignment on one view and then train the encoder of the other view by predicting the assignment using a metric learning algorithm snell2017prototypical. Our method diverges from previous work on multi-view clustering bickel2004multi; chaudhuri2009multi; kumar2011co, as it is able to learn robust representations via neural networks that are in clustering-analysis-friendly geometric spaces. Experimental results on a dialog intent induction dataset and a question intent clustering dataset show that Av-Kmeans significantly outperforms multi-view clustering algorithms without joint representation learning by –% absolute F1 scores. It also gives rise to better F1 scores than quick thoughts logeswaran2018efficient, a state-of-the-art unsupervised representation learning method.
Our contributions are summarized as follows:
We introduce the dialog intent induction task and present a multi-view clustering formulation to solve the problem.
We propose a novel deep multi-view clustering approach that jointly learns cluster-discriminative representations and cluster assignments.
We derive and annotate a dialog intent induction dataset obtained from a public Twitter corpus and process a duplicate question detection dataset into a question intent clustering dataset.
The presented algorithm, Av-Kmeans, significantly outperforms previous state-of-the-art multi-view clustering algorithms as well as two unsupervised representation learning methods on the two datasets.
2 Deep Multi-View Clustering
In this section, we present a novel method for joint multi-view representation learning and clustering analysis. We consider the case of two independent views, in which the first view corresponds to the user query utterance (query view) and the second one corresponds to the rest of the conversation (content view).
Formally, given a set of instances , we assume that each data point can be naturally partitioned into two independent views and . We further use two neural network encoders and to transform the two views into vector representations . We are interested in grouping the data points into clusters using the multi-view feature representations. In particular, the neural encoders corresponding to the two views are jointly optimized so that they would commit to similar cluster assignments for the same instances.
In this work, we implement the query-view encoder with a bi-directional LSTM (BiLSTM) network hochreiter1997long and the content-view encoder with a hierarchical BiLSTM model that consists of a utterance-level BiLSTM encoder and a content-level BiLSTM encoder. The concatenations of the hidden representations from the last time steps are adopted as the query or content embeddings.
2.1 Alternating-view k-means clustering
In this work, we propose alternating-view k-means (Av-Kmeans) clustering, a novel method for deep multi-view clustering that iteratively updates neural encoders corresponding to the two views by encouraging them to yield similar cluster assignments for the same instances. In each semi-iteration, we perform k-means-style updates to compute a cluster assignment and centroids on feature representations corresponding to one view, and then project the cluster assignment to the other view where the assignment is used to train the view encoder in a supervised learning fashion.
The full training algorithm is presented in Alg. 1, where is a function that runs k-means clustering on inputs . is the number of clusters. and are optional arguments that represent the number of k-means iterations and the initial cluster assignment. The function returns cluster assignment . A visual demonstration of one semi-iteration of Av-Kmeans is also available in Fig. 2.
In particular, we initialize the encoders randomly or by using pretrained encoders (§ 2.3). Then, we can obtain the initial cluster assignment by performing k-means clustering on vector representations encoded by . During each Av-Kmeans iteration, we first project cluster assignment from view 1 to view 2 and update the neural encoder for view 2 by formulating a supervised learning problem (§ 2.2). Then we perform vanilla k-means steps to adjust the cluster assignment in view 2 based on the updated encoder. We repeat the procedure for view 2 in the same iteration. Note that in each semi-iteration, the initial centroids corresponding to a view are calculated based on the cluster assignment obtained from the other view. The algorithm runs a total number of iterations.
2.2 Prototypical episode training
In each Av-Kmeans iteration, we need to solve two supervised classification problems using the pseudo training datasets and respectively. A simple way to do so is putting a softmax classification layer on top of each encoder network. However, we find that it is beneficial to directly perform classification in the k-means clustering space. To this end, we adopt prototypical networks snell2017prototypical, a metric learning approach, to solely rely on the encoders to form the classifiers instead of introducing additional classification layers.
Given input data and a neural network encoder , prototypical networks compute a -dimensional representation , or prototype, of each class by averaging the vectors of the embedded support points belonging to its class:
here we drop the view superscripts for simplicity. Conceptually, the prototypes are similar to the centroids in the k-means algorithm, except that a prototype is computed on a subset of the instances of a class (the support set) while a centroid is computed based on all instances of a class.
Given a sampled query data point , prototypical networks produce a distribution over classes based on a softmax over distances to the prototypes in the embedding space:
where the distance function is the squared Euclidean distance .
The model minimizes the negative log-likelihood of the data: . Training episodes are formed by randomly selecting a subset of classes from the training set, then choosing a subset of examples within each class to act as the support set and a subset of the remainder to serve as query points. We refer to the original paper snell2017prototypical for more detailed description of the model.
2.3 Parameter initialization
Although Av-Kmeans can effectively work with random parameter initializations, we do expect that it will benefit from initializations obtained from pretrained models with some well-studied unsupervised learning objectives. We present two methods to initialize the utterance encoders for both the query and content views. The first approach is based on recurrent autoencoders. We embed an utterance using a BiLSTM encoder. The utterance embedding is then concatenated with every word vector corresponding to the decoder inputs that are fed into a uni-directional LSTM decoder. We use the neural encoder trained with the autoencoding objective to initialize the two utterance encoders in Av-Kmeans.
Recurrent autoencoders independently reconstruct an input utterance without capturing semantic dependencies across consecutive utterances. We consider a second initialization method, quick thoughts logeswaran2018efficient, that addresses the problem by predicting a context utterance from a set of candidates given a target utterance. Here, the target utterances are sampled randomly from the corpus, and the context utterances are sampled from within each pair of adjacent utterances. We use two separate BiLSTM encoders to encode utterances, which are named as the target encoder and the context encoder . To score the compatibility of a target utterance and a candidate context utterance , we simply use the inner product of the two utterance vectors . The training objective maximizes the log-likelihood of the context utterance given the target utterance and the candidate utterance set. After pretraining, we adopt the target encoder to initialize the two utterance encoders in Av-Kmeans.
As discussed in the introduction, existing goal-oriented dialog datasets mostly concern predefined dialog intents in some narrow domains such as restaurant or travel booking henderson2014second; budzianowski2018multiwoz; serban2018survey. To carry out this study, we adopt a more challenging corpus that consists of human-human conversations for customer service and manually annotate the user intents of a small number of dialogs. We also build a question intent clustering dataset to assess the generalization ability of the proposed method on the related problem.
3.1 Twitter airline customer support
|Dialog intent||# Dialogs||Query utterance example|
|Baggage||40||hi, do suit bags count as a personal items besides carry on baggage?|
|BookFlight||27||trying all day to book an international flight, only getting error msg.|
|ChangeFlight||16||can i request to change my flight from lax to msy on 10/15?|
|CheckIn||21||hy how can i have some help… having some problems with a check in|
|CustomerService||19||2 hour wait time to talk to a customer service agent?!?|
|FlightDelay||85||delay… detroit orlando|
|FlightEntertainment||40||airline is killing it with these inflight movie options|
|FlightFacility||32||just flew airline economy… best main cabin seat ive ever sat in.|
|FlightStaff||30||great crew on las vegas to baltimore tonight.|
|Other||116||hi, i have a small question!|
|RequestFeature||10||when are you going to update your app for iphone x?|
|Reward||17||need to extend travel funds that expire tomorrow!|
|TerminalFacility||13||thx for the new digital signs at dallas lovefield. well done!!|
|TerminalOperation||34||would be nice if you actually announced delays|
We consider the customer support on Twitter corpus released by Kaggle,555https://www.kaggle.com/thoughtvector/customer-support-on-twitter which contains more than three million tweets and replies in the customer support domain. The tweets constitute conversations between customer support agents of some big companies and their customers. As the conversations regard a variety of dynamic topics, they serve as an ideal testbed for the dialog intent induction task. In the customer service domain, different industries generally address unrelated topics and concerns. We focus on dialogs in the airline industry,666We combined conversations involved the following Twitter handles: Delta, British_Airways, SouthwestAir, and AmericanAir. as they represent the largest number of conversations in the corpus. We name the resulting dataset the Twitter airline customer support (TwACS) corpus. We rejected any conversation that redirects the customer to a URL or another communication channel, e.g., direct messages. We ended up with a dataset of dialogs. The total numbers of dialog turns and tokens are and respectively.
After investigating randomly sampled conversations from TwACS, we established an annotation task with dialog intents and hired two annotators to label the sampled dialogs based on the user query utterances. The Cohen’s kappa coefficient was , indicating a substantial agreement between the annotators. The disagreed items were resolved by a third annotator. To our knowledge, this is the first dialog intent induction dataset. The data statistics and user query utterance examples corresponding to different dialog intents are presented in Table 1.
AskUbuntu is a dataset collected and processed by \newciteshah2018adversarial for the duplicate question detection task. The dataset consists of technical support questions posted by users on AskUbuntu website with annotations indicating that two questions are semantically equivalent. For instance,
how to install ubuntu w/o removing windows
installing ubuntu over windows 8.1
are duplicate and they can be resolved with similar answers. A total number of questions are included in the dataset and pairs of questions are labeled as duplicate ones. In addition, we obtain the top rated answer for each question from the AskUbuntu website dump.777https://archive.org/details/stackexchange
In this work, we reprocess the data and build a question intent clustering dataset using an automatic procedure. Following \newcitehaponchyk2018supervised, we transform the duplicate question annotations into the question intent cluster annotations with a simple heuristic: for each question pair , annotated as a duplicate, we assigned and to the same cluster. As a result, the question intent clusters correspond to the connected components in the duplicate question graph. There are such connected components. However, most of the clusters are very small: of the clusters contain only – questions. Therefore, we experiment with the largest clusters that contain questions in this work. The sizes of the largest and the smallest clusters considered in this study are and respectively.
In this section, we evaluate Av-Kmeans on the TwACS and AskUbuntu datasets as described in § 3. We compare Av-Kmeans with competitive systems for representation learning or multi-view clustering and present our main findings in § 4.2. In addition, we examine the output clusters obtained from Av-Kmeans on the TwACS dataset to perform a thoughtful error analysis.
|Clustering algorithm||Pretraining method||TwACS||AskUbuntu|
4.1 Experimental settings
We train the models on all the instances of a dataset and evaluate on the labeled instances. We employ the publicly available 300-dimensional GloVe vectors pennington2014glove pretrained with 840 billion tokens to initialize the word embeddings for all the models.
We consider state-of-the-art methods for representation learning and/or multi-view clustering as our baseline systems. We formulate the dialog induction task as an unsupervised clustering task and include two popular clustering algorithms k-means and spectral clustering. multi-view spectral clustering (MVSC) kanaan2018multiview is a competitive standard multi-view clustering approach.888We use the scikit-learn k-means implementation and the MVSC implementation available at: https://pypi.org/project/multiview/. In particular, we carry out clustering using the query-view and content-view representations learned by the representation learning methods (k-means only requires query-view representations). In the case where a content-view input corresponds to multiple utterances, we take the average of the utterance vectors as the content-view output representation for autoencoders and quick thoughts.
Av-Kmeans is a joint representation learning and multiview clustering method. Therefore, we compare with SOTA representation learning methods autoencoders, and quick thoughts logeswaran2018efficient. Quick thoughts is a strong representation learning baseline that is adopted in BERT bert. We also include principal component analysis (PCA), a classic representation learning and dimensionality reduction method, since bag-of-words representations are too expensive to work with for clustering analysis.
We compare three variants of Av-Kmeans that differ in the pretraining strategies. In addition to the Av-Kmeans systems pretrained with autoencoders and quick thoughts, we also consider a system whose encoder parameters are randomly initialized (no pretraining).
We compare the competitive approaches on a number of standard evaluation measures for clustering analysis. Following prior work kumar2011co; haponchyk2018supervised; xie2016unsupervised, we set the number of clusters to the number of ground truth categories and report precision, recall, F1 score, and unsupervised clustering accuracy (ACC). To compute precision or recall, we assign each predicted cluster to the most frequent gold cluster or assign each gold cluster to the most frequent predicted cluster respectively. The F1 score is the harmonic average of the precision and recall. ACC uses a one-to-one assignment between the gold standard clusters and the predicted clusters. The assignment can be efficiently computed by the Hungarian algorithm kuhn1955hungarian.
We empirically set both the dimension of the LSTM hidden state and the number of principal components in PCA to . The number of Av-Kmeans iterations and the number of k-means steps in a Av-Kmeans semi-iteration are set to and respectively, as we find that more iterations lead to similar cluster assignments. We adopt the same set of hyperparameter values as used by \newcitesnell2017prototypical for training the prototypical networks. Specifically, we fix the number of query examples and the number of support examples to and . The networks are trained for episodes per Av-Kmeans semi-iteration. The number of sampled classes per episode is chosen to be , as it has to be smaller than the number of ground truth clusters. Adam kingma2015adam is utilized to optimize the models and the initial learning rate is . During autoencoders or quick thoughts pretraining, we check the performance on the development set after each epoch to perform early stopping, where we randomly sample 10% unlabeled instances as the development data.
Our main empirical findings are presented in Table 2, in which we compare Av-Kmeans with standard single-view and multi-view clustering algorithms. We also evaluate classic and neural approaches for representation learning, where the pretrained representations are fixed during k-means and MVSC clustering and they are fine-tuned during Av-Kmeans clustering. We analyze the empirical results in details in the following paragraphs.
Utilizing multi-view information
Among all the systems, k-means clustering on representations trained with PCA or autoencoders only employs single-view information encoded in user query utterances. They clearly underperform the rest of the systems that leverage multi-view information of the entire conversations. Quick thoughts infuses the multi-view knowledge through the learning of the query-view vectors that are aware of the content-view semantics. In contrast, multi-view spectral clustering can work with representations that are separately learned for the individual views and the multi-view information is aggregated using the common eigenvectors of the data similarity Laplacian matrices. As shown, k-means clustering on quick thoughts vectors gives superior results than MVSC pretrained with PCA or autoencoders by more than F1 or ACC, which indicates that multi-view representation learning is effective for problems beyond simple supervised learning tasks. Combining representation learning and multi-view clustering in a static way seems to be less ideal—MVSC performs worse than k-means using the quick thoughts vectors as clustering inputs. Multi-view representation learning breaks the independent-view assumption that is critical for classic multi-view clustering algorithms.
Joint representation learning and clustering
We now investigate whether joint representation learning and clustering can reconcile the conflict between cross-view representation learning and classic multi-view clustering. Av-Kmeans outperforms k-means and MVSC baselines by considerable margins. It achieves and F1 scores and and ACC scores on the TwACS and AskUbuntu datasets, which are – percent higher than competitive systems. Compared to alternative methods, Av-Kmeans is able to effectively seek clustering-friendly representations that also encourage similar cluster assignments for different views of the same instances. With the help of quick thoughts pretraining, Av-Kmeans improves upon the strongest baseline, k-means clustering on quick thoughts vectors, by ACC on the TwACS dataset and F1 on the AskUbuntu dataset.
Model pretraining for Av-Kmeans
Evaluation results on Av-Kmeans with different parameter initialization strategies are available in Table 2. As suggested, pretraining neural encoders is important for obtaining competitive results on the TwACS dataset, while its impact on the AskUbuntu dataset is less pronounced. AskUbuntu is six times larger than TwACS and models trained on AskUbuntu are less sensitive to their parameter initializations. This observation is consistent with early research on unsupervised pretraining, where \newciteschmidhuber2012multi argue that unsupervised initialization/pretraining is not necessary if a large amount of training data is available. Between the two pretraining methods, quick thoughts is much more effective than autoencoders—it improves upon no pretraining and autoencoders by and ACC scores on the TwACS dataset.
4.3 Error analysis
|Ground truth||Prediction||# Instances|
Our best performed system still fails to hit F1 or ACC score on the TwACS dataset. We examine the outputs of the quick thoughts pretrained Av-Kmeans on TwACS, focusing on investigating the most frequent errors made by the system. To this end, we compute the confusion matrix based on the one-to-one assignment between the gold clusters and the predicted clusters used by ACC. The top 5 most frequent errors are presented in Table 3. As shown, three of the five errors involve Other. Instances under the Other category correspond to miscellaneous dialog intents, thereby they are less likely to be grouped together based on the semantic meaning representations.
The other two frequent errors confuse FlightDelay with TerminalOperation and ChangeFlight respectively. Poor terminal operations often incur unexpected customer delays. Two example query utterances are shown as follows,
who’s running operation at mia flight 1088 been waiting for a gate.
have been sitting in the plane waiting for our gate for 25 minutes.
Sometimes, a user may express more than one intents in a single query utterance. For example, in the following query utterance, the user complaints about the delay and requests for an alternative flight:
why is ba flight 82 from abuja to london delayed almost 24 hours? and are you offering any alternatives?
We leave multi-intent induction to future work.
5 Related Work
User intent clustering
Automatic discovery of user intents by clustering user utterances is a critical task in understanding the dynamics of a domain with user generated content. Previous work focuses on grouping similar web queries or user questions together using supervised or unsupervised clustering techniques. \newcitekathuria2010classifying perform simple k-means clustering on a variety of query traits to understand user intents. \newcitecheung2012sequence present an unsupervised method for query intent clustering that produces a pattern consisting of a sequence of semantic concepts and/or lexical items for each intent. \newcitejeon2005finding use machine translation to estimate word translation probabilities and retrieve similar questions from question archives. A variation of k-means algorithm, MiXKmeans, is presented by \newcitedeepak2016mixkmeans to cluster threads that present on forums and Community Question Answering websites. \newcitehaponchyk2018supervised propose to cluster questions into intents using a supervised learning method that yields better semantic similarity modeling. Our work focuses on a related but different task that automatically induces user intents for building dialog systems. Two sources of information are naturally available for exploring our deep multi-view clustering approach.
Multi-view clustering (MVC) aims at grouping similar subjects into the same cluster by combining the available multi-view feature information to search for consistent cluster assignments across different views chao2017survey. Generative MVC approaches assume that the data is drawn from a mixture model and the membership information can be inferred using the multi-view EM algorithm bickel2004multi. Most of the works on MVC employ discriminative approaches that directly optimize an objective function that involves pairwise similarities so that the average similarity within clusters can be minimized and the average similarity between clusters can be maximized. In particular, \newcitechaudhuri2009multi propose to exploit canonical correlation analysis to learn multi-view representations that are then used for downstream clustering. Multi-view spectral clustering kumar2011co; kanaan2018multiview constructs a similarity matrix for each view and then iteratively updates a matrix using the eigenvectors of the similarity matrix computed on another view. Standard MVC algorithms expect multi-view feature inputs that are fixed during unsupervised clustering. Av-Kmeans works with raw multi-view text inputs and learns representations that are particularly suitable for clustering.
Joint representation learning and clustering
Several recent works propose to jointly learn feature representations and clustering via neural networks. \newcitexie2016unsupervised present the deep embedded clustering (DEC) method that learns a mapping from the data space to a lower-dimensional feature space where it iteratively optimizes a KL divergence based clustering objective. Deep clustering network (DCN) yang2017towards is a joint dimensional reduction and k-means clustering framework, in which the dimensional reduction model is implemented with a deep neural network. These methods focus on the learning of single-view representations and the multi-view information is under-explored. \newcitelin2018jointly present a joint framework for deep multi-view clustering (DMJC) that is the closest work to ours. However, DMJC only works with single-view inputs and the feature representations are learned using a multi-view fusion mechanism. In contrast, Av-Kmeans assumes that the inputs can be naturally partitioned into multiple views and carry out learning with the multi-view inputs directly.
We introduce the novel task of dialog intent induction that concerns automatic discovery of dialog intents from user query utterances in human-human conversations. The resulting dialog intents provide valuable insights in helping design goal-oriented dialog systems. We propose to leverage the dialog structure to divide a dialog into two independent views and then present Av-Kmeans, a deep multi-view clustering algorithm, to jointly perform multi-view representation learning and clustering on the views. We conduct extensive experiments on a Twitter conversation dataset and a question intent clustering dataset. The results demonstrate the superiority of Av-Kmeans over competitive representation learning and multi-view clustering baselines. In the future, we would like to abstract multi-view data from multi-lingual and multi-modal sources and investigate the effectiveness of Av-Kmeans on a wider range of tasks in the multi-lingual or multi-modal settings.
We thank Shawn Henry and Ethan Elenberg for their comments on an early draft of the paper. We also thank the conversational AI and the general research teams of ASAPP for their support throughout the project. We thank the EMNLP reviewers for their helpful feedback.