HINT3: Raising the bar for Intent Detection in the Wild
Intent Detection systems in the real world are exposed to complexities of imbalanced datasets containing varying perception of intent, unintended correlations and domain-specific aberrations. To facilitate benchmarking which can reflect near real-world scenarios, we introduce 3 new datasets created from live chatbots in diverse domains. Unlike most existing datasets that are crowdsourced, our datasets contain real user queries received by the chatbots and facilitates penalising unwanted correlations grasped during the training process. We evaluate 4 NLU platforms and a BERT based classifier and find that performance saturates at inadequate levels on test sets because all systems latch on to unintended patterns in training data.
Over the last few years, task-oriented dialogue systems have gained increasing traction for applications like personal assistants, automated customer support agents, etc. This has led to the availability of several commercialised and/or open conversational bot building platforms. Most popular systems today involve intent detection as a vital part of their Natural Language Understanding (NLU) pipeline. Recent advances in transfer learning (Howard and Ruder, 2018; Peters et al., 2018; Devlin et al., 2019) has enabled systems that perform quite well on existing benchmarking datasets (Larson et al., 2019; Casanueva et al., 2020).
Definitions of intent often vary across users, tasks and domains. Perception of intent could range from a generic abstraction such as âOrdering a productâ to extreme granularity such as âEnquiring for a discount on a specific product if ordered using a specific cardâ. Additionally, factors such as imbalanced data distribution in the training set, assumptions during training data generation, diverse background of domain experts involved in defining the classes make this task more challenging. During inference, these systems may be deployed to users with diverse cultural backgrounds who might frame their queries differently even when communicating in the same language. Furthermore, during inference, apart from correctly identifying in-scope queries, the system is expected to accurately reject out-of-scope (Larson et al., 2019) queries, adding on to the challenge.
Most existing datasets for intent detection are generated using crowdsourcing services. To accurately benchmark in real-world settings, we release 3 new single-domain datasets, each spanning multiple coarse and fine grain intents, with the test sets being drawn entirely from actual user queries on the live systems at scale instead of being crowdsourced. On these datasets, we find that the performance of existing systems saturates at unsatisfactory levels because they end up learning spurious patterns from the training dataset instead of generalising to the perceived meanings of intents.
We evaluate 4 NLU platforms - Dialogflow
2 Prior Work
Despite intent detection being an important component of most dialogue systems, very few datasets have been collected from real users. Web Apps, Ask Ubuntu and Chatbot datasets from (Braun et al., 2017) contain a limited number of intents (<10), oversimplifying the task. More recent datasets like HWU64 from (Liu et al., 2019) and CLINC150 from (Larson et al., 2019) span a large number of intents in multiple domains but are generated using crowd sourcing services hence are limited in diversity in user expressions which arise from but not limited to domain specific presumptions, context from how and where the bot is made available, paraphrases emerging from cultural and ethnic diversity of user base, conversational slang, etc. Our work has some similarity with CLINC150, in that they also highlight the problem of out-of-scope intent detection and with BANKING77 from (Casanueva et al., 2020) that focuses on a single domain. However, all three - HWU64, CLINC150, BANKING77 offer relatively large and well balanced training set which might not be always feasible to collect for every new domain. For all datasets mentioned so far, recent works have reported a reasonably high performance (>90% average) for in-scope queries. Despite this, gaps in language understanding become apparent when such systems are deployed. Datasets introduced in this paper and further analysis of results attempts to recognise critical gaps in language understanding and calls for further research into more robust methods.
We introduce HINT3, a collection of datasets shown in Table 1 - SOFMattress, Curekart and Powerplay11 each containing diverse set of intents in a single domain - mattress products retail, fitness supplements retail and online gaming respectively.
3.1 Training Data Collection
Training data is prepared by a team of domain experts trying to emulate real users after in-depth research of historical user queries. The experts do not create an explicit set of out of scope queries primarily because the universe of such queries is infinitely big. Training datasets show class imbalance, occurrence of domain specific words, acronyms. All training data queries are in English.
In addition to Full training sets, we create Subset versions for each training set. For each class, after retaining the first query we iterate over the rest, discarding a query if it has an entailment score (Bowman et al., 2015) greater than 0.6 in both directions with any of the queries retained so far i.e. the subset version has the following property
where is the set of all intents, is the set of queries retained for class , is the entailment scoring function with as hypothesis and as premise. We use ELMo model trained on SNLI (Peters et al., 2018; Parikh et al., 2016)
3.2 Test Data Collection and Annotation
Our test sets contain the first message received by live systems from real users over a period of 15 days. Inter-annotator agreement was 75.8%, 80.0% and 73.4% for SOFMattress, Curekart and Powerplay11 respectively and conflicts were resolved by domain experts. Directly coming from real users our test set queries also contain messaging slangs, acronyms, spelling mistakes, grammatical mistakes and usage of code-mixed languages. Queries in non-Latin script or code-mixed languages were marked as out of scope (labelled as NO_NODES_DETECTED). Since live chat systems donât cater all the queries related to a brand, our test set contains relevant out-of-scope queries received from users about that domain. Any identifiable information of users, brands was replaced with made-up values in both train and test sets.
4 Benchmark Evaluation
We evaluated the performance of our datasets on platforms like Dialogflow, LUIS, RASA and Haptik in addition to evaluating performance on BERT. All layers of BERT were fine-tuned with a learning rate of 4e-5 for up to 50 epochs with a warmup period of 0.1 and early stopping.
4.1 Out-Of-Scope (OOS) prediction
We use thresholds on the model’s probability estimate for the task of predicting whether a query is OOS. We show performance on thresholds ranging from 0.1 to 0.9 at an interval of 0.1 to show the maximum performance a model can achieve irrespective of how we choose the threshold.
We consider Accuracy and Matthewâs Correlation Coefficient
|Test query||True label||Top predicted label||Sample training queries for True label||Sample training queries for predicted label|
|Ergo 7272 inches price?||MATTRESS_COST||L,H,D,R: ERGO_FEATURES||
|Trail option are there||100_NIGHT_TRIAL_OFFER||
|I require 75 inch 57 inch. Is it available?||SIZE_CUSTOMIZATION||
|20 % discount available on emi||OFFERS||L,H,D,R: EMI||
|How will u deliver with this LockDown in place ?||NO_NODES_DETECTED||L,H,D,R: CHECK_PINCODE||-||
|Covid19 how can you deliver||NO_NODES_DETECTED||L,H,D,R: CHECK_PINCODE|
Figure 1 presents results for all systems, for both Full and Subset variations of the dataset. Best Accuracy on all the datasets is in the early 70s. Best MCC for the datasets varies from 0.4 to 0.6, suggesting the systems are far from perfectly understanding natural language.
In Table 2, we consider in-scope accuracy at a very low threshold of 0.1, to see if false positives on OOS queries would not have mattered, whatâs the maximum in-scope accuracy that current systems are able to achieve. Our results show that even with such a low threshold, the maximum in-scope accuracy which systems are able to achieve on Full Training set is pretty low, unlike the 90+ in-scope accuracies of these systems which have been reported on other public datasets like CLINC150 in (Larson et al., 2019). And, the in-scope accuracy is even worse for the Subset of the training data.
Table 4 shows percentage drop in in-scope accuracy on subset data across all systems as compared to in-scope accuracy on full data. The drop varies from 0.6% to 22.3% across datasets and platforms. In an ideal world, this drop should be close to 0 across all datasets, as if the system understands the meaning of queries in training data, its performance should not get affected at all by removing queries in training data which are semantically similar to the ones already present.
Analyzing few example queries which failed on all platforms in Table 3 suggests that these models arenât actually âunderstandingâ language or capturing âmeaningâ, instead capturing spurious patterns in training data, as was also pointed in (Bender and Koller, 2020). Predicting based on these spurious patterns, which models latch on to during training, leads to models having high confidence even on OOS queries. Figure 2 shows this behaviour on SOFMattress Full dataset, as significant percentage of OOS queries have high confidence scores on all systems, except LUIS, for which it is at the cost of in-scope accuracy.
This paper analyzed intent detection on 3 new datasets consisting of both in-scope and out-of-scope queries received on 3 live chat bots over a period of 15 days. Our findings
- Access requests for signup on Haptik are processed via contact form at https://haptik.ai/contact-us/
- Refer supplementary material for datasets and reproducibility instructions
- Climbing towards NLU: On meaning, form, and understanding in the age of data. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 5185–5198. External Links: Cited by: §5.
- A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, pp. 632–642. External Links: Cited by: §3.1.
- Evaluating natural language understanding services for conversational question answering systems. In Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue, Saarbrücken, Germany, pp. 174–185. External Links: Cited by: §2.
- Efficient intent detection with dual sentence encoders. In Proceedings of the 2nd Workshop on Natural Language Processing for Conversational AI, Online, pp. 38–45. External Links: Cited by: §1, §2.
- BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. External Links: Cited by: §1, §1.
- Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 328–339. External Links: Cited by: §1.
- An evaluation dataset for intent classification and out-of-scope prediction. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 1311–1316. External Links: Cited by: §1, §1, §2, §4.2, §5.
- Benchmarking natural language understanding services for building conversational agents. In Proceedings of the Tenth International Workshop on Spoken Dialogue Systems Technology (IWSDS), Ortigia, Siracusa (SR), Italy, pp. xxx–xxx. External Links: Cited by: §2.
- A decomposable attention model for natural language inference. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas, pp. 2249–2255. External Links: Cited by: §3.1.
- Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, pp. 2227–2237. External Links: Cited by: §1, §3.1.