HINT3: Raising the bar for Intent Detection in the Wild

HINT3: Raising the bar for Intent Detection in the Wild


Intent Detection systems in the real world are exposed to complexities of imbalanced datasets containing varying perception of intent, unintended correlations and domain-specific aberrations. To facilitate benchmarking which can reflect near real-world scenarios, we introduce 3 new datasets created from live chatbots in diverse domains. Unlike most existing datasets that are crowdsourced, our datasets contain real user queries received by the chatbots and facilitates penalising unwanted correlations grasped during the training process. We evaluate 4 NLU platforms and a BERT based classifier and find that performance saturates at inadequate levels on test sets because all systems latch on to unintended patterns in training data.


1 Introduction

Over the last few years, task-oriented dialogue systems have gained increasing traction for applications like personal assistants, automated customer support agents, etc. This has led to the availability of several commercialised and/or open conversational bot building platforms. Most popular systems today involve intent detection as a vital part of their Natural Language Understanding (NLU) pipeline. Recent advances in transfer learning (Howard and Ruder, 2018; Peters et al., 2018; Devlin et al., 2019) has enabled systems that perform quite well on existing benchmarking datasets (Larson et al., 2019; Casanueva et al., 2020).

Definitions of intent often vary across users, tasks and domains. Perception of intent could range from a generic abstraction such as “Ordering a product” to extreme granularity such as “Enquiring for a discount on a specific product if ordered using a specific card”. Additionally, factors such as imbalanced data distribution in the training set, assumptions during training data generation, diverse background of domain experts involved in defining the classes make this task more challenging. During inference, these systems may be deployed to users with diverse cultural backgrounds who might frame their queries differently even when communicating in the same language. Furthermore, during inference, apart from correctly identifying in-scope queries, the system is expected to accurately reject out-of-scope (Larson et al., 2019) queries, adding on to the challenge.

Most existing datasets for intent detection are generated using crowdsourcing services. To accurately benchmark in real-world settings, we release 3 new single-domain datasets, each spanning multiple coarse and fine grain intents, with the test sets being drawn entirely from actual user queries on the live systems at scale instead of being crowdsourced. On these datasets, we find that the performance of existing systems saturates at unsatisfactory levels because they end up learning spurious patterns from the training dataset instead of generalising to the perceived meanings of intents.

We evaluate 4 NLU platforms - Dialogflow1, LUIS2, Rasa NLU3, Haptik45 and a BERT (Devlin et al., 2019) based classifier on all 3 datasets and highlight gaps in language understanding. We further probe into queries where all the current systems fail and question the efficacy of the current approach of learning. Additionally, we repeat all our experiments on the subset of training data and show a performance drop in all the systems despite retaining relevant and sufficient utterances in the training subset.

2 Prior Work

Despite intent detection being an important component of most dialogue systems, very few datasets have been collected from real users. Web Apps, Ask Ubuntu and Chatbot datasets from (Braun et al., 2017) contain a limited number of intents (<10), oversimplifying the task. More recent datasets like HWU64 from (Liu et al., 2019) and CLINC150 from (Larson et al., 2019) span a large number of intents in multiple domains but are generated using crowd sourcing services hence are limited in diversity in user expressions which arise from but not limited to domain specific presumptions, context from how and where the bot is made available, paraphrases emerging from cultural and ethnic diversity of user base, conversational slang, etc. Our work has some similarity with CLINC150, in that they also highlight the problem of out-of-scope intent detection and with BANKING77 from (Casanueva et al., 2020) that focuses on a single domain. However, all three - HWU64, CLINC150, BANKING77 offer relatively large and well balanced training set which might not be always feasible to collect for every new domain. For all datasets mentioned so far, recent works have reported a reasonably high performance (>90% average) for in-scope queries. Despite this, gaps in language understanding become apparent when such systems are deployed. Datasets introduced in this paper and further analysis of results attempts to recognise critical gaps in language understanding and calls for further research into more robust methods.

Dataset #Intent #Queries
Train Test
Full Subset in-scope oos
SOFMattress 21 328 180 231 166
Curekart 28 600 413 452 539
Powerplay11 59 471 261 275 708
Table 1: Statistics of the 3 datasets in HINT3

3 Datasets

We introduce HINT3, a collection of datasets shown in Table 1 - SOFMattress, Curekart and Powerplay11 each containing diverse set of intents in a single domain - mattress products retail, fitness supplements retail and online gaming respectively.

3.1 Training Data Collection

Training data is prepared by a team of domain experts trying to emulate real users after in-depth research of historical user queries. The experts do not create an explicit set of out of scope queries primarily because the universe of such queries is infinitely big. Training datasets show class imbalance, occurrence of domain specific words, acronyms. All training data queries are in English.

Dataset Variants

In addition to Full training sets, we create Subset versions for each training set. For each class, after retaining the first query we iterate over the rest, discarding a query if it has an entailment score (Bowman et al., 2015) greater than 0.6 in both directions with any of the queries retained so far i.e. the subset version has the following property

where is the set of all intents, is the set of queries retained for class , is the entailment scoring function with as hypothesis and as premise. We use ELMo model trained on SNLI (Peters et al., 2018; Parikh et al., 2016) 6 for . These are intended to evaluate performance with only semantically different sentences in the training set as ideally systems should already understand semantically similar queries to the ones present in the training set.

3.2 Test Data Collection and Annotation

Our test sets contain the first message received by live systems from real users over a period of 15 days. Inter-annotator agreement was 75.8%, 80.0% and 73.4% for SOFMattress, Curekart and Powerplay11 respectively and conflicts were resolved by domain experts. Directly coming from real users our test set queries also contain messaging slangs, acronyms, spelling mistakes, grammatical mistakes and usage of code-mixed languages. Queries in non-Latin script or code-mixed languages were marked as out of scope (labelled as NO_NODES_DETECTED). Since live chat systems don’t cater all the queries related to a brand, our test set contains relevant out-of-scope queries received from users about that domain. Any identifiable information of users, brands was replaced with made-up values in both train and test sets.

Figure 1: Matthew’s Correlation Coefficient and Accuracy across all datasets and platforms
SOFMattress Curekart Powerplay11
Full Subset Full Subset Full Subset
Dialogflow 73.1 65.3 75.0 71.2 59.6 55.6
RASA 69.2 56.2 84.0 80.5 49.0 38.5
LUIS 59.3 49.3 72.5 71.6 48.0 44.0
Haptik 72.2 64.0 80.3 79.8 66.5 59.2
BERT 73.5 57.1 83.6 82.3 58.5 53.0
Table 2: Inscope Accuracy at low threshold=0.1 for Full and Subset data variants

4 Benchmark Evaluation

We evaluated the performance of our datasets on platforms like Dialogflow, LUIS, RASA and Haptik in addition to evaluating performance on BERT. All layers of BERT were fine-tuned with a learning rate of 4e-5 for up to 50 epochs with a warmup period of 0.1 and early stopping.

4.1 Out-Of-Scope (OOS) prediction

We use thresholds on the model’s probability estimate for the task of predicting whether a query is OOS. We show performance on thresholds ranging from 0.1 to 0.9 at an interval of 0.1 to show the maximum performance a model can achieve irrespective of how we choose the threshold.

4.2 Metrics

We consider Accuracy and Matthew’s Correlation Coefficient7 as overall performance metrics for the systems. We use OOS recall (Larson et al., 2019) to evaluate performance on OOS queries and accuracy of in-scope queries to evaluate performance on in-scope queries.

5 Results

Test query True label Top predicted label Sample training queries for True label Sample training queries for predicted label
Ergo 7272 inches price? MATTRESS_COST L,H,D,R: ERGO_FEATURES
• Price of mattress
• Custom size cost
• Features of Ergo mattress
• Tell me about SOF Ergo mattress
Trail option are there 100_NIGHT_TRIAL_OFFER
• Trial details
• How to enroll for trial
• Can I get COD option?
• Can it deliver by COD
I require 75 inch 57 inch. Is it available? SIZE_CUSTOMIZATION
• Will I get an option to Customise the size
• How can I order a custom sized mattress
• Want to know the custom size chart
• Show me all available sizes
20 % discount available on emi OFFERS L,H,D,R: EMI
• Want to know the discount
• Tell me about the latest offers
• You guys provide EMI option?
• No cost EMI is available?
How will u deliver with this LockDown in place ? NO_NODES_DETECTED L,H,D,R: CHECK_PINCODE -
• Do you deliver to my pincode
• Will you be able to deliver here
Covid19 how can you deliver NO_NODES_DETECTED L,H,D,R: CHECK_PINCODE
Table 3: Few examples of test queries in SOFMattress which failed on all platforms, L: LUIS, H: Haptik, D: Dialogflow, R: Rasa. NO_NODES_DETECTED is the out-of-scope label.

Figure 1 presents results for all systems, for both Full and Subset variations of the dataset. Best Accuracy on all the datasets is in the early 70s. Best MCC for the datasets varies from 0.4 to 0.6, suggesting the systems are far from perfectly understanding natural language.

In Table 2, we consider in-scope accuracy at a very low threshold of 0.1, to see if false positives on OOS queries would not have mattered, what’s the maximum in-scope accuracy that current systems are able to achieve. Our results show that even with such a low threshold, the maximum in-scope accuracy which systems are able to achieve on Full Training set is pretty low, unlike the 90+ in-scope accuracies of these systems which have been reported on other public datasets like CLINC150 in (Larson et al., 2019). And, the in-scope accuracy is even worse for the Subset of the training data.

Table 4 shows percentage drop in in-scope accuracy on subset data across all systems as compared to in-scope accuracy on full data. The drop varies from 0.6% to 22.3% across datasets and platforms. In an ideal world, this drop should be close to 0 across all datasets, as if the system understands the meaning of queries in training data, its performance should not get affected at all by removing queries in training data which are semantically similar to the ones already present.

Dialogflow 10.6 5.0 6.7
RASA 18.7 4.1 10.5
LUIS 16.8 1.2 8.3
Haptik 11.3 0.6 10.9
BERT 22.3 1.5 9.4
Table 4: Percentage drop in Inscope Accuracy at low threshold=0.1 in Subset data as compared to Full
Figure 2: Out-of-Scope (OOS) Recall at the cost of In-scope Accuracy for SOFMattress Full dataset

Analyzing few example queries which failed on all platforms in Table 3 suggests that these models aren’t actually “understanding” language or capturing “meaning”, instead capturing spurious patterns in training data, as was also pointed in (Bender and Koller, 2020). Predicting based on these spurious patterns, which models latch on to during training, leads to models having high confidence even on OOS queries. Figure 2 shows this behaviour on SOFMattress Full dataset, as significant percentage of OOS queries have high confidence scores on all systems, except LUIS, for which it is at the cost of in-scope accuracy.

6 Conclusion

This paper analyzed intent detection on 3 new datasets consisting of both in-scope and out-of-scope queries received on 3 live chat bots over a period of 15 days. Our findings8 indicate that there’s a significant gap in performance on crowd-sourced datasets vs in a real world setup. NLU systems don’t seem to be actually “understanding” language or capturing “meaning”. We believe our analysis and dataset will lead to developing better, more robust dialogue systems.


  1. https://cloud.google.com/dialogflow
  2. https://www.luis.ai/
  3. https://github.com/RasaHQ/rasa/
  4. https://haptik.ai
  5. Access requests for signup on Haptik are processed via contact form at https://haptik.ai/contact-us/
  6. https://demo.allennlp.org/textual-entailment
  7. https://scikit-learn.org/stable/modules/model_evaluation
  8. Refer supplementary material for datasets and reproducibility instructions


  1. Climbing towards NLU: On meaning, form, and understanding in the age of data. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 5185–5198. External Links: Link, Document Cited by: §5.
  2. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, pp. 632–642. External Links: Link, Document Cited by: §3.1.
  3. Evaluating natural language understanding services for conversational question answering systems. In Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue, Saarbrücken, Germany, pp. 174–185. External Links: Link, Document Cited by: §2.
  4. Efficient intent detection with dual sentence encoders. In Proceedings of the 2nd Workshop on Natural Language Processing for Conversational AI, Online, pp. 38–45. External Links: Link, Document Cited by: §1, §2.
  5. BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. External Links: Link, Document Cited by: §1, §1.
  6. Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 328–339. External Links: Link, Document Cited by: §1.
  7. An evaluation dataset for intent classification and out-of-scope prediction. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 1311–1316. External Links: Link, Document Cited by: §1, §1, §2, §4.2, §5.
  8. Benchmarking natural language understanding services for building conversational agents. In Proceedings of the Tenth International Workshop on Spoken Dialogue Systems Technology (IWSDS), Ortigia, Siracusa (SR), Italy, pp. xxx–xxx. External Links: Link Cited by: §2.
  9. A decomposable attention model for natural language inference. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas, pp. 2249–2255. External Links: Link, Document Cited by: §3.1.
  10. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, pp. 2227–2237. External Links: Link, Document Cited by: §1, §3.1.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description