Validating Label Consistency in NER Data Annotation
Data annotation plays a crucial role in ensuring your named entity recognition (NER) projects are trained with the right information to learn from. Producing the most accurate labels is a challenge due to the complexity involved with annotation. Label inconsistency between multiple subsets of data annotation (e.g., training set and test set, or multiple training subsets) is an indicator of label mistakes. In this work, we present an empirical method to explore the relationship between label (in-)consistency and NER model performance. It can be used to validate the label consistency (or catches the inconsistency) in multiple sets of NER data annotation. In experiments, our method identified the label inconsistency of test data in SCIERC and CoNLL03 datasets (with 26.7% and 5.4% label mistakes). It validated the consistency in the corrected version of both datasets.
|Original Examples||Corrected Examples|
|Starting from a DP-based solution to the [traveling salesman problem]Method, we present a novel technique …||Starting from a DP-based solution to the [traveling salesman problem]Task, we present a novel technique …|
|FERRET utilizes a novel approach to [Q/A]Method known as predictive questioning which attempts to identify …||FERRET utilizes a novel approach to [Q/A]Task known as predictive questioning which attempts to identify …|
|The goal of this work is the enrichment of [human-machine interactions]Task in a natural language environment.||The goal of this work is the [enrichment of human-machine interactions]Task in a natural language environment.|
Named entity recognition (NER) is one of the foundations of many downstream tasks such as relation extraction, event detection, and knowledge graph construction. NER models require vast amounts of labelled data to continuously learn and identify patterns that humans cannot, it is really about getting accurate data to train the models. When end-to-end neural models achieve excellent performance on NER in various domains Lample et al. (2016); Ma and Hovy (2016); Liu et al. (2018); Luan et al. (2018), building useful and challenging NER benchmarks, such as CoNLL03, WNUT16, and SCIERC, contribute greatly to the research community.
Data annotation plays a crucial role in building benchmarks and ensuring NER models are trained with the right information to learn from. Producing the necessary annotation from any asset at scale is a challenge, mainly due to the complexity involved with annotation. Getting the most accurate labels demands time and expertise.
Label mistakes can hardly be avoided especially when the labeling process splits the data into multiple sets for distributed annotation. The mistakes cause label inconsistency between subsets of annotated data (e.g., training set and test set, or multiple training subsets). For example, in the CoNLL03 dataset Sang and De Meulder (2003), a standard NER benchmark that has been cited over 2,300 times, label mistakes were found in 5.38% of the test set Wang et al. (2019). Note that the state-of-the-art results on CoNLL03 has achieved an F1 score of . So even if the label mistakes make up a very small part, they cannot be negligible when researchers are trying to further improve the results. In the work of Wang et al., five annotators were recruited to correct the label mistakes. Compared to the results on the original test set, the results on the corrected test set are more accurate and stable.
However, two critical issues were not resolved in this process: i) How to identify label inconsistency between the subsets of annotated data? ii) How to validate that the label consistency was recovered by the correction?
Another example is SCIERC Luan et al. (2018) (cited 50 times) which is a multi-task (including NER) benchmark in AI domain. It has 1,861 sentences for training, 455 for dev, and 551 for test. When we looked at the false predictions given by SCIIE which was a multi-task model released along with the SCIERC dataset, we found that as many as 147 (26.7% of the test set) sentences were not properly annotated. (We also recruited five annotators and counted a mistake when all the annotators report it.) Three examples are given in Table 1: two of them have wrong entity types; the third has a wrong span boundary. As shown in the experiments section, after the correction, the NER performance becomes more accurate and stable.
Besides the significant correction on the SCIERC dataset, our contributions in this work are as follows: i) an empirical, visual method to identify the label inconsistency between subsets of annotated data (see Figure 1), ii) a method to validate the label consistency of corrected data annotation (see Figure 2). Experiments show that they are effective on the CoNLL03 and SCIERC datasets.
2 Proposed Methods
2.1 A method to identify label inconsistency
Suppose the labeling processes on two parts of annotated data were consistent. They are likely to be equivalently predictive of each other. In other words, if we train a model with a set of samples from either part or part to predict a different set from part , the performance should be similar.
Take SCIERC as an example. We were wondering whether the labels in the test set were consistent with those in the training set. Our method to identify the inconsistency is presented in Figure 1.
We sample three exclusive subsets (of size ) from the training set. We set according to the size of the original test set. We use one of the subsets as the new test set. Then we train the SCIIE NER model Luan et al. (2018) to perform on the new test set. We build three new training sets to feed into the model:
“TrainTest”: first fed with one training subset and then the original test set;
“PureTrain”: fed with two training subsets;
“TestTrain”: first fed with the original test set and then one of the training subsets.
Results show that “TestTrain” performed the worst at the early stage because the quality of the original test set is not reliable. In “TrainTest” the performance no longer improved when the model started being fed with the original test set. “PureTrain” performed the best. All the observations conclude that the original test set is less predictive of training samples than the training set itself. It may due to the issue of label inconsistency. Moreover, we do not have such observations on two other datasets, WikiGold and WNUT16.
2.2 A method to validate label consistency after correction
After we corrected the label mistakes, how could we empirically validate the recovery of label consistency? Again, we use a subset of training data as the new test set. We evaluate the predictability of the original wrong test subset, the corrected test subset, and the rest of the training set. We expect to see that the wrong test subset delivers weaker performance and the other two sets make comparable good predictions. Figure 2 illustrates this idea.
Take SCIERC as an example. Suppose we corrected of sentences in the test set. The original wrong test subset (“Mistake”) and the corrected test subset (“Correct”) are both of size . Here and the original good test subset (“Test”). We sampled three exclusive subsets of size , , and from the training set (“Train”). We use the first subset (of size ) as the new test set. We build four new training sets and feed into the SCIIE model. Each new training set has sentences.
“TestTrainMistake”/“TestTrainCorrect”: the original good test subset, the third sampled training subset, and the original wrong test subset (or the corrected test subset);
“PureTrainMistake”/“PureTrainCorrect”: the second and third sampled training subsets and the original wrong test subset (or the corrected test subset);
“MistakeTestTrain”/“CorrectTestTrain”: the original wrong test subset (or the corrected test subset), the original good test subset, and the third sampled training subset;
“MistakePureTrain”/“CorrectPureTrain”: the original wrong test subset (or the corrected test subset) and the second and third sampled training subsets.
Results show that the label mistakes (i.e., original wrong test subset) hurt the model performance whenever being fed at the beginning or later. The corrected test subset delivers comparable performance with the original good test subset and the training set. This demonstrates the label consistency of the corrected test set with the training set.
3.1 Results on SCIERC
The visual results of the proposed methods have been presented in Section 2. Here we deploy five state-of-the-art NER models to investigate their performance on the corrected SCIERC dataset. The NER models are BiLSTM-CRF Lample et al. (2016), LM-BiLSTM-CRF Liu et al. (2018), single-task and multi-task SCIIE Luan et al. (2018), and multi-task DyGIE Luan et al. (2019).
|Method||Corrected SCIERC||Original SCIERC|
As shown in Table 2, all NER models deliver better performance on the corrected SCIERC than the original dataset. So the training set is more consistent with the corrected test set than the original wrong test set. In the future work, we will explore more baselines in the leaderboard.
3.2 Results on CoNLL03
Based on the correction contributed by Wang et al. (2019), we use the proposed method to justify label inconsistency though the label mistakes take “only” 5.38%. It also validates the label consistency after recovery. Figure 3(a) shows that starting with the wrong labels in the original test set makes the performance worse than starting with the training set or the good test subset. After label correction, this issue is fixed in Figure 3(b).
4 Related Work
NER is typically cast as a sequence labeling problem and solved by models integrate LSTMs, CRF, and language models Lample et al. (2016); Ma and Hovy (2016); Liu et al. (2018). Another idea is to generate span candidates and predict their type. Span-based models have been proposed with multi-task learning strategies Luan et al. (2018, 2019). The multiple tasks include concept recognition, relation extraction, and co-reference resolution.
Researchers notice label mistakes in many NLP tasks Manning (2011); Wang et al. (2019); Eskin (2000); Kvĕtoň and Oliva (2002). For instance, it is reported that the bottleneck of the POS tagging task is the consistency of the annotation result Manning (2011). People tried to detect label mistakes automatically and minimize the influence of noise in training. The mistake re-weighting mechanism is effective in the NER task Wang et al. (2019). We focus on visually evaluating the label consistency.
We presented an empirical method to explore the relationship between label consistency and NER model performance. It identified the label inconsistency of test data in SCIERC and CoNLL03 datasets (with 26.7% and 5.4% label mistakes). It validated the label consistency in multiple sets of NER data annotation on two benchmarks, CoNLL03 and SCIERC.
- Detecting errors within a corpus using anomaly detection. In Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference, pp. 148–153. Cited by: §4.
- (Semi-)automatic detection of errors in PoS-tagged corpora. In COLING 2002: The 19th International Conference on Computational Linguistics, External Links: Cited by: §4.
- Neural architectures for named entity recognition. arXiv preprint arXiv:1603.01360. Cited by: §1, §3.1, §4.
- Empower sequence labeling with task-aware neural language model. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §1, §3.1, §4.
- Multi-task identification of entities, relations, and coreference for scientific knowledge graph construction. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP). Cited by: §1, §1, §2.1, §3.1, §4.
- A general framework for information extraction using dynamic span graphs. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Cited by: §3.1, §4.
- End-to-end sequence labeling via bi-directional lstm-cnns-crf. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1064–1074. Cited by: §1, §4.
- Part-of-speech tagging from 97% to 100%: is it time for some linguistics?. In International conference on intelligent text processing and computational linguistics, pp. 171–189. Cited by: §4.
- Introduction to the conll-2003 shared task: language-independent named entity recognition. arXiv preprint cs/0306050. Cited by: §1.
- CrossWeigh: training named entity tagger from imperfect annotations. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 5157–5166. Cited by: §1, §3.2, §4.