Towards Open-Domain Named Entity Recognition via Neural Correction Models

Towards Open-Domain Named Entity Recognition via Neural Correction Models

Mengdi Zhu1, Zheye Deng1, Wenhan Xiong2, Mo Yu3, Ming Zhang1, William Yang Wang2
1Peking University
2University of California, Santa Barbara
3IBM Research
{1600012990,dzy97,mzhang_cs}, {xwhan,william},
Equal ContributionCorresponding Author

Named Entity Recognition (NER) plays an important role in a wide range of natural language processing tasks, such as relation extraction, question answering, etc. However, previous studies on NER are limited to a particular genre, using small manually-annotated or large but low-quality datasets. In this work, we propose a semi-supervised annotation framework to make full use of abstracts from Wikipedia and obtain a large and high-quality dataset called AnchorNER. We assume anchored strings in abstracts are named entities and annotate them with entity types mentioned in DBpedia. To improve the coverage, we design a neural correction model trained with a human-annotated NER dataset, DocRED, to correct the false-negative entity labels, and then train a BERT model with the corrected dataset. We evaluate our trained model on six NER datasets and our experimental results show that we have obtained state-of-the-art open-domain performances — on top of the strong baselines BERT-base and BERT-large, we achieve relative improvements of 4.66% and 3.07% respectively.

1 Introduction

Named entity recognition (NER) aims to identify named entities, such as person, location, organization, etc, from texts [Yadav and Bethard2018]. As a key component of many natural language processing (NLP) tasks such as data mining, summarization, and information extraction [Chen et al.2004, Banko et al.2007, Aramaki et al.2009], NER has drawn much attention and many studies have been conducted in this field. In practice, NER could be applied to various genres of texts. Therefore, a NER model, especially a NER toolkit, should get high enough results in almost every case. However, open-domain is a great challenge for NER due to the limited annotated NER datasets.

Previously, there are two main ways to address the problem. The first is using external knowledge and pre-training. Some researchers exploit features [Ghaddar and Langlais2018, Wu, Liu, and Cohn2018] and some other studies utilize dictionaries or gazetteers [Nadeau, Turney, and Matwin2006, Shang et al.2018]. However, with this method, external resources are usually domain-specific and building up such resources is time-consuming. Besides, dictionaries are finite and introduction of the dictionary also suffers from mismatching and missing matching problems. Moreover, some other studies utilize word and character embeddings. Pre-trained word embeddings are widely used in long short-term memory (LSTM) models [Huang, Xu, and Yu2015] and convolutional neural networks (CNN) are used to implement character embeddings [Ma and Hovy2016, Chiu and Nichols2016]. Some researchers also come up with contextual embeddings [Akbik, Bergmann, and Vollgraf2019] to replace fixed embeddings and obtain promising results. Recently, with the development of pre-training language models [Devlin et al.2019], they are also implemented in NER to significantly improve the performances. These strong models, nonetheless, only learn word representations without information regarding named entities. After the pre-training process, they still need to predict NER labels from word representations, which means that many improvements still can be made based on these models.

The second way is increasing the training data with semi-labeled or auto-labeled data. \citeauthorDBLP:journals/ai/NothmanRRMC13 (\citeyearDBLP:journals/ai/NothmanRRMC13) implement a semi-supervised method to classify Wikipedia articles into named entity types and link anchor texts to the corresponding articles. \citeauthorDBLP:conf/ijcnlp/GhaddarL17 (\citeyearDBLP:conf/ijcnlp/GhaddarL17) exploit out-links and out-links of out-links to annotate named entities in the target article and build an auto-labeled dataset called WiNER. However, these methods suffer from noise and many false labels are generated during the annotation.

In this work, our solution is along the direction of combining the best of both worlds. We propose a neural correction model and we use the correction model to correct the false-negative entity labels of a large-scale but low-quality NER dataset. As correction is an easier task comparing to labeling from scratch, a relatively smaller training data can already help and address the lack of NER dataset as well. We first make use of abstracts from Wikipedia and leverage anchored strings in Wikipedia and a well-built knowledge base, DBpedia, to annotate named entities in the abstracts. Then, in order to correct the false-negative entity labels, we implement a neural correction model trained with a manually-annotated and high-precision dataset, DocRED. In order to train the correction model better, we utilize curriculum learning to assist the correction. We use the correction model to build a large-scale and high-quality dataset called AnchorNER. After that, we train BERT models with our corrected dataset. Experimental results demonstrate that we have obtained state-of-the-art open-domain performances. We will release our OpenNER toolkit for open-domain named entity recognition111Our code and data are released at

The main contributions of our work are three-fold:

  • We build a large-scale and high-quality NER dataset called AnchorNER with Wikipedia and DBpedia.

  • We implement a semi-supervised correction model with curriculum learning to correct the false-negative entity labels.

  • BERT models trained with AnchorNER yield state-of-art results in the open-domain setting and we will release our model as a NER toolkit called OpenNER.

2 Related Work

Figure 1: Illustration of the process with which we gather annotations into AnchorNER for the abstract of Clive Anderson.
Statistical Models for Named Entity Recognition

: Previously, many prior research studies have been conducted in NER. Before deep learning models are widely used in NER, there are many common machine learning systems, such as Hidden Markov Models (HMM), Support Vector Machines (SVM), Conditional Random Fields (CRF) and decision trees, proposed in NER. \citeauthorDBLP:conf/acl/ZhouS02 (\citeyearDBLP:conf/acl/ZhouS02) propose a Hidden Markov Model (HMM) and an HMM-based chunk tagger based on four types of internal and external features. \citeauthorDBLP:conf/dsmml/LiBC04 (\citeyearDBLP:conf/dsmml/LiBC04) implement an SVM model with uneven margins and experiment with different window sizes to obtain different combinations of feature vectors. \citeauthorDBLP:conf/acl/KrishnanM06 (\citeyearDBLP:conf/acl/KrishnanM06) utilize two CRF models in which one is used to make predictions based on local features and another CRF is trained using both the outputs of the first CRF model and local information. These feature-engineered NER systems must rely on manually-designed features and they only focus on the shallow information. Meanwhile, the features are usually specific for the task and the models are not general enough.

Neural Models for Named Entity Recognition

: Recently, neural networks are becoming more and more popular and the implementation of neural networks improves NER performances a lot. \citeauthorDBLP:journals/jmlr/CollobertWBKKK11 (\citeyearDBLP:journals/jmlr/CollobertWBKKK11) propose the first word-level neural network model using CNN and CRF. Some other studies exploit character-level architectures. In [Kuru, Can, and Yuret2016], \citeauthorDBLP:conf/coling/KuruCY16 (\citeyearDBLP:conf/coling/KuruCY16) propose CharNER which utilizes a stacked biLSTM and a Viterbi decoder. \citeauthorDBLP:conf/naacl/GillickBVS16 (\citeyearDBLP:conf/naacl/GillickBVS16) describe an LSTM-based model called BTS which encodes each character into bytes and achieves high performances on CoNLL 2003 dataset without feature engineering. Many models also combine word-level and character-level embeddings. \citeauthorDBLP:conf/acl/MaH16 (\citeyearDBLP:conf/acl/MaH16) implement a CNN to get character embeddings and they are input into a biLSTM-CRF model with word embeddings to make predictions. \citeauthorDBLP:journals/tacl/ChiuN16 (\citeyearDBLP:journals/tacl/ChiuN16) also add character and word features to embeddings and achieve a 91.62% score on CoNLL 2003 English dataset. \citeauthorDBLP:conf/aclnut/LimsopathamC16 (\citeyearDBLP:conf/aclnut/LimsopathamC16) focus on NER on twitter and concatenated representations obtained from features and embeddings in both character and word level are input into a biLSTM-CRF model. \citeauthorDBLP:conf/naacl/LampleBSKD16 (\citeyearDBLP:conf/naacl/LampleBSKD16) use biLSTM to obtain character-level embeddings and concatenate them with word embeddings to be the inputs of biLSTM-CRF model. \citeauthorDBLP:conf/starsem/YadavSB18 (\citeyearDBLP:conf/starsem/YadavSB18) incorporate affix features with character+word NN architecture and show that affix embeddings can capture complementary information. These neural models are limited to the small NER datasets and hard to obtain high results in the open-domain setting if they are trained with specific datasets.

Unsupervised Pre-training Methods for Named Entity Recognition

: In recent years, unsupervised pre-training is becoming more and more popular and a lot of strong models are proposed. \citeauthorDBLP:conf/naacl/PetersNIGCLZ18 (\citeyearDBLP:conf/naacl/PetersNIGCLZ18) introduce ELMo representation which is a new type of deep contextualized word representation learned from a pre-trained bidirectional language model. BERT is proposed in [Devlin et al.2019] and it is pre-trained using the masked language model task and the next sentence prediction task. Both BERT-base and BERT-large achieve very high performances on English CoNLL 2003 dataset. \citeauthorDBLP:conf/coling/AkbikBV18 (\citeyearDBLP:conf/coling/AkbikBV18) propose contextual string embeddings which are produced with a pre-trained character language model and the biLSTM-CRF model using the contextual string embeddings obtains state-of-art 93.09% score on English CoNLL 2003 dataset. \citeauthorDBLP:journals/corr/abs-1903-07785 (\citeyearDBLP:journals/corr/abs-1903-07785) present a pre-trained bi-directional transformer model using cloze-style word reconstruction task. The model outperforms all the previous models and achieves 93.5% on English CoNLL 2003 dataset. Although pre-trained language models are proved very helpful for many natural language understanding tasks including NER, these strong models are not fully exploited due to the lack of large-scale and high-quality NER dataset. Only a little NER information from small NER datasets is input into the model during the fine-tuning process. From this point of view, the amount of NER information still needs to be increased when much semantic and syntactic information has been learned.

Open-domain Named Entity Recognition

: Previous NER models are trained with different datasets in different text genres. The most common datasets are CoNLL 2003 dataset [Sang and Meulder2003], Ontonote5 dataset [Pradhan et al.2012], etc. At the same time, some models also focus on NER in twitter [Ritter et al.2011, Li et al.2012]. However, few studies are focusing on open-domain NER task. Although there are some NER toolkits, such as StanfordNLP [Manning et al.2014], spaCy222 and NLTK [Wagner2010], which can be used on various texts, most of them are trained with a single dataset or mixed datasets. Due to the lack of specific designs and strategies, their performances are limited in the open-domain setting333The results are shown in Section 5.3..

3 AnchorNER Dataset

We apply the pipeline described hereafter to a dump of abstracts of English Wikipedia from 2017 and obtain AnchorNER. This dataset is built out of 5.2M abstracts of Wikipedia articles, consisting of 268M tokens accounting for 12M sentences. The pipeline used to annotate named-entity from abstracts of Wikipedia is illustrated in Figure 1. Our strategy for annotations is a three-step process:

  1. We consider the title and anchored strings of hyperlinks in the abstract of articles most likely to be named entities.

  2. We search for the types of these potential entities in DBpedia and map them to entity types. For instance, we map Person, Place, Organisation to PER, LOC and ORG, respectively. If an entry does not belong to any of the previous classes, we tag it as MISC.

  3. We make an exact match in the original text using the entities we find in DBpedia.

Through this process, we get an initial version of AnchorNER. The coverage, or the ratio of annotated tokens, of the dataset is 10.22%, which is lower than the coverage of the manually-annotated CoNLL-2003 dataset which is 16.64%. The main reason for our relatively low coverage is that we may miss the entity which does not appear in any anchored string. The missing entities, for example, Middlesex (LOC), British Comedy Award (MISC), will be captured by our correction model mentioned in the next section. The importance of the correction step will be illustrated in detail in the ablation study section.

After the false-negative entity labels are corrected by the correction model, the coverage of AnchorNER raises to 22.26%. We further assess the annotation quality of a random subset of 1000 tokens and we measure an accuracy of 98% for labels, which is better than the 92% accuracy of WiNER dataset.

4 Our Method

In this section, we introduce our methods. First, we briefly give a definition of the NER task in Section 4.1. Then we build up a correction dataset in Section 4.2 and propose a semi-supervised correction model in Section 4.3. Section 4.4 describes how we use the idea of curriculum learning to learn the correction model.

4.1 Task Definition

Before introducing our method, we first give a formal formulation of the NER task.

NER is the process of locating and classifying named entities in text into predefined entity categories. In this paper, we define four types of entities: person (PER), organizations (ORG), locations (LOC) and miscellaneous names (MISC).

Formally, we define a sentence as a sequence of tokens . NER is to annotate each token with a tag. Tokens tagged with O are outside of named entities. The B- tag indicates that this token is the first word of a named entity of type while the I- tag is used for words inside a named entity of type .

4.2 Correction Dataset

The goal of the neural correction model is to use carefully annotated high-quality dataset to correct the false negatives of large open-domain Wikipedia text. In order to implement the correction model, we first build a correction dataset with our AnchorNER dataset and DocRED dataset [Yao et al.2019]. DocRED is the largest human-annotated dataset constructed from Wikipedia and Wikidata with named entities annotated. Since DocRED is obtained from Wikipedia, as well as AnchorNER, there are 2,937 articles consisting of 8,882 sentences that appear in both datasets. Though some of the articles are not the same, most of the entities manually annotated in articles from DocRED have appeared in abstracts from AnchorNER. We believe that the initial labels in AnchorNER can help us learn how to make annotations in DocRED, in other words, learn a pattern of manual annotations. To get ground truth for each token in the articles which appear in both datasets, we find out all the entities marked in DocRED and use them to make exact matches in abstracts from AnchorNER, in descending order of length of entities. If we match a phrase that has not been matched, we will consider this tag as the ground truth of all the tokens in this phrase. For those sentences that do not obtain any ground truth, we remove them from our dataset because the difference between them and the corresponding ones in DocRED may be too big, which means they can not help with the learning of the correction model. As a result, the dataset comprises 121,627 tokens accounting for 4,288 sentences, containing one word per line with empty lines representing sentence boundaries. Each word is followed by two tags. The first is the initial label in AnchorNER, and the second is the ground truth obtained from DocRED.

4.3 Correction Model

Figure 2: Overall architecture of the correction model.

We denote our correction dataset as ={, , } where is the sentences in the correction dataset, is the entity labels from AnchorNER dataset and is the entity labels from DocRED dataset. We define ={, , …, } where ={, , …, } is a sentence and is a word. For entity labels, ={, , …, } and ={, , …, } where ={, , …, } and ={, , …, }.

For a sentence , we first input it into a BERT model to obtain its representations where m denotes the length of the sentence. Before the last classification layer, we embed the entity labels from AnchorNER dataset and concatenate the entity label embeddings and the sentence representations in the corresponding positions. We denote and where means concatenation. Therefore, the last classification layer can be defined as where denotes the parameters of the classification layer. The objective function for our correction model can be defined as


4.4 Curriculum Learning

Curriculum learning [Bengio et al.2009] is a training strategy proposed by \citeauthorDBLP:conf/icml/BengioLCW09 in the context of machine learning. They demonstrate that models can be better trained when the inputs are not randomly presented but organized in a meaningful order, such as from easy to hard.

Inspired by this thought, we rank all sentences in the correction dataset from easy to hard and split the correction dataset into three sets which are input into the correction model in order. Specifically, we calculate an score for each sentence in the correction dataset with the corresponding entity label and (see Figure 3 for the distribution). We remove the sentences whose scores are lower than 0.1. Then, we rank all the sentences in the correction dataset according to their score from high to low and split the correction dataset into three sets , and . That means in , the sentence has more similar labels and and is easier for the correction model to learn. Similarly, is more difficult for the model to learn. We input the three sets from to and train our correction model with each set for five epochs.

Figure 3: score of the correction dataset.

5 Experiments

In this section, we evaluate the effectiveness of the proposed method using several different NER data sets.

5.1 Experiment Setup

Data Sets

We conduct our experiments on six open-domain NER data sets.

  • CoNLL03 [Sang and Meulder2003] the CoNLL 2003 Shared Task dataset is a well known NER dataset built up with Reuters newswire articles. It is annotated with four entity types (PER, LOC, ORG and MISC).

  • DocRED [Yao et al.2019] DocRED is a human-annotated dataset constructed from Wikipedia and Wikidata. \citeauthorDBLP:conf/acl/YaoYLHLLLHZS19 (\citeyearDBLP:conf/acl/YaoYLHLLLHZS19) annotate 5,053 Wikipedia documents containing 1,002k words with named entities and their relations. Entity types include person, location, organization, time and number which are mapped to CoNLL 2003 named entity (NE) classes.

  • Ontonote5 [Pradhan et al.2012] the OntoNote 5.0 dataset contains newswire, magazine articles, broadcast news, broadcast conversations, web data and conversational speech data. The dataset has about 1.6M words and is annotated with 18 named entity types. We follow [Nothman2008] to map annotations to CoNLL 2003 tag set.

  • Tweet [Ritter et al.2011] \citeauthorDBLP:conf/emnlp/RitterCME11 (\citeyearDBLP:conf/emnlp/RitterCME11) annotate 2,400 tweets (34k tokens) with 10 entity types and we map the entity types to CoNLL 2003 tag set.

  • Webpage \citeauthorDBLP:conf/conll/RatinovR09 (\citeyearDBLP:conf/conll/RatinovR09) manually annotated a collection of 20 webpages (8k tokens) on different topics with the CoNLL 2003 NE classes.

  • WikiGold [Balasuriya et al.2009] \citeauthorDBLP:conf/acl-pwnlp/BalasuriyaRNMC09 (\citeyearDBLP:conf/acl-pwnlp/BalasuriyaRNMC09) manually annotate a set of Wikipedia articles comprising 40k tokens with the CoNLL 2003 tag set.

Competing Methods

We compare the performance of our model against the following approaches.

  • Bi-LSTM-CRF [Huang, Xu, and Yu2015] A bidirectional Long Short-Term Memory (LSTM) network with a Conditional Random Field (CRF) layer.

  • CVT [Clark et al.2018] A semi-supervised learning algorithm that improves the representations of a Bi-LSTM sentence encoder using a mix of labeled and unlabeled data.

  • ELMo [Peters et al.2018] A type of deep contextualized word representation which models both complex characteritics of word use and how these uses vary across linguistic contexts.

  • Flair [Akbik, Bergmann, and Vollgraf2019, Akbik, Blythe, and Vollgraf2018] A type of contextual word representation which is distilled from all contextualized instances using a pooling operation.

  • BERT [Devlin et al.2019] A language representation model which is designed to pre-train deep bidirectional representations from the unlabeled text by jointly conditioning on both left and right context in all layers.

Evaluation Metrics

We evaluate NER systems by comparing their outputs against human annotations with exact-match. A named entity is considered correctly recognized only if both boundaries and type match ground truth.

score is the harmonic mean of precision and recall, and the balanced score is most commonly used:

We calculate the score on all six data sets and calculate the average score to measure the open-domain performance of each model.

5.2 Implementation Details

Model CoNLL03 DocRED Ontonote5 Tweet Webpage Wikigold Avg.
Public NER toolkits
NLTK 48.91 45.19 39.00 21.18 28.17 44.22 37.78
SpaCy 65.14 51.32 76.66 31.87 36.39 58.86 53.37
StandordNER 87.95 53.37 60.64 35.74 44.34 62.00 57.34
State-of-the-art neural models
Bi-LSTM-CRF 87.00 52.30 56.33 33.66 46.45 61.81 56.26
CVT 92.09 67.43 62.41 40.49 51.76 73.57 64.63
ELMo 92.51 69.84 62.35 33.82 52.33 75.63 64.41
Flair 92.90 67.45 64.76 38.87 51.56 74.99 65.09
91.95 71.82 66.63 47.79 45.58 78.21 67.00
92.10 73.67 66.31 50.00 49.85 80.44 68.72
Our models
92.02 79.80 66.71 50.22 48.75 82.37 70.12 (+4.66%)
92.15 80.20 68.19 51.45 50.81 82.16 70.83 (+3.07%)
Table 1: Comparing our methods with state-of-the-art methods on various open-domain NER datasets.
Model CoNLL03 DocRED Ontonote5 Tweet Webpage Wikigold Avg.
92.02 79.80 66.71 50.22 48.75 82.37 70.12
w/o Wiki label 92.11 78.03 67.29 49.50 48.56 81.20 69.45 (-0.96%)
smaller correction dataset 92.13 77.56 67.29 48.57 47.09 81.90 69.09 (-1.47%)
w/o curriculum learning 91.58 77.59 66.94 46.98 47.21 81.05 68.56 (-2.22%)
w/o correction 91.86 73.44 66.44 46.75 46.29 78.41 67.20 (-4.16%)
Ablation on fine-tuning with labeled data
CoNLL (i.e.) 91.95 71.82 66.63 47.79 45.58 78.21 67.00
DocRED 70.25 88.18 56.82 47.03 50.68 76.76 64.95 (-7.37%)
DocRED+CoNLL (mixed) 90.63 88.02 64.89 43.25 49.30 77.44 68.92 (-1.71%)
DocRED+CoNLL (sequential) 91.69 75.81 66.24 47.86 48.23 79.79 68.27 (-2.64%)

Table 2: Ablation Study Results.

We first split AnchorNER dataset into training set, development set and testing set according to the categories of different titles. Considering of the limitation of computing resources, we randomly sample 70,000 abstracts and 20,000 abstracts for training phase and testing phase respectively, accounting for three million tokens from AnchorNER datasets.

All the BERT models in our model use default parameters. We use the cased BERT model and the maximum sequence length is 128. In the correction model, we utilize 12-dimension one-hot vectors for embeddings of entity labels from AnchorNER. The optimizer is BERTAdam [Devlin et al.2019] and learning rate is 1e-5 for both BERT-base and BERT-large. We set batch size as 32 for BERT-base and 8 for BERT-large. Warming-up proportion is 0.4. We use our AnchorNER training set to fine-tune both the BERT-base and BERT-large for 5 epochs. After that, both BERT-base and BERT-large are fine-tuned with CoNLL dataset for 50 epochs. For the competing methods, the BERT models are fine-tuned and other models are trained with CoNLL 2003 English dataset only.

5.3 Main Results

Comparison with competing methods

The performance of each model is presented in Table 1. From the results, we have the following observations: (1) Our approach is comparable to the state-of-the-art result on the benchmark dataset CoNLL03. (2) Our approach outperforms existing methods for open-domain NER, increasing average score by 4.66% and 3.07% with BERT base and large respectively.

Comparison with existing NER toolkits

As shown in Table 1, our toolkit OpenNER significantly and consistently outperforms existing NER toolkits on six NER data sets.

5.4 Ablation Studies

Now we systematically look into some important components of our method, and we conduct some ablation studies to analyze the effectiveness of each component.

Effectiveness of AnchorNER dataset

We evaluate the effectiveness of AnchorNER dataset by not using AnchorNER dataset. Instead, we train a BERT model directly with DocRED dataset and CoNLL dataset. We try two training methods: (1) First use DocRED dataset then use CoNLL dataset to fine-tune BERT model twice. (2) Use the mix of DocRED dataset and CoNLL dataset to fine-tune BERT model. These two methods reduce the performance by 2.64% and 1.71% as shown in Table 2. Although DocRED is a high-quality NER dataset, its size is still limited. Our models are exposed to more named entities if we use AnchorNER dataset and our models can obtain much more information.

Effectiveness of AnchorNER label

We remove the AnchorNER label from the correction model to evaluate its effectiveness. In this way, the correction model becomes an original BERT classifier trained with DocRED dataset. Results show that removing these labels slightly hurts the performance of the model, reducing the average score by 0.96%. Although our uncorrected dataset has a low recall rate, it still has precise labels which can add more information to DocRED dataset.

Effectiveness of the size of the correction dataset

We try to restrict the collection of the correction dataset, leaving only the same sentences that appear in both AnchorNER and DocRED. A total of 2,587 sentences consisting of 61,343 tokens meets this requirement, which is almost half the size of our correction dataset. The results in the third row of Table 2 illustrate that the overall performance using this smaller correction dataset is decreased by 1.47%. As the correction model is exposed to a larger correction dataset, more information about corrected named entities is learned by our correction model, which improves the performances of the model.

Effectiveness of correction

We study the effectiveness of correction by removing the process of correction and only using the initial label from AnchorNER. As we mentioned in Section 3, we will miss some entities in this way, resulting in a lower recall rate. As shown in the fifth row of Table 2 show that removing the correction part leads to a reduction of 4.16% in average score. This result illustrates that correction process is the most important part in our approach, and the improvement of the quality of AnchorNER dataset can greatly improve the final performance of the model.

Effectiveness of curriculum learning

During the training of the correction model, we take advantage of the idea of curriculum learning. We compare this training method with a variant, in which we directly train the correction model with the whole correction dataset. From Table 2 we can see that if we do not use this idea, the performance is decreased by 2.22%. By using curriculum learning, we input our correction dataset into the correction model in order and it is easier for the correction model to learn the pattern of the correction process.

5.5 Qualitative Study of the Correction Model

Corrected v.s. uncorrected AnchorNER dataset

We compare the labels for the same 1000 tokens in corrected and uncorrected AnchorNER dataset and 61 entities are retrieved after correction. This result clearly illustrates the effectiveness and necessity of the correction process.

Case study

Figure 4 shows two examples of correction. The sentence in Figure 4(a) first appears in Figure 1, where the entity British Comedy Award is not recognized during the three-step process mentioned in Section 3 because it does not appear in any hyperlink. The second example in Figure 4(b) shows that entity CANDU and Canada Deuterium Uranium are not recognized. In the uncorrected dataset, Canada is mislabeled as B-LOC. This is because Canada appears in one of the hyperlinks, but Canada Deuterium Uranium does not. With the help of the correction model, all the labels are corrected.

(a) A case showing that the correction model retrieves the missing entity called British Comedy Award.
(b) A case showing that the correction model retrieves two missing entities called CANDU and Canada Deuterium Uranium.
(c) A case showing that the correction model fails to recognize the entity called Two Flint.
Figure 4: Comparison between the dataset before correction and the dataset after correction.

However, we also identify some errors during the correction process as shown in Figure 4(c). Two Flint should be marked as MISC, but the correction model fails to recognize it. Instead, it only tags Flint as I-MISC. Another type of error is that the model modifies the correct label to the wrong label. Maybe our model should be further improved to address these issues.

5.6 Discussion

Our work is similar to the application of distant supervision on relation extraction [Mintz et al.2009]. As they extract relations from Freebase, we extract entities from anchored strings and search them in DBpedia. For each entity that appears in DBpedia, we find all the locations where it appears in the sentence and annotate them with the entity type mentioned in DBpedia.

When building large dataset, instead of being supervised by knowledge bases in their work, our algorithm is supervised by the correction model trained with a relatively smaller but high-quality dataset. We use the correction model to correct the false negatives of Wikipedia text so as to increase the recall rate of AnchorNER dataset. Therefore, our approach takes advantage of semi-supervised methods, leading to state-of-the-art performances.

6 Conclusion and Future Work

In this paper, we propose a semi-supervised annotation framework to make full use of abstracts from Wikipedia and obtain a large and high-quality dataset called AnchorNER. We utilize anchored strings in abstracts and DBpedia to annotate the dataset. We also design a neural correction model trained with DocRED to correct the false-negative entity labels and then train a BERT model with the corrected dataset. Our trained BERT model has obtained state-of-the-art open-domain performances — average score is increased by 4.66% and 3.07% with BERT base and large respectively. In the future, we can also use our AnchorNER dataset during the pre-training process and come up with more pre-training methods leveraging NER information.


  • [Akbik, Bergmann, and Vollgraf2019] Akbik, A.; Bergmann, T.; and Vollgraf, R. 2019. Pooled contextualized embeddings for named entity recognition. In NAACL-HLT (1).
  • [Akbik, Blythe, and Vollgraf2018] Akbik, A.; Blythe, D.; and Vollgraf, R. 2018. Contextual string embeddings for sequence labeling. In COLING.
  • [Aramaki et al.2009] Aramaki, E.; Miura, Y.; Tonoike, M.; Ohkuma, T.; Mashuichi, H.; and Ohe, K. 2009. TEXT2TABLE: medical text summarization system based on named entity recognition and modality identification. In BioNLP@HLT-NAACL.
  • [Baevski et al.2019] Baevski, A.; Edunov, S.; Liu, Y.; Zettlemoyer, L.; and Auli, M. 2019. Cloze-driven pretraining of self-attention networks. CoRR.
  • [Balasuriya et al.2009] Balasuriya, D.; Ringland, N.; Nothman, J.; Murphy, T.; and Curran, J. R. 2009. Named entity recognition in wikipedia. In PWNLP@IJCNLP.
  • [Banko et al.2007] Banko, M.; Cafarella, M. J.; Soderland, S.; Broadhead, M.; and Etzioni, O. 2007. Open information extraction from the web. In IJCAI.
  • [Bengio et al.2009] Bengio, Y.; Louradour, J.; Collobert, R.; and Weston, J. 2009. Curriculum learning. In ICML.
  • [Chen et al.2004] Chen, H.; Chung, W.; Xu, J. J.; Wang, G.; Qin, Y.; and Chau, M. 2004. Crime data mining: A general framework and some examples. IEEE Computer.
  • [Chiu and Nichols2016] Chiu, J. P. C., and Nichols, E. 2016. Named entity recognition with bidirectional lstm-cnns. TACL.
  • [Clark et al.2018] Clark, K.; Luong, M.; Manning, C. D.; and Le, Q. V. 2018. Semi-supervised sequence modeling with cross-view training. In EMNLP.
  • [Collobert et al.2011] Collobert, R.; Weston, J.; Bottou, L.; Karlen, M.; Kavukcuoglu, K.; and Kuksa, P. P. 2011. Natural language processing (almost) from scratch. J. Mach. Learn. Res.
  • [Devlin et al.2019] Devlin, J.; Chang, M.; Lee, K.; and Toutanova, K. 2019. BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT (1).
  • [Ghaddar and Langlais2017] Ghaddar, A., and Langlais, P. 2017. Winer: A wikipedia annotated corpus for named entity recognition. In IJCNLP(1).
  • [Ghaddar and Langlais2018] Ghaddar, A., and Langlais, P. 2018. Robust lexical features for improved neural network named-entity recognition. In COLING.
  • [Gillick et al.2016] Gillick, D.; Brunk, C.; Vinyals, O.; and Subramanya, A. 2016. Multilingual language processing from bytes. In HLT-NAACL.
  • [Huang, Xu, and Yu2015] Huang, Z.; Xu, W.; and Yu, K. 2015. Bidirectional LSTM-CRF models for sequence tagging. CoRR.
  • [Krishnan and Manning2006] Krishnan, V., and Manning, C. D. 2006. An effective two-stage model for exploiting non-local dependencies in named entity recognition. In ACL.
  • [Kuru, Can, and Yuret2016] Kuru, O.; Can, O. A.; and Yuret, D. 2016. Charner: Character-level named entity recognition. In COLING.
  • [Lample et al.2016] Lample, G.; Ballesteros, M.; Subramanian, S.; Kawakami, K.; and Dyer, C. 2016. Neural architectures for named entity recognition. In HLT-NAACL.
  • [Li et al.2012] Li, C.; Weng, J.; He, Q.; Yao, Y.; Datta, A.; Sun, A.; and Lee, B. 2012. Twiner: named entity recognition in targeted twitter stream. In SIGIR.
  • [Li, Bontcheva, and Cunningham2004] Li, Y.; Bontcheva, K.; and Cunningham, H. 2004. SVM based learning system for information extraction. In Deterministic and Statistical Methods in Machine Learning.
  • [Limsopatham and Collier2016] Limsopatham, N., and Collier, N. 2016. Bidirectional LSTM for named entity recognition in twitter messages. In NUT@COLING.
  • [Ma and Hovy2016] Ma, X., and Hovy, E. H. 2016. End-to-end sequence labeling via bi-directional lstm-cnns-crf. In ACL (1).
  • [Manning et al.2014] Manning, C. D.; Surdeanu, M.; Bauer, J.; Finkel, J. R.; Bethard, S.; and McClosky, D. 2014. The stanford corenlp natural language processing toolkit. In ACL (System Demonstrations).
  • [Mintz et al.2009] Mintz, M.; Bills, S.; Snow, R.; and Jurafsky, D. 2009. Distant supervision for relation extraction without labeled data. In ACL.
  • [Nadeau, Turney, and Matwin2006] Nadeau, D.; Turney, P. D.; and Matwin, S. 2006. Unsupervised named-entity recognition: Generating gazetteers and resolving ambiguity. In Canadian Conference on AI, Lecture Notes in Computer Science.
  • [Nothman et al.2013] Nothman, J.; Ringland, N.; Radford, W.; Murphy, T.; and Curran, J. R. 2013. Learning multilingual named entity recognition from wikipedia. Artif. Intell.
  • [Nothman2008] Nothman, J. 2008. Learning named entity recognition from wikipedia. Honours Bachelor thesis, The University of Sydney Australia.
  • [Peters et al.2018] Peters, M. E.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark, C.; Lee, K.; and Zettlemoyer, L. 2018. Deep contextualized word representations. In NAACL-HLT.
  • [Pradhan et al.2012] Pradhan, S.; Moschitti, A.; Xue, N.; Uryupina, O.; and Zhang, Y. 2012. Conll-2012 shared task: Modeling multilingual unrestricted coreference in ontonotes. In EMNLP-CoNLL Shared Task.
  • [Ratinov and Roth2009] Ratinov, L., and Roth, D. 2009. Design challenges and misconceptions in named entity recognition. In CoNLL.
  • [Ritter et al.2011] Ritter, A.; Clark, S.; Mausam; and Etzioni, O. 2011. Named entity recognition in tweets: An experimental study. In EMNLP.
  • [Sang and Meulder2003] Sang, E. F. T. K., and Meulder, F. D. 2003. Introduction to the conll-2003 shared task: Language-independent named entity recognition. In CoNLL.
  • [Shang et al.2018] Shang, J.; Liu, L.; Gu, X.; Ren, X.; Ren, T.; and Han, J. 2018. Learning named entity tagger using domain-specific dictionary. In EMNLP.
  • [Wagner2010] Wagner, W. 2010. Steven bird, ewan klein and edward loper: Natural language processing with python, analyzing text with the natural language toolkit - o’reilly media, beijing, 2009, ISBN 978-0-596-51649-9. Language Resources and Evaluation.
  • [Wu, Liu, and Cohn2018] Wu, M.; Liu, F.; and Cohn, T. 2018. Evaluating the utility of hand-crafted features in sequence labelling. CoRR.
  • [Yadav and Bethard2018] Yadav, V., and Bethard, S. 2018. A survey on recent advances in named entity recognition from deep learning models. In COLING.
  • [Yadav, Sharp, and Bethard2018] Yadav, V.; Sharp, R.; and Bethard, S. 2018. Deep affix features improve neural named entity recognizers. In *SEM@NAACL-HLT.
  • [Yao et al.2019] Yao, Y.; Ye, D.; Li, P.; Han, X.; Lin, Y.; Liu, Z.; Liu, Z.; Huang, L.; Zhou, J.; and Sun, M. 2019. Docred: A large-scale document-level relation extraction dataset. In ACL (1).
  • [Zhou and Su2002] Zhou, G., and Su, J. 2002. Named entity recognition using an hmm-based chunk tagger. In ACL.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description