pioNER: Datasets and Baselines for Armenian Named Entity Recognition
In this work, we tackle the problem of Armenian named entity recognition, providing silver- and gold-standard datasets as well as establishing baseline results on popular models. We present a 163000-token named entity corpus automatically generated and annotated from Wikipedia, and another 53400-token corpus of news sentences with manual annotation of people, organization and location named entities. The corpora were used to train and evaluate several popular named entity recognition models. Alongside the datasets, we release 50-, 100-, 200-, 300-dimensional GloVe word embeddings trained on a collection of Armenian texts from Wikipedia, news, blogs, and encyclopedia.
Named entity recognition is an important task of natural language processing, featuring in many popular text processing toolkits. This area of natural language processing has been actively studied in the latest decades and the advent of deep learning reinvigorated the research on more effective and accurate models. However, most of existing approaches require large annotated corpora. To the best of our knowledge, no such work has been done for the Armenian language, and in this work we address several problems, including the creation of a corpus for training machine learning models, the development of gold-standard test corpus and evaluation of the effectiveness of established approaches for the Armenian language.
Considering the cost of creating manually annotated named entity corpus, we focused on alternative approaches. Lack of named entity corpora is a common problem for many languages, thus bringing the attention of many researchers around the globe. Projection based transfer schemes have been shown to be very effective (e.g. , , ), using resource-rich language’s corpora to generate annotated data for the low-resource language. In this approach, the annotations of high-resource language are projected over the corresponding tokens of the parallel low-resource language’s texts. This strategy can be applied for language pairs that have parallel corpora. However, that approach would not work for Armenian as we did not have access to sufficiently large parallel corpus with a resource-rich language.
Another popular approach is using Wikipedia. Klesti Hoxha and Artur Baxhaku employ gazetteers extracted from Wikipedia to generate an annotated corpus for Albanian, and Weber and Pötzl propose a rule-based system for German that leverages the information from Wikipedia. However, the latter relies on external tools such as part-of-speech taggers, making it nonviable for the Armenian language.
Nothman et al. generated a silver-standard corpus for 9 languages by extracting Wikipedia article texts with outgoing links and turning those links into named entity annotations based on the target article’s type. Sysoev and Andrianov used a similar approach for the Russian language. Based on its success for a wide range of languages, our choice fell on this model to tackle automated data generation and annotation for the Armenian language.
Aside from the lack of training data, we also address the absence of a benchmark dataset of Armenian texts for named entity recognition. We propose a gold-standard corpus with manual annotation of CoNLL named entity categories: person, location, and organization , hoping it will be used to evaluate future named entity recognition models.
Furthermore, popular entity recognition models were applied to the mentioned data in order to obtain baseline results for future research in the area. Along with the datasets, we developed GloVe word embeddings to train and evaluate the deep learning models in our experiments.
The contributions of this work are (i) the silver-standard training corpus, (ii) the gold-standard test corpus, (iii) GloVe word embeddings, (iv) baseline results for 3 different models on the proposed benchmark data set. All aforementioned resources are available on GitHub
Ii Automated training corpus generation
We used Sysoev and Andrianov’s modification of the Nothman et al. approach to automatically generate data for training a named entity recognizer. This approach uses links between Wikipedia articles to generate sequences of named-entity annotated tokens.
Ii-a Dataset extraction
The main steps of the dataset extraction system are described in Figure 1.
First, each Wikipedia article is assigned a named entity class (e.g. the article Քիմ Քաշքաշյան (Kim Kashkashian) is classified as PER (person), Ազգերի լիգա(League of Nations) as ORG (organization), Սիրիա(Syria) as LOC etc). One of the core differences between our approach and Nothman’s system is that we do not rely on manual classification of articles and do not use inter-language links to project article classifications across languages. Instead, our classification algorithm uses only an article’s Wikidata entry’s first instance of label’s parent subclass of labels, which are, incidentally, language independent and thus can be used for any language.
Then, outgoing links in articles are assigned the article’s type they are leading to. Sentences are included in the training corpus only if they contain at least one named entity and all contained capitalized words have an outgoing link to an article of known type. Since in Wikipedia articles only the first mention of each entity is linked, this approach becomes very restrictive and in order to include more sentences, additional links are inferred. This is accomplished by compiling a list of common aliases for articles corresponding to named entities, and then finding text fragments matching those aliases to assign a named entity label. An article’s aliases include its title, titles of disambiguation pages with the article, and texts of links leading to the article (e.g. Լենինգրադ (Leningrad), Պետրոգրադ (Petrograd), Պետերբուրգ (Peterburg) are aliases for Սանկտ Պետերբուրգ (Saint Petersburg)). The list of aliases is compiled for all PER, ORG, LOC articles.
After that, link boundaries are adjusted by removing the labels for expressions in parentheses, the text after a comma, and in some cases breaking into separate named entities if the linked text contains a comma. For example, [LOC Աբովյան (քաղաք)] (Abovyan (town)) is reworked into [LOC Աբովյան] (քաղաք).
company, business enterprise, company, juridical person, air carrier, political organization, government organization, secret service, political party, international organization, alliance, armed organization, higher education institution, educational institution, university, educational organization, school, fictional magic school, broadcaster, newspaper, periodical literature, religious organization, football club, sports team, musical ensemble, music organisation, vocal-musical ensemble, sports organization, criminal organization, museum of culture, scientific organisation, non-governmental organization, nonprofit organization, national sports team, legal person, scholarly publication, academic journal, association, band, sports club, institution, medical facility
|state, disputed territory, country, occupied territory, political territorial entity, city, town, village, rural area, rural settlement, urban-type settlement, geographical object, geographic location, geographic region, community, administrative territorial entity, former administrative territorial entity, human settlement, county, province, federated state, district, county-equivalent, municipal formation, raion, nahiyah, mintaqah, muhafazah, realm, principality, historical country, watercourse, lake, sea, still waters, body of water, landmass, minor planet, landform, natural geographic object, mountain range, mountain, protected area, national park, geographic region, geographic location, arena, bridge, airport, stadium, performing arts center, public building, venue, sports venue, church, temple, place of worship, retail building||LOC|
|person, fictional character, fictional humanoid, human who may be fictional, given name, fictional human, magician in fantasy||PER|
Ii-B Using Wikidata to classify Wikipedia
Instead of manually classifying Wikipedia articles as it was done in Nothman et al., we developed a rule-based classifier that used an article’s Wikidata instance of and subclass of attributes to find the corresponding named entity type.
The classification could be done using solely instance of labels, but these labels are unnecessarily specific for the task and building a mapping on it would require a more time-consuming and meticulous work. Therefore, we classified articles based on their first instance of attribute’s subclass of values. Table I displays the mapping between these values and named entity types. Using higher-level subclass of values was not an option as their values often were too general, making it impossible to derive the correct named entity category.
Ii-C Generated data
Using the algorithm described above, we generated 7455 annotated sentences with 163247 tokens based on 20 February 2018 dump of Armenian Wikipedia.
The generated data is still significantly smaller than the manually annotated corpora from CoNLL 2002 and 2003. For comparison, the train set of English CoNLL 2003 corpus contains 203621 tokens and the German one 206931, while the Spanish and Dutch corpora from CoNLL 2002 respectively 273037 and 218737 lines. The smaller size of our generated data can be attributed to the strict selection of candidate sentences as well as simply to the relatively small size of Armenian Wikipedia.
The accuracy of annotation in the generated corpus heavily relies on the quality of links in Wikipedia articles. During generation, we assumed that first mentions of all named entities have an outgoing link to their article, however this was not always the case in actual source data and as a result the train set contained sentences where not all named entities are labeled. Annotation inaccuracies also stemmed from wrongly assigned link boundaries (for example, in Wikipedia article Արթուր Ուելսլի Վելինգթոն (Arthur Wellesley) there is a link to the Napoleon article with the text “է Նապոլեոնը” (“Napoleon is”), when it should be “Նապոլեոնը” (“Napoleon”)). Another kind of common annotation errors occurred when a named entity appeared inside a link not targeting a LOC, ORG, or PER article (e.g. “ԱՄՆ նախագահական ընտրություններում” (“USA presidential elections”) is linked to the article ԱՄՆ նախագահական ընտրություններ 2016 (United States presidential election, 2016) and as a result [LOC ԱՄՆ] (USA) is lost).
Iii Test dataset
In order to evaluate the models trained on generated data, we manually annotated a named entities dataset comprising 53453 tokens and 2566 sentences selected from over 250 news texts from ilur.am
During annotation, we generally relied on categories and guidelines assembled by BBN Technologies for TREC 2002 question answering track
We ignored entities of other categories (e.g. works of art, law, or events), including those cases where an ORG, LOC or PER entity was inside an entity of extraneous type (e.g. ՀՀ (RA) in ՀՀ Քրեական Օրենսգիրք (RA Criminal Code) was not annotated as LOC).
Quotation marks around a named entity were not annotated unless those quotations were a part of that entity’s full official name (e.g. «Նաիրիտ գործարան» ՓԲԸ (“Nairit Plant” CJSC)).
Depending on context, metonyms such as Կրեմլ (Kremlin), Բաղրամյան 26 (Baghramyan 26) were annotated as ORG when referring to respective government agencies. Likewise, country or city names were also tagged as ORG when referring to sports teams representing them.
Iv Word embeddings
Apart from the datasets, we also developed word embeddings for the Armenian language, which we used in our experiments to train and evaluate named entity recognition algorithms. Considering their ability to capture semantic regularities, we used GloVe to train word embeddings. We assembled a dataset of Armenian texts containing 79 million tokens from the articles of Armenian Wikipedia, The Armenian Soviet Encyclopedia, a subcorpus of Eastern Armenian National Corpus, over a dozen Armenian news websites and blogs. Included texts covered topics such as economics, politics, weather forecast, IT, law, society and politics, coming from non-fiction as well as fiction genres.
Similar to the original embeddings published for the English language, we release 50-, 100-, 200- and 300-dimensional word vectors for Armenian with a vocabulary size of 400000. Before training, all the words in the dataset were lowercased. For the final models we used the following training hyperparameters: 15 window size and 20 training epochs.
In this section we describe a number of experiments targeted to compare the performance of popular named entity recognition algorithms on our data. We trained and evaluated Stanford NER
Stanford NER is conditional random fields (CRF) classifier based on lexical and contextual features such as the current word, character-level n-grams of up to length 6 at its beginning and the end, previous and next words, word shape and sequence features .
spaCy 2.0 uses a CNN-based transition system for named entity recognition. For each token, a Bloom embedding is calculated based on its lowercase form, prefix, suffix and shape, then using residual CNNs, a contextual representation of that token is extracted that potentially draws information from up to 4 tokens from each side . Each update of the transition system’s configuration is a classification task that uses the contextual representation of the top token on the stack, preceding and succeeding tokens, first two tokens of the buffer, and their leftmost, second leftmost, rightmost, second rightmost children. The valid transition with the highest score is applied to the system. This approach reportedly performs within 1% of the current state-of-the-art for English
The main model that we focused on was the recurrent model with a CRF top layer, and the above-mentioned methods served mostly as baselines. The distinctive feature of this approach is the way contextual word embeddings are formed. For each token separately, to capture its word shape features, character-based representation is extracted using a bidirectional LSTM . This representation gets concatenated with a distributional word vector such as GloVe, forming an intermediate word embedding. Using another bidirectional LSTM cell on these intermediate word embeddings, the contextual representation of tokens is obtained (Figure 4). Finally, a CRF layer labels the sequence of these contextual representations. In our experiments, we used Guillaume Genthial’s implementation
Experiments were carried out using IOB tagging scheme, with a total of 7 class tags: O, B-PER, I-PER, B-LOC, I-LOC, B-ORG, I-ORG.
We randomly selected 80% of generated annotated sentences for training and used the other 20% as a development set. The models with the best F1 score on the development set were tested on the manually annotated gold dataset.
|train embeddings=False||train embeddings=True|
||dev F1||test F1||dev F1||test F1|
Table III shows the average scores of evaluated models. The highest F1 score was achieved by the recurrent model using a batch size of 8 and Adam optimizer with an initial learning rate of 0.001. Updating word embeddings during training also noticeably improved the performance. GloVe word vector models of four different sizes (50, 100, 200, and 300) were tested, with vectors of size 50 producing the best results (Table IV).
For spaCy 2.0 named entity recognizer, the same word embedding models were tested. However, in this case the performance of 200-dimensional embeddings was highest (Table V). Unsurprisingly, both deep learning models outperformed the feature-based Stanford recognizer in recall, the latter however demonstrated noticeably higher precision.
It is clear that the development set of automatically generated examples was not an ideal indicator of models’ performance on gold-standard test set. Higher development set scores often led to lower test scores as seen in the evaluation results for spaCy 2.0 and Char-biLSTM+biLSTM+CRF (Tables V and IV). Analysis of errors on the development set revealed that many were caused by the incompleteness of annotations, when named entity recognizers correctly predicted entities that were absent from annotations (e.g. [ԽՍՀՄ-ի LOC] (USSR’s), [Դինամոն ORG] (the_Dinamo), [Պիրենեյան թերակղզու LOC] (Iberian Peninsula’s) etc). Similarly, the recognizers often correctly ignored non-entities that are incorrectly labeled in data (e.g. [օսմանների PER], [կոնսերվատորիան ORG] etc).
Generally, tested models demonstrated relatively high precision of recognizing tokens that started named entities, but failed to do so with descriptor words for organizations and, to a certain degree, locations. The confusion matrix for one of the trained recurrent models illustrates that difference (Table VI). This can be partly attributed to the quality of generated data: descriptor words are sometimes superfluously labeled (e.g. [Հավայան կղզիների տեղաբնիկները LOC] (the indigenous people of Hawaii)), which is likely caused by the inconsistent style of linking in Armenian Wikipedia (in the article ԱՄՆ մշակույթ (Culture of the United States), its linked text fragment “Հավայան կղզիների տեղաբնիկները” (“the indigenous people of Hawaii”) leads to the article Հավայան կղզիներ(Hawaii)).
We release two named-entity annotated datasets for the Armenian language: a silver-standard corpus for training NER models, and a gold-standard corpus for testing. It is worth to underline the importance of the latter corpus, as we aim it to serve as a benchmark for future named entity recognition systems designed for the Armenian language. Along with the corpora, we publish GloVe word vector models trained on a collection of Armenian texts.
Additionally, to establish the applicability of Wikipedia-based approaches for the Armenian language, we provide evaluation results for 3 different named entity recognition systems trained and tested on our datasets. The results reinforce the ability of deep learning approaches in achieving relatively high recall values for this specific task, as well as the power of using character-extracted embeddings alongside conventional word embeddings.
There are several avenues of future work. Since Nothman et al. 2013, more efficient methods of exploiting Wikipedia have been proposed, namely WiNER , which could help increase both the quantity and quality of the training corpus. Another potential area of work is the further enrichment of the benchmark test set with additional annotation of other classes such as MISC or more fine-grained types (e.g. CITY, COUNTRY, REGION etc instead of LOC).
- David Yarowsky, Grace Ngai, Richard Wicentowski. Inducing multilingual text analysis tools via robust projection across aligned corpora. In Proceedings of the First International Conference on Human Language Technology Research. Association for Computational Linguistics, Stroudsburg, PA, USA, HLT’01, pp. 1–8. 2001.
- Imed Zitouni, Radu Florian. Mention detection crossing the language barrier. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, pp. 600–609. 2008.
- Maud Ehrmann, Marco Turchi, Ralf Steinberger. Building a multilingual named entity-annotated corpus using annotation projection. In Proceedings of Recent Advances in Natural Language Processing. Association for Computational Linguistics, pp. 118–124. 2011.
- Klesti Hoxha, Artur Baxhaku. An Automatically Generated Annotated Corpus for Albanian Named Entity Recognition CYBERNETICS AND INFORMATION TECHNOLOGIES, vol. 18, No 1. 2018.
- Weber and Pötzl. NERU: Named Entity Recognition for German. Proceedings of GermEval 2014 Named Entity Recognition Shared Task, pp. 157-162. 2014.
- Nothman J., Ringland N., Radford W., Murphy T., Curran J. R. Learning multilingual named entity recognition from Wikipedia. Artificial Intelligence, vol. 194, pp. 151–175. 2013.
- Sysoev A. A., Andrianov I. A. Named Entity Recognition in Russian: the Power of Wiki-Based Approach. Computational Linguistics and Intellectual Technologies: Proceedings of the International Conference "Dialogue 2016". 2016.
- Turdakov D., Astrakhantsev N., Nedumov Y., Sysoev A., Andrianov I., Mayorov V., Fedorenko D., Korshunov A., Kuznetsov S. Texterra: A Framework for Text Analysis. Proceedings of the Institute for System Programming of RAS, vol. 26, Issue 1, pp. 421–438. 2014.
- Erik F. Tjong Kim Sang. Introduction to the CoNLL-2002 Shared Task: Language-Independent Named Entity Recognition. Proceedings of CoNLL-2002, pp. 155–158. Taipei, Taiwan. 2002.
- Erik F. Tjong Kim Sang, Fien De Meulder. Introduction to the CoNLL-2003 Shared Task: LanguageIndependent Named Entity Recognition. Proceedings of the CoNLL-2003, vol. 4, pp. 142–147. Association for Computational Linguistics. 2003.
- Jeffrey Pennington, Richard Socher, Christopher D. Manning. GloVe: Global Vectors for Word Representation.. Empirical Methods in Natural Language Processing (EMNLP) 2014, pp. 1532-1543. 2014.
- Marat M. Yavrumyan, Hrant H. Khachatrian, Anna S. Danielyan, Gor D. Arakelyan. ArmTDP: Eastern Armenian Treebank and Dependency Parser. XI International Conference on Armenian Linguistics, Abstracts. Yerevan. 2017.
- Khurshudian V.G., Daniel M.A., Levonian D.V., Plungian V.A., Polyakov A.E., Rubakov S.V. EASTERN ARMENIAN NATIONAL CORPUS. "Dialog 2009". 2009.
- Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, Chris Dyer. Neural Architectures for Named Entity Recognition Proceedings of NAACL-2016, San Diego, California, USA. 2016.
- Xuezhe Ma, Eduard Hovy. End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF. Proceedings of ACL. 2016.
- Guillaume Genthial. Sequence Tagging with Tensorflow. 2017. https://guillaumegenthial.github.io/
- Jenny Rose Finkel, Trond Grenager, Christopher Manning. Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling. Proceedings of the 43nd Annual Meeting of the Association for Computational Linguistics (ACL 2005), pp. 363-370. 2005.
- Emma Strubell, Patrick Verga, David Belanger, Andrew McCallum. Fast and Accurate Entity Recognition with Iterated Dilated Convolutions. 2017. https://arxiv.org/pdf/1702.02098.pdf
- Wang Ling, Tiago Lúıs, Lúıs Marujo, Ramón Fernandez Astudillo, Silvio Amir, Chris Dyer, Alan W Black, Isabel Trancoso. Finding function in form: Compositional character models for open vocabulary word representation. Proceedings of EMNLP. 2015.
- Abbas Ghaddar, Philippe Langlais. WiNER: A Wikipedia Annotated Corpus for Named Entity Recognition. Proceedings of the The 8th International Joint Conference on Natural Language Processing, pp. 413–422. 2017.
- Toldova S. Y. Starostin A. S., Bocharov V. V., Alexeeva S. V., Bodrova A. A., Chuchunkov A. S., Dzhumaev S. S., Efimenko I. V., Granovsky D. V., Khoroshevsky V. F., Krylova I. V., Nikolaeva M. A., Smurov I. M. FactRuEval 2016: Evaluation of Named Entity Recognition and Fact Extraction Systems for Russian. Computational Linguistics and Intellectual Technologies: Proceedings of the International Conference "Dialogue 2016". 2016.