pioNER: Datasets and Baselines for Armenian Named Entity Recognition

pioNER: Datasets and Baselines for Armenian Named Entity Recognition

Abstract

In this work, we tackle the problem of Armenian named entity recognition, providing silver- and gold-standard datasets as well as establishing baseline results on popular models. We present a 163000-token named entity corpus automatically generated and annotated from Wikipedia, and another 53400-token corpus of news sentences with manual annotation of people, organization and location named entities. The corpora were used to train and evaluate several popular named entity recognition models. Alongside the datasets, we release 50-, 100-, 200-, 300-dimensional GloVe word embeddings trained on a collection of Armenian texts from Wikipedia, news, blogs, and encyclopedia.

machine learning, deep learning, natural language processing, named entity recognition, word embeddings
\bstctlcite

IEEEisprasopen:BSTcontrol

I Introduction

Named entity recognition is an important task of natural language processing, featuring in many popular text processing toolkits. This area of natural language processing has been actively studied in the latest decades and the advent of deep learning reinvigorated the research on more effective and accurate models. However, most of existing approaches require large annotated corpora. To the best of our knowledge, no such work has been done for the Armenian language, and in this work we address several problems, including the creation of a corpus for training machine learning models, the development of gold-standard test corpus and evaluation of the effectiveness of established approaches for the Armenian language.

Considering the cost of creating manually annotated named entity corpus, we focused on alternative approaches. Lack of named entity corpora is a common problem for many languages, thus bringing the attention of many researchers around the globe. Projection based transfer schemes have been shown to be very effective (e.g. [1], [2], [3]), using resource-rich language’s corpora to generate annotated data for the low-resource language. In this approach, the annotations of high-resource language are projected over the corresponding tokens of the parallel low-resource language’s texts. This strategy can be applied for language pairs that have parallel corpora. However, that approach would not work for Armenian as we did not have access to sufficiently large parallel corpus with a resource-rich language.

Another popular approach is using Wikipedia. Klesti Hoxha and Artur Baxhaku employ gazetteers extracted from Wikipedia to generate an annotated corpus for Albanian[4], and Weber and Pötzl propose a rule-based system for German that leverages the information from Wikipedia[5]. However, the latter relies on external tools such as part-of-speech taggers, making it nonviable for the Armenian language.

Nothman et al. generated a silver-standard corpus for 9 languages by extracting Wikipedia article texts with outgoing links and turning those links into named entity annotations based on the target article’s type[6]. Sysoev and Andrianov used a similar approach for the Russian language[7][8]. Based on its success for a wide range of languages, our choice fell on this model to tackle automated data generation and annotation for the Armenian language.

Aside from the lack of training data, we also address the absence of a benchmark dataset of Armenian texts for named entity recognition. We propose a gold-standard corpus with manual annotation of CoNLL named entity categories: person, location, and organization [9][10], hoping it will be used to evaluate future named entity recognition models.

Furthermore, popular entity recognition models were applied to the mentioned data in order to obtain baseline results for future research in the area. Along with the datasets, we developed GloVe[11] word embeddings to train and evaluate the deep learning models in our experiments.

The contributions of this work are (i) the silver-standard training corpus, (ii) the gold-standard test corpus, (iii) GloVe word embeddings, (iv) baseline results for 3 different models on the proposed benchmark data set. All aforementioned resources are available on GitHub1.

Ii Automated training corpus generation

We used Sysoev and Andrianov’s modification of the Nothman et al. approach to automatically generate data for training a named entity recognizer. This approach uses links between Wikipedia articles to generate sequences of named-entity annotated tokens.

Ii-a Dataset extraction

Classification of Wikipedia articles into NE types

Labelling common article aliases to increase coverage

Extraction of text fragments with outgoing links

Labelling links according to their target article’s type

Adjustment of labeled entities’ boundaries

Fig. 1: Steps of automatic dataset extraction from Wikipedia

The main steps of the dataset extraction system are described in Figure 1.

First, each Wikipedia article is assigned a named entity class (e.g. the article Քիմ Քաշքաշյան (Kim Kashkashian) is classified as PER (person), Ազգերի լիգա(League of Nations) as ORG (organization), Սիրիա(Syria) as LOC etc). One of the core differences between our approach and Nothman’s system is that we do not rely on manual classification of articles and do not use inter-language links to project article classifications across languages. Instead, our classification algorithm uses only an article’s Wikidata entry’s first instance of label’s parent subclass of labels, which are, incidentally, language independent and thus can be used for any language.

Then, outgoing links in articles are assigned the article’s type they are leading to. Sentences are included in the training corpus only if they contain at least one named entity and all contained capitalized words have an outgoing link to an article of known type. Since in Wikipedia articles only the first mention of each entity is linked, this approach becomes very restrictive and in order to include more sentences, additional links are inferred. This is accomplished by compiling a list of common aliases for articles corresponding to named entities, and then finding text fragments matching those aliases to assign a named entity label. An article’s aliases include its title, titles of disambiguation pages with the article, and texts of links leading to the article (e.g. Լենինգրադ (Leningrad), Պետրոգրադ (Petrograd), Պետերբուրգ (Peterburg) are aliases for Սանկտ Պետերբուրգ (Saint Petersburg)). The list of aliases is compiled for all PER, ORG, LOC articles.

After that, link boundaries are adjusted by removing the labels for expressions in parentheses, the text after a comma, and in some cases breaking into separate named entities if the linked text contains a comma. For example, [LOC Աբովյան (քաղաք)] (Abovyan (town)) is reworked into [LOC Աբովյան] (քաղաք).


subclass of
NE type

company, business enterprise, company, juridical person, air carrier, political organization, government organization, secret service, political party, international organization, alliance, armed organization, higher education institution, educational institution, university, educational organization, school, fictional magic school, broadcaster, newspaper, periodical literature, religious organization, football club, sports team, musical ensemble, music organisation, vocal-musical ensemble, sports organization, criminal organization, museum of culture, scientific organisation, non-governmental organization, nonprofit organization, national sports team, legal person, scholarly publication, academic journal, association, band, sports club, institution, medical facility
ORG
state, disputed territory, country, occupied territory, political territorial entity, city, town, village, rural area, rural settlement, urban-type settlement, geographical object, geographic location, geographic region, community, administrative territorial entity, former administrative territorial entity, human settlement, county, province, federated state, district, county-equivalent, municipal formation, raion, nahiyah, mintaqah, muhafazah, realm, principality, historical country, watercourse, lake, sea, still waters, body of water, landmass, minor planet, landform, natural geographic object, mountain range, mountain, protected area, national park, geographic region, geographic location, arena, bridge, airport, stadium, performing arts center, public building, venue, sports venue, church, temple, place of worship, retail building LOC
person, fictional character, fictional humanoid, human who may be fictional, given name, fictional human, magician in fantasy PER

TABLE I: The mapping of Wikidata subclass of values to named entity types

Ii-B Using Wikidata to classify Wikipedia

Instead of manually classifying Wikipedia articles as it was done in Nothman et al., we developed a rule-based classifier that used an article’s Wikidata instance of and subclass of attributes to find the corresponding named entity type.

The classification could be done using solely instance of labels, but these labels are unnecessarily specific for the task and building a mapping on it would require a more time-consuming and meticulous work. Therefore, we classified articles based on their first instance of attribute’s subclass of values. Table I displays the mapping between these values and named entity types. Using higher-level subclass of values was not an option as their values often were too general, making it impossible to derive the correct named entity category.

Ii-C Generated data

Using the algorithm described above, we generated 7455 annotated sentences with 163247 tokens based on 20 February 2018 dump of Armenian Wikipedia.

The generated data is still significantly smaller than the manually annotated corpora from CoNLL 2002 and 2003. For comparison, the train set of English CoNLL 2003 corpus contains 203621 tokens and the German one 206931, while the Spanish and Dutch corpora from CoNLL 2002 respectively 273037 and 218737 lines. The smaller size of our generated data can be attributed to the strict selection of candidate sentences as well as simply to the relatively small size of Armenian Wikipedia.

The accuracy of annotation in the generated corpus heavily relies on the quality of links in Wikipedia articles. During generation, we assumed that first mentions of all named entities have an outgoing link to their article, however this was not always the case in actual source data and as a result the train set contained sentences where not all named entities are labeled. Annotation inaccuracies also stemmed from wrongly assigned link boundaries (for example, in Wikipedia article Արթուր Ուելսլի Վելինգթոն (Arthur Wellesley) there is a link to the Napoleon article with the text “է Նապոլեոնը” (“Napoleon is”), when it should be “Նապոլեոնը” (“Napoleon”)). Another kind of common annotation errors occurred when a named entity appeared inside a link not targeting a LOC, ORG, or PER article (e.g. “ԱՄՆ նախագահական ընտրություններում” (“USA presidential elections”) is linked to the article ԱՄՆ նախագահական ընտրություններ 2016 (United States presidential election, 2016) and as a result [LOC ԱՄՆ] (USA) is lost).

Iii Test dataset

In order to evaluate the models trained on generated data, we manually annotated a named entities dataset comprising 53453 tokens and 2566 sentences selected from over 250 news texts from ilur.am2. This dataset is comparable in size with the test sets of other languages (Table II). Included sentences are from political, sports, local and world news (Figures 2, 3), covering the period between August 2012 and July 2018. The dataset provides annotations for 3 popular named entity classes: people (PER), organizations (ORG), and locations (LOC), and is released in CoNLL03 format with IOB tagging scheme. Tokens and sentences were segmented according to the UD standards for the Armenian language[12].

\pie
Fig. 2: Topics distribution in the gold-standard corpus
\pie
Fig. 3: Distribution of examples by location in the gold-standard corpus

Test set
Tokens LOC ORG PER

Armenian
53453 1306 1337 1274
English CoNLL03 46435 1668 1661 1617
German CoNLL03 51943 1035 773 1195
Spanish CoNLL02 51533 1084 1400 735
Russian factRuEval-2016 59382 1239 1595 1353

TABLE II: Comparison of Armenian, English, German, Spanish and Russian test sets: sentence, token, and named entity counts

During annotation, we generally relied on categories and guidelines assembled by BBN Technologies for TREC 2002 question answering track3. Only named entities corresponding to BBN’s person name category were tagged as PER. Those include proper names of people, including fictional people, first and last names, family names, unique nicknames. Similarly, organization name categories, including company names, government agencies, educational and academic institutions, sports clubs, musical ensembles and other groups, hospitals, museums, newspaper names, were marked as ORG. However, unlike BBN, we did not mark adjectival forms of organization names as named entities. BBN’s gpe name, facility name, location name categories were combined and annotated as LOC.

We ignored entities of other categories (e.g. works of art, law, or events), including those cases where an ORG, LOC or PER entity was inside an entity of extraneous type (e.g. ՀՀ (RA) in ՀՀ Քրեական Օրենսգիրք (RA Criminal Code) was not annotated as LOC).

Quotation marks around a named entity were not annotated unless those quotations were a part of that entity’s full official name (e.g. «Նաիրիտ գործարան» ՓԲԸ (“Nairit Plant” CJSC)).

Depending on context, metonyms such as Կրեմլ (Kremlin), Բաղրամյան 26 (Baghramyan 26) were annotated as ORG when referring to respective government agencies. Likewise, country or city names were also tagged as ORG when referring to sports teams representing them.

Iv Word embeddings

Apart from the datasets, we also developed word embeddings for the Armenian language, which we used in our experiments to train and evaluate named entity recognition algorithms. Considering their ability to capture semantic regularities, we used GloVe to train word embeddings. We assembled a dataset of Armenian texts containing 79 million tokens from the articles of Armenian Wikipedia, The Armenian Soviet Encyclopedia, a subcorpus of Eastern Armenian National Corpus[13], over a dozen Armenian news websites and blogs. Included texts covered topics such as economics, politics, weather forecast, IT, law, society and politics, coming from non-fiction as well as fiction genres.

Similar to the original embeddings published for the English language, we release 50-, 100-, 200- and 300-dimensional word vectors for Armenian with a vocabulary size of 400000. Before training, all the words in the dataset were lowercased. For the final models we used the following training hyperparameters: 15 window size and 20 training epochs.

V Experiments

In this section we describe a number of experiments targeted to compare the performance of popular named entity recognition algorithms on our data. We trained and evaluated Stanford NER4, spaCy 2.05, and a recurrent model similar to [14],[15] that uses bidirectional LSTM cells for character-based feature extraction and CRF, described in Guillaume Genthial’s Sequence Tagging with Tensorflow blog post [16].

V-a Models

Stanford NER is conditional random fields (CRF) classifier based on lexical and contextual features such as the current word, character-level n-grams of up to length 6 at its beginning and the end, previous and next words, word shape and sequence features [17].

spaCy 2.0 uses a CNN-based transition system for named entity recognition. For each token, a Bloom embedding is calculated based on its lowercase form, prefix, suffix and shape, then using residual CNNs, a contextual representation of that token is extracted that potentially draws information from up to 4 tokens from each side [18]. Each update of the transition system’s configuration is a classification task that uses the contextual representation of the top token on the stack, preceding and succeeding tokens, first two tokens of the buffer, and their leftmost, second leftmost, rightmost, second rightmost children. The valid transition with the highest score is applied to the system. This approach reportedly performs within 1% of the current state-of-the-art for English 6. In our experiments, we tried out 50-, 100-, 200- and 300-dimensional pre-trained GloVe embeddings. Due to time constraints, we did not tune the rest of hyperparameters and used their default values.

The main model that we focused on was the recurrent model with a CRF top layer, and the above-mentioned methods served mostly as baselines. The distinctive feature of this approach is the way contextual word embeddings are formed. For each token separately, to capture its word shape features, character-based representation is extracted using a bidirectional LSTM [19]. This representation gets concatenated with a distributional word vector such as GloVe, forming an intermediate word embedding. Using another bidirectional LSTM cell on these intermediate word embeddings, the contextual representation of tokens is obtained (Figure 4). Finally, a CRF layer labels the sequence of these contextual representations. In our experiments, we used Guillaume Genthial’s implementation7 of the algorithm. We set the size of character-based biLSTM to 100 and the size of second biLSTM network to 300.

Input token

Word embedding

Character embeddings

biLSTM

biLSTM

Contextual representation

Fig. 4: The neural architecture for extracting contextual representations of tokens

V-B Evaluation

Experiments were carried out using IOB tagging scheme, with a total of 7 class tags: O, B-PER, I-PER, B-LOC, I-LOC, B-ORG, I-ORG.

We randomly selected 80% of generated annotated sentences for training and used the other 20% as a development set. The models with the best F1 score on the development set were tested on the manually annotated gold dataset.


Algorithm
dev test

Precision Recall F1 Precision Recall F1

Stanford NER
76.86 70.62 73.61 78.46 46.52 58.41
spaCy 2.0 68.19 71.86 69.98 64.83 55.77 59.96
Char-biLSTM+biLSTM+CRF 77.21 74.81 75.99 73.27 54.14 62.23

TABLE III: Evaluation results for named entity recognition algorithms

Word embeddings
train embeddings=False train embeddings=True

dev F1 test F1 dev F1 test F1

GloVe (dim=50)
74.18 56.46 75.99 62.23
GloVe (dim=100) 73.94 58.52 74.83 61.54
GloVe (dim=200) 75.00 58.37 74.97 59.78
GloVe (dim=300) 74.21 59.66 74.75 59.92

TABLE IV: Evaluation results for Char-biLSTM+biLSTM+CRF

Word embeddings
dev test

Precision Recall F1 Precision Recall F1

GloVe (dim=50)
69.31 71.26 70.27 66.52 51.72 58.20
GloVe (dim=100) 70.12 72.91 71.49 66.34 53.35 59.14
GloVe (dim=200) 68.19 71.86 69.98 64.83 55.77 59.96
GloVe (dim=300) 70.08 71.80 70.93 66.61 52.94 59.00

TABLE V: Evaluation results for spaCy 2.0 NER

Vi Discussion

Table III shows the average scores of evaluated models. The highest F1 score was achieved by the recurrent model using a batch size of 8 and Adam optimizer with an initial learning rate of 0.001. Updating word embeddings during training also noticeably improved the performance. GloVe word vector models of four different sizes (50, 100, 200, and 300) were tested, with vectors of size 50 producing the best results (Table IV).

For spaCy 2.0 named entity recognizer, the same word embedding models were tested. However, in this case the performance of 200-dimensional embeddings was highest (Table V). Unsurprisingly, both deep learning models outperformed the feature-based Stanford recognizer in recall, the latter however demonstrated noticeably higher precision.

It is clear that the development set of automatically generated examples was not an ideal indicator of models’ performance on gold-standard test set. Higher development set scores often led to lower test scores as seen in the evaluation results for spaCy 2.0 and Char-biLSTM+biLSTM+CRF (Tables V and IV). Analysis of errors on the development set revealed that many were caused by the incompleteness of annotations, when named entity recognizers correctly predicted entities that were absent from annotations (e.g. [ԽՍՀՄ-ի LOC] (USSR’s), [Դինամոն ORG] (the_Dinamo), [Պիրենեյան թերակղզու LOC] (Iberian Peninsula’s) etc). Similarly, the recognizers often correctly ignored non-entities that are incorrectly labeled in data (e.g. [օսմանների PER], [կոնսերվատորիան ORG] etc).

Generally, tested models demonstrated relatively high precision of recognizing tokens that started named entities, but failed to do so with descriptor words for organizations and, to a certain degree, locations. The confusion matrix for one of the trained recurrent models illustrates that difference (Table VI). This can be partly attributed to the quality of generated data: descriptor words are sometimes superfluously labeled (e.g. [Հավայան կղզիների տեղաբնիկները LOC] (the indigenous people of Hawaii)), which is likely caused by the inconsistent style of linking in Armenian Wikipedia (in the article ԱՄՆ մշակույթ (Culture of the United States), its linked text fragment “Հավայան կղզիների տեղաբնիկները” (“the indigenous people of Hawaii”) leads to the article Հավայան կղզիներ(Hawaii)).

Predicted
O B-PER B-ORG B-LOC I-ORG I-PER I-LOC

Actual

O 26707 100 57 249 150 78 129
B-PER 107 712 6 32 2 4 0
B-ORG 93 6 259 58 8 0 0
B-LOC 226 25 32 1535 5 3 20
I-ORG 67 1 5 3 289 3 19
I-PER 46 5 0 1 6 660 8
I-LOC 145 0 1 13 45 11 597
Precision (%) 97.5 83.86 71.94 81.17 57.23 86.95 77.23
TABLE VI: Confusion matrix on the development set

Vii Conclusion

We release two named-entity annotated datasets for the Armenian language: a silver-standard corpus for training NER models, and a gold-standard corpus for testing. It is worth to underline the importance of the latter corpus, as we aim it to serve as a benchmark for future named entity recognition systems designed for the Armenian language. Along with the corpora, we publish GloVe word vector models trained on a collection of Armenian texts.

Additionally, to establish the applicability of Wikipedia-based approaches for the Armenian language, we provide evaluation results for 3 different named entity recognition systems trained and tested on our datasets. The results reinforce the ability of deep learning approaches in achieving relatively high recall values for this specific task, as well as the power of using character-extracted embeddings alongside conventional word embeddings.

There are several avenues of future work. Since Nothman et al. 2013, more efficient methods of exploiting Wikipedia have been proposed, namely WiNER [20], which could help increase both the quantity and quality of the training corpus. Another potential area of work is the further enrichment of the benchmark test set with additional annotation of other classes such as MISC or more fine-grained types (e.g. CITY, COUNTRY, REGION etc instead of LOC).

Footnotes

  1. https://github.com/ispras-texterra/pioner
  2. http://ilur.am/news/newsline.html
  3. https://catalog.ldc.upenn.edu/docs/LDC2005T33/BBN-Types-Subtypes.html
  4. https://nlp.stanford.edu/software/CRF-NER.shtml
  5. https://spacy.io/
  6. https://spacy.io/usage/v2#features-models
  7. https://github.com/guillaumegenthial/sequence_tagging

References

  1. David Yarowsky, Grace Ngai, Richard Wicentowski. Inducing multilingual text analysis tools via robust projection across aligned corpora. In Proceedings of the First International Conference on Human Language Technology Research. Association for Computational Linguistics, Stroudsburg, PA, USA, HLT’01, pp. 1–8. 2001.
  2. Imed Zitouni, Radu Florian. Mention detection crossing the language barrier. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, pp. 600–609. 2008.
  3. Maud Ehrmann, Marco Turchi, Ralf Steinberger. Building a multilingual named entity-annotated corpus using annotation projection. In Proceedings of Recent Advances in Natural Language Processing. Association for Computational Linguistics, pp. 118–124. 2011.
  4. Klesti Hoxha, Artur Baxhaku. An Automatically Generated Annotated Corpus for Albanian Named Entity Recognition CYBERNETICS AND INFORMATION TECHNOLOGIES, vol. 18, No 1. 2018.
  5. Weber and Pötzl. NERU: Named Entity Recognition for German. Proceedings of GermEval 2014 Named Entity Recognition Shared Task, pp. 157-162. 2014.
  6. Nothman J., Ringland N., Radford W., Murphy T., Curran J. R. Learning multilingual named entity recognition from Wikipedia. Artificial Intelligence, vol. 194, pp. 151–175. 2013.
  7. Sysoev A. A., Andrianov I. A. Named Entity Recognition in Russian: the Power of Wiki-Based Approach. Computational Linguistics and Intellectual Technologies: Proceedings of the International Conference "Dialogue 2016". 2016.
  8. Turdakov D., Astrakhantsev N., Nedumov Y., Sysoev A., Andrianov I., Mayorov V., Fedorenko D., Korshunov A., Kuznetsov S. Texterra: A Framework for Text Analysis. Proceedings of the Institute for System Programming of RAS, vol. 26, Issue 1, pp. 421–438. 2014.
  9. Erik F. Tjong Kim Sang. Introduction to the CoNLL-2002 Shared Task: Language-Independent Named Entity Recognition. Proceedings of CoNLL-2002, pp. 155–158. Taipei, Taiwan. 2002.
  10. Erik F. Tjong Kim Sang, Fien De Meulder. Introduction to the CoNLL-2003 Shared Task: LanguageIndependent Named Entity Recognition. Proceedings of the CoNLL-2003, vol. 4, pp. 142–147. Association for Computational Linguistics. 2003.
  11. Jeffrey Pennington, Richard Socher, Christopher D. Manning. GloVe: Global Vectors for Word Representation.. Empirical Methods in Natural Language Processing (EMNLP) 2014, pp. 1532-1543. 2014.
  12. Marat M. Yavrumyan, Hrant H. Khachatrian, Anna S. Danielyan, Gor D. Arakelyan. ArmTDP: Eastern Armenian Treebank and Dependency Parser. XI International Conference on Armenian Linguistics, Abstracts. Yerevan. 2017.
  13. Khurshudian V.G., Daniel M.A., Levonian D.V., Plungian V.A., Polyakov A.E., Rubakov S.V. EASTERN ARMENIAN NATIONAL CORPUS. "Dialog 2009". 2009.
  14. Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, Chris Dyer. Neural Architectures for Named Entity Recognition Proceedings of NAACL-2016, San Diego, California, USA. 2016.
  15. Xuezhe Ma, Eduard Hovy. End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF. Proceedings of ACL. 2016.
  16. Guillaume Genthial. Sequence Tagging with Tensorflow. 2017. https://guillaumegenthial.github.io/
  17. Jenny Rose Finkel, Trond Grenager, Christopher Manning. Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling. Proceedings of the 43nd Annual Meeting of the Association for Computational Linguistics (ACL 2005), pp. 363-370. 2005.
  18. Emma Strubell, Patrick Verga, David Belanger, Andrew McCallum. Fast and Accurate Entity Recognition with Iterated Dilated Convolutions. 2017. https://arxiv.org/pdf/1702.02098.pdf
  19. Wang Ling, Tiago Lúıs, Lúıs Marujo, Ramón Fernandez Astudillo, Silvio Amir, Chris Dyer, Alan W Black, Isabel Trancoso. Finding function in form: Compositional character models for open vocabulary word representation. Proceedings of EMNLP. 2015.
  20. Abbas Ghaddar, Philippe Langlais. WiNER: A Wikipedia Annotated Corpus for Named Entity Recognition. Proceedings of the The 8th International Joint Conference on Natural Language Processing, pp. 413–422. 2017.
  21. Toldova S. Y. Starostin A. S., Bocharov V. V., Alexeeva S. V., Bodrova A. A., Chuchunkov A. S., Dzhumaev S. S., Efimenko I. V., Granovsky D. V., Khoroshevsky V. F., Krylova I. V., Nikolaeva M. A., Smurov I. M. FactRuEval 2016: Evaluation of Named Entity Recognition and Fact Extraction Systems for Russian. Computational Linguistics and Intellectual Technologies: Proceedings of the International Conference "Dialogue 2016". 2016.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
311640
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description