To What Extent are Name Variants Used as Named Entities in Turkish Tweets?

To What Extent are Name Variants Used as Named Entities in Turkish Tweets?

Abstract

Social media texts differ from regular texts in various aspects. One of the main differences is the common use of informal name variants instead of well-formed named entities in social media compared to regular texts. These name variants may come in the form of abbreviations, nicknames, contractions, and hypocoristic uses, in addition to names distorted due to capitalization and writing errors. In this paper, we present an analysis of the named entities in a publicly-available tweet dataset in Turkish with respect to their being name variants belonging to different categories. We also provide finer-grained annotations of the named entities as well-formed names and different categories of name variants, where these annotations are made publicly-available. The analysis presented and the accompanying annotations will contribute to related research on the treatment of named entities in social media.

\Keywordsnamed entity, named entity recognition, name variant, Turkish, Twitter

\name

Dilek Küçük \addressElectrical Power Technologies Department
TÜBİTAK Energy Institute
Ankara–Turkey
dilek.kucuk@tubitak.gov.tr

\maketitleabstract

1 Introduction

Automatic extraction and classification of named entities in natural language texts (i.e., named entity recognition (NER)) is a significant topic of natural language processing (NLP), both as a stand-alone research problem and as a subproblem to facilitate solutions of other related NLP problems. NER has been studied for a long time and in different domains, and there are several survey papers on NER including [\citenameMarrero et al.2013].

Conducting NLP research (such as NER) on microblog texts like tweets poses further challenges, due to the particular nature of this text genre. Contractions, writing/grammatical errors, and deliberate distortions of words are common in this informal text genre which is produced with character limitations and published without a formal review process before publication. There are several studies that propose tweet normalization schemes [\citenameHan and Baldwin2011] to alleviate the negative effects of such language use in microblogs, for the other NLP tasks to be performed on the normalized microblogs thereafter. Yet, particularly regarding Turkish content, a related study on NER on Turkish tweets [\citenameKüçük and Steinberger2014] claims that normalization before the actual NER procedure on tweets may not guarantee improved NER performance.

Identification of name variants is an important research issue that can help facilitate tasks including named entity linking [\citenameWeichselbraun et al.2019] and NER, among others. Name variants can appear due to several reasons including the use of abbreviations, contracted forms, nicknames, hypocorism, and capitalization/writing errors [\citenameWeichselbraun et al.2019]. The identification and disambiguation of name variants have been studied in studies such as [\citenameDriscoll and Yarowsky2007] and [\citenameWeichselbraun et al.2019], where resource-based and/or algorithmic solutions are proposed.

In this paper, we consider name variants from the perspective of a NER application and analyze an existing named entity-annotated tweet dataset in Turkish described in [\citenameKüçük et al.2014], in order to further annotate the included named entities with respect to a proprietary name variant categorization. The original dataset includes named annotations for eight types: PERSON, LOCATION, ORGANIZATION, DATE, TIME, MONEY, PERCENT, and MISC [\citenameKüçük et al.2014]. However, in this study, we target only at the first three categories which amounts to a total of 980 annotations in 670 tweets in Turkish. We further annotate these 980 names with respect to a name variant categorization that we propose and try to present a rough estimate of the extent at which different named entity variants are used as named entities in Turkish tweets. The resulting annotations of named entities as different name variants are also made publicly available for research purposes. We believe that both the analysis described in the paper and the publicly-shared annotations (i.e., a tweet dataset annotated for name variants) will help improve research on NER, name disambiguation, and name linking on Turkish social media posts.

The rest of the paper is organized as follows: In Section 2, an analysis of the named entities in the publicly-available Turkish tweet dataset with respect to their being name variants or not is presented together with the descriptions of name variant categories. In Section 3, details and samples of the related finer-grained annotations of named entities are described and Section 4 concludes the paper with a summary of main points.

2 An Analysis of Turkish Tweets for Name Variants Included

Although NER is an NLP topic that has been studied for a long time, currently, the target genre of the related studies has shifted from well-formed texts such as news articles to microblog texts like tweets [\citenameRitter et al.2011]. Following this scheme (mostly) on English content, NER research on other languages like Turkish has also started to target at tweets [\citenameKüçük et al.2014, \citenameKüçük and Steinberger2014]. A named entity-annotated dataset consisting of Turkish tweets is described in [\citenameKüçük et al.2014] and the results of NER experiments on Turkish tweets are presented in [\citenameKüçük and Steinberger2014]. Interested readers are referred to [\citenameKüçük et al.2017] which presents a survey of named entity recognition on Turkish, including related work on tweets.

In this study, we analyze the basic named entities (of type PERSON, LOCATION, and ORGANIZATION, henceforth, PLOs) in the annotated dataset compiled in [\citenameKüçük et al.2014], with respect to their being well-formed canonical names or name variants. The dataset includes a total of 1.322 named entity annotations, however, 980 of them are PLOs (457 PERSON, 282 LOCATION, and 241 ORGANIZATION names) and are the main focus of this paper. These 980 PLOs were annotated within a total of 670 tweets.

We have extracted these PLO annotations from the dataset and further annotated them as belonging to one of the following eight name variant categories that we propose. We should note that a particular name can belong to several categories and therefore, there may be multiple category labels assigned to it. However, the number of category labels does not exceed two in our case, i.e., each name is annotated with either one or two labels in the resulting dataset.

  • WELL-FORMED: This category comprises those names which are written in their open and canonical form without any distortions, conforming to the capitalization and other writing rules of Turkish. In Turkish, each of the tokens of names are written with their initial letters capitalized. However, those names written all in uppercase are also considered within this category as they cannot be considered as writing errors.

  • ABBREVIATION: This category represents those names which are provided as abbreviations. This usually applies to named entities of ORGANIZATION type. But, these abbreviations can include writing errors due capitalization or characters with diacritics, as will be explained below. Hence, those names annotated as ABBREVIATION can also have an additional category label as CAPITALIZATION or DIACRITICS.

  • CAPITALIZATION: This category includes those names distorted due to not conforming to the capitalization rules of Turkish. As pointed out above, initial letters of each of the tokens of a named entity are capitalized in Turkish. Additionally, abbreviations of names are generally all in uppercase. Those names not conforming to these rules are marked with the CAPITALIZATION label, denoting a capitalization issue.

  • DIACRITICS: There are six letters with diacritics in Turkish alphabet {ç, ğ, ı, ö, ş, ü} which are sometimes replaced with their counterparts without diacritics {c, g, i, o, s, u}, in informal texts like microblogs [\citenameKüçük and Steinberger2014]. Very rarely, the opposite (and perhaps unintentional) replacements can be observed again in informal texts (this time at least one character without diacritics is replaced with a character having diacritics in a word). Named entities including such writing errors are assigned the category label of DIACRITICS.

  • HASHTAG-LIKE: Another name variant type is the case where the whitespaces in the names are removed, so they appear like hashtags, and sometimes they are actually hashtags. Such phenomena are annotated with the category label of HASHTAG-LIKE.

  • CONTRACTED: This category represents those name variants in which the original name is contracted, by leaving out some of its tokens. Since users like to produce and publish instantly on social media, they tend to contract especially those long organization names, mostly by using its initial token only. Such name variants are annotated as CONTRACTED.

  • HYPOCORISM: Hypocorism or hypocoristic use is defined as the phenomenon of deliberately modifying a name, in the forms of nicknames, diminutives, and terms of endearment, to show familiarity and affection [\citenameNewman and Ahmad1992, \citenameDriscoll2013]. An example hypocoristic use in English is using Bobby instead of the name Bob [\citenameNewman and Ahmad1992]. Such name variants observed in the tweet dataset are marked with the category label of HYPOCORISM.

  • ERROR: This category denotes those name variants which have some forms of writing errors, excluding issues related to capitalization, diacritics, hypocorism, and removing whitespaces to make names appear like hashtags. Hence, names conforming to this category are labelled with ERROR.

The following subsection includes examples of the above name variant categories in the Turkish tweet dataset analyzed, in addition to statistical information indicating the share of each category in the overall dataset.

3 Finer-Grained Annotation of Named Entities

We have annotated the PLOs in the tweet dataset (already-annotated for named entities as described in [\citenameKüçük et al.2014]) with the name variant category labels of WELL-FORMED, ABBREVIATION, CAPITALIZATION, DIACRITICS, HASHTAG-LIKE, CONTRACTED, HYPOCORISM, and ERROR, as described in the previous subsection. Although there are 980 PLOs in the dataset, since 44 names have two name variant category labels, the total number of name variant annotations is 1,024.

The percentages of the category labels in the final annotation file are provided as a bar graph in Figure 1. As indicated in the figure, about 60% of all named entities are well-formed and hence about 40% of them are not in their canonical open form or do not conform to the capitalization/writing errors regarding named entities in Turkish.

Figure 1: Statistical Information for Each Named Entity Variant Category in the Turkish Tweet Dataset.

The most common issue is the lack of proper capitalization of names in tweets, revealed with a percentage of 22.56% names annotated with the CAPITALIZATION label. For instance, people write istanbul instead of the correct form İstanbul and ankara instead of Ankara in their tweets.

The number of names having issues about characters with diacritics is 45, and similarly there are 45 abbreviations (of mostly organization names) in the dataset. As examples of names having issues with diacritics, people use Kutahya istead of the correct form Kütahya, and similarly Besiktas instead of Beşiktaş. Abbreviations in the dataset include national corporations like TRT and SGK, and international organizations like UEFA.

Instances of the categories of HASHTAG-LIKE and CONTRACTED are observed in 38 and 35 names, respectively. A sample name variant marked with HASHTAG-LIKE is SabriSarıoğlu where this person name should have been written as Sabri Sarıoğlu. A contracted name instance in the dataset is Diyanet which is an organization name with the correct open form of Diyanet İşleri Başkanlığı.

The instances of HYPOCORISM and ERROR are comparatively low, where 10 instances of hyprocorism and 11 instances of other errors are seen in the dataset. An instance of the former category is Nazlış which is a hypocoristic use of the female person name Nazlı. An instance of the ERROR category is the use of FENEBAHÇE instead of the correct sports club name FENERBAHÇE.

Overall, this finer-granularity analysis of named entities as name variants in a common Turkish tweet dataset is significant due to the following reasons.

  • The analysis leads to a breakdown of different named entity variants into eight categories. Although about 60% of the names are in their correct and canonical forms, about 40% of them either appear as abbreviations or suffer from a deviation from the standard form due to multiple reasons including violations of the writing rules of the language. Hence, it provides an insight about the extent of the use of different name variants as named entities in Turkish tweets.

  • The use of different name variants is significant for several NLP tasks including NER on social media, name disambiguation and linking. A recent and popular research topic that may benefit from patterns governing name variants is stance detection, where the position of a post owner towards a target is explored, mostly using the content of the post [\citenameMohammad et al.2016]. A recent study reports that named entities can be used as improving features for the stance detection task [\citenameKüçük2017]. Hence, an analysis of name variants can contribute to the algorithmic/learning-based proposals for these research problems.

The name variant annotations described in the study are made publicly available at https://github.com/dkucuk/Name-Variants-Turkish-Tweets as a text file, for research purposes. Each line in the annotation file denotes triplets, separated by semicolons. The first item in each triplet is the tweet id, the second item is another triplet denoting the already-existing named entity boundaries and type, and the final item is a comma-separated list of name variant annotations for the named entity under consideration. Below provided are two sample lines from the annotation file. The first line indicates a person name (between the non-white-space characters of 0 and 11 in the tweet text) annotated with CAPITALIZATION category, as it lacks proper capitalization. The second line denotes an organization name (between the non-white-space characters of 0 and 19 in the tweet) which has issues related to characters with diacritics and proper capitalization.

360731728177922048;0,11,PERSON;CAPITALIZATION
360733236961349636;0,19,ORGANIZATION;DIACRITICS,CAPITALIZATION

4 Conclusion

This paper focuses on named entity variants in Turkish tweets and presents the related analysis results on a common named-entity annotated tweet dataset in Turkish. The named entities of type person, location, and organization names are further categorized into eight proprietary name variant classes and the resulting annotations are made publicly available. The results indicate that about 40% of the considered names deviate from their standard canonical forms in these tweets and the categorizations for these cases can be used by researchers to devise solutions for related NLP problems. These problems include named entity recognition, name disambiguation and linking, and more recently, stance detection.

References

  1. Patricia Driscoll and David Yarowsky. 2007. Disambiguation of standardized personal name variants. In Proceedings of IWMMIES, pages 1–7.
  2. Patricia Driscoll. 2013. Computational methods for name normalization using hypocoristic personal name variants. In Multi-source, multilingual information extraction and summarization, pages 73–91.
  3. Bo Han and Timothy Baldwin. 2011. Lexical normalisation of short text messages: Makn sens a# twitter. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, pages 368–378.
  4. Dilek Küçük. 2017. Joint named entity recognition and stance detection in tweets. arXiv preprint arXiv:1707.09611.
  5. Dilek Küçük and Ralf Steinberger. 2014. Experiments to improve named entity recognition on Turkish tweets. In Proceedings of the 5th Workshop on Language Analysis for Social Media (LASM), pages 71–78.
  6. Dilek Küçük, Guillaume Jacquet, and Ralf Steinberger. 2014. Named entity recognition on Turkish tweets. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC), pages 450–454.
  7. Doğan Küçük, Nursal Arıcı, and Dilek Küçük. 2017. Named entity recognition in Turkish: Approaches and issues. In Proceedings of the International Conference on Applications of Natural Language to Information Systems, pages 176–181.
  8. Mónica Marrero, Julián Urbano, Sonia Sánchez-Cuadrado, Jorge Morato, and Juan Miguel Gómez-Berbís. 2013. Named entity recognition: fallacies, challenges and opportunities. Computer Standards & Interfaces, 35(5):482–489.
  9. Saif Mohammad, Svetlana Kiritchenko, Parinaz Sobhani, Xiaodan Zhu, and Colin Cherry. 2016. Semeval-2016 task 6: Detecting stance in tweets. In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), pages 31–41.
  10. Paul Newman and Mustapha Ahmad. 1992. Hypocoristic names in hausa. Anthropological Linguistics, pages 159–172.
  11. Alan Ritter, Sam Clark, Oren Etzioni, et al. 2011. Named entity recognition in tweets: an experimental study. In Proceedings of the conference on empirical methods in natural language processing, pages 1524–1534.
  12. Albert Weichselbraun, Philipp Kuntschik, and Adrian MP Brasoveanu. 2019. Name variants for improving entity discovery and linking. In Proceedings of the 2nd Conference on Language, Data and Knowledge.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
402298
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description