Annotating Cognates and Etymological Origin in Turkic Languages
Turkic languages exhibit extensive and diverse etymological relationships among lexical items. These relationships make the Turkic languages promising for exploring automated translation lexicon induction by leveraging cognate and other etymological information. However, due to the extent and diversity of the types of relationships between words, it is not clear how to annotate such information. In this paper, we present a methodology for annotating cognates and etymological origin in Turkic languages. Our method strives to balance the amount of research effort the annotator expends with the utility of the annotations for supporting research on improving automated translation lexicon induction.
Benjamin S. Mericli, Michael Bloodgood
\addressUniversity of Maryland, College Park, MD email@example.com
University of Maryland, College Park, MD firstname.lastname@example.org
Automated translation lexicon induction has been investigated in the literature and shown to be feasible for various language families and subgroups, such as the Romance languages and the Slavic languages [\citenameMann and Yarowsky2001, \citenameSchafer and Yarowsky2002]. Although there have been some studies investigating using Swadesh lists of words to identify Turkic language groups and loanword candidates [\citenamevan der Ark et al.2007], we are not aware of any work yet on automated translation lexicon induction for the Turkic languages.
However, the Turkic languages are well suited to exploring such technology since they exhibit many diverse lexical relationships both within family and to languages outside of the family through loanwords. For the Turkic languages, it is prudent to leverage both cognate information and other etymological information when automating translation lexicon induction. However, we are not aware of any corpora for the Turkic languages that have been annotated for this information in a suitable way to support automatic translation lexicon induction. Moreover, performing the annotation is not straightforward because of the range of relationships that exist. In this paper, we lay out a methodology for performing this annotation that is intended to balance the amount of effort expended by the annotators with the utility of the annotations for supporting computational linguistics research.
2 Main Annotation System
We obtained the dictionary of the Turkic languages [\citenameÖztopçu et al.1996]. One section of this dictionary contains 1996 English glosses and for each English gloss a corresponding translation in the following eight Turkic languages: Azerbaijani, Kazakh, Kyrgyz, Tatar, Turkish, Turkmen, Uyghur, and Uzbek. Table 1 shows an example for the English gloss ‘alive.’ When a language has an official Latin script, that script is used. Otherwise, the dictionary’s transliteration is shown in parentheses. Our annotation system is to annotate each Turkic word with a two-character code. The first character will be a number indicating which words are cognate with each other and the second character will indicate etymological information. Subsection 2.1 discusses how to define and annotate cognates and subsection 2.2 discusses how to define and annotate etymological information.
According to the Oxford English Dictionary Online111http://www.oed.com/view/Entry/35870?redirectedFrom=cognate accessed on February 2, 2012, ‘cognate’ is defined as: “…Of words: Coming naturally from the same root, or representing the same original word, with differences due to subsequent separate phonetic development; thus, English five, Latin quinque, Greek , are cognate words, representing a primitive *penke.” As this definition shows, shared genetic origin is key to the notion of cognateness. A word is only considered cognate with another if both words proceed from the same ancestor. Nonetheless, in line with the conventions of previous research in computational linguistics, we set a broader definition. We use the word ‘cognate’ to denote, as in [\citenameKondrak2001]: “…words in different languages that are similar in form and meaning, without making a distinction between borrowed and genetically related words; for example, English ‘sprint’ and the Japanese borrowing ‘supurinto’ are considered cognate, even though these two languages are unrelated.” These broader criteria are motivated by the ways scientists develop and use cognate identification algorithms in natural language processing (NLP) systems. For cross-lingual applications, the advantage of such technology is the ability to identify words for which similarity in meaning can be accurately inferred from similarity in form; it does not matter if the similarity in form is from strict genetic relationship or later borrowing.
However, not every pair of apparently similar words will be annotated as cognate. For them to be considered cognates, the differences in form between them must meet a threshold of consistency within the data. We will explain the definitions and rules for the annotators to follow in order to establish such a threshold.
First, we elaborate on how our notion of cognate differs from that of strict genetic relation. At a high level, there are two cases to consider: A) where the words involved are native Turkic words, and B) where the words involved are shared loanwords from non-Turkic languages. Within case A, there are two cases to consider: (A1) genetic cognates; and (A2) intra-family loans. Table 2 shows an example of case A1. This example shows the English gloss ‘one’ for all eight Turkic languages, descended from the same postulated form, *bir, in Proto-Turkic [\citenameRóna-Tas2006]. Case A1 is the strict definition of ‘cognate,’ and these are to be annotated as cognate.
Case A2 is for intra-family loans, i.e., a word of ultimately Turkic origin borrowed by one Turkic language from another Turkic language. These cases, contrary to the strict definition, are to be marked as cognate in our system. An example is the modern Turkish neologism almaş ‘alternation, permutation’, incorporated from the Kyrgyz (almaş) ‘change’ [\citenameTürk Dil Kurumu1942]. While rare, it is used today in Turkish scholarly literature to describe concepts in areas such as mathematics and botany. Processing genetic cognates (case A1) and intra-family loans (case A2) differently would have little impact on the success of a cross-dictionary lookup system. In fact, accounting for the difference might limit the efficacy of such a system. Also, the time depth of intra-Turkic borrowings may be centuries or mere decades. The more distant the borrowing the more difficult it will be for annotators to distinguish between cases A1 and A2. Hence, instances of case A2 are to be annotated as cognate in our system.222For similar reasons, false cognates may be annotated as cognate if the annotator does not have readily available knowledge indicating that they are false cognates. Although this is a potential limitation of our system, it is not clear how to distinguish false cognates from true cognates without significant additional annotation expense.
Case B is for situations of shared loanwords, where the source of the words is ultimately non-Turkic. There are three subcases: (B1) loanwords borrowed from the same non-Turkic language; (B2) loanwords borrowed from different non-Turkic languages, but of the same ultimate origin; and (B3) loanwords of non-Turkic origin borrowed via another Turkic language.
Table 3 shows an example of case B1, the word ‘book,’ borrowed from Arabic in all eight Turkic languages. Table 4 shows an example of case B2, the word ‘ballet,’ borrowed from Russian in all cases except Turkish, where it was borrowed directly from the French. Table 5 shows an example of case B3: the word ‘benefit’ in Kyrgyz was borrowed most likely through Uzbek or Chaghatay [\citenameKirchner2006], but the Uzbek word was borrowed from Persian, and ultimately from Arabic. It is difficult and time-consuming for annotators to make these fine-grained distinctions. And again, for computational processing, such distinctions are not expected to be helpful. Hence, all of cases B1, B2, and B3 are to be annotated as cognate in our system.
Recall that all our annotations are two-character codes; the first character is a number from one to eight indicating what words are cognate with each other. Table 6 shows the first character of the annotations for the example from Table 1. The words marked with 1 are cognate with each other and the words marked 2 are cognate with each other.
The second character in a word’s annotation indicates a conjecture about etymological origin, e.g., T for Turkic. The decision to annotate word origin is motivated by its value for facilitating the development of technology for cross-language lookup of unknown forms. We therefore take a practical approach, balancing the value of the annotation for this purpose with the amount of effort required to perform the annotation. We have created the following code for annotating etymology:
Turkic origin. This includes compound forms and affixed forms whose constituents are all Turkic. For example, the Turkmen for ‘manager’, ýolbaşçy, is marked T because its compound base, ýol with baş, and affix -çy are all Turkic in origin.
Arabic origin, to include words borrowed indirectly through another language such as Persian. For example, the word in every Turkic language for ‘book’ is marked A for all eight Turkic languages. Because variations on the Arabic form /kita:b/ exist in every Turkic language, in Persian, and in other languages of the Islamic world, it is difficult to tease out the word’s trajectory into a language such as Kyrgyz. The burden of researching these fine distinctions is not placed on the annotator, as explained below.
Persian origin, not including Arabic words possibly borrowed through Persian. An example is the word for ‘color’ in many Turkic languages, from the Persian /ræng/.
borrowed from Russian, including words that are ultimately of French origin.
French origin, not including ultimately French words borrowed from Russian. Direct French loans occur almost exclusively in Turkish. An example is the word for ‘station’ in Turkish, istasyon.
English origin. For example the word for ‘basketball’ in every language.
Italian origin. Usually of importance only to specific domains in Turkish.
Greek origin. For example, the word in Azerbaijani, Turkish, Turkmen, Uyghur, and Uzbek for ‘box’ comes from the Greek .
Chinese origin, usually Mandarin and usually of importance only to Uyghur. An example is the word for ‘mushroom’ in Uyghur, (mogu).
unknown or inconclusive origin.
The careful reader will have noticed that there is an inconsistency in that words of ultimately Arabic origin borrowed through Persian are marked as A, but words of ultimately French origin borrowed through Russian are marked as R. There are two reasons for this. The first is annotator efficiency. Making the judgment that a word is ultimately of Arabic origin is much easier than having to figure out whether it was borrowed from Arabic or indirectly from Persian. For the Russian/French situation, the distinction is much easier to make. To begin with, the Russian loanwords occur almost exclusively in former USSR languages and the French loanwords occur almost exclusively in Turkish. Also, the orthography often gives clear cues for making this distinction, as Russian loanwords consistently retain characteristically Russian letters.
2.2.1 Multi-Language Exceptions
We also define other codes that categorize certain complex words that do not fall into any of the categories described in subsection 2.2. Other etymological annotation studies, such as the Loanword Typology project and its World Loanword Database [\citenameHaspelmath and Tadmor2009], have instructed linguists to pass over such complex words and optionally flag them as “contains a borrowed base,” etc. Our annotation system requires that these words, which are very common in Turkic languages, be annotated according to more fine grained categories.
The following are our multi-language exception codes:
Compound words where the constituents are from different origins. For example, the Tatar word for ‘truck’, (yök mashinası), is to be marked X since it contains Russian-origin (mashina), ’machine, vehicle’ in compound with Tatar (yök), ‘baggage,cargo.’ In contrast, the Turkish compound word for thunder, gök gürlemesi, will be marked T because all of its constituents are Turkish.
A verb formed by combining a non-Turkic base with a Turkic auxiliary verb or denominal affix. For example, the verb ‘to repeat’ in Azerbaijani, Tatar, and Turkish, because it consists of a noun borrowed from the Arabic /takra:r/ plus a Turkic auxiliary verb et- or it-.
A nominal consisting of a non-Turkic base bearing one or more Turkic affixes, in cases where removing the affixes results in a form that can plausibly be found elsewhere in the data or in a loan language dictionary. For example, the Kazakh word for ‘baker,’ (nawbayshı), is composed of a Persian-origin base, from /na:nva:/, ‘baker’, and a suffix that indicates a person associated with a profession, (-shı). The Turkmen word for ‘baker,’ (çörekçi), on the other hand, will be marked T, because both its base (çörek) and affix (-çi) are Turkic.
Table 7 shows an example of an entry that has been fully annotated for both cognates and etymology.
3 Inter-Annotator Agreement
We pilot-tested our annotation system with two annotators on 400 etymology annotations.333Table 8 has 392 entries because the annotators claimed eight entries had multiple translations for the same English gloss. Both annotators have studied linguistics. Also, both are native English speakers with experience studying or speaking multiple Turkic languages, Persian, and Arabic. Training consisted of studying the authors’ annotation manual and asking any follow-up questions. Both annotators made approximately 240 annotations per hour.
Table 8 shows the contingency matrix for annotating the 400 entries.444We left out columns for English, Greek, Italian, and Chinese, which were not relevant for the 50 entries (according to unanimous agreement of our annotators). From Table 8 it is immediate that agreement is substantial, and when there is disagreement it is largely for the difficult cases of inconclusive origin and the multi-language exceptions: Q, X, V, and N. We measured inter-annotator agreement using Cohen’s Kappa [\citenameCohen1960] and found Kappa = 0.5927 (95% CI = 0.5192 to 0.6662). If we restrict attention to only the instances where neither of the annotators marked an inconclusive origin or multi-language exception, then Kappa is 0.9216, generally considered high agreement. This shows that our annotation system is feasible for use and also shows that to improve the system we might focus efforts on finding ways to increase agreement on the annotation of the exceptional cases (Q, X, V, and N).
4 Conclusions and Future Work
The Turkic languages are a promising candidate family of languages to benefit from automated translation lexicon induction. A necessary step in that direction is the creation of annotated data for cognates and etymology. However, this annotation is not straightforward, as the Turkic languages exhibit extensive and diverse etymological relationships among words. Some distinctions are difficult for annotators to make and some are easier. Also, some distinctions are expected to be more useful than others for automating cross-lingual applications among the Turkic languages. We presented an annotation methodology that balances the research effort required of the annotator with the expected value of the annotations. We surveyed and explained the wide range of the most important relationships observed in the Turkic languages and how to annotate them. When we finish the annotations, we would like to make the annotated data available as long as it is legal under copyright laws for us to do so. Finally, we hope that our annotation system and the associated discussion can be useful for other teams that are annotating Turkic resources, and perhaps parts of it can be useful for annotating resources for other language families as well.
- [\citenameCohen1960] J. Cohen. 1960. A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20:37–46.
- [\citenameHaspelmath and Tadmor2009] Martin Haspelmath and Uri Tadmor. 2009. The Loanword Typology project and the World Loanword Database. In Martin Haspelmath and Uri Tadmor, editors, Loanwords in the World’s Languages: A Comparative Handbook, pages 1–34, Berlin. Walter de Gruyter.
- [\citenameKirchner2006] Mark Kirchner. 2006. Kirghiz. In Lars Johanson and Éva Á. Csató, editors, The Turkic Languages, pages 344–356, New York. Routledge.
- [\citenameKondrak2001] Grzegorz Kondrak. 2001. Identifying cognates by phonetic and semantic similarity. In Proceedings of the second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies, NAACL ’01, pages 1–8, Stroudsburg, PA, USA. Association for Computational Linguistics.
- [\citenameMann and Yarowsky2001] Gideon S. Mann and David Yarowsky. 2001. Multipath translation lexicon induction via bridge languages. In Proceedings of the second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies, NAACL ’01, pages 1–8, Stroudsburg, PA, USA. Association for Computational Linguistics.
- [\citenameÖztopçu et al.1996] Kurtuluş Öztopçu, Zhoumagaly Abuov, Nasir Kambarov, and Youssef Azemoun. 1996. Dictionary of the Turkic Languages. Routledge, New York.
- [\citenameRóna-Tas2006] András Róna-Tas. 2006. The reconstruction of Proto-Turkic and the genetic question. In Lars Johanson and Éva Á. Csató, editors, The Turkic Languages, pages 67–80, New York. Routledge.
- [\citenameSchafer and Yarowsky2002] Charles Schafer and David Yarowsky. 2002. Inducing translation lexicons via diverse similarity measures and bridge languages. In proceedings of the 6th conference on Natural language learning - Volume 20, COLING-02, pages 1–7, Stroudsburg, PA, USA. Association for Computational Linguistics.
- [\citenameTürk Dil Kurumu1942] Türk Dil Kurumu. 1942. Felsefe ve Gramer Terimleri. Cumhuriyet Basımevi, Istanbul.
- [\citenamevan der Ark et al.2007] René van der Ark, Philippe Mennecier, John Nerbonne, and Franz Manni. 2007. Preliminary identification of language groups and loan words in Central Asia. In Proceedings of the RANLP Workshop on Computational Phonology, pages 12–20, Borovetz, Bulgaria.