Turkish PoS Tagging by Reducing Sparsity with Morpheme Tags in Small Datasets
Sparsity is one of the major problems in natural language processing. The problem becomes even more severe in agglutinating languages that are highly prone to be inflected. We deal with sparsity in Turkish by adopting morphological features for part-of-speech tagging. We learn inflectional and derivational morpheme tags in Turkish by using conditional random fields (CRF) and we employ the morpheme tags in part-of-speech (PoS) tagging by using hidden Markov models (HMMs) to mitigate sparsity. Results show that using morpheme tags in PoS tagging helps alleviate the sparsity in emission probabilities. Our model outperforms other hidden Markov model based PoS tagging models for small training datasets in Turkish. We obtain an accuracy of 94.1% in morpheme tagging and 89.2% in PoS tagging on a 5K training dataset.
Keywords:morphology, syntax, part-of-speech tagging, sparsity, conditional random fields (CRFs), hidden Markov models (HMMs)
Turkish is an agglutinating language that builds words by gluing meaning bearing units called morphemes. While gluing morphemes together, vowel harmony and consonant assimilation are intensely applied leading to orthographic transformations in morphemes. For example, the suffix dir can be transformed into dır, dur, dür depending on the last vowel in the word to which it is being attached. This is called vowel harmony. Moreover, the same morpheme can be transformed into tir, tır, tur, tür, this time depending on the last consonant of the word. This is called consonant assimilation. Both vowel harmony and consonant assimilation introduce different realizations of the same morpheme, which are called allomorphs (e.g. dir, dır, dur, dür, tir, tır, tur, tür are all allomorphs).
Agglutination already introduces a sparsity problem in natural language processing for especially agglutinating languages. The sparsity problem becomes more crucial when a morpheme has got different realizations. Identifying morphemes that are realizations of each other is the starting point of this work.
Morphological segmentation systems normally provide only the segments of words without any morpheme tags. However, labeled segmentation is required for some natural language processing tasks. For example, in sentiment analysis the Turkish negation suffix ma (and its allomorph me) needs to be distinguished from the derivational suffix ma (and its allomorph me) that turns a verb into a noun in order to extract the correct sentiment out. The same also applies for machine translation, question answering, and other natural language processing applications.
Morpheme tagging has become a neglected aspect of morphological segmentation. In this paper, we use conditional random fields (CRF) for morpheme tagging in a weakly-supervised setting. We use the obtained morpheme tags in part-of-speech tagging (PoS tagging) in order to mitigate sparsity in a case when small amount of data is provided. Indeed the sparsity problem is quite severe in PoS tagging for especially agglutinating languages where different methods (e.g. smoothing) have been applied to deal with sparsity. The sparsity is alleviated significantly by using morpheme tags rather than using lexical instances such as words or suffixes.
This paper is organized as follows: Section 2 points at the related work in the literature, section 3 describes the CRF model adopted in morpheme tagging and describes the HMMs used in PoS tagging, section 4 presents the experimental results from both tasks and finally section 5 concludes the paper with the remaining future work.
2 Related Work
There has been a substantial amount of work on unsupervised morphological segmentation. Goldsmith , Creutz and Lagus  build morphological segmentation systems based on minimum description length (MDL). Creutz and Lagus  introduce a hidden Markov model (HMM) that employs the probability distributions between different morpheme categories such as prefix, stem, and suffix. Poon et al.  introduce a log-linear model for unsupervised morphological segmentation that incorporates MDL-inspired priors.
All of these models provide only morphological segmentations of words and not any morphological tags that identify the morpheme roles within a word. Learning morpheme tags involve distinguishing homophonous morphemes111Morphemes with the same surface forms but with different meanings. and learning allomorphs. Oflazer  introduce derivational boundaries and inflectional groups in Turkish morphological analysis. This is performed by two-level morphology (PC-KIMMO [2, 12]) that formulates morphological segmentation via a cascade of finite state transducers by employing morphophonemic alternations. All ortographic and morphophonemic rules are implemented by a set of finite-state automata (FSA) rules. Their model gives a labeled morphological analysis based on these rules.
Allomorfessor  is one of the models that aims to perform morphological segmentation based on allomorphs by modeling mutations between different surface forms of morphemes, namely allomorphs. Can and Manandhar  develop an agglomerative hierarchical clustering to find the morpheme classes in an unsupervised setting.
To our knowledge, Ryan et al.  introduce labeled morphological segmentation for the first time in a supervised learning framework without using any rules. They model morphotactics by a semi-Markov model. Different levels of tagsets are introduced that capture different levels of granularity. Our model resembles their model from the aspect of morphological tagging.
Morpheme tags have been used in many natural language processing tasks. El-Kahlout and Oflazer  employ morphological tags in order to alleviate the sparsity by matching the Turkish morphemes having the same morphological tag to the same English translation in statistical machine translation task. They address that using morphological tags provides a substantial improvement on the BLUE score.
Morpheme tags have been used in morphological/PoS disambiguation in Turkish language. Ehsani et al.  use conditional random fields for disambiguating PoS tags in Turkish by utilizing the morphological tags. They introduce some dependencies between inflectional groups of morphemes in order to simplify the transition probabilities. Sak et al.  apply perceptron algorithm for morphological disambiguation. Hakkani-Tur et al.  formulate a trigram HMM based on inflectional groups in order to disambiguate morphological parses of a given word. The results show that using the dependencies between inflectional groups of adjacent words improve PoS tagging accuracy. Many of these models select a complete morphological analysis for each word rather than providing a single PoS tag.
Dincer et al.  formulate HMMs by emitting suffixes rather than emitting words in order to mitigate the sparsity. However, they do not use any morpheme tags. Our PoS model is mostly similar to their work in this respect. We use morpheme tags in order to cope with the sparsity in emission probabilities rather than using fixed-length endings of words.
3.1 Turkish Morphology
Turkish is an agglutinating language that has a productive inflectional and derivational suffixation. This brings the sparsity problem in nlp tasks due to the large vocabulary introduced by the language. The vocabulary size of a corpus having 1 million words becomes 106.547 . In order to deal with the sparsity, a representation that shows inflectional groups and derivation boundaries of the morphological analysis of each word is introduced by Hakkani-Tür et al. .The different morphological analyses of the word alındı are given as follows by a two-level morphological analyzer :
al+Verb^DB+Verb+Pass+Pos+Past+A3sg (it was taken)
al+Adj^DB+Noun+Zero+A3sg+P2sg+Nom^DB+Verb+Zero+Past+A3sg (it was your red)
al+Adj^DB+Noun+Zero+A3sg+Pnon+Gen^DB+Verb+Zero+Past+A3sg (it was the one of the red)
alın+Noun+A3sg+Pnon+Nom^DB+Verb+Zero+Past+A3sg (it was the forehead)
Here ËDB’s denote the derivation boundaries and the rest of the morpheme tags denote the inflectional groups (IGs). Most of the words have more than one morphological analysis in Turkish and the morphological disambiguation aim to find the right morphological analysis of the word given in a specific context.
In this work, we are only using the morpheme tags (both derivational and inflectional) of words in order to find a single PoS tag for each word. We believe that morpheme tags give the best clue for a PoS tag. This is sufficient if we are only interested in syntax but not in the meaning. For example, the analyses of alındı that end with A3sg can be considered as verbs, whereas the only analysis ending with Nom can be considered as a noun. In order to find the morpheme tags we only use the morphotactic features of morphemes within the words, whereas we use contextual features and morphological features in PoS tagging.
3.2 Morphological Tagging by Using CRFs
Conditional Random Fields (CRF)  are undirected graphical models that are generally used for segmenting and labeling a given sequence. Unlike HMMs, CRFs are discriminative models that define the conditional distribution rather than the joint probability distribution , where Y corresponds to the label sequence and corresponds to the input (i.e. observations) sequence . In our case, the label sequence refers to the morpheme tags and observation sequence refers to the morphemes.
The conditional distribution in our CRF model is given as follows:
that iterates over the morphemes of each word in the corpus with words, each having morphemes defined on a feature set . Here is the normalization factor:
Here corresponds to the weight vector for the feature set . Feature function consists of two types: state feature functions and transition feature functions where denotes the input position. State feature function is non-zero when the label is matched with the label defined in the function, whereas transition functions depend on the label sequence .
Our model is given in Figure 1. We adopt a Naive model where an edge is built between every state pair. Therefore, morpheme tags within the same word are assumed to be dependent on each other, whereas each word is assumed to be independent from the others. Thus, we deal with only morphotactic rules within the same word for morpheme tagging task without using any contextual features.
3.3 Adopting Morphological Tags in PoS Tagging
We use the obtained morpheme tags from the CRF model in order to infer the PoS tags of words. We learn PoS tags according to the following formulation by finding the PoS tag sequence that maximizes the probability for a given sequence of words:
where denotes the PoS tags and denotes the sequence of words. The Bayes’ rule is simply applied for the posterior probability as follows:
where is discarded since it is the same for all tag assignments for the given word sequence.
We formulate the posterior probability as a trigram HMM by assuming that each PoS tag depends only on the previous two tags:
We apply interpolation to smooth the transition probabilities in order to rule out zeros in transitions with the equation given below:
which defines an nth-order smoothed model where corresponds to the transition probability after interpolation is applied recursively. We estimate the parameters by tuning our model on a development set.
The sparsity problem also emerges in the emission probabilities. We emit the tag of the last morpheme in the word if the word has more than two segments. Otherwise, the word itself is emitted from the PoS tag as seen in Figure 2. Therefore, the emission probabilities are estimated as follows:
where is the morpheme tag of the last suffix in the word. We apply interpolation to smooth for the words which do not exist in the corpus and cannot be segmented further:
Here corresponds to the smoothed emission probabilities, is the number of word tokens of type , is the vocabulary size, and is the interpolation coefficient.
Viterbi algorithm is applied to find the PoS tag sequence that maximizes the posterior probability given in Equation 3.3.
4 Experiments & Results
We used several different corpora for the experiments. One of them is METU-Sabancı Turkish Treebank  that consists of 56k word tokens and 5600 sentences. The dataset includes PoS tags and morphological analyses of the words.
For the additional experiments, in order to compare our CRF model with the Semi-Markov Model by Ryan et al. , we used their dataset that consists of 3573 morphologically segmented and tagged word tokens, of which 1987 words belong to the train set and 1586 words belong to the test set.
In order to compare our PoS tagging model with Sak et. al. , we used their training and test set that are collected from various newspaper archives. This dataset consists of 800k word tokens and 47.5k sentences.
For all of the experiments, we used a separate development set that consists of 6K words to tune the interpolation coefficients. We assigned for the emission probabilities, (bigram) and (unigram) for the bigram transition probabilities, and (trigram), (bigram), and (unigram) for the interpolation used in trigram transitions.
For morpheme tagging experiments, we removed out all punctuation from the datasets and reintroduced the terminal punctuation for PoS tagging task since the word boundaries are crucial for PoS tagging.
4.2 Experiments on Morphological Tagging
In morpheme tagging task, we assume that morphological segmentations of words are provided. We obtained the segmentations and morpheme labels through an open-source morphological analyzer called Zemberek  in order to build a train set for the morpheme tagging task. Zemberek defines 84 different morpheme tags on Metu-Sabancı Treebank. We used the open source CRF package  for training our own model on our training set. Some of the morphemes belonging to the same morpheme tag obtained from the test set by using the trained CRF model are given in Table 1. The final morpheme tags show that allomorphs can ben learned by our model. For example, la, le, yla, and yle are all allomorphs.
|Morpheme tag||Example morphemes obtained by CRF|
|Location:||da (7439), de (7633), nde (4877), te (1520), nda (8548),|
|ta (1537), çi (2), un (1)|
|Infinitive:||mek (2774), mak (3481)|
|Inst:||la (2218), le (2299), yla (2186), yle (2518)|
|AfterDoing:||yıp (141), yip (104), up (335), üp (106), yup (20), ün (12),|
|ım (24), ümüz (2)|
|Dative:||na (5877), e (7298), ne (4860), ye (2718), ya (2741)|
|Progressive:||ıyor (4738), iyor (6022), uyor (1756), üyor (1343), ıcı (2)|
|Desire:||se (301), sa (659), yacak (8), yecek (3)|
|Ablative:||nden (1959), dan (3053), tan (1061), ndan (3514),|
|ten (874), dik (1)|
|Narrative:||mış (595), müş (276), miş (1640), muş (782), tür (7), tır (30),|
|tikçe (3), ıver (2), tık (1), üver (1), ıcı (2), ün (1), yıver (1),|
|tığ (1), sın (1), çi (1), dür (1), üm (1)|
|Train set size||Test set size||Number of tags||F1 Score|
We used Zemberek again in order to create gold sets for the morpheme tagging task. F1 scores for morphological tagging for different sizes of train and test sets obtained from Metu-Sabancı Turkish Treebank are given in Table 2222Precision and recall values are the same because gold sets and results consists of same number of morphemes since we are only doing morpheme tagging and not any segmentation.. The F1 score of the model is for a training set with 500 words, whereas the F1 score increases up to on a 5K training set. Therefore, the F1 score significantly improves on the larger training sets.
We also tested our model on the manually collected newspaper archives, which is much larger than Metu-Sabancı Turkish Treebank. We obtained an F1 score of on a 5K train set and 700K test set. This shows that the performance of our model does not drop significantly for larger test sets. The results are given in Table 3.
|Train set size||Test set size||Number of tags||F1 Score|
We compared our model with Chipmunk  by using their tag set and datasets. The results are given in Table 4. Our CRF model outperforms their model on accuracy, whereas their model outperforms ours on F1 score. However, it should be noted that Chipmunk dataset lacks the derivation morpheme tags, whereas we are also learning derivation morpheme tags in our model.
|Train set size||Test set size||Accuracy||F1 Score|
4.3 Experiments on PoS Tagging
Our PoS tag set consists of 13 major PoS tags , that are Adj, Adv, Conj, Det, Interj, Noun, Num, Postp, Pron, Punc, Verb, Ques, Dup.
|Train size||Test size||Tag num.||Word Emission Acc.||Suffix Acc.||Morpheme Tag Acc.|
We tested our model on two different datasets. The first set of experiments were performed on Metu-Sabancı Turkish Treebank. The results are given in Table 5 for different sizes of train/test sets and for different emission types. We provide results for word emissions, last suffix emissions, and the tag of the last morpheme’s emissions. For a 5K training set, word emission accuracy is , suffix emission accuracy is , and morpheme tag emission accuracy is . This shows that using morpheme tag emission outperforms both word emissions and last suffix emissions in smaller datasets. The accuracy increases on a 40K train set, but still using morpheme tag emissions outperforms using word and last suffix emissions.
The results obtained from the manually collected newspaper archives are given in Table 6. This time using the word emissions outperforms using the last suffix and the last morpheme tag emissions because the sparsity becomes no longer a problem in the larger train sets.
|Train size||Test size||Tag num.||Word Emission Acc.||Suffix Acc.||Morpheme Tag Acc.|
|With terminal punc - multiple HMM||88.98%||90.95%||91.05%|
|With terminal punc - single HMM||88.60%||90.86%||90.96%|
|Without terminal punc - multiple HMM||87.51%||89.74%||89.85%|
|Without terminal punc - single HMM||86.30%||88.19%||88.53%|
|Train Set Size||Test Set Size||Accuracy|
|Suffix-based tagger ||5025||1017||84.25%|
|HMM with the last morpheme tag||5025||1017||88.98%|
|Suffix-based tagger ||18205||1017||88.90%|
|HMM with the last morpheme tag||18205||1017||90.95%|
In order to measure the impact of terminal punctuation in PoS tagging, we did two sets of experiments on Metu Sabancı Turkish Treebank. In the first set of experiments, we included the terminal punctuation, whereas in the second set of experiments we excluded the terminal punctuation. While including the terminal punctuation, first we built one HMM for each sentence in the training set, second we built only one HMM for the entire corpus where all the words are linked to each other on the same HMM that are separated by terminal punctuation. We obtained an accuracy of for multiple HMM approach, whereas we obtained an accuracy of for a single HMM approach on 5K train set. In the second set of experiments, we completely excluded the terminal punctuation and repeated the experiments for multiple HMMs and a single HMM. We obtained an accuracy of for multiple HMMs, whereas we obtained an accuracy of for a single HMM. Results are given in Table 7. It shows that even though terminal punctuation plays an important role in PoS tagging, behaving each sentence as a single HMM by assuming that sentences are independent from each other leads to a slight increase in the accuracy.
We compared our model with Sak et al.  and Dincer et al.  on Metu Sabancı Turkish Treebank. We used the last 5 letters of each word with a second order HMM to implement the suffix based tagger by Dincer et al. , since their model gives the best scores for the last 5 letters. Results are given in Table 8. The results show that our model outperforms the other two models on smaller datasets (i.e. 5K and 18K).
Obtaining data is one major problem in natural language processing tasks. Using small datasets by reducing sparsity is one challenge in natural language processing. Here, we aimed to increase the accuracy of PoS tagging for an agglutinating language on smaller datasets when large datasets are not available. Our results show that it is possible to use a kind of class-based language model by grouping the morphemes according to their syntactic roles within a word by tagging them and then using it for PoS tagging to reduce the sparsity in smaller datasets.
5 Conclusion & Future Work
We introduced a CRF model to tag the morphemes syntactically and a HMM model for PoS tagging that uses these morpheme tags in order to reduce the sparsity in Turkish PoS tagging on smaller datasets. We managed to obtain morpheme tags with F1 score 94.1% on a limited training set by using CRFs. Then, we trained a second-order HMM model with the last morpheme tag of each word emitted from each HMM state in order to perform PoS tagging, contrary to the conventional approach of using words’ surface forms emitted from HMM states. The results show that using the last morpheme tags helps dealing with the sparsity especially on small train sets.
We believe that morphological features of the context words will be also informative in morpheme tagging task because Eryigit et al.  shows that using inflectional groups as units in Turkish dependency parsing increases the parsing performance. We leave using the contextual information in morpheme tagging as a future work.
This research is supported by the Scientific and Technological Research Council of Turkey (TUBITAK) with the project number EEEAG-115E464 and we are grateful to TUBITAK for their financial support.
-  Akın, A.A., Akın, M.D.: Zemberek, an open source nlp framework for Turkic languages. Structure 10, 1–5 (2007)
-  Antworth, L.E.: PC-KIMMO: A two-level processor for morphological analysis. Occasional Publications in Academic Computing, Dallas (1990)
-  Can, B., Manandhar, S.: An agglomerative hierarchical clustering algorithm for morpheme labelling. In: Proceedings of the Recent Advances in Natural Language Processing 2013. RANLP 2013 (2013)
-  Cotterell, R., Müller, T., Fraser, A., Schütze, H.: Labeled morphological segmentation with semi-markov models. In: Proceedings of the Nineteenth Conference on Computational Natural Language Learning. pp. 164–174. Association for Computational Linguistics, Beijing, China (July 2015), http://www.aclweb.org/anthology/K15-1017
-  Creutz, M., Lagus, K.: Unsupervised discovery of morphemes. In: Proceedings of the ACL-02 Workshop on Morphological and Phonological Learning - Volume 6. pp. 21–30. MPL ’02, Association for Computational Linguistics, Stroudsburg, PA, USA (2002)
-  Creutz, M., Lagus, K.: Inducing the morphological lexicon of a natural language from unannotated text. In: Proceedings of the International and Interdisciplinary Conference on Adaptive Knowledge Representation and Reasoning (AKRRâ05) (2005)
-  Durgar El-Kahlout, I., Oflazer, K.: Initial explorations in English to Turkish statistical machine translation. In: Proceedings on the Workshop on Statistical Machine Translation. pp. 7–14. Association for Computational Linguistics, New York City (June 2006), http://www.aclweb.org/anthology/W06-3102
-  Ehsani, R., Alper, M.E., Eryiğit, G., Adalı, E.: Disambiguating main pos tags for Turkish. In: Proceedings of ROCLING - Conference on Computational Linguistics and Speech Processing. Association for Computational Linguistics and Chinese Language Processing (ACLCLP), Taiwan (2012)
-  Eryiğit, G., Nivre, J., Oflazer, K.: Dependency parsing of Turkish. Comput. Linguist. 34(3), 357–389 (Sep 2008)
-  Goldsmith, J.: Unsupervised learning of the morphology of a natural language. Computational Linguistics 27(2), 153–198 (Jun 2001)
-  Hakkani-Tür, D.Z., Oflazer, K., Tür, G.: Statistical morphological disambiguation for agglutinative languages. Computers and the Humanities 36(4), 381–410 (2000)
-  Koskenniemi, K.: Two-level morphology: a general computational model for word-form recognition and production. Department of General Linguistics, University of Helsinki (1983)
-  Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proceedings of the Eighteenth International Conference on Machine Learning. pp. 282–289. ICML ’01, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (2001)
-  Oflazer, K.: Two-level description of Turkish morphology. In: Proceedings of the Sixth Conference on European Chapter of the Association for Computational Linguistics. pp. 472–472. EACL ’93, Association for Computational Linguistics, Stroudsburg, PA, USA (1993)
-  Poon, H., Cherry, C., Toutanova, K.: Unsupervised morphological segmentation with log-linear models. In: Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics. pp. 209–217. NAACL ’09, Association for Computational Linguistics, Stroudsburg, PA, USA (2009)
-  Sak, H., Güngör, T., Saraçlar, M.: Morphological disambiguation of Turkish text with perceptron algorithm. In: Computational Linguistics and Intelligent Text Processing: 8th International Conference, CICLing 2007. pp. 107–118. Springer Berlin Heidelberg, Berlin, Heidelberg (2007)
-  Say, B., Zeyrek, D., Oflazer, K., Özge, U.: Development of a corpus and a treebank for present-day written Turkish. In: Proceedings of the eleventh international conference of Turkish linguistics. pp. 183–192 (2002)
-  Sha, F., Pereira, F.: Shallow parsing with conditional random fields. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1. pp. 134–141. Association for Computational Linguistics (2003)
-  Taner Dinçer, Bahar Karaoğlan, T.K.: A suffix based part-of-speech tagger for Turkish. In: Proceedings of the International and Interdisciplinary Conference on Adaptive Knowledge Representation and Reasoning (AKRRâ05) (2005)
-  Virpioja, S., Kohonen, O., Lagus, K.: Unsupervised morpheme analysis with allomorfessor. In: Multilingual Information Access Evaluation I. Text Retrieval Experiments: 10th Workshop of the Cross-Language Evaluation Forum, CLEF 2009. pp. 609–616. Springer Berlin Heidelberg, Berlin, Heidelberg (2010)