LemmaTag: Jointly Tagging and Lemmatizing for Morphologically-Rich Languages with BRNNs
We present LemmaTag, a featureless recurrent neural network architecture that jointly generates part-of-speech tags and lemmatizes sentences of languages with complex morphology, using bidirectional RNNs with character-level and word-level embeddings. We demonstrate that both tasks benefit from sharing the encoding part of the network and from using the tagger output as an input to the lemmatizer. We evaluate our model across several morphologically-rich languages, surpassing state-of-the-art accuracy in both part-of-speech tagging and lemmatization in Czech, German, and Arabic.
Daniel Kondratyuk Tomáš Gavenčiak Charles University Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics firstname.lastname@example.org email@example.com firstname.lastname@example.org Milan Straka
Morphologically-rich languages are often difficult to process in many NLP tasks (Tsarfaty et al., 2010). As opposed to analytical languages like English, morphologically-rich languages encode diverse sets of grammatical information within each word using inflections, which convey characteristics such as case, gender, and tense. The addition of several inflectional variants across many words dramatically increases the vocabulary size, which results in data sparsity and out-of-vocabulary (OOV) issues.
Due to these issues, morphological part-of-speech (POS) tagging and lemmatization are heavily used in NLP tasks such as machine translation (Fraser et al., 2012) and sentiment analysis (Abdul-Mageed et al., 2014). In morphologically-rich languages, the POS tags typically consist of multiple morpho-syntactic categories providing additional information (see Figure 1). Closely related to POS tagging is lemmatization, which involves transforming each word to its root or dictionary form. Both tasks require context-sensitive awareness to disambiguate words with the same form but different syntactic or semantic structure. Furthermore, lemmatization of a word form can benefit substantially from the information present in morphological tags, as grammatical attributes often disambiguate word forms using context (Müller et al., 2015).
We address context-sensitive POS tagging and lemmatization using a neural network model that jointly performs both tasks on each input word in a given sentence.111https://github.com/hyperparticle/LemmaTag We train the model in a supervised fashion, requiring training data containing word forms, lemmas, and POS tags. Our model is related of the work of Müller et al. (2015), which use conditional random fields (CRF) to jointly tag and lemmatize words for morphologically-rich languages. In addition, we incorporate the ideas from Inoue et al. (2017) to optionally allow the network to predict the subcategories of each tag to improve accuracy.
Our model consists of three parts: The shared encoder, which creates an internal representation for every word based on its character sequence and the sentence context. We adopt the encoder architecture of Chakrabarty et al. (2017), utilizing character-level (Heigold et al., 2017) and word-level embeddings (Mikolov et al., 2013; Santos and Zadrozny, 2014) processed through several layers of bidirectional recurrent neural networks (BRNN) (Schuster and Paliwal, 1997; Chakrabarty et al., 2017). The tagger decoder, which applies a fully-connected layer to the outputs of the shared encoder to predict the POS tags. The lemmatizer decoder, which applies an RNN sequence decoder to the combined outputs of the shared encoder and tagger decoder similar to Bergmanis and Goldwater (2018), producing a sequence of characters that predict each lemma.
The main advantages over other proposed models are: The model is featureless, requiring little to no text preprocessing or morphological analysis postprocessing. We share the word embeddings, character embeddings, and RNN encoder weights in the tagger and lemmatizer, improving both tagging and lemmatization accuracy while reducing the number of parameters for both tasks. We provide the output of the tagger as features for the input of the lemmatizer, further improving lemmatizer accuracy.
We evaluate the accuracy of our model in POS tagging and lemmatization across several languages, including Czech, Arabic, German, and English. For each language, we also compare the performance of a completely separated tagger and lemmatizer to the proposed joint model. Our results show that our joint model is able to improve in accuracy for both tasks, and achieves state-of-the-art performance in both POS tagging and lemmatization in Czech, German, and Arabic, while closely matching state-of-the-art performance for English.
2 The Joint LemmaTag Model
Given a sequence of words in a sentence , the task of the model is to produce a sequence of associated tags and lemmas . For a word at position , we denote to be the sequence of characters that make up , where indicates the length of the word string at position . Analogously, we define to be the sequence of characters that make up the lemma .
Our proposed model (shown in Figures 2 and 3) is split into three parts: the shared encoder, the tagger, and the lemmatizer. The initial layers of the model are shared between the tagger and lemmatizer, encoding the words and characters in a given sentence. The encoder then passes its outputs to two networks, which performs a classification task to predict tags in the tagger, and a sequence prediction task to output lemmas character-by-character in the lemmatizer.
2.1 Shared Encoder
In the encoder shown in Figure 2, each character of a word is indexed into an embedding layer to produce fixed-length embedded vectors representing each character. These vectors are further passed into a layer of BRNNs composed of gated recurrent units (GRU) (Cho et al., 2014) producing outputs , and whose final states are concatenated to produce the character-level embedding of the word. Similarly, we index into a word-level embedding layer to compute vector . Then we sum these results to produce the final word embedding .
We repeat this process independently for all the words in the sentence and feed the resulting sequence into another two BRNN layers composed of long short-term memory units (LSTM) with residual connections. This produces word-level outputs that encode sentence-level context in each output (we ignore the final hidden states).
The task of the tagger is to predict a tag given a word and its context, where is a set of possible tags. As explained the introduction, morphologically-rich languages typically subdivide tags further into several subcategories , where . See Figure 1 for an illustration on the Czech PDT tags where .
Having the encoded words of a sentence available, the tagger consists of a fully-connected layer with neurons whose input is the output of the word feature RNN . This layer produces the logits of the tag values and the predictions as the maximum-likelihood value.
To obtain the information about categorical nature of each tag, we also predict every category of the tag independently (if they exist in the dataset) with dense layers similar to Inoue et al. (2017). The -th layer has neurons and outputs the logits for the category values. While these values are trained for, their value is not used in tag prediction. The values are fed into the lemmatizer as an additional set of potentially useful features.
The task of the lemmatizer is to produce a sequence of characters and the lemma length for each lemma . We use a recurrent sequence decoder, a setup typical of many sequence-to-sequence tasks such as in neural machine translation (Sutskever et al., 2014).
The lemmatizer consists of a recurrent LSTM layer whose initial state is taken from word-level output and whose inputs consist of three parts. The first part is the embedding of the previous output character (initially a beginning-of-word character BOW).
The second part is the character-level attention mechanism (Bahdanau et al., 2014) on , the outputs of the character-level BRNN. We employ the multiplicative attention mechanism described in Luong et al. (2015), which allows the LSTM cell to compute an attention vector that selectively weights character-level information in at each time step based on the input state of the cell.
The third and final part of the RNN input allows the network to receive the information about the embedding of the word, the surrounding context of the sentence, and the output of the tagger. This output is the same for all time steps of a lemma and is a concatenation of the following: the output of the encoder , the embedded word and processed tag features . The tag features are obtained by projecting the concatenated outputs of the tagger through a fully connected layer with ReLU activation.222During training we do not pass the gradients back through to prevent the distortion of the tagger output.
The decoder runs until it produces the end-of-word character EOW or reaches a character limit of .
2.4 Loss Function
We define the final loss function as the weighted sum of the losses of the tagger and the lemmatizer:
where y are the predicted outputs, the expected outputs, , the tag components and are the lemma characters. The tagger and lemmatizer losses are separately computed as the softmax cross entropy of the output logits. The weight hyperparameters scale the training losses so that the subtag and lemmatizer losses do not overpower the unfactored tag predictor. The vector contains weights: one for the whole tags and one for every component.333If no components are avalable, .
In this section, we show the outcomes of evaluation when running our joint tagger and lemmatizer and compare with the current state of the art in Czech, German, Arabic, and English datasets. Additionally, we evaluate the lemmatizer and tagger separately to compare the relative increase in tagging and lemmatization accuracy.
Our datasets consist of the Czech Prague Dependency Treebank (PDT) (Bejček et al., 2013), the German TIGER corpus (Brants et al., 2004), the Universal Dependencies Prague Arabic Dependency Treebank (UD-PADT) (Hajic et al., 2004), the Universal Dependencies English Web Treebank (UD-EWT) (Silveira et al., 2014), and the WSJ portion of the English Penn Treebank (tags only) (Marcus et al., 1993). In all datasets, we use the tags specific to their respective language. Of these datasets, only Czech and Arabic provide subcategorical tags, and we use unfactored tags for the rest. See Table 1 for tagger and lemmatizer accuracies.
Note that the PDT dataset disambiguates lemmas with the same textual representation by appending a number as lemma sense indicator. For example, the dataset contains disambiguated lemmas moc-1 (as power) and moc-2 (as too much). About 17.5% of the PDT tokens have such disambiguated lemmas. LemmaTag predicts the lemmas including the disambiguations and the accuracies in Table 1 take that into account. Ignoring the disambiguations, the lemmatization accuracy of the joint LemmaTag model is 98.94%.
We use loss weights for the whole tags, for the tag component losses and for the lemmatizer loss. The RNNs and word-embedding tables have dimensionality 768 except for character-lever embeddings and the character-level RNN, which are of dimension 384. The fully-connected layer whose inputs are is of dimension 256.
We train the models for 40 epochs with random permutations of training sentences and batches of 16 sentences. The starting learning rate is and we scale this by 0.25 at epochs 20 and 30. We train the network using the lazy variant of the Adam optimizer (Kingma and Ba, 2014), which only updates accumulators for variables that appear in the current batch (TensorFlow, 2018), with parameters and . We clip the global gradient norm to 3.0.
To prevent the tagger from over-fitting we devise several strategies for regularization. We apply dropouts with rate 0.5 as indicated in Figures 2 and 3. The word dropout (WD) replaces 25% of words by the unknown token <unk> to force the network to rely more on context, combatting data sparsity issues. Lastly, we employ label smoothing (Pereyra et al., 2017) which is a way to prevent the network from being too confident in any one class. The label smoothing parameter is set to for the tagger logits (both whole tags and the tag components).
The evaluation results have shown that performing lemmatization and tagging jointly by sharing encoder parameters and utilizing tag features is mutually beneficial in morphologically-rich languages. In languages with poor morphology such as English, sharing the encoder parameters may even hurt the performance of the tagger. But in all cases, the lemmatizer benefits from using the predictions of the tagger as extra features and does not harm the accuracy. Finally, we have shown that incorporating these ideas results in excellent performance, surpassing state-of-the-art in Czech, German, and Arabic POS tagging and lemmatization by a substantial margin.
- Abdul-Mageed et al. (2014) Muhammad Abdul-Mageed, Mona Diab, and Sandra Kübler. 2014. Samar: Subjectivity and sentiment analysis for arabic social media. Computer Speech & Language, 28(1):20–37.
- Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
Bejček et al. (2013)
Eduard Bejček, Eva Hajičová, Jan Hajič, Pavlína
Jínová, Václava Kettnerová, Veronika Kolářová, Marie Mikulová, Jiří Mírovský, Anna
Nedoluzhko, Jarmila Panevová, Lucie Poláková, Magda
v Sevčíková, Jan
v Stěpánek, and
v Sárka Zikánová. 2013. Prague dependency treebank 3.0. LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics (
’UFAL), Faculty of Mathematics and Physics, Charles University.
- Bergmanis and Goldwater (2018) Toms Bergmanis and Sharon Goldwater. 2018. Context sensitive neural lemmatization with lematus. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), volume 1, pages 1391–1400.
- Brants et al. (2004) Sabine Brants, Stefanie Dipper, Peter Eisenberg, Silvia Hansen-Schirra, Esther König, Wolfgang Lezius, Christian Rohrer, George Smith, and Hans Uszkoreit. 2004. Tiger: Linguistic interpretation of a german corpus. Research on language and computation, 2(4):597–620.
- Chakrabarty et al. (2017) Abhisek Chakrabarty, Onkar Arun Pandit, and Utpal Garain. 2017. Context sensitive lemmatization using two successive bidirectional gated recurrent networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 1481–1491.
- Cho et al. (2014) Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078.
- Eger et al. (2016) Steffen Eger, Rüdiger Gleim, and Alexander Mehler. 2016. Lemmatization and morphological tagging in german and latin: A comparison and a survey of the state-of-the-art. In LREC.
- Fraser et al. (2012) Alexander Fraser, Marion Weller, Aoife Cahill, and Fabienne Cap. 2012. Modeling inflection and word-formation in smt. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 664–674. Association for Computational Linguistics.
- Hajič et al. (2009) Jan Hajič, Jan Raab, Miroslav Spousta, et al. 2009. Semi-supervised training for the averaged perceptron pos tagger. In Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, pages 763–771. Association for Computational Linguistics.
- Hajic et al. (2004) Jan Hajic, Otakar Smrz, Petr Zemánek, Jan Šnaidauf, and Emanuel Beška. 2004. Prague arabic dependency treebank: Development in data and tools. In Proc. of the NEMLAR Intern. Conf. on Arabic Language Resources and Tools, pages 110–117.
- Heigold et al. (2017) Georg Heigold, Guenter Neumann, and Josef van Genabith. 2017. An extensive empirical evaluation of character-based morphological tagging for 14 languages. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, volume 1, pages 505–513.
- Inoue et al. (2017) Go Inoue, Hiroyuki Shindo, and Yuji Matsumoto. 2017. Joint prediction of morphosyntactic categories for fine-grained arabic part-of-speech tagging exploiting tag dictionary information. In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), pages 421–431.
- Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- Ling et al. (2015) Wang Ling, Tiago Luís, Luís Marujo, Ramón Fernandez Astudillo, Silvio Amir, Chris Dyer, Alan W Black, and Isabel Trancoso. 2015. Finding function in form: Compositional character models for open vocabulary word representation. arXiv preprint arXiv:1508.02096.
- Luong et al. (2015) Minh-Thang Luong, Hieu Pham, and Christopher D Manning. 2015. Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025.
- Marcus et al. (1993) Mitchell P Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. 1993. Building a large annotated corpus of english: The penn treebank. Computational linguistics, 19(2):313–330.
- Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119.
- Müller et al. (2015) Thomas Müller, Ryan Cotterell, Alexander Fraser, and Hinrich Schütze. 2015. Joint lemmatization and morphological tagging with lemming. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 2268–2274.
- Pereyra et al. (2017) Gabriel Pereyra, George Tucker, Jan Chorowski, Łukasz Kaiser, and Geoffrey Hinton. 2017. Regularizing neural networks by penalizing confident output distributions. arXiv preprint arXiv:1701.06548.
- Santos and Zadrozny (2014) Cicero D Santos and Bianca Zadrozny. 2014. Learning character-level representations for part-of-speech tagging. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pages 1818–1826.
- Schuster and Paliwal (1997) Mike Schuster and Kuldip K Paliwal. 1997. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 45(11):2673–2681.
- Silveira et al. (2014) Natalia Silveira, Timothy Dozat, Marie-Catherine de Marneffe, Samuel Bowman, Miriam Connor, John Bauer, and Christopher D. Manning. 2014. A gold standard dependency corpus for English. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC-2014).
- Straka et al. (2016) Milan Straka, Jan Hajic, and Jana StrakovÃ¡. 2016. Udpipe: Trainable pipeline for processing conll-u files performing tokenization, morphological analysis, pos tagging and parsing. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), Paris, France. European Language Resources Association (ELRA).
- Straková et al. (2014) Jana Straková, Milan Straka, and Jan Hajič. 2014. Open-Source Tools for Morphology, Lemmatization, POS Tagging and Named Entity Recognition. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 13–18, Baltimore, Maryland. Association for Computational Linguistics.
- Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112.
- TensorFlow (2018) TensorFlow. 2018. tf.contrib.opt.lazyadamoptimizer: Class lazyadamoptimizer. TensorFlow documentation from tensorflow.org.
- Tsarfaty et al. (2010) Reut Tsarfaty, Djamé Seddah, Yoav Goldberg, Sandra Kübler, Marie Candito, Jennifer Foster, Yannick Versley, Ines Rehbein, and Lamia Tounsi. 2010. Statistical parsing of morphologically rich languages (spmrl): what, how and whither. In Proceedings of the NAACL HLT 2010 First Workshop on Statistical Parsing of Morphologically-Rich Languages, pages 1–12. Association for Computational Linguistics.
Appendix A Supplemental Material
We ran all the tests on a GTX 1080 Ti. The joint LemmaTag training takes about 3 hours for Arabic, 4.5 hour for English, 12 hours for German (Tiger) and 22 hours for Czech (PDT), the separate models take about 50% more time. After training, the lemma and tag predictions of 219000 test tokens of the PDT take about 100 seconds.
a.2 Other techniques
We briefly summarize some of the additional techniques we have tried but which do not improve the results. While some of those techniques do help on smaller models and earlier in the training, the effect on the fully trained network seems to be marginal or even detrimental.
Separate sense prediction. Instead of predicting the sense disambiguation with the decoder, we tried to predict sense as an additional classification problem with one dense layer based on and , but it seems to perform slightly worse (0.2%).
Beam search decoder. We have implemented use a beam search decoder instead of the standard greedy one, but the improvement was marginal (around 0.01%).
Variational dropout. While the dropouts in the LemmaTag are completely random, variational dropout erases the same channels across the time steps of the RNN. While this generally improves training in convolutional networks and RNNs, we saw no significant difference.
Layer normalization. Layer normalization applied to the encoding RNNs did not bring significant gain and also slowed down the training.