Are All Languages Equally Hard to Language-Model?
For general modeling methods applied to diverse languages, a natural question is: how well should we expect our models to work on languages with differing typological profiles? In this work, we develop an evaluation framework for fair cross-linguistic comparison of language models, using translated text so that all models are asked to predict approximately the same information. We then conduct a study on 21 languages, demonstrating that in some languages, the textual expression of the information is harder to predict with both -gram and LSTM language models. We show complex inflectional morphology to be a cause of performance differences among languages.
Modern natural language processing practitioners strive to create modeling techniques that work well on all of the world’s languages. Indeed, most methods are portable in the following sense: Given appropriately annotated data, they should, in principle, be trainable on any language. However, despite this crude cross-linguistic compatibility, it is unlikely that all languages are equally easy, or that our methods are equally good at all languages.
In this work, we probe the issue, focusing on language modeling. A fair comparison is tricky. Training corpora in different languages have different sizes, and reflect the disparate topics of discussion in different linguistic communities, some of which may be harder to predict than others. Moreover, bits per character, a standard metric for language modeling, depends on the vagaries of a given orthographic system. We argue for a fairer metric based on the bits per utterance using utterance-aligned multi-text. That is, we train and test on “the same” set of utterances in each language, modulo translation. To avoid discrepancies in out-of-vocabulary handling, we evaluate open-vocabulary models.
We find that under standard approaches, text tends to be harder to
predict in languages with fine-grained inflectional morphology.
Specifically, language models perform worse on these languages,
in our controlled comparison.
Furthermore, this performance difference essentially vanishes when we
remove the inflectional markings.
Thus, in highly inflected languages, either the utterances have more content or the models are worse. (1) Text in highly inflected languages may be inherently harder to predict (higher entropy per utterance) if its extra morphemes carry additional, unpredictable information. (2) Alternatively, perhaps the extra morphemes are predictable in principle—for example, redundant marking of grammatical number on both subjects and verbs, or marking of object case even when it is predictable from semantics or word order—and yet our current language modeling technology fails to predict them. This might happen because (2a) the technology is biased toward modeling words or characters and fails to discover intermediate morphemes, or because (2b) it fails to capture the syntactic and semantic predictors that govern the appearance of the extra morphemes. We leave it to future work to tease apart these hypotheses.
2 Language Modeling
A traditional closed-vocabulary, word-level language model operates as follows: Given a fixed set of words , the model provides a probability distribution over sequences of words with parameters to be estimated from data. Most fixed-vocabulary language models employ a distinguished symbol unk that represents all words not present in ; these words are termed out-of-vocabulary (OOV).
Choosing the set is something of a black art: Some practitioners choose the most common words (e.g., \newciteDBLP:conf/interspeech/MikolovKBCK10 choose ) and others use all those words that appear at least twice in the training corpus. In general, replacing more words with unk artificially improves the perplexity measure but produces a less useful model. OOVs present something of a challenge for the cross-linguistic comparison of language models, especially in morphologically rich languages, which simply have more word forms.
2.1 The Role of Inflectional Morphology
Inflectional morphology can explode the base vocabulary of a language. Compare, for instance, English and Turkish. The nominal inflectional system of English distinguishes two forms: a singular and plural. The English lexeme book has the singular form book and the plural form books. In contrast, Turkish distinguishes at least 12: kitap, kitablar, kitabı, kitabın, etc.
To compare the degree of morphological inflection in our evalation languages, we use counting complexity Sagot (2013). This crude metric counts the number of inflectional categories distinguished by a language (e.g., English includes a category of 3rd-person singular present-tense verbs). We count the categories annotated in the language’s UniMorph Kirov et al. (2018) lexicon. See Table 1 for the counting complexity of evaluated languages.
2.2 Open-Vocabulary Language Models
To ensure comparability across languages, we require our language models to predict every character in an utterance, rather than skipping some characters because they appear in words that were (arbitrarily) designated as OOV in that language. Such models are known as “open-vocabulary” LMs.
Let denote disjoint union, i.e., iff and .
Let be a discrete alphabet of characters, including a distinguished unknown-character symbol
Baseline -gram LM.
We train “flat” hybrid word/character open-vocabulary
-gram models Bisani and Ney (2005), defined over strings
from a vocabulary with mutually disjoint subsets:
, where single characters are distinguished in the
model from single character full words , e.g., a
versus the word a.
Special symbols are end-of-word and end-of-string,
respectively. N-gram histories in are either word-boundary or
word-internal (corresponding to a whitespace tokenization), i.e., .
String-internal word boundaries are always separated by a single whitespace character.
Symbols are generated from a multinomial given the history , leading to a new history that now includes the symbol and is truncated to the Markov order. Histories can generate symbols . If , the string is ended. If , it has an implicit eow and the model transitions to history . If , it translitions to . Histories can generate symbols and transition to if , otherwise to .
We use standard \newciteDBLP:conf/icassp/KneserN95 model training, with distributions at word-internal histories constrained so as to only provide probability mass for symbols . We train 7-gram models, but prune -grams where the history , for , i.e., 6- and 7-gram histories must include at least one . To establish the vocabularies and , we replace exactly one instance of each word type with its spelled out version. Singleton words are thus excluded from , and character sequence observations from all types are included in training. Note any word can also be generated as a character sequence. For perplexity calculation, we sum the probabilities for each way of generating the word.
While neural language models can also take a hybrid approach Hwang and Sung (2017); Kawakami et al. (2017), recent advances indicate that full character-level modeling is now competitive with word-level modeling. A large part of this is due to the use of recurrent neural networks Mikolov et al. (2010), which can generalize about how the distribution depends on .
We use a long short-term memory (LSTM) LM Sundermeyer et al. (2012), identical to that of \newcite,DBLP:journals/corr/ZarembaSV14, but at the character-level. To achieve the hidden state at time step , one feeds the left context to the LSTM: where the model uses a learned vector to represent each character type. This involves a recursive procedure described in \newcitehochreiter1997long. Then, the probability distribution over the character is , where and are parameters.
Parameters for all models are estimated on the training
portion and model selection is performed on the development portion. The
neural models are trained with SGD Robbins and Monro (1951) with
gradient clipping, such that each component has a maximum absolute
value of . We optimize for 100 iterations and perform early stopping (on the development portion).
We employ a character embedding of size and 2
hidden layers of size .
3 A Fairer Evaluation: Multi-Text
Effecting a cross-linguistic study on LMs is complicated because different models could be trained and tested on incomparable corpora. To avoid this problem, we use multi-text: -way translations of the same semantic content.
What’s wrong with bits per character?
Open-vocabulary language modeling is most commonly evaluated under
bits per character (BPC) .
Bits per English Character.
Multi-text allows us to compute a fair metric that is invariant to the orthographic (or phonological) changes discussed above: bits per English character (BPEC). , where is the English character sequence in the utterance aligned to . The choice of English is arbitrary, as any other choice of language would simply scale the values by a constant factor.
Note that this metric is essentially capturing the overall bits per utterance, and that normalizing using English characters only makes numbers independent of the overall utterance length; it is not critical to the analysis we perform in this paper.
A Potential Confound: Translationese.
Working with multi-text, however, does introduce a new bias: all of the utterances in the corpus have a source language and 20 translations of that source utterance into target languages. The characteristics of translated language has been widely studied and exploited, with one prominent characteristic of translations being simplification Baker (1993).
Note that a significant fraction of the original utterances in the corpus are English. Our analysis may then have underestimated the BPEC for other languages, to the extent that their sentences consist of simplified ‘‘translationese.’’ Even so, English had the lowest BPEC from among the set of languages.
4 Experiments and Results
Our experiments are conducted on the 21 languages of the Europarl corpus Koehn (2005). The corpus consists of utterances made in the European parliament and are aligned cross-linguistically by a unique utterance id. With the exceptions (noted in Table 1) of Finnish, Hungarian and Estonian, which are Uralic, the languages are Indo-European.
While Europarl does not contain quite our desired breadth of typological diversity, it serves our purpose by providing large collections of aligned data across many languages.
our experimental data, we extract all utterances and randomly sort
them into train-development-test splits such that roughly 80% of
the data are in train and 10% in development and test,
Experimentally, we want to show: (i) When evaluating models in a controlled environment (multi-text under BPEC), the models achieve lower performance on certain languages and (ii) inflectional morphology is the primary culprit for the performance differences. However, we repeat that we do not in this paper tease apart whether the models are at fault, or that certain languages inherently encode more information.
5 Discussion and Analysis
The Effect of BPEC.
The first major take-away is that BPEC offers a cleaner cross-linguistic comparison than BPC. Were we to rank the languages by BPC (lowest to highest), we would find that English was in the middle of the pack, which is surprising as new language models are often only tuned on English itself. For example, BPC surprisingly suggests that French is easier to model than English. However, ranking under BPEC shows that the LSTM has the easiest time modeling English itself. Scandinavian languages Danish and Swedish have BPEC closest to English; these languages are typologically and genetically similar to English.
-gram versus LSTM.
As expected, the LSTM outperforms the baseline -gram models across the board. In addition, however, -gram modeling yields relatively poor performance on some languages, such as Dutch, with only modestly more complex inflectional morphology than English. Other phenomena---e.g., perhaps, compounding---may also be poorly modeled by -grams.
The Impact of Inflectional Morphology.
Another major take-away is that rich inflectional morphology is a difficulty for both -gram and LSTM LMs. In this section we give numbers for the LSTMs. Studying Fig. 0(a), we find that Spearman’s rank correlation between a language’s BPEC and its counting complexity (§ 2.1) is quite high (, significant at ). This clear correlation between the level of inflectional morphology and the LSTM performance indicates that character-level models do not automatically fix the problem of morphological richness. If we lemmatize the words, however (Fig. 0(b)), the correlation becomes insignificant and in fact slightly negative (, ). The difference of the two previous graphs (Fig. 0(c)) shows more clearly that the LM penalty for modeling inflectional endings is greater for languages with higher counting complexity. Indeed, this penalty is arguably a more appropriate measure of the complexity of the inflectional system. See also Fig. 2.
The differences in BPEC among languages are reduced when we lemmatize, with standard deviation dropping from bits to bits. Zooming in on Finnish (see Table 1), we see that Finnish forms are harder to model than English forms, but Finnish lemmata are easier to model than English ones. This is strong evidence that it was primarily the inflectional morphology, which lemmatization strips, that caused the differences in the model’s performance on these two languages.
6 Related Work
Recurrent neural language models can effectively learn complex dependencies, even in open-vocabulary settings Hwang and Sung (2017); Kawakami et al. (2017). Whether the models are able to learn particular syntactic interactions is an intriguing question, and some methodologies have been presented to tease apart under what circumstances variously-trained models encode attested interactions Linzen et al. (2016); Enguehard et al. (2017). While the sort of detailed, construction-specific analyses in these papers is surely informative, our evaluation is language-wide.
MT researchers have investigated whether an English sentence contains enough information to predict the fine-grained inflections used in its foreign-language translations (see Kirov et al., 2017).
sproat2014database present a corpus of close translations of sentences in typologically diverse languages along with detailed morphosyntactic and morphosemantic annotations, as the means for assessing linguistic complexity for comparable messages, though they expressly do not take an information-theoretic approach to measuring complexity. In the linguistics literature, \newcitemcwhorter2001world argues that certain languages are less complex than others: he claims that Creoles are simpler. \newcitemueller-schuetze-schmid:2012:NAACL-HLT compare LMs on EuroParl, but do not compare performance across languages.
We have presented a clean method for the cross-linguistic comparison of language modeling: We assess whether a language modeling technique can compress a sentence and its translations equally well. We show an interesting correlation between the morphological richness of a language and the performance of the model. In an attempt to explain causation, we also run our models on lemmatized versions of the corpora, showing that, upon the removal of inflection, no such correlation between morphological richness and LM performance exists. It is still unclear, however, whether the performance difference originates from the inherent difficulty of the languages or with the models.
- One might have expected a priori that some difference would remain, because most highly inflected languages can also vary word order to mark a topic-focus distinction, and this (occasional) marking is preserved in our experiment.
- The set of graphemes in these languages can be assumed to be closed, but external graphemes may on rare occasion appear in random text samples. These are rare enough to not materially affect the metrics.
- The model can be extended to handle consecutive whitespace characters or punctuation at word boundaries; for this paper, the tokenization split punctuation from words and reduced consecutive whitespaces to one, hence the simpler model.
- As \newciteDBLP:journals/corr/ZarembaSV14 indicate, increasing the number of parameters may allow us to achieve better performance.
- To aggregate this over an entire test corpus, we replace the denominator and also the numerator by summations over all utterances .
- Why not work with phonological characters, rather than orthographic ones, obtaining /put\textipaS/ for both Czech and German? Sadly this option is also fraught with problems as many languages have perfectly predictable phonological elements that will artificially lower the score.
- Characters appearing times in train are .
- Mona Baker. 1993. Corpus linguistics and translation studies: Implications and applications. Text and Technology: In Honour of John Sinclair 233:250.
- Maximilian Bisani and Hermann Ney. 2005. Open vocabulary speech recognition with flat hybrid models. In INTERSPEECH. pages 725--728.
- Émile Enguehard, Yoav Goldberg, and Tal Linzen. 2017. Exploring the syntactic abilities of RNNs with multi-task learning. In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017). Association for Computational Linguistics, pages 3--14. https://doi.org/10.18653/v1/K17-1003.
- Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation 9(8):1735--1780.
- Kyuyeon Hwang and Wonyong Sung. 2017. Character-level language modeling with hierarchical recurrent neural networks. In International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pages 5720--5724.
- Kazuya Kawakami, Chris Dyer, and Phil Blunsom. 2017. Learning to create and reuse words in open-vocabulary neural language modeling. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL). Association for Computational Linguistics, Vancouver, Canada, pages 1492--1502. http://aclweb.org/anthology/P17-1137.
- Christo Kirov, Ryan Cotterell, John Sylak-Glassman, Géraldine Walther, Ekaterina Vylomova, Patrick Xia, Manaal Faruqui, Arya McCarthy, Sabrina J. Mielke, Sandra Kübler, David Yarowsky, Jason Eisner, and Mans Hulden. 2018. Unimorph 2.0: Universal morphology. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC). European Language Resources Association (ELRA).
- Christo Kirov, John Sylak-Glassman, Rebecca Knowles, Ryan Cotterell, and Matt Post. 2017. A rich morphological tagger for English: Exploring the cross-linguistic tradeoff between morphology and syntax. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers. Association for Computational Linguistics, pages 112--117. http://aclweb.org/anthology/E17-2018.
- Reinhard Kneser and Hermann Ney. 1995. Improved backing-off for m-gram language modeling. In International Conference on Acoustics, Speech, and Signal Processing, ICASSP. pages 181--184.
- Philipp Koehn. 2005. Europarl: A parallel corpus for statistical machine translation. In MT Summit. volume 5, pages 79--86.
- Tal Linzen, Emmanuel Dupoux, and Yoav Goldberg. 2016. Assessing the ability of LSTMs to learn syntax-sensitive dependencies. Transactions of the Association of Computational Linguistics 4:521--535. http://www.aclweb.org/anthology/Q16-1037.
- John McWhorter. 2001. The worldâs simplest grammars are creole grammars. Linguistic Typology 5(2):125--66.
- Tomas Mikolov, Martin Karafiát, Lukás Burget, Jan Cernocký, and Sanjeev Khudanpur. 2010. Recurrent neural network based language model. In INTERSPEECH 2010, 11th Annual Conference of the International Speech Communication Association. pages 1045--1048. http://www.isca-speech.org/archive/interspeech_2010/i10_1045.html.
- Thomas Müller, Hinrich Schütze, and Helmut Schmid. 2012. A comparative investigation of morphological language modeling for the languages of the European Union. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Montréal, Canada, pages 386--395. http://www.aclweb.org/anthology/N12-1043.
- Herbert Robbins and Sutton Monro. 1951. A stochastic approximation method. The Annals of Mathematical Statistics pages 400--407.
- Benoît Sagot. 2013. Comparing complexity measures. In Computational Approaches to Morphological Complexity.
- Richard Sproat, Bruno Cartoni, HyunJeong Choe, David Huynh, Linne Ha, Ravindran Rajakumar, and Evelyn Wenzel-Grondie. 2014. A database for measuring linguistic information content. In LREC.
- Milan Straka, Jan Hajič, and Jana Straková. 2016. UDPipe: Trainable pipeline for processing CoNLL-U files performing tokenization, morphological analysis, POS tagging and parsing. In LREC.
- Martin Sundermeyer, Ralf Schlüter, and Hermann Ney. 2012. LSTM neural networks for language modeling. In Thirteenth Annual Conference of the International Speech Communication Association.
- Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. 2014. Recurrent neural network regularization. CoRR abs/1409.2329. http://arxiv.org/abs/1409.2329.