The SIGMORPHON 2019 Shared Task: Morphological Analysis in Context and Cross-Lingual Transfer for Inflection
The SIGMORPHON 2019 shared task on cross-lingual transfer and contextual analysis in morphology examined transfer learning of inflection between 100 language pairs, as well as contextual lemmatization and morphosyntactic description in 66 languages. The first task evolves past years’ inflection tasks by examining transfer of morphological inflection knowledge from a high-resource language to a low-resource language. This year also presents a new second challenge on lemmatization and morphological feature analysis in context. All submissions featured a neural component and built on either this year’s strong baselines or highly ranked systems from previous years’ shared tasks. Every participating team improved in accuracy over the baselines for the inflection task (though not Levenshtein distance), and every team in the contextual analysis task improved on both state-of-the-art neural and non-neural baselines.
While producing a sentence, humans combine various types of knowledge to produce fluent output—various shades of meaning are expressed through word selection and tone, while the language is made to conform to underlying structural rules via syntax and morphology. Native speakers are often quick to identify disfluency, even if the meaning of a sentence is mostly clear.
Automatic systems must also consider these constraints when constructing or processing language. Strong enough language models can often reconstruct common syntactic structures, but are insufficient to properly model morphology. Many languages implement large inflectional paradigms that mark both function and content words with a varying levels of morphosyntactic information. For instance, Romanian verb forms inflect for person, number, tense, mood, and voice; meanwhile, Archi verbs can take on thousands of forms (Kibrik, 1998). Such complex paradigms produce large inventories of words, all of which must be producible by a realistic system, even though a large percentage of them will never be observed over billions of lines of linguistic input. Compounding the issue, good inflectional systems often require large amounts of supervised training data, which is infeasible in many of the world’s languages.
This year’s shared task is concentrated on encouraging the construction of strong morphological systems that perform two related but different inflectional tasks. The first task asks participants to create morphological inflectors for a large number of under-resourced languages, encouraging systems that use highly-resourced, related languages as a cross-lingual training signal. The second task welcomes submissions that invert this operation in light of contextual information: Given an unannotated sentence, lemmatize each word, and tag them with a morphosyntactic description. Both of these tasks extend upon previous morphological competitions, and the best submitted systems now represent the state of the art in their respective tasks.
2 Tasks and Evaluation
2.1 Task 1: Cross-lingual transfer for morphological inflection
Annotated resources for the world’s languages are not distributed equally—some languages simply have more as they have more native speakers willing and able to annotate more data. We explore how to transfer knowledge from high-resource languages that are genetically related to low-resource languages.
The first task iterates on last year’s main task: morphological inflection (Cotterell et al., 2018). Instead of giving some number of training examples in the language of interest, we provided only a limited number in that language. To accompany it, we provided a larger number of examples in either a related or unrelated language. Each test example asked participants to produce some other inflected form when given a lemma and a bundle of morphosyntactic features as input. The goal, thus, is to perform morphological inflection in the low-resource language, having hopefully exploited some similarity to the high-resource language. Models which perform well here can aid downstream tasks like machine translation in low-resource settings. All datasets were resampled from UniMorph, which makes them distinct from past years.
The mode of the task is inspired by Zoph et al. (2016), who fine-tune a model pre-trained on a high-resource language to perform well on a low-resource language. We do not, though, require that models be trained by fine-tuning. Joint modeling or any number of methods may be explored instead.
The model will have access to type-level data in a low-resource target language, plus a high-resource source language. We give an example here of Asturian as the target language with Spanish as the source language.
We score the output of each system in terms of its predictions’ exact-match accuracy and the average Levenshtein distance between the predictions and their corresponding true forms.
2.2 Task 2: Morphological analysis in context
Although inflection of words in a context-agnostic manner is a useful evaluation of the morphological quality of a system, people do not learn morphology in isolation.
In 2018, the second task of the CoNLL–SIGMORPHON Shared Task Cotterell et al. (2018) required submitting systems to complete an inflectional cloze task (Taylor, 1953) given only the sentential context and the desired lemma – an example of the problem is given in the following lines: A successful system would predict the plural form “dogs”. Likewise, a Spanish word form \sayayuda may be a feminine noun or a third-person verb form, which must be disambiguated by context.
This year’s task extends the second task from last year. Rather than inflect a single word in context, the task is to provide a complete morphological tagging of a sentence: for each word, a successful system will need to lemmatize and tag it with a morphsyntactic description (MSD).
width= The dogs are barking . the dog be bark . DET N;PL V;PRS;3;PL V;V.PTCP;PRS PUNCT
Context is critical—depending on the sentence, identical word forms realize a large number of potential inflectional categories, which will in turn influence lemmatization decisions. If the sentence were instead “The barking dogs kept us up all night”, “barking” is now an adjective, and its lemma is also “barking”.
3.1 Data for Task 1
We presented data in 100 language pairs spanning 79 unique languages. Data for all but four languages (Basque, Kurmanji, Murrinhpatha, and Sorani) are extracted from English Wiktionary, a large multi-lingual crowd-sourced dictionary with morphological paradigms for many lemmata.
For each language, the basic data consists of triples of the form (lemma, feature bundle, inflected form), as in Table 1. The first feature in the bundle always specifies the core part of speech (e.g., verb). For each language pair, separate files contain the high- and low-resource training examples.
Extraction from Wiktionary
For each of the Wiktionary languages, Wiktionary provides a number of tables, each of which specifies the full inflectional paradigm for a particular lemma. As in the previous iteration, tables were extracted using a template annotation procedure described in (Kirov et al., 2018).
Sampling data splits
From each language’s collection of paradigms, we sampled the training, development, and test sets as in 2018.
Our first step was to construct probability distributions over the (lemma, feature bundle, inflected form) triples in our full dataset.
For each triple, we counted how many tokens the inflected form has in the February 2017 dump of Wikipedia for that language.
To distribute the counts of an observed form over all the triples that have this token as its form, we follow the method used in the previous shared task (Cotterell et al., 2018), training a neural network on unambiguous forms to estimate the distribution over all, even ambiguous, forms.
We then sampled 12,000 triples without replacement from this distribution. The first 100 were taken as training data for low-resource settings. The first 10,000 were used as high-resource training sets. As these sets are nested, the highest-count triples tend to appear in the smaller training sets.
The final 2000 triples were randomly shuffled and then split in half to obtain development and test sets of 1000 forms each.
We further adopted some changes to increase compatibility. Namely, we corrected some annotation errors created while scraping Wiktionary for the 2018 task, and we standardized Romanian t-cedilla and t-comma to t-comma. (The same was done with s-cedilla and s-comma.)
3.2 Data for Task 2
Our data for task 2 come from the Universal Dependencies treebanks (UD; Nivre et al., 2018, v2.3), which provides pre-defined training, development, and test splits and annotations in a unified annotation schema for morphosyntax and dependency relationships. Unlike the 2018 cloze task which used UD data, we require no manual data preparation and are able to leverage all 107 monolingual treebanks. As is typical, data are presented in CoNLL-U format,
The morphological annotations for the 2019 shared task were converted to the UniMorph schema (Kirov et al., 2018) according to McCarthy et al. (2018), who provide a deterministic mapping that increases agreement across languages. This also moves the part of speech into the bundle of morphological features. We do not attempt to individually correct any errors in the UD source material. Further, some languages received additional pre-processing. In the Finnish data, we removed morpheme boundaries that were present in the lemmata (e.g., puhe#kieli puhekieli ‘spoken+language’). Russian lemmata in the GSD treebank were presented in all uppercase; to match the 2018 shared task, we lowercased these. In development and test data, all fields except for form and index within the sentence were struck.
4.1 Task 1 Baseline
We include four neural sequence-to-sequence models mapping lemma into inflected word forms: soft attention Luong et al. (2015), non-monotonic hard attention Wu et al. (2018), monotonic hard attention and a variant with offset-based transition distribution Wu and Cotterell (2019). Neural sequence-to-sequence models with soft attention Luong et al. (2015) have dominated previous SIGMORPHON shared tasks Cotterell et al. (2017). Wu et al. (2018) instead models the alignment between characters in the lemma and the inflected word form explicitly with hard attention and learns this alignment and transduction jointly. Wu and Cotterell (2019) shows that enforcing strict monotonicity with hard attention is beneficial in tasks such as morphological inflection where the transduction is mostly monotonic. The encoder is a biLSTM while the decoder is a left-to-right LSTM. All models use multiplicative attention and have roughly the same number of parameters. In the model, a morphological tag is fed to the decoder along with target character embeddings to guide the decoding. During the training of the hard attention model, dynamic programming is applied to marginalize all latent alignments exactly.
4.2 Task 2 Baselines
(Müller et al., 2015): The Lemming model is a log-linear model that performs joint morphological tagging and lemmatization. The model is globally normalized with the use of a second order linear-chain CRF. To efficiently calculate the partition function, the choice of lemmata are pruned with the use of pre-extracted edit trees.
(Malaviya et al., 2019): This is a state-of-the-art neural model that also performs joint morphological tagging and lemmatization, but also accounts for the exposure bias with the application of maximum likelihood (MLE). The model stitches the tagger and lemmatizer together with the use of jackknifing Agić and Schluter (2017) to expose the lemmatizer to the errors made by the tagger model during training. The morphological tagger is based on a character-level biLSTM embedder that produces the embedding for a word, and a word-level biLSTM tagger that predicts a morphological tag sequence for each word in the sentence. The lemmatizer is a neural sequence-to-sequence model Wu and Cotterell (2019) that uses the decoded morphological tag sequence from the tagger as an additional attribute. The model uses hard monotonic attention instead of standard soft attention, along with a dynamic programming based training scheme.
The SIGMORPHON 2019 shared task received 30 submissions—14 for task 1 and 16 for task 2—from 23 teams. In addition, the organizers’ baseline systems were evaluated.
5.1 Task 1 Results
Five teams participated in the first Task, with a variety of methods aimed at leveraging the cross-lingual data to improve system performance.
The University of Alberta (UAlberta) performed a focused investigation on four language pairs, training cognate-projection systems from external cognate lists. Two methods were considered: one which trained a high-resource neural encoder-decoder, and projected the test data into the HRL, and one that projected the HRL data into the LRL, and trained a combined system. Results demonstrated that certain language pairs may be amenable to such methods.
The Tuebingen University submission (Tuebingen) aligned source and target to learn a set of edit-actions with both linear and neural classifiers that independently learned to predict action sequences for each morphological category. Adding in the cross-lingual data only led to modest gains.
AX-Semantics combined the low- and high-resource data to train an encoder-decoder seq2seq model; optionally also implementing domain adaptation methods to focus later epochs on the target language.
The CMU submission first attends over a decoupled representation of the desired morphological sequence before using the updated decoder state to attend over the character sequence of the lemma. Secondly, in order to reduce the bias of the decoder’s language model, they hallucinate two types of data that encourage common affixes and character copying. Simply allowing the model to learn to copy characters for several epochs significantly out-performs the task baseline, while further improvements are obtained through fine-tuning. Making use of an adversarial language discriminator, cross lingual gains are highly-correlated to linguistic similarity, while augmenting the data with hallucinated forms and multiple related target language further improves the model.
The system from IT-IST also attends separately to tags and lemmas, using a gating mechanism to interpolate the importance of the individual attentions. By combining the gated dual-head attention with a SparseMax activation function, they are able to jointly learn stem and affix modifications, improving significantly over the baseline system.
The relative system performance is described in Table 5, which shows the average per-language accuracy of each system. The table reflects the fact that some teams submitted more than one system (e.g. Tuebingen-1 & Tuebingen-2 in the table).
5.2 Task 2 Results
Nine teams submitted system papers for Task 2, with several interesting modifications to either the baseline or other prior work that led to modest improvements.
Charles-Saarland achieved the highest overall tagging accuracy by leveraging multi-lingual BERT embeddings fine-tuned on a concatenation of all available languages, effectively transporting the cross-lingual objective of Task 1 into Task 2. Lemmas and tags are decoded separately (with a joint encoder and separate attention); Lemmas are a sequence of edit-actions, while tags are calculated jointly. (There is no splitting of tags into features; tags are atomic.)
CBNU instead lemmatize using a transformer network, while performing tagging with a multilayer perceptron with biaffine attention. Input words are first lemmatized, and then pipelined to the tagger, which produces atomic tag sequences (i.e., no splitting of features).
The team from Istanbul Technical University (ITU) jointly produces lemmatic edit-actions and morphological tags via a two level encoder (first word embeddings, and then context embeddings) and separate decoders. Their system slightly improves over the baseline lemmatization, but significantly improves tagging accuracy.
The team from the University of Groningen (RUG) also uses separate decoders for lemmatization and tagging, but uses ELMo to initialize the contextual embeddings, leading to large gains in performance. Furthermore, joint training on related languages further improves results.
CMU approaches tagging differently than the multi-task decoding we’ve seen so far (baseline is used for lemmatization). Making use of a hierarchical CRF that first predicts POS (that is subsequently looped back into the encoder), they then seek to predict each feature separately. In particular, predicting POS separately greatly improves results. An attempt to leverage gold typological information led to little gain in the results; experiments suggest that the system is already learning the pertinent information.
The team from Ohio State University (OHIOSTATE) concentrates on predicting tags; the baseline lemmatizer is used for lemmatization. To that end, they make use of a dual decoder that first predicts features given only the word embedding as input; the predictions are fed to a GRU seq2seq, which then predicts the sequence of tags.
The UNT HiLT+Ling team investigates a low-resource setting of the tagging, by using parallel Bible data to learn a translation matrix between English and the target language, learning morphological tags through analogy with English.
The UFAL-Prague team extends their submission from the UD shared task (multi-layer LSTM), replacing the pretrained embeddings with BERT, to great success (first in lemmatization, 2nd in tagging). Although they predict complete tags, they use the individual features to regularize the decoder. Small gains are also obtained from joining multi-lingual corpora and ensembling.
CUNI–Malta performs lemmatization as operations over edit actions with LSTM and ReLU. Tagging is a bidirectional LSTM augmented by the edit actions (i.e., two-stage decoding), predicting features separately.
The Edinburgh system is a character-based LSTM encoder-decoder with attention, implemented in OpenNMT. It can be seen as an extension of the contextual lemmatization system Lematus Bergmanis and Goldwater (2018) to include morphological tagging, or alternatively as an adaptation of the morphological re-inflection system MED Kann and Schütze (2016) to incorporate context and perform analysis rather than re-inflection. Like these systems it uses a completely generic encoder-decoder architecture with no specific adaptation to the morphological processing task other than the form of the input. In the submitted version of the system, the input is split into short chunks corresponding to the target word plus one word of context on either side, and the system is trained to output the corresponding lemmas and tags for each three-word chunk.
Several teams relied on external resources to improve their lemmatization and feature analysis. Several teams made use of pre-trained embeddings. CHARLES-SAARLAND-2 and UFALPRAGUE-1 used pretrained contextual embeddings (BERT) provided by Google (Devlin et al., 2019). CBNU-1 used a mix of pre-trained embeddings from the CoNLL 2017 shared task and fastText. Further, some teams trained their own embeddings to aid performance.
6 Future Directions
In general, the application of typology to natural language processing (e.g., Gerz et al., 2018; Ponti et al., 2018) provides an interesting avenue for multilinguality. Further, our shared task was designed to only leverage a single helper language, though many may exist with lexical or morphological overlap with the target language. Techniques like those of Neubig and Hu (2018) may aid in designing universal inflection architectures. Neither task this year included unannotated monolingual corpora. Using such data is well-motivated from an L1-learning point of view, and may affect the performance of low-resource data settings.
In the case of inflection an interesting future topic could involve departing from orthographic representation and using more IPA-like representations, i.e. transductions over pronunciations. Different languages, in particular those with idiosyncratic orthographies, may offer new challenges in this respect.
Only one team tried to learn inflection in a multilingual setting—i.e. to use all training data to train one model. Such transfer learning is an interesting avenue of future research, but evaluation could be difficult. Whether any cross-language transfer is actually being learned vs. whether having more data better biases the networks to copy strings is an evaluation step to disentangle.
Creating new data sets that accurately reflect learner exposure (whether L1 or L2) is also an important consideration in the design of future shared tasks. One pertinent facet of this is information about inflectional categories—often the inflectional information is insufficiently prescribed by the lemma, as with the Romanian verbal inflection classes or nominal gender in German.
As we move toward multilingual models for morphology, it becomes important to understand which representations are critical or irrelevant for adapting to new languages; this may be probed in the style of (Thompson et al., 2018), and it can be used as a first step toward designing systems that avoid \saycatastrophic forgetting as they learn to inflect new languages (Thompson et al., 2019).
Future directions for Task 2 include exploring cross-lingual analysis—in stride with both Task 1 and Malaviya et al. (2018)—and leveraging these analyses in downstream tasks.
The SIGMORPHON 2019 shared task provided a type-level evaluation on 100 language pairs in 79 languages and a token-level evaluation on 107 treebanks in 66 languages, of systems for inflection and analysis. On task 1 (low-resource inflection with cross-lingual transfer), 14 systems were submitted, while on task 2 (lemmatization and morphological feature analysis), 16 systems were submitted. All used neural network models, completing a trend in past years’ shared tasks and other recent work on morphology.
In task 1, gains from cross-lingual training were generally modest, with gains positively correlating with the linguistic similarity of the two languages.
In the second task, several methods were implemented by multiple groups, with the most successful systems implementing variations of multi-headed attention, multi-level encoding, multiple decoders, and ELMo and BERT contextual embeddings.
We have released the training, development, and test sets, and expect these datasets to provide a useful benchmark for future research into learning of inflectional morphology and string-to-string transduction.
MS has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 771113).
- The Basque language data was extracted from a manually designed finite-state morphological analyzer (Alegria et al., 2009). Murrinhpatha data was donated by John Mansfield; it is discussed in Mansfield (2019). Data for Kurmanji Kurdish and Sorani Kurdish were created as part of the Alexina project (Walther et al., 2010; Walther and Sagot, 2010).
- These datasets can be obtained from https://sigmorphon.github.io/sharedtasks/2019/
- Several high-resource languages had necessarily fewer, but on a similar order of magnitude. Bengali, Uzbek, Kannada, Swahili. Likewise, the low-resource language Telugu had fewer than 100 forms.
- When sufficient data are unavailable, we instead use 50 or 100 examples.
- This mimics a realistic setting, as supervised training is usually employed to generalize from frequent words that appear in annotated resources to less frequent words that do not. Unsupervised learning methods also tend to generalize from more frequent words (which can be analyzed more easily by combining information from many contexts) to less frequent ones.
- Although some work suggests that working with IPA or phonological distinctive features in this context yields very similar results to working with graphemes Wiemerslage et al. (2018).
- This has been addressed by \newcitejin-kann-2017-exploring.
- How (not) to train a dependency parser: the curious case of jackknifing part-of-speech taggers. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Vancouver, Canada, pp. 679–684. External Links: Cited by: §4.2.
- Porting Basque morphological grammars to foma, an open-source tool. In International Workshop on Finite-State Methods and Natural Language Processing, pp. 105–113. Cited by: footnote 1.
- Context sensitive neural lemmatization with Lematus. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, pp. 1391–1400. External Links: Cited by: §5.2.
- The CoNLL–SIGMORPHON 2018 shared task: universal morphological reinflection. In Proceedings of the CoNLL–SIGMORPHON 2018 Shared Task: Universal Morphological Reinflection, Brussels, pp. 1–27. External Links: Cited by: §2.1, §2.2, §3.1.
- CoNLL-SIGMORPHON 2017 shared task: universal morphological reinflection in 52 languages. Proceedings of the CoNLL SIGMORPHON 2017 Shared Task: Universal Morphological Reinflection, pp. 1–30. Cited by: §4.1.
- BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. External Links: Cited by: §5.2.
- Language modeling for morphologically rich languages: character-aware modeling for word-level prediction. Transactions of the Association for Computational Linguistics 6, pp. 451–465. External Links: Cited by: §6.
- MED: the LMU system for the SIGMORPHON 2016 shared task on morphological reinflection. In Proceedings of the 14th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology, Berlin, Germany, pp. 62–70. External Links: Cited by: §5.2.
- Archi. In The Handbook of Morphology, A. Spencer and A. M. Zwicky (Eds.), pp. 455–476. Cited by: §1.
- UniMorph 2.0: Universal Morphology. In Proceedings of the 11th Language Resources and Evaluation Conference, Miyazaki, Japan. External Links: Cited by: §3.1, §3.2.
- Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, pp. 1412–1421. External Links: Cited by: §4.1.
- Neural factor graph models for cross-lingual morphological tagging. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 2653–2663. External Links: Cited by: §6.
- A simple joint model for improved contextual neural lemmatization. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 1517–1528. External Links: Cited by: §4.2.
- Murrinhpatha morphology and phonology. Vol. 653, Walter de Gruyter GmbH & Co KG. Cited by: footnote 1.
- Marrying Universal Dependencies and Universal Morphology. In Proceedings of the Second Workshop on Universal Dependencies (UDW 2018), Brussels, Belgium, pp. 91–101. External Links: Cited by: §3.2.
- Joint lemmatization and morphological tagging with Lemming. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, pp. 2268–2274. External Links: Cited by: §4.2.
- Rapid adaptation of neural machine translation to new languages. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 875–880. External Links: Cited by: §6.
- Universal dependencies 2.3. Note: LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University External Links: Cited by: §3.2.
- Modeling language variation and universals: A survey on typological linguistics for natural language processing. CoRR abs/1807.00914. External Links: Cited by: §6.
- A universal feature schema for rich morphological annotation and fine-grained cross-lingual part-of-speech tagging. In Proceedings of the 4th Workshop on Systems and Frameworks for Computational Morphology (SFCM), C. Mahlow and M. Piotrowski (Eds.), Communications in Computer and Information Science, pp. 72–93. External Links: Cited by: §3.1.
- A language-independent feature schema for inflectional morphology. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), Beijing, China, pp. 674–680. External Links: Cited by: §3.1.
- “Cloze procedure”: a new tool for measuring readability. Journalism Bulletin 30 (4), pp. 415–433. Cited by: §2.2.
- Overcoming catastrophic forgetting during domain adaptation of neural machine translation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 2062–2068. External Links: Cited by: §6.
- Freezing subnetworks to analyze domain adaptation in neural machine translation. In Proceedings of the Third Conference on Machine Translation: Research Papers, Belgium, Brussels, pp. 124–132. External Links: Cited by: §6.
- Fast development of basic NLP tools: towards a lexicon and a POS tagger for Kurmanji Kurdish. In International conference on lexis and grammar, Cited by: footnote 1.
- Developing a large-scale lexicon for a less-resourced language: general methodology and preliminary experiments on Sorani Kurdish. In Proceedings of the 7th SaLTMiL Workshop on Creation and use of basic lexical resources for less-resourced languages (LREC 2010 Workshop), Valetta, Malta. External Links: Cited by: footnote 1.
- Phonological features for morphological inflection. In Proceedings of the Fifteenth Workshop on Computational Research in Phonetics, Phonology, and Morphology, Brussels, Belgium, pp. 161–166. External Links: Cited by: footnote 7.
- Exact hard monotonic attention for character-level transduction. arXiv preprint arXiv:1905.06319v1. External Links: Cited by: §4.1, §4.2.
- Hard non-monotonic attention for character-level transduction. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 4425–4438. External Links: Cited by: §4.1.
- Transfer learning for low-resource neural machine translation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas, pp. 1568–1575. External Links: Cited by: §2.1.