The CoNLL–SIGMORPHON 2018 Shared Task: Universal Morphological Reinflection

# The CoNLL–SIGMORPHON 2018 Shared Task: Universal Morphological Reinflection

\aclfinalcopy

## 1 Introduction

Some of a word’s syntactic and semantic properties are expressed on the word form through a process termed morphological inflection. For example, each English count noun has both singular and plural forms (robot/robots, process/processes), known as the inflected forms of the noun. Some languages display little inflection, while others possess a proliferation of forms. A Polish verb can have nearly 100 inflected forms and an Archi verb has thousands Kibrik (1998).

Natural language processing systems must be able to analyze and generate these inflected forms. Fortunately, inflected forms tend to be systematically related to one another. This is why English speakers can usually predict the singular form from the plural and vice versa, even for words they have never seen before: given a novel noun wug, an English speaker knows that the plural is wugs.

We conducted a competition on generating inflected forms. This “shared task” consisted of two separate scenarios. In Task 1, participating systems must inflect word forms based on labeled examples. In English, an example of inflection is the conversion of a citation form1 run to its present participle, running. The system is provided with the source form and the morphosyntactic description (MSD) of the target form, and must generate the actual target form. Task 2 is a harder version of Task 1, where the system must infer the appropriate MSD from a sentential context. This is essentially a cloze task, asking participants to provide the correct form of a lemma in context.

The first task was identical to sub-task 1 from the CoNLL–SIGMORPHON 2017 shared task Cotterell et al. (2017), but the language selection was extended from 52 languages to 103. The data sets for the overlapping languages between 2017 and 2018 were also resampled and are not identical. The task consists of morphological generation with sparse training data, something that can be practically useful for MT and other downstream tasks in NLP. Here, participants were given examples of inflected forms as shown in Table 1. Each test example asked participants to produce some other inflected form when given a lemma and a bundle of morphosyntactic features as input.

The training data was sparse in the sense that it included only a few inflected forms from each lemma. That is, as in human L1 learning, the learner does not necessarily observe any complete paradigms in a language where the paradigms are large (e.g., dozens of inflected forms per lemma).2

Key points:

1. The task is inflection: Given an input lemma and desired output tags, participants had to generate the correct output inflected form (a string).

2. The supervised training data consisted of individual forms (see Table 1) that were sparsely sampled from a large number of paradigms.

3. Forms that are empirically more frequent were more likely to appear in both training and test data (see § 3 for details).

4. Systems were evaluated after training on (low), (medium), and (high) lemma/MSD/inflected form triplets.

### 2.2 Task 2: Inflection in Context

The cloze test is a common exercise in an L2 instruction setting. In the cloze test, a number of words are deleted from a text and students are required to fill in the gaps with contextually plausible forms, often working from the knowledge about which lemma should be inflected. The second task of the morphology shared task presents two variations of this traditional cloze test in two tracks specifically aimed at data-driven morphology learning.

Solving a cloze test well requires integration of many types of evidence beyond the pure capacity to inflect a word on demand. Since our training sets were gathered from actual textual resources, a good solver that accurately determines the most plausible form must implicitly combine knowledge of morphology, morphosyntax, semantics, and pragmatics. Potentially, even textual register and genre may affect the choice of correct form. Hence, the task is both intrinsically interesting from a linguistic point of view and carries potential to support many downstream NLP applications.

As shown in Figure 1, both tracks supply the lemma of the omitted target word form and ask the competitors to inflect the lemma in a contextually appropriate way. In the first track, the competitors additionally see the lemmata and MSDs for all context words, whereas in the second track only the context words are available. In contrast to task 1, the MSD for the target lemma is never observed in either the first or the second track. This means that successful inflection requires the competitors to identify relevant contextual cues.

As training data, the first track supplies a full morphosyntactically annotated corpus of sentences: every token is annotated with a lemma and MSD as shown in Figure 2. In the second track, the training data identifies a number of target tokens. Lemmata are supplied for these tokens but the remaining tokens receive no MSD annotation.

Similarly to task 1, both tracks in task 2 provide three different training data settings providing varying amounts of data: low (ca.  tokens), medium (ca.  tokens) and high (ca.  tokens). The token counts refer to the total number of tokens in the training sets. In the first track, this allows competitors to train their systems on all available tokens. In the second track, however, only a number of tokens supply the input lemma as explained above. Thus, the effective number of training examples is smaller in the second track than in the first track. In both tracks, competitors were restricted to using only the provided training sets. For example, semi-supervised training using external data was forbidden.

Key points:

1. The task is inflection in context. Given an input lemma in sentential context, participants generate the correct inflected output form.

2. Two degrees of supervision are provided. In track 1, participants see context word forms and their lemmata, as well as their MSDs. In track 2, participants only witness context word forms.

3. The supervised training data, the development data, and the test data consist of sampled sentences from Universal Dependencies (UD) treebanks Nivre et al. (2017) together with UD-provided lemmata as well as MSDs, which were converted to the UniMorph format, in track 1.

## 3 Data

### 3.1 Data for Task 1

#### Languages

The data for the shared task was highly multilingual, comprising 103 unique languages. Of these, 52 were shared with the 2017 shared task Cotterell et al. (2017). As with all but 5 of the 2017 languages (Khaling, Kurmanji Kurdish, Sorani Kurdish, Haida, and Basque), the 34 remaining 2018 languages were sourced from the English edition of Wiktionary, a large multi-lingual crowd-sourced dictionary containing morphological paradigms for many lemmata.3

The shared task language set is genealogically diverse, including languages from 20 language stocks. Although the majority of the languages are Indo-European, we also include two language isolates (Haida and Basque) along with languages from Athabaskan (Navajo), Kartvelian (Georgian), Quechua, Semitic (Arabic, Hebrew), Sino-Tibetan (Khaling), Turkic (Turkish), and Uralic (Estonian, Finnish, Hungarian, and Northern Sami) language families. The shared task language set is also diverse in terms of morphological structure, with languages which use primarily prefixes (Navajo), suffixes (Quechua and Turkish), and a mix, with Spanish exhibiting internal vowel variations along with suffixes and Georgian using both infixes and suffixes. The language set also exhibits features such as templatic morphology (Arabic, Hebrew), vowel harmony (Turkish, Finnish, Hungarian), and consonant harmony (Navajo) which require systems to learn non-local alternations. Finally, the resource level of the languages in the shared task set varies greatly, from major world languages (e.g. Arabic, English, French, Spanish, Russian) to languages with few speakers (e.g. Haida, Khaling). Typologically, the majority of the languages are agglutinating or fusional, with three polysynthetic languages; Haida, Greenlandic, and Navajo.4

#### Data Format

For each language, the basic data consists of triples of the form (lemma, feature bundle, inflected form), as in Table 1. The first feature in the bundle always specifies the core part of speech (e.g., verb).

All features in the bundle are coded according to the UniMorph Schema, a cross-linguistically consistent universal morphological feature set Sylak-Glassman et al. (2015a, b).

#### Extraction from Wiktionary

For each of the Wiktionary languages, Wiktionary provides a number of tables, each of which specifies the full inflectional paradigm for a particular lemma. These tables were extracted using a template annotation procedure described in Kirov et al. (2018).

Within a language, different paradigms may have different shapes. To prepare the shared task data, each language’s parsed tables from Wiktionary were grouped according to their tabular structure and number of cells. Each group represents a different type of paradigm (e.g., verb). We used only groups with a large number of lemmata, relative to the number of lemmata available for the language as a whole. For each group, we associated a feature bundle with each cell position in the table, by manually replacing the prose labels describing grammatical features (e.g.  “accusative case”) with UniMorph features (e.g. ACC). This allowed us to extract triples as described in the previous section. The dataset produced by this process was sampled to create appropriately-sized data for the shared task, as described in § 3.1.5 The dataset sizes by language are given in Table 2 and Table 3.

#### Sampling the Train-Dev-Test Splits.

From each language’s collection of paradigms, we sampled the training, development, and test sets as follows.6

Our first step was to construct probability distributions over the (lemma, feature bundle, inflected form) triples in our full dataset. For each triple, we counted how many tokens the inflected form has in the February 2017 dump of Wikipedia for that language. To distribute the counts of an observed form over all the triples that have this token as its form, we use the syncretism resolution method of Cotterell et al. (2018), training a neural network on unambiguous forms to estimate the distribution over all, even ambiguous, forms. We then sampled 12,000 triples without replacement from this distribution. The first 100 were taken as the low-resource training set for sub-task 1, the first 1,000 as the medium-resource training set, and the first 10,000 as the high-resource training set. Note that these training sets are nested, and that the highest-count triples tend to appear in the smaller training sets.

The final 2,000 triples were randomly shuffled and then split in half to obtain development and test sets of 1,000 forms each. The final shuffling was performed to ensure that the development set is similar to the test set. By contrast, the development and test sets tend to contain lower-count triples than the training set.7 Note that for languages that do not have enough triples for this process, we settle for omitting the higher-resource training regimes and scale down the other sizes. Details for all languages are found in Tables 3 and 2.

### 3.2 Data for Task 2

All task 2 data sets are based on Universal Dependencies (UD) v2 treebanks Nivre et al. (2017). We used the data sets aimed for the 2017 CoNLL shared task on Multilingual Dependency Parsing Zeman et al. (2017) because those were available before the official UD v2 data sets.8 For contextual inflection data sets, we retained only word forms, lemmata, part-of-speech tags and morphosyntactic feature descriptions. Dependency trees were discarded along with all other annotations present in the treebanks.

Task 2 submissions are evaluated with regard to two distinct criteria: (1) the ability of the system to reconstruct the original word form in the UD test set and (2) the ability of the system to find a contextually plausible form even if the form differs from the original one. Evaluation on plausible forms is based on manually identifying the set of contextually plausible forms for each test example. Because of the need for manual annotation, task 2 covers a more limited set of languages than task 1. In total, there are seven languages: English, Finnish, French, German, Russian, Spanish and Swedish. Token counts for the training, development and test sets are given in Table 4.

#### Data Conversion

Some of the UD treebanks required slight modifications in order to be suitable for reinflection. In the Finnish data sets, lemmata for compound words included morpheme boundaries, for example muisti#kapasiteetti ‘memory capacity’. The morpheme boundary symbols were deleted. In the Russian treebanks, all lemmata were written completely in upper case letters. These were converted to lower case.9

#### Manual annotation

To produce the complete list of “plausible forms” annotators were given complete UniMorph inflection tables for the center lemma for each sentence and were asked to check off all forms that are “grammatically plausible” in the particular context. For example, given an original sentence We saw the dog, the form dogs would be contextually plausible and would be annotated into the test set. For pro-drop languages and short sentences, it is sometimes the case that all or most indicative, conditional, and future forms of a verb are acceptable when the subject is omitted and agreement is unknown. For example, consider the Spanish sentence from the test data:

width= la mejor de Primera ser to be ‘the’ ‘best’ ‘of’ ‘premier (league)’

Obviously, almost any person, tense, and aspect of the verb ‘to be’ will be appropriate for this limited context (sería ‘I would be’, fue ‘he/she/it was’, eres ‘you are’, …). Of course, depending on the genre of the text, some would be highly implausible, but the annotation intends to capture morphosyntactic rather than semantic and pragmatic felicity.

We had one annotator for each test set, with the exception of French, in which, due to practical difficulties in finding a native speaker annotator, we did not annotate the plausible forms and instead used the original sentences.

When forming the final test sets, all test examples with more than contextually plausible word form alternatives were filtered out. This was done because a large number of plausible word forms was deemed to raise the risk of annotation errors. A threshold of plausible forms was chosen because it means that all languages have test sets greater than examples. The test set for French is smaller but this is not due to manual annotations.

#### Sampling examples

The data sets for each language are based on UD treebanks for the given language. We preserved UD splits into training, development and test data.

For each UD treebank, we first formed sets of training, development and test candidate sentences. A sentence was a candidate for the shared task data set if it contained a token found in the UniMorph resource for the relevant language; or more precisely, a token whose word form, lemma and MSD occur in a same UniMorph inflection table.

We limited target tokens to tokens present in the UniMorph resource in order to facilitate manual annotation of data sets. In particular, we limited the set of possible target MSDs to MSDs which occur in the Unimorph resource. This was necessary to avoid a prohibitively large number of contextually plausible inflections in certain languages. For example, Finnish includes a number of clitics (ko/kä, kin, han/hän, pa/pä, s, kaan/kään) which can be appended relatively freely to word forms. Combinations of clitics are also possible. This easily leads to hundreds of word forms which can be contextually plausible. Restricting the MSDs of a possible output form to the more limited set of MSDs occurring in the UniMorph resource made the selection of plausible forms far more manageable from an annotation perspective.

Training data sets were formed from candidate sentences simply by sampling a suitable number of sentences from the candidate sets in order to achieve the desired token counts , , and for the low, medium, and high data settings, respectively. For German and Russian, all candidate sentences were used in the high data setting, although this was not sufficient to create a training set of tokens. The training sets for German and Russian are, therefore, smaller than those for the other languages. For the development sets, we used all available candidate sentences for all of the languages.

For the test data, we first formed a set of candidate sentences so that the combined number of target tokens in the test sets was 1,000.10 Target tokens in these initial test sets were then manually annotated with additional contextually plausible word forms.

#### MSD conversion

Sampling of training, development and test examples was based on comparing UD word forms, lemmata and MSDs to equivalents in UniMorph paradigms. Therefore, it was necessary to convert the morphosyntactic annotation in the UD data sets into UniMorph morphosyntactic annotation. We used deterministic tag conversion rules to accomplish this. An example of a source UD sentence and a target UniMorph sentence is shown in Figure 3.

Since the selection of languages in task 2 is small and we do not attempt to correct annotation errors in the UD source materials, conversion between UD and UniMorph morphosyntactic descriptions is generally straightforward.11 However, UD descriptions are more fine-grained than their UniMorph equivalents. For example, UD denotes lexical features such as noun gender which are inherent features of a lexeme possessed by all of its word forms. Such inherent features are missing from UniMorph which exclusively annotates inflectional morphology McCarthy et al. (2018). Therefore, UD features which lack correspondents in the UniMorph tagging schema were simply dropped during conversion.

## 4 Baselines

The baseline system provided for task 1 was based on the observation that, for a large number of languages, producing an inflected form from an input citation form can often be done by memorizing the suffix changes that occur in doing so, assuming enough examples are seen Liu and Mao (2016). For example, in witnessing a Finnish inflection of the noun koti ‘home’ in the singular elative case as kodista, a number of transformation rules can be extracted that may apply to previously unseen nouns:

    $koti$
$kodista$  N;IN+ABL;SG


In this example, the following transformation rules are extracted:

 $→ sta$ i$→ ista$ ti$→ dista$ oti$→ odista$ koti$→ kodista$

Such rules are then extracted from each example inflection in the training data. At generation time, the longest matching left hand side of a rule is identified and applied to the citation form. For example, if the Finnish noun luoti ‘bullet’ were to be inflected in the elative (N;IN+ABL;SG) using only the extracted rules given above, the transformation oti$odista$ would be triggered, producing the output luodista. In case there are multiple candidate rules of equally long left hand sides that all match, ties are broken by frequency—i.e. the rule that has been witnessed most times in the training data applies.

Since languages may also use prefixing as a inflectional strategy, a similar process is applied to any identified prefix changes. Identifying which parts of a change in a word form correspond to a prefix and which are considered suffixes requires alignment of the citation form and the output form, which is performed as a preliminary step. We refer the reader to \newcitecotterell-conll-sigmorphon2017 for a detailed description of the baseline system.

#### Neural Baseline

The neural baseline system is an encoder-decoder reinflection system with attention inspired by \newcitekann-schutze:2016:P16-2. The crucial difference is that the reinflection is conditioned on sentence context. This is accomplished by conditioning the encoder on embeddings of context words in track 2 and context words, their lemmata and their MSDs in track 1.

The neural baseline system takes as input

1. A lemma ,

2. a left and right context word form and , respectively.

3. a left and right context lemma and , respectively (only in track 1) and

4. a left and right context MSD and , respectively (only in track 1).

The neural baseline system produces an inflected form of the lemma as output.

The input characters are first embedded: . Then, context words ( and ) for both tracks, as well as context lemmata ( and ) and MSDs ( and ) for track 1 are also embedded: , and . The system also a uses the whole token embedding of the input lemma : .

A bidirectional LSTM encoder is used to encode the lemma into representation vectors. In order to condition the encoder on the sentence context of the lemma, the encoder input vector for character is

1. a concatenation of embeddings for the context word forms, context lemmata, context MSDs, input lemma and input character: for track 1, and

2. a concatenation of embeddings for the context word forms, input lemma and input character: for track 2.

The input vectors are then encoded into representations by a bidirectional LSTM encoder. Finally, a decoder with additive attention Vaswani et al. (2017) is used for generating the output word form based on the representations .

The baseline system uses 100-dimensional embeddings and the LSTM hidden dimension for both the encoder and decoder is of size 100. Both encoder and decoder LSTM networks are single layer networks. The additive attention network is a 2-layer feed-forward network with hidden dimension 100 and nonlinearity.

The baseline system is trained for 20 epochs in both tracks and under all data settings using Adam Kingma and Ba (2014). During training, 30% dropout is applied on all input and recurrent connections in the encoder and decoder LSTM networks. Whole token embeddings for the input lemma, context word forms, lemmata and MSDs are dropped with a probability of 10%.

#### Copy Baseline

The second baseline is very straightforward. It simply copies the input lemma into the output. The system is based on the observation that in many languages the lemma form is quite common. In some languages, such as English, this baseline is in fact quite difficult to beat when the training set is small.

## 5 Results

The CoNLL–SIGMORPHON 2018 shared task received submissions from 15 teams with members from 17 universities or institutes (Table 7). Many of the teams submitted more than one system, yielding a total of 33 unique systems entered—27 for task 1, and 6 for task 2. In addition, baseline systems provided by the organizers for both tasks were also evaluated.

The relative system performance is described in Table 8, which show the average per-language accuracy of each system by resource condition. The table reflects the fact that some teams submitted more than one system (e.g. UZH-1 & UZH-2 in the table). Learning curves for each language across conditions are shown in Tables 10 and 9, which indicates the best per-form accuracy achieved by a submitted system. Full results can be found in Appendix A. Newer approaches led to better overall results in 2018 compared to 2017. In the low-resource condition, 41 (80%) of the 52 languages shared across years saw improvement in top system performance.

In the lower data conditions, encoder-decoder models are known to perform worse than the baseline model due to data sparsity. One way to work around this weakness is to learn sequences of edit operations instead of a standard string-to-string transduction, a strategy which was used by teams last year and this year (AX SEMANTICS, UZH, HAMBURG, MSU, RACAI). Another strategy is to create artificial training data that biases the neural model toward copying Kann and Schütze (2017); Bergmanis et al. (2017); Silfverberg et al. (2017); Zhou and Neubig (2017); Nicolai et al. (2017), which was also employed this year (TUEBINGEN-OSLO, WASEDA). Learning edit sequences requires input/output alignment, often as a preliminary step. The UZH submissions, which attained the highest average accuracy on the higher data conditions, built upon ideas in their last year’s submission Makarov et al. (2017), which had used such a separate alignment step followed by the application of an edit sequence. Their 2018 submission included edit distance alignment as part of the training loss function in the model, producing an end-to-end model. Another alternative to the edit sequence model is to use pointer generator networks, introduced by See et al. (2017) for text summarization, which also allow for copying parts of the input. This was employed by IITBHU. BME used a modified attention model that attended to both the lemma sequence and the tag sequence, which worked well in the high data condition, but, being without models of data augmentation or edit sequences, it suffered in the low data setting. In general, systems that included edit sequence generation or data augmentation fared significantly better in the low data settings. The HAMBURG submission attempted to learn similarities between characters based on rendering them visually using a font, with the intent to discover similarities such as those between a and ä, where the former is usually a low back vowel, and the latter a fronted version. Ensembling was also a popular choice to improve system performance. The UA system combined multiple models, both neural and non-neural, and focused on performance in the low data setting.

Even though the top-ranked systems used some form of ensembling to improve performance, different teams relied on different overall approaches. As a result, submissions may contain some amount of complementary information, so that a global ensemble may improve accuracy. As in 2017, we present an upper bound on the possible performance of such an ensemble. Table 8 includes an “Ensemble Oracle” system (oracle-e) that gives the correct answer if any of the submitted systems is correct. The oracle performs significantly better than any one system in both the Medium (10%) and Low (25%) conditions. This suggests that the different strategies used by teams to “bias” their systems in an effort to make up for sparse data lead to substantially different generalization patterns.

As in 2017, we also present a second “Feature Combination” Oracle (oracle-fc) that gives the correct answer for a given test triple iff its feature bundle appeared in training (with any lemma). Thus, oracle-fc provides an upper bound on the performance of systems that treat a feature bundle such as V;SBJV;FUT;3;PL as atomic. In the low-data condition, this upper bound was 77%, meaning that 23% of the test bundles had never been seen in training data. Nonetheless, systems should be able to make some accurate predictions on this 23% by decomposing each test bundle into individual morphological features such as FUT (future) and PL (plural), and generalizing from training examples that involve those features. For example, a particular feature or sub-bundle might be realized as a particular affix. For systems to succeed at this type of generalization, they must treat each individual feature separately, rather than treating feature bundles as holistic. In the medium data condition for some languages, some submissions far surpassed oracle-fc. As in 2017, the most notable example of this is Basque, where oracle-fc produced a 44% accuracy while six of the submitted systems produced an accuracy of 80% or above. Basque is an extreme example with very large paradigms for the few verbs that inflect in the language, so the problem of generalizing correctly to unseen feature combinations is amplified.

All systems submitted for task 2 were neural systems. All but one of the systems were encoder-decoder systems reminiscent of \newcitekann-schutze:2016:P16-2. The exception, \newcitemakarov-clematide:2018:K18-20, used a neural transition-based transducer with a designated copy action, which edits the input lemma into an output form. Table 6 details some of the design features in task 2 systems.

Predict MSD systems predicted the MSD of the target word form based on contextual cues and used the MSD to improve performance. The system by \newcitekementchedjhieva-bjerva-augenstein:2018:K18-20 used MSD prediction as an auxiliary task. The system by \newciteliu-EtAl:2018:K18-20 instead converted the contextual reinflection problem into ordinary morphological reinflection. They first predicted the MSD of the target word form based on sentence context and then generated the target word form using the input lemma and the predicted MSD.

Several systems improved upon the context model in the neural baseline system. Three systems (BME-HAS, NYU, and ZHU) used subword context models, for example, character-level models to encode context word forms, lemmata and MSDs. Many systems Ács (2018); Kementchedjhieva et al. (2018); Kann et al. (2018) also used a context RNN for encoding sentence context exceeding the immediate neighboring words. \newcitekann-lauly-cho:2018:K18-20 used context attention which refers to an attention mechanisms directed at contextual information.

The system by \newcitekementchedjhieva-bjerva-augenstein:2018:K18-20 was multilingual in the sense that it combined training data for all task 2 languages. Finally, the system by \newcitemakarov-clematide:2018:K18-20 used beam search for decoding.

Overall performance for all data settings in tracks 1 and 2 of task 2 is described in Table 11. For evaluation with regard to original forms, the evaluation criterion is accuracy; that is, how often a system correctly predicted the original UD form. For evaluation with regard to plausible forms, the evaluation criterion is relaxed accuracy given the set of contextually plausible forms. In other words, we measure how often the prediction was one of the variants in the set of plausible forms.

In track 1, the COPENHAGEN system is the clear winner in the high and medium data settings, whereas the UZH system is the clear winner in the low data setting. In fact, UZH is the only system which can beat the lemma copying baseline COPY-BL in the low setting. In track 2, the COPENHAGEN system and the neural baseline system NEURAL-BL deliver comparable performance in the high data setting. In the medium and low setting, the UZH system is the clear winner. Once again, the UZH system is the only system which can beat the lemma copying baseline COPY-BL in the low setting.

Table 11 shows that the best track 1 system outperforms the best track 2 system for every data setting, meaning that the additional supervision offered by context lemmata and MSDs is useful. Moreover, this effect seems to strengthen with increasing amounts of training data: the difference in performance between the best track 1 and track 2 systems for original forms in the low data setting is 3.8%-points, in the medium setting 7.8%-points, and in the high setting 13.6%-points. A further observation is that it seems to be more difficult to deliver improvements over the neural baseline system NEURAL-BL in the high setting in track 2, where NEURAL-BL in fact is one of the top two systems. This may be a result of the relatively small training sets: even in the high data setting, the training sets only contain approximately tokens.

The results on original and plausible forms show strong agreement. In all but one case, the same systems deliver the strongest performance for both evaluation criteria. The only exception is the Track 2 high setting where COPENHAGEN is the top system with regard to original forms and NEURAL-BL with regard to plausible forms. However, the performance of these systems is very similar. This strong agreement indicates that evaluation on plausible forms might not be necessary.

The best-performing systems for each language, track, and data setting in task 2 are given in Table 12. In track 1, COPENHAGEN achieves the strongest results for most languages in the high and medium data settings, whereas UZH delivers the best performance on all languages in the low setting. In track 2, COPENHAGEN and NEURAL-BL deliver the best performance on an equal number of languages in the high setting, whereas UZH delivers best performance for most languages in the low and medium settings, and COPENHAGEN performs best for the remaining languages.

## 6 Future Directions

In the case of inflection an interesting future topic could involve departing from orthographic representation and using more IPA-like representations, i.e. transductions over pronunciations. Different languages, in particular those with idiosyncratic orthographies, may offer new challenges in this respect.12

Neither task this year included unannotated monolingual corpora. Using such data is well-motivated from an L1-learning point of view, and may affect the performance of low-resource data settings, especially for the cloze task. In the inflection task, some results from last year Zhou and Neubig (2017) did not see significant gains by using extra data.

Only one team tried to learn inflection in a multilingual setting—i.e. to use all training data to train one model. Such transfer learning is an interesting avenue of future research, but evaluation could be difficult. Whether any cross-language transfer is actually being learned vs. whether having more data better biases the networks to copy strings is an evaluation step to disentangle.13

Creating new data sets that accurately reflect learner exposure (whether L1 or L2) is also an important consideration in the design of future shared tasks.

The results for task 2 show that evaluation against the original test form versus against set of plausible forms results in a very similar ranking of systems, justifying the use of the former, much simpler, method for future shared tasks. No manual annotation would then be required for the creation of test sets, allowing the inclusion of a wider variety of languages.

In track 2 of task 2, it turned out to be difficult to achieve clear improvements over the neural baseline system. This may be a consequence of the limited amount of training data. Increasing the amount of training data is an obvious solution, but encouraging the use of external datasets for semi-supervised learning could also be an interesting direction to pursue. Such semi-supervised methods could take the form of pretrained embeddings from monolingual corpora or more expressive models dedicated to improving morphological inflection, e.g., Wolf-Sonkin et al. (2018).

## 7 Conclusion

The CoNLL–SIGMORPHON 2018 shared task introduced a new cloze-test task with data sets for 7 languages, as well as extended the existing inflection task to include 103 languages. In task 1 (inflection) 27 systems were submitted, while 6 systems were submitted in task 2 (cloze test). Neural network models prevailed in both, although significant modifications to standard architectures were required to beat a simple baseline in the low data settings in both tasks.

As in previous years, we compared inflection system performance to oracle ensembles, showing that systems possessed complementary strengths. We released the training, development, and test sets for each task, and expect these to be useful for future endeavors in morphological learning, both in sentential context and in the case of isolated word inflection.

## Acknowledgements

The first author would like to acknowledge the support of an NDSEG fellowship. MS was supported by a grant from the Society of Swedish Literature in Finland (SLS). Several authors (CK, DY, JSG, MH) were supported in part by the Defense Advanced Research Projects Agency (DARPA) in the program Low Resource Languages for Emergent Incidents (LORELEI) under contract No. HR0011-15-C-0113. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the Defense Advanced Research Projects Agency (DARPA). NVIDIA Corp. donated a Titan Xp GPU used for this research.

## Appendix A Detailed Task 1 Results

This section contains detailed results for each submitted system on each language. Systems are ordered by average per-form accuracy for each sub-task and data condition. Three metrics are presented for each system/language combination.

1. Per-Form Accuracy: Percentage of test forms inflected correctly.

2. Levenshtein Distance: Average Levenshtein distance of system-predicted form from gold inflected form.

Scores in bold include the highest scoring non-oracle system for each language as well as any other systems that did not differ significantly in terms of per-form accuracy according to a sign test (). Scores marked with a indicate submissions that were significantly better than the feature combination oracle (), showing per-feature generalization. Scores marked with did not differ significantly from the ensemble oracle, suggesting minimal complementary information across systems.