IMS at the PolEval 2018:
A Bulky Ensemble Dependency Parser meets 12 Simple Rules for Predicting Enhanced Dependencies in Polish
This paper presents the IMS contribution to \polevalFull.111http://poleval.pl We submitted systems for both of the Subtasks of Task 1. In Subtask (A), which was about dependency parsing, we used our ensemble system from \conllstFull. The system first preprocesses the sentences with a CRF POS/morphological tagger and predicts supertags with a neural tagger. Then, it employs multiple instances of three different parsers and merges their outputs by applying blending. The system achieved the second place out of four participating teams. In this paper we show which components of the system were the most responsible for its final performance.
The goal of Subtask (B) was to predict enhanced graphs. Our approach consisted of two steps: parsing the sentences with our ensemble system from Subtask (A), and applying 12 simple rules to obtain the final dependency graphs. The rules introduce additional enhanced arcs only for tokens with “conj” heads (conjuncts). They do not predict semantic relations at all. The system ranked first out of three participating teams. In this paper we show examples of rules we designed and analyze the relation between the quality of automatically parsed trees and the accuracy of the enhanced graphs.
Keywords:Dependency Parsing Enhanced Dependencies Ensemble Parsers.
This paper presents the IMS contribution to \polevalFull (\poleval). The Shared Task consisted of three Tasks: (1) Dependency Parsing, (2) Named Entity Recognition, and (3) Language Models. Our team took part only in the Task (1) and submitted systems for both of its Subtasks (A) and (B).
The goal of the Subtask (A) was predicting morphosyntactic analyses and dependency trees for given sentences. The IMS submission was based on our ensemble system from \conllstFull . The system (described in detail in  and henceforth referred to as \imsoriginal) relies on established techniques for improving accuracy of dependency parsers. It performs its own preprocessing with a CRF tagger, incorporates supertags into the feature model of a dependency parser , and combines multiple parsers through blending (also known as reparsing; ).
The original system only needed few modifications to be applied in the PolEval18-ST setting. First, the organizers provided gold-standard tokenization so we excluded the tokenization modules from the system. Second, one of the metrics used in the PolEval18-ST was BLEX. While the metric takes lemmas into consideration we added a lemmatizer to the preprocessing steps. Finally, \imsoriginal was designed to run on the TIRA platform , where only a limited amount of CPU time was available to parse a multitude of test sets. The maximal number of instances of individual parsers thus had to be limited to ensure that parsing would end within the given time. Since in the \poleval setting the parsing time was not limited we removed the time constraint from the search procedure of the system. We call the modified version \imsnew.
The aim of Subtask (B) was to predict enhanced dependency graphs and additional semantic labels. Our approach consisted of two steps: parsing the sentences to surface dependency trees with our system from Subtask (A), and applying a rule-based system to extend the trees with enhanced arcs. Since the \poleval data contains enhanced dependencies only for conjuncts, our set of manually designed rules is small and introduces new relations only for tokens with “conj” heads (it does not predict semantic labels at all).
All components of both submitted systems (including POS tagger, morphological analyzers, and lemmatizer) were trained only on the training treebank. Out of all the additional resources allowed by the organizers we used only the pre-trained word embeddings prepared for \conllstFull.222https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-1989 We did not employ any Polish-specific tools as they (or the data their models were trained on) was not among the resources allowed by the organizers.
The remainder of this paper is organized as follows. Section 2 discusses our submission to Subtask (A) and analyzes which components of the system were the most responsible for its final performance. In Section 3 we describe our submission to Subtask (B), show examples of the designed rules, and analyze the relation between the quality of automatically parsed trees and the accuracy of the enhanced graphs. Our official test set results are shown in Section 4 and Section 5 concludes.
2 Subtask (A): Morphosyntactic prediction of dependency trees
The focus of Subtask (A) was morphosyntactic prediction and dependency parsing. The training and development data contained information about gold-standard tokenization, universal part-of-speech tags (UPOS), Polish-specific tags (XPOS), universal morphological features (UFeats), lemmas, and dependency trees. The dependency trees were annotated with the Universal Dependencies (UD)  according to the guidelines of UD v. 2.333http://universaldependencies.org/ To make the Shared Task more accessible to participants, the test data was released with baseline predictions for all preprocessing steps using the baseline \udpipe 1.2 system .
2.1 System description
Figure 1 shows an overview of the \imsnew system architecture. The architecture can be divided into two steps: preprocessing and parsing. The system uses its own preprocessing tools, so we did not utilize the baseline \udpipe predictions provided by the ST organizers. All the preprocessing tools annotate the training data via 5-fold jackknifing. The parsing step consists of running multiple instances of three different baseline parsers and combining them into an ensemble system. All the trained models for both of the steps, as well as code developed during this Shared Task will be made available on the first author’s page.
Below we give a summary of all the components of the system and describe changes introduced to the \imsoriginal system needed to adapt it to the \poleval setting.
is not performed by \imsoriginal. Since BLEX, one of the metrics used in the PolEval18-ST, takes lemmas into consideration, we added a lemmatizer to the preprocessing steps. For this purpose we used lemmatizer from the mate-tools with default hyperparameters.444https://code.google.com/archive/p/mate-tools/
Part-of-Speech and Morphological Tagging
is performed within \imsoriginal by \marmot, a morphological CRF tagger .555http://cistern.cis.lmu.de/marmot/ UPOS and UFeats are predicted jointly. Since \imsoriginal did not use XPOS tags, we added an additional CRF tagger predicting only XPOS tags (separately from other preprocessing steps). We used \marmot with default hyperparameters.
 are labels for tokens which encode syntactic information, e.g., the head direction or the subcategorization frame. \imsoriginal follows  and extracts supertags from the training treebank. Then, it incorporates them into the feature models of all baseline dependency parsers. Supertags are predicted with an in-house neural-based tagger (\xiangtag) .666https://github.com/EggplantElf/sclem2017-tagger
used by \imsoriginal differ in terms of architecture and employed training methods. The system uses three baseline parsers:
(1) The graph-based perceptron parser from mate-tools , henceforth referred to as \mate (the parser has been slightly modified to handle features based on supertags and shuffle training instances between epochs).777Since there are no time constraints in the \poleval (unlike the CoNLL 2017 Shared Task), \mate is applied to all sentences, cf.  for details on how some sentences were skipped to save time in the \imsoriginal system. (2) An in-house transition-based beam-perceptron parser , henceforth referred to as \abt. (3) An in-house transition-based greedy neural parser , henceforth referred to as \xiang.
We use the default hyperparameters during training and testing of all the three baseline parsers.
Blending, i.e., combining outputs of multiple different baseline parsers, can lead to improved performance . \imsoriginal parses every sentence with each baseline parser and combines all the predicted trees into one graph. It assigns scores to arcs depending on how frequent they are in the predicted trees. Then it uses the Chu-Liu-Edmonds algorithm [5, 6] to find the maximum spanning tree in the combined graph. For every resulting arc it selects the most frequent label across all the labels previously assigned to it.
To enlarge the number of parsers taking part in the final ensemble \imsoriginal trains multiple instances of each baseline parser using different random seeds: (1) eight \mate instances, (2) eight \abt instances which differ in the direction of parsing – four parse from left to right (\abt-l2r) and four from right to left (\abt-r2l), (3) eight \xiang instances which differ in the direction of parsing and the used word embeddings – four use pre-trained embeddings (\xiang-l2r-embed, \xiang-r2l-embed) and four use randomly initialized embeddings (\xiang-l2r-rand, \xiang-r2l-rand).
The final component of the \imsoriginal system (\blend) selects the best possible blending setting. It checks all the possible combinations of the above-mentioned instances ( possibilities) and selects the one which achieves the highest LAS score on the development set. The original \imsoriginal limits the maximal number of instances of individual parsers to ensure that parsing will end within a restricted time. Since in the \poleval setting the parsing time was not limited we removed the time constraint from the search procedure \blend.
Finally, since the UD guidelines do not allow multiple root nodes, we re-attach all excessive root dependents in a chain manner, i.e., every root dependent is attached to the previous one.
2.2 Evaluation of the components of the system
In this section we evaluate all the components of the submitted \imsnew system with the evaluation script provided by the ST organizers. We use the \udpipe 1.2 system (as provided by the ST organizers) as a baseline through all the steps.
Preprocessing and Supertags. We begin with evaluating the preprocessing components of our system on the development data (see Table 1). We find that \udpipe is much better at predicting lemmas than mate-tools and it surpasses it by more than 10 points. On the contrary, \marmot outperforms \udpipe on all the other tagging tasks, with the highest gain of more than two points on the task of predicting morphological features.
To see how the above-mentioned differences influence the parsing accuracy we run the baseline parsers (\mate, \abt, and \xiang) in four incremental settings: (1) using UPOS and morphological features predicted by \udpipe, (2) replacing UPOS and morphological features with \marmot’s predictions, (3) adding lemmas, (4) adding supertags. Table 2 shows LAS scores for the three baseline parsers for the consecutive experiments. Replacing \udpipe’s UPOS and morphological features with the predictions from \marmot improves accuracy by 0.42 points on average. The introduction of lemmas improves only the \mate parser and leads to minuscule improvements for the other two. The step which influences the final accuracy the most is the addition of supertags. It brings an additional 0.9 points on average (with the biggest gain for \abt of 1.54 points).
Parsing and Blending. Table 3 shows parsing results on the development set. The relation between baseline parsers (rows , , and ) is the same as in : \mate is the strongest method, \abt ranked second, and \xiang performs the worst. All the baseline parsers surpass the \udpipe parser (row ) in terms of the LAS and MLAS measures. Since the measure BLEX uses lemmas and \udpipe is much better in terms of lemmatization, it achieves higher BLEX than the baseline parsers (in fact it achieves the highest BLEX across all the compared methods).
Rows and show results of two separate blends. \blendBl (row ) is an arbitrarily selected combination of 4+4+4 instances: four \mate instances, four \abt instances (two \abt-l2r and two \abt-r2l), and four \xiang instances (\xiang-l2r-rand, \xiang-r2l-rand, \xiang-l2r-embed, \xiang-l2r-embed). Comparing rows ( – with row we see that blending parsers ends with a strong boost over the baselines, which corroborates the findings of [12, 1]. The blended accuracy surpasses the strongest baseline parser \mate by more than one point.
Finally, searching for the optimal combination yields an additional small improvement of 0.2 points. The best combination selected by the search contains: seven instances of \mate, three instances of \abt (two \abt-l2r and one \abt-r2l) and all the instances of \xiang.
3 Subtask (B): Beyond dependency tree
The goal of Subtask (B) was to predict labeled dependency graphs and semantic labels. The dependency graphs used in the ST were UD dependency trees extended with additional enhanced arcs. The arcs encoded shared dependents and shared governors of conjuncts. The semantic labels (e.g. Experiencer, Place, Condition) were used to annotate additional semantic meanings of tokens.
3.1 System description
Our submission to the Subtask (B) followed [13, 4] and carried out rule-based augmentation. The method consisted of two steps. First, we parsed all sentences to obtain surface dependency trees. Since the training data for Subtasks (A) and (B) was the same, we performed parsing with the same \blend system as described in Section 2.1. In the second step, we applied 12 simple rules to the predicted trees and augmented them with enhanced relations.
The rules of the system were designed manually and guided by intuition of a Polish native speaker while analyzing gold-standard graphs from the training part of the treebank. As the enhanced relations in the treebank mostly apply to conjuncts, i.e., tokens connected with the relation “conj” to their heads, our rules only apply to such tokens. We define two main rules: , which predicts additional heads, and , which adds enhanced children. The remaining 10 out of the 12 rules serve as filtering steps to improve the accuracy of the rule.
The rule introduces enhanced arcs for all the tokens whose head is “conj” and connects them to their grandparents (see Figure 1(a)). Figure 1(b) shows an example of a sentence where an enhanced arc was introduced by the rule. The word pracują (eng. they-work) received an additional head ROOT.
When introducing enhanced heads for “conj” tokens, this rule achieves an F-score of 99.40 on the gold-standard trees from the training data.
The rule adds all the siblings of a “conj” token as its dependents (see Figure 2(a)). Figure 2(b) shows an example of a sentence where an enhanced arc was introduced by the rule. The word zawsze (eng. always) is a sibling of the “conj” token przerażały (eng. terrified) and therefore got attached to it by an “advmod” arc.
When introducing enhanced children of “conj” tokens this rule alone is too generous. On gold trees from the training data it has a perfect recall, it introduces a lot of incorrect arcs. It achieves a precision of only , resulting in an an F-score of . We tackled this problem by designing 10 additional filtering rules which remove some suspicious arcs. Combined with the 10 filtering rules the rule achieves an F-score of on the gold trees from the training data. Below we give examples of three such rules: , , .
removes all the enhanced arcs with labels that are not among the ten most common ones: case, nsubj, mark, obl, advmod, amod, cop, obj, discourse:comment, advcl.
is the first of four filtering rules that remove enhanced arcs with label “advmod”. It applies to tokens which have their own “advmod” basic modifiers (see Figure 3(a)). The intuition is that if the token has its own adverbial modifier then most likely the modifier of its head does not refer to it. Figure 3(b) shows an example of a sentence where correctly removed an arc. Since the word miauknął (eng. meowed) has its own adverbial modifier znowu (eng. again) the enhanced arc to obok (eng. nearby) was removed.
When applied to the training data, this filter removed 105 enhanced arcs with an accuracy of 93%.
is the only filter which removes arcs with label “obj”. It applies when the enhanced “obj” modifier appears before the token in the sentence (see Figure 4(a)). The intuition is that in Polish “obj” modifiers tend to appear after both of the conjuncts. For example, in sentence Podziwiali i doceniali ją też uczniowie (id train-s4812; eng. Admired and appreciated her also students) the “obj” modifier ją (eng. her) appears after both of Podziwiali (eng. admired) and doceniali (eng. appreciated) and modifies both of them. In contrast, Figure 4(b) shows an example of a sentence where the filter correctly removed an arc. The rule introduced an arc from the token śpiewają (eng. they-sing) to ręce (eng. hands). But since the word ręce appears before śpiewają the arc was removed.
When applied to the training data, this filter removed 854 enhanced arcs with an accuracy of 96%.
3.2 Evaluation of the rules
In this section we evaluate the rules on the development set to test if they generalize well. As a baseline we use the system without any rules, i.e., we run the evaluation script on trees without any enhanced arcs.
We start with oracle experiments and apply the rules to gold-standard trees (see Table 4; Column 2). In this scenario the baseline achieves a very high accuracy of ELAS. Adding the rule gives a big boost of almost 4 points, resulting in an ELAS of . As expected, the pure rule introduces too many incorrect arcs and considerably deteriorates the performance. All the consecutive filters (, , ) give small improvements, but together (see Table 4; the final row) they not only recover the drop caused by the rule but also improve the total accuracy by additional points.
Next, we analyze the situation when enhanced arcs are introduced on automatically parsed trees. We apply the rules to outputs of two systems: the strongest parsing baseline \mate and the full ensemble system \blend. As expected, replacing gold-standard trees with a parser’s predictions results in a big drop in performance: baseline accuracy decreases from to for \mate and for \blend. Apart from the lower starting point, the rules behave similarly to the setting with gold-standard trees: gives a big boost, causes a big drop in accuracy, while the 12 rules together perform better than alone. Finally, comparing the accuracy of \mate and \blend shows that the parsing accuracy directly translates into enhanced parsing accuracy – \blend surpasses \mate by in terms of LAS (cf. Table 3) and the advantage stays the same in terms of ELAS ( points).
4 Test Results
The final results on the test set are shown in Table 5. In Subtask (A) we ranked second in terms of LAS score () and MLAS score () and were behind the COMBO team by and points respectively. We achieved the third best result in terms of BLEX score due to our poor lemmatization accuracy. In Subtask (B) we ranked first with an ELAS score of . Since we did not predict any semantic labels our SLAS score can be treated as a baseline result of running the evaluation script only on trees.
We have presented the IMS contribution to \polevalFull.
In Subtask (A) we re-used our system from \conllstFull. We confirmed our previous findings that strong preprocessing, supertags, and the use of diverse parsers for blending are important factors influencing the parsing accuracy. We extended those findings to the PolEval treebank which was a new test case for the system. The treebank differs from traditional treebanks since it is mostly built from selected sentences containing difficult syntactic constructions, instead of being sampled from some source at random.
In Subtask (B) we extended the bulky ensemble system from Subtask (A) by a set of 12 simple rules predicting enhanced arcs. We showed that a successful rule-based augmentation strongly depends on the employed parsing system. As we have demonstrated, if perfect parsing is assumed (by using gold trees), the simple rules we have developed are able to achieve an extremely high ELAS, leaving little space for further improvements. However, since the rules are not built to handle parsing errors, the parsing accuracy directly translates into performance on predicting the enhanced arcs.
This work was supported by the Deutsche Forschungsgemeinschaft (DFG) via the SFB 732, project D8.
-  Björkelund, A., Falenska, A., Yu, X., Kuhn, J.: IMS at the CoNLL 2017 UD Shared Task: CRFs and Perceptrons Meet Neural Networks. In: Proc. of CoNLL 2017 Shared Task. pp. 40–51 (2017)
-  Björkelund, A., Nivre, J.: Non-deterministic oracles for unrestricted non-projective transition-based dependency parsing. In: Proceedings of the 14th International Conference on Parsing Technologies. pp. 76–86. Association for Computational Linguistics, Bilbao, Spain (July 2015), http://www.aclweb.org/anthology/W15-2210
-  Bohnet, B.: Top accuracy and fast dependency parsing is not a contradiction. In: Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010). pp. 89–97. Coling 2010 Organizing Committee, Beijing, China (August 2010), http://www.aclweb.org/anthology/C10-1011
-  Candito, M., Guillaume, B., Perrier, G., Seddah, D.: Enhanced ud dependencies with neutralized diathesis alternation. In: Depling 2017-Fourth International Conference on Dependency Linguistics (2017)
-  Chu, Y., Liu, T.: On the shortest aborescence of a directed graph. Science Sinica 14, 1396–1400 (1965)
-  Edmonds, J.: Optimum branchings. Journal of Research of the National Bureau of Standards 71(B), 233–240 (1967)
-  Joshi, A.K., Bangalore, S.: Disambiguation of Super Parts of Speech (or Supertags): Almost Parsing. In: Proceedings of the 15th Conference on Computational Linguistics - Volume 1. pp. 154–160. COLING ’94, Association for Computational Linguistics, Stroudsburg, PA, USA (1994). https://doi.org/10.3115/991886.991912, http://dx.doi.org/10.3115/991886.991912
-  Müller, T., Schmid, H., Schütze, H.: Efficient Higher-Order CRFs for Morphological Tagging. In: In Proceedings of EMNLP (2013)
-  Nivre, J., de Marneffe, M.C., Ginter, F., Goldberg, Y., Hajič, J., Manning, C., McDonald, R., Petrov, S., Pyysalo, S., Silveira, N., Tsarfaty, R., Zeman, D.: Universal Dependencies v1: A multilingual treebank collection. In: Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016). pp. 1659–1666. European Language Resources Association, Portorož, Slovenia (2016)
-  Ouchi, H., Duh, K., Matsumoto, Y.: Improving Dependency Parsers with Supertags. In: Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, volume 2: Short Papers. pp. 154–158. Association for Computational Linguistics, Gothenburg, Sweden (April 2014), http://www.aclweb.org/anthology/E14-4030
-  Potthast, M., Gollub, T., Rangel, F., Rosso, P., Stamatatos, E., Stein, B.: Improving the reproducibility of PAN’s shared tasks: Plagiarism detection, author identification, and author profiling. In: Kanoulas, E., Lupu, M., Clough, P., Sanderson, M., Hall, M., Hanbury, A., Toms, E. (eds.) Information Access Evaluation meets Multilinguality, Multimodality, and Visualization. 5th International Conference of the CLEF Initiative (CLEF 14). pp. 268–299. Springer, Berlin Heidelberg New York (Sep 2014). https://doi.org/10.1007/978-3-319-11382-1_22
-  Sagae, K., Lavie, A.: Parser combination by reparsing. In: Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers. pp. 129–132. Association for Computational Linguistics, New York City, USA (June 2006), http://www.aclweb.org/anthology/N/N06/N06-2033
-  Schuster, S., Manning, C.D.: Enhanced english universal dependencies: An improved representation for natural language understanding tasks. In: LREC. pp. 23–28. Portorož, Slovenia (2016)
-  Straka, M., Hajič, J., Straková, J.: UDPipe: trainable pipeline for processing CoNLL-U files performing tokenization, morphological analysis, pos tagging and parsing. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16). European Language Resources Association (ELRA), Paris, France (May 2016), http://www.lrec-conf.org/proceedings/lrec2016/pdf/873_Paper.pdf
-  Yu, X., Falenska, A., Vu, N.T.: A general-purpose tagger with convolutional neural networks. In: arXiv preprint arXiv:1706.01723 (2017)
-  Yu, X., Vu, N.T.: Character composition model with convolutional neural networks for dependency parsing on morphologically rich languages. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Vancouver, Canada (August 2017)
-  Zeman, D., Popel, M., Straka, M., Hajič, J., Nivre, J., Ginter, F., et al.: CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies. In: Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies. Association for Computational Linguistics (2017)