# Global Transition-based Non-projective Dependency Parsing

###### Abstract

Shi, Huang, and Lee (2017a) obtained state-of-the-art results for English and Chinese dependency parsing by combining dynamic-programming implementations of transition-based dependency parsers with a minimal set of bidirectional LSTM features. However, their results were limited to projective parsing. In this paper, we extend their approach to support non-projectivity by providing the first practical implementation of the algorithm, an mildly non-projective dynamic-programming parser with very high coverage on non-projective treebanks. To make compatible with minimal transition-based feature sets, we introduce a transition-based interpretation of it in which parser items are mapped to sequences of transitions. We thus obtain the first implementation of global decoding for non-projective transition-based parsing, and demonstrate empirically that it is more effective than its projective counterpart in parsing a number of highly non-projective languages.

## 1 Introduction

Transition-based dependency parsers are a popular approach to natural language parsing, as they achieve good results in terms of accuracy and efficiency Yamada and Matsumoto (2003); Nivre and Scholz (2004); Zhang and Nivre (2011); Chen and Manning (2014); Dyer et al. (2015); Andor et al. (2016); Kiperwasser and Goldberg (2016). Until very recently, practical implementations of transition-based parsing were limited to approximate inference, mainly in the form of greedy search or beam search. While cubic-time exact inference algorithms for several well-known projective transition systems had been known since the work of \newcitehuang-dp and \newcitekuhlmann-dp, they had been considered of theoretical interest only due to their incompatibility with rich feature models: incorporation of complex features resulted in jumps in asymptotic runtime complexity to impractical levels.

However, the recent popularization of bi-directional long-short term memory networks (bi-LSTMs; Hochreiter and Schmidhuber, 1997) to derive feature representations for parsing, given their capacity to capture long-range information, has demonstrated that one may not need to use complex feature models to obtain good accuracy Kiperwasser and Goldberg (2016); Cross and Huang (2016). In this context, \newciteexact-minfeats presented an implementation of the exact inference algorithms of \newcitekuhlmann-dp with a minimal set of only two bi-LSTM-based feature vectors. This not only kept the complexity cubic, but also obtained state-of-the-art results in English and Chinese parsing.

While
their approach provides both accurate parsing and the flexibility
to use any of greedy, beam, or exact decoding with the same underlying transition
systems, it does not support non-projectivity. Trees with crossing dependencies
make up a significant portion of many treebanks, going as high as 63% for the
Ancient Greek treebank in the Universal Dependencies^{1}^{1}1http://universaldependencies.org/
(UD)
dataset
version 2.0 and averaging
around 12% over all languages in
UD 2.0.
In this
paper, we extend
Shi et al.’s (2017a) approach to mildly non-projective parsing in what, to our knowledge,
is the first implementation of exact decoding for a non-projective
transition-based parser.

As in the projective case, a mildly non-projective decoder has been known for several years Cohen et al. (2011), corresponding to a variant of the transition-based parser of Attardi (2006). However, its runtime — or the of a recently introduced improved-coverage variant (Shi et al., 2018) — is still prohibitively costly in practice. Instead, we seek a more efficient algorithm to adapt, and thus develop a transition-based interpretation of Gómez-Rodríguez et al.’s (2011) dynamic programming parser, which has been shown to provide very good non-projective coverage in time Gómez-Rodríguez (2016). While the parser was originally presented as a non-projective generalization of the dynamic program that later led to the arc-hybrid transition system Gómez-Rodríguez et al. (2008); Kuhlmann et al. (2011), its own relation to transition-based parsing was not known. Here, we show that can be interpreted as exploring a subset of the search space of a transition-based parser that generalizes the arc-hybrid system, under a mapping that differs from the “push computation” paradigm used by the previously-known dynamic-programming decoders for transition systems. This allows us to extend \newciteexact-minfeats’s work to non-projective parsing, by implementing with a minimal set of transition-based features.

Experimental results show that our approach outperforms the projective approach of Shi et al. (2017a) and maximum-spanning-tree non-projective parsing on the most highly non-projective languages in the CoNLL 2017 shared-task data that have a single treebank. We also compare with the third-order 1-Endpoint-Crossing (1EC) parser of Pitler (2014), the only other practical implementation of an exact mildly non-projective decoder that we know of, which also runs in but without a transition-based interpretation. We obtain comparable results for these two algorithms, in spite of the fact that the algorithm is notably simpler than 1EC. The parser remains effective in parsing projective treebanks, while our baseline parser, the fully non-projective maximum spanning tree algorithm, falls behind due to its unnecessarily large search space in parsing these languages. Our code, including our re-implementation of the third-order 1EC parser with neural scoring, is available at https://github.com/tzshi/mh4-parser-acl18.

## 2 Non-projective Dependency Parsing

In dependency grammar, syntactic structures are modeled
as word-word asymmetrical subordinate relations among lexical entries Kübler et al. (2009).
These relations can be
represented in a graph.
For a sentence ,
we first define a corresponding set of nodes
,
where is an artificial node denoting the root of the sentence.
Dependency relations are encoded by
edges
of the form
,
where is the head and the modifier of the bilexical subordinate relation.^{2}^{2}2
To simplify exposition here, we only consider the unlabeled case.
We use a separately-trained labeling module
to obtain labeled parsing results in §5.

As is conventional, we assume two more properties on dependency structures. First, each word has exactly one syntactic head, and second, the structure is acyclic. As a consequence, the edges form a directed tree rooted at node .

We say that a dependency structure is projective if it has no crossing edges. While in the CoNLL and Stanford conversions of the English Penn Treebank, over of the sentences are projective Chen and Manning (2014) — see Fig. 1 for a non-projective English example — for other languages’ treebanks, non-projectivity is a common occurrence (see Table 3 for some statistics). This paper is targeted at learning parsers that can handle non-projective dependency trees.

## 3 Deduction System and Its Underlying Transition System

### 3.1 The Deduction System

The parser is the instantiation for of Gómez-Rodríguez et al.’s (2011) more general parser. stands for “multi-headed with at most heads per item”: items in its deduction system take the form for , indicating the existence of a forest of dependency subtrees headed by such that their yields are disjoint and the union of their yields is the contiguous substring of the input. Deduction steps, shown in Figure 2, can be used to join two such forests that have an endpoint in common via graph union (Combine); or to add a dependency arc to a forest that attaches an interior head as a dependent of any of the other heads (Link).

In the original formulation by \newcitegomez-nonproj-schemata, all valid items of the form are considered to be axioms. In contrast, we follow Kuhlmann et al.’s (2011) treatment of : we consider as the only axiom and include an extra Shift step to generate the rest of the items of that form. Both formulations are equivalent, but including this Shift rule facilitates giving the parser a transition-based interpretation.

Higher values of provide wider coverage of non-projective structures at an asymptotic
runtime complexity of .
When is at its minimum value of 3, the parser covers exactly the set of projective trees, and in fact, it can be seen as a transformation^{3}^{3}3
Formally, it is a step refinement; see Gómez-Rodríguez et al. (2011). of the deduction system described in Gómez-Rodríguez et al. (2008) that
gave rise to the projective arc-hybrid parser Kuhlmann et al. (2011). For , the parser covers an increasingly larger set of non-projective structures. While a simple characterization of these sets has been lacking^{4}^{4}4
This is a common
issue with parsers based on the general idea of arcs between non-contiguous heads, such as those deriving from Attardi (2006)., empirical evaluation on a large number of treebanks Gómez-Rodríguez (2016) has shown to provide the best known tradeoff between asymptotic complexity and efficiency for . When , its coverage is second only to the 1-Endpoint-Crossing parser of \newcitepitler-1ec.
Both parsers fully cover well over 80% of the non-projective trees observed in the studied treebanks.

### 3.2 The Transition System

\newcitekuhlmann-dp show how the items of a variant of can be given a transition-based interpretation under the “push computation” framework, yielding the arc-hybrid projective transition system. However, such a derivation has not been made for the non-projective case (), and the known techniques used to derive previous associations between tabular and transition-based parsers do not seem to be applicable in this case. The specific issue is that the deduction systems of Kuhlmann et al. (2011) and Cohen et al. (2011) have in common that the structure of their derivations is similar to that of a Dyck (or balanced-brackets) language, where steps corresponding to shift transitions are balanced with those corresponding to reduce transitions. This makes it possible to group derivation subtrees, and the transition sequences that they yield, into “push computations” that increase the length of the stack by a constant amount. However, this does not seem possible in .

Instead, we derive a transition-based interpretation of by a generalization of that of that departs from push computations.

To do so, we start with the interpretation of an item given by \newcitekuhlmann-dp. This item represents a set of computations (transition sequences) that start from a configuration of the form (where is the stack and is the buffer, with being the first buffer node) and take the parser to a configuration of the form . That is, the computation has the net effect of placing node on top of the previous contents of the stack, and it ends in a state where the first buffer element is .

Under this item semantics, the Combine deduction step of the parser (i.e., the instantiation of the one in Fig. 2 for ) simply concatenates transition sequences. The Shift step generates a sequence with a single arc-hybrid transition:

and the two possible instantiations of the Combine step when take the antecedent transition sequence and add a transition to it, namely, one of the two arc-hybrid reduce transitions. Written in the context of the node indexes used in Figure 2, these are the following:

where and respectively can be simplified out to obtain the well-known arc-hybrid transitions:

Now, we assume the following generalization of the item semantics: an item represents a set of computations that start from a configuration of the form and lead to a configuration of the form . Note that this generalization no longer follows the “push computation” paradigm of Kuhlmann et al. (2011) and Cohen et al. (2011) because the number of nodes pushed onto the stack depends on the value of .

Under this item semantics, the Shift and Combine steps have the same interpretation as for . In the case of the Link step, following the same reasoning as for the case, we obtain the following transitions:

These transitions give us the transition system: a parser with four projective reduce transitions (,,,) and two Attardi-like, non-adjacent-arc reduce transitions ( and ).

It is worth mentioning that this transition system we have obtained is the same as one of the variants of Attardi’s algorithm introduced by \newciteattardi-dp-n6, there called All. However, in that paper they show that it can be tabularized in using the push computation framework. Here, we have derived it as an interpretation of the parser.

However, in this case the dynamic programming algorithm does not cover the full search space of the transition system: while each item in the parser can be mapped into a computation of this transition-based parser, the opposite is not true. This tree:

{dependency}[theme = simple] {deptext}[column sep=2em] 0 & 1 & 2 & 3 & 4 & 5

\depedge13 \depedge35 \depedge[edge end x offset=-6pt]56 \depedge[edge start x offset=8pt]64 \depedge42

can be parsed by the transition system using the computation

but it is not covered by the dynamic programming algorithm, as no deduction sequence will yield an item representing this transition sequence. As we will see, this issue will not prevent us from implementing a dynamic-programming parser with transition-based scoring functions, or from achieving good practical accuracy.

## 4 Model

Given the transition-based interpretation of the system, the learning objective becomes to find a computation that gives the gold-standard parse. For each sentence , we train parsers to produce the transition sequence that corresponds to the annotated dependency structure. Thus, the model consists of two components: a parameterized scorer , and a decoder that finds a sequence as prediction based on the scoring.

As discussed by Shi et al. (2017a), there exists some tension between rich-feature scoring models and choices of decoders. Ideally, a globally-optimal decoder finds the maximum-scoring transition sequence without brute-force searching the exponentially-large output space. To keep the runtime of our exact decoder at a practical low-order polynomial, we want its feature set to be minimal, consulting as few stack and buffer positions as possible. In what follows, we use and to denote the top two stack items and and to denote the first two buffer items.

### 4.1 Scoring and Minimal Features

This section empirically explores the lower limit on the number of necessary positional features. We experiment with both local and global decoding strategies. The parsers take features extracted from parser configuration , and score each valid transition with . The local parsers greedily take transitions with the highest score until termination, while the global parsers use the scores to find the globally-optimal solutions , where is the sum of scores for the component transitions.

Following prior work, we employ bi-LSTMs for compact feature representation. A bi-LSTM runs in both directions on the input sentence, and assigns a context-sensitive vector encoding to each token in the sentence: . When we need to extract features, say, , from a particular stack or buffer position, say , we directly use the bi-LSTM vector , where gives the index of the subroot of into the sentence.

Shi et al. (2017a) showed that feature vectors suffice for . Table 1 and Table 2 show the use of small feature sets for , for local and global parsing models, respectively. For a local parser to exhibit decent performance, we need at least , but adding on top of that does not show any significant impact on the performance. Interestingly, in the case of global models, the two-vector feature set already suffices. Adding to the global setting (column “Hybrid” in Table 2) seems attractive, but entails resolving a technical challenge that we discuss in the following section.

Features | |||
---|---|---|---|

UAS |

### 4.2 Global Decoder

Features | Hybrid | |
---|---|---|

UAS |

In our transition-system interpretation of , transitions correspond to Shift and reduce transitions reflect the Link steps. Since the Shift conclusions lose the contexts needed to score the transitions, we set the scores for all Shift rules to zero and delegate the scoring of the transitions to the Combine steps, as as in Shi et al. (2017a); for example,

Here the transition sequence denoted by starts from a , with and taking the and positions. If we further wish to access , such information is not readily available in the deduction step, apparently requiring extra bookkeeping that pushes the space and time complexity to an impractical and , respectively. But, consider the scoring for the reduce transitions in the Link steps:

The deduction steps already keep indices for ( in the first rule, in the second) and thus provide direct access without any modification. To resolve the conflict between including for richer representations and the unavailability of in scoring the transitions in the Combine steps, we propose a hybrid scoring approach — we use features when scoring a transition, and features for consideration of reduce transitions. We call this method -hybrid, in contrast to -two, where we simply take for scoring all transitions.

### 4.3 Large-Margin Training

We train the greedy parsers with hinge loss, and the global parsers with its structured version Taskar et al. (2005). The loss function for each sentence is formally defined as:

where the margin counts the number of mis-attached nodes for taking sequence instead of . Minimizing this loss can be thought of as optimizing for the attachment scores.

The calculation of the above loss function can be solved as efficiently as the deduction system if the function decomposes into the dynamic program. We achieve this by replacing the scoring of each reduce step by its cost-augmented version:

where . This loss function encourages the model to give higher contrast between gold-standard and wrong predictions, yielding better generalization results.

## 5 Experiments

#### Data and Evaluation

We experiment with the Universal Dependencies (UD) 2.0 dataset
used for
the
CoNLL 2017 shared task Zeman et al. (2017).
We restrict our choice of languages to be those with only
one training treebank,
for a better comparison with the shared task results.^{5}^{5}5
When multiple treebanks are available, one can develop domain transfer strategies,
which is not the focus of this work.
Among these languages, we pick the top 10 most non-projective languages.
Their basic statistics are listed in Table 3.
For all
development-set results, we assume gold-standard tokenization and sentence delimitation.
When comparing to the shared task results on test sets,
we use the provided baseline UDPipe Straka et al. (2016) segmentation.
Our models do not use part-of-speech tags or morphological tags as features,
but
rather leverage such information via stack propagation Zhang and Weiss (2016),
i.e., we learn to predict them as a secondary training objective.
We report unlabeled attachment F1-scores (UAS) on the development sets for better focus on comparing our (unlabeled) parsing modules.
We report its labeled variant (LAS), the main metric of the shared task, on the test sets.
For each experiment setting, we ran the model with different random initializations,
and report the mean and standard deviation.
We detail the implementation details in the supplementary material.

Language | Code | # Sent. | # Words | Sentence Coverage () | Edge Coverage () | ||||
---|---|---|---|---|---|---|---|---|---|

Proj. | 1EC | Proj. | 1EC | ||||||

Basque | eu | ||||||||

Urdu | ur | ||||||||

Gothic | got | ||||||||

Hungarian | hu | ||||||||

Old Church Slavonic | cu | ||||||||

Danish | da | ||||||||

Greek | el | ||||||||

Hindi | hi | ||||||||

German | de | ||||||||

Romanian | ro |

#### Baseline Systems

For comparison, we include three baseline systems with the same underlying feature representations and scoring paradigm. All the following baseline systems are trained with the cost-augmented large-margin loss function.

The parser is the projective instantiation of the parser family. This corresponds to the global version of the arc-hybrid transition system Kuhlmann et al. (2011). We adopt the minimal feature representation , following Shi et al. (2017a). For this model, we also implement a greedy incremental version.

The edge-factored non-projective maximal spanning tree (MST) parser allows arbitrary non-projective structures. This decoding approach has been shown to be very competitive in parsing non-projective treebanks McDonald et al. (2005), and was deployed in the top-performing system at the CoNLL 2017 shared task Dozat et al. (2017). We score each edge individually, with the features being the bi-LSTM vectors , where is the head, and the modifier of the edge.

The crossing-sensitive third-order 1EC parser provides a hybrid dynamic program for parsing 1-Endpoint-Crossing non-projective dependency trees with higher-order factorization Pitler (2014). Depending on whether an edge is crossed, we can access the modifier’s grandparent , head , and sibling . We take their corresponding bi-LSTM features for scoring each edge. This is a re-implementation of Pitler (2014) with neural scoring functions.

#### Main Results

Global Models | Greedy Models | ||||||
---|---|---|---|---|---|---|---|

Lan. | MST | -two | -hybrid | 1EC | |||

eu | |||||||

ur | |||||||

got | |||||||

hu | |||||||

cu | |||||||

da | |||||||

el | |||||||

hi |