A Transition-based Algorithm for Unrestricted AMR Parsing
Non-projective parsing can be useful to handle cycles and reentrancy in amr graphs. We explore this idea and introduce a greedy left-to-right non-projective transition-based parser. At each parsing configuration, an oracle decides whether to create a concept or whether to connect a pair of existing concepts. The algorithm handles reentrancy and arbitrary cycles natively, i.e. within the transition system itself. The model is evaluated on the LDC2015E86 corpus, obtaining results close to the state of the art, including a Smatch of 64%, and showing good behavior on reentrant edges.
David Vilares Universidade da Coruña FASTPARSE Lab, LyS Group Departamento de Computación Campus de A Elviña s/n, 15071 A Coruña, Spain email@example.com Carlos Gómez-Rodríguez Universidade da Coruña FASTPARSE Lab, LyS Group Departamento de Computación Campus de A Elviña s/n, 15071 A Coruña, Spain firstname.lastname@example.org
Abstract Meaning Representation (amr) is a semantic representation language to map the meaning of English sentences into directed, cycled, labeled graphs (banarescu2013abstract). Graph vertices are concepts inferred from words. The concepts can be represented by the words themselves (e.g. dog), PropBank framesets (palmer2005proposition) (e.g. eat-01), or keywords (like named entities or quantities). The edges denote relations between pairs of concepts (e.g. eat-01 :ARG0 dog). amr parsing integrates tasks that have usually been addressed separately in natural language processing (nlp), such as named entity recognition (nadeau2007survey), semantic role labeling (palmer2010semantic) or co-reference resolution (ng2002improving; lee2017scaffolding). Figure 1 shows an example of an amr graph.
Several transition-based dependency parsing algorithms have been extended to generate amr. wang2015transition describe a two-stage model, where they first obtain the dependency parse of a sentence and then transform it into a graph. damonte-cohen-satta:2017:EACLlong propose a variant of the arc-eager algorithm to identify labeled edges between concepts. These concepts are identified using a lookup table and a set of rules. A restricted subset of reentrant edges are supported by an additional classifier. A similar configuration is used in (gildea-satta-cl17; peng-aaai18), but relying on a cache data structure to handle reentrancy, cycles and restricted non-projectivity. A feed-forward network and additional hooks are used to build the concepts. ballesteros-alonaizan:2017:EMNLP2017 use a modified arc-standard algorithm, where the oracle is trained using stack-lstms (dyer-EtAl:2015:ACL-IJCNLP). Reentrancy is handled through swap (nivre2009non) and they define additional transitions intended to detect concepts, entities and polarity nodes.
This paper explores unrestricted non-projective amr parsing and introduces amr-covington, inspired by covington2001fundamental. It handles arbitrary non-projectivity, cycles and reentrancy in a natural way, as there is no need for specific transitions, but just the removal of restrictions from the original algorithm. The algorithm has full coverage and keeps transitions simple, which is a matter of concern in recent studies (peng-aaai18).
2 Preliminaries and Notation
We use typewriter font for concepts and their indexes (e.g. dog or 1), regular font for raw words (e.g. dog or 1), and a bold style font for vectors and matrices (e.g. , ).
covington2001fundamental describes a fundamental algorithm for unrestricted non-projective dependency parsing. The algorithm can be implemented as a left-to-right transition system (nivre2008algorithms). The key idea is intuitive. Given a word to be processed at a particular state, the word is compared against the words that have previously been processed, deciding to establish or not a syntactic dependency arc from/to each of them. The process continues until all previous words are checked or until the algorithm decides no more connections with previous words need to be built, then the next word is processed. The runtime is in the worst scenario. To guarantee the single-head and acyclicity conditions that are required in dependency parsing, explicit tests are added to the algorithm to check for transitions that would break the constraints. These are then disallowed, making the implementation less straightforward.
3 The amr-Covington algorithm
The acyclicity and single-head constraints are not needed in amr, as arbitrary graphs are allowed. Cycles and reentrancy are used to model semantic relations between concepts (as shown in Figure 1) and to identify co-references. By removing the constraints from the Covington transition system, we achieve a natural way to deal with them.111This is roughly equivalent to going back to the naive parser called ESH in (covington2001fundamental), which has not seen practical use in parsing due to the lack of these constraints.
Also, amr parsing requires words to be transformed into concepts. Dependency parsing operates on a constant-length sequence. But in amr, words can be removed, generate a single concept, or generate several concepts. In this paper, additional lookup tables and transitions are defined to create concepts when needed, following the current trend (damonte-cohen-satta:2017:EACLlong; ballesteros-alonaizan:2017:EMNLP2017; gildea-satta-cl17).
Let = be an edge-labeled directed graph where: = is the set of concepts and is the set of labeled edges, we will denote a connection between a head concept and a dependent concept as , where l is the semantic label connecting them.
The parser will process sentences from left to right. Each decision leads to a new parsing configuration, which can be abstracted as a 4-tuple where:
is a buffer that contains unprocessed words. They await to be transformed to a concept, a part of a larger concept, or to be removed. In , represents the head of , and it optionally can be a concept. In that case, it will be denoted as b.
is a list of previously created concepts that are waiting to determine its semantic relation with respect to b. Elements in are concepts. In , i denotes its last element.
is a list that contains previously created concepts for which the relation with b has already been determined. Elements in are concepts. In , j denotes the head of .
is the set of the created edges.
Given an input sentence, the parser starts at an initial configuration = and will apply valid transitions until a final configuration is reached, such that = . The set of transitions is formally defined in Table 1:
left-arc: Creates an edge . i is moved to .
right-arc: Creates an edge . i is moved to .
shift: Pops b from . , and b are appended.
no arc: It is applied when the algorithm determines that there is no semantic relationship between i and b, but there is a relationship between some other node in and b.
confirm: Pops from and puts the concept b in its place. This transition is called to handle words that only need to generate one (more) concept.
breakdown: Creates a concept b from , and places it on top of , but is not popped, and the new buffer state is . It is used to handle a word that is going to be mapped to multiple concepts. To guarantee termination, breakdown is parametrized with a constant , banning generation of more than consecutive concepts by using this operation. Otherwise, concepts could be generated indefinitely without emptying .
reduce: Pops from . It is used to remove words that do not add any meaning to the sentence and are not part of the amr graph.
left and right-arc handle cycles and reentrancy with the exception of cycles of length 2 (which only involve i and b). To assure full coverage, we include an additional transition: multiple-arc that creates two edges and . i is moved to . multiple-arcs are marginal and will not be learned in practice. amr-covington can be implemented without multiple-arc, by keeping i in after creating an arc and using no-arc when the parser has finished creating connections between i and b, at a cost to efficiency as transition sequences would be longer. Multiple edges in the same direction between i and b are handled by representing them as a single edge that merges the labels.
|w, t, p||reduce|
|p, a, o||confirm|
|p, a, o||shift|
|p||a, o, t||confirm|
|p||a, o, t||left-arc|
|p||a, o, t||shift|
|p, a||o, t, E||reduce|
|p, a||E, h, w||breakdown|
|p, a||‘E’, E, h||shift|
|p, a, ‘E’||E, h, w||breakdown|
|p, a, ‘E’||‘E’, h, w||shift|
|a, ‘E’, ‘E’||E, h, w||breakdown|
|a, ‘E’, ‘E’||n, h, w||left-arc|
|a, ‘E’||‘E’||n, h, w||shift|
|‘E’, ‘E’, n||E, h, w||confirm|
|‘E’, ‘E’, n||p2, h, w||left-arc|
|a, ‘E’, ‘E’||n,||p2, h, w||no-arc|
|p, a, ‘E’||‘E’, n,||p2, h, w||left-arc|
|p, a||‘E’, ‘E’, n,||p2, h, w||shift|
|‘E’, n, p2||h, w, s||reduce|
|‘E’, n, p2||s, n, t||confirm|
|‘E’, n, p2||s, n, t||left-arc|
|‘E’, ‘E’, n||p2||s, n, t||no-arc|
|p, a||‘E’, ‘E’, n||s, n, t||left-arc|
|p, a, ‘E’||s, n, t||shift|
|n, p2, s||n, t, s2||confirm|
|n, p2, s||-, t, s2||shift|
|p2, s, -||t, s2, a2||reduce|
|p2, s, -||s2, a2, p3||confirm|
|p2, s, -||s2, a2, p3||left-arc|
|n, p2, s||-||s2, a2, p3||no-arc|
|p||a, ‘E’, ‘E’||s2, a2, p3||left-arc|
|p, a, ‘E’||s2, a2, p3||shift|
|s, -, s2||a2, p3||confirm|
|s, -, s2||a2, p3||shift|
|-, s2, a2||p3||confirm|
|-, s2, a2||p3||left-arc|
|s, -, s2||a2||p3||right-arc|
|s2, a2, p3||shift|
3.2 Training the classifiers
The algorithm relies on three classifiers: (1) a transition classifier, , that learns the set of transitions introduced in §3.1, (2) a relation classifier, , to predict the label(s) of an edge when the selected action is a left-arc, right-arc or multiple-arc and (3) a hybrid process (a concept classifier, , and a rule-based system) that determines which concept to create when the selected action is a confirm or breakdown.
Sentences are tokenized and aligned with the concepts using Jamr (flanigan2014discriminative). For lemmatization, tagging and dependency parsing we used UDpipe (straka2016udpipe) and its English pre-trained model (zeman2017conll). Named Entity Recognition is handled by Stanford CoreNLP (manning-EtAl:2014:P14-5).
We use feed-forward neural networks to train the tree classifiers. The transition classifier uses 2 hidden layers (400 and 200 input neurons) and the relation and concept classifiers use 1 hidden layer (200 neurons). The activation function in hidden layers is a = and their output is computed as where and are the weights and bias tensors to be learned and the input at the th hidden layer. The output layer uses a function, computed as . All classifiers are trained in mini-batches (size=32), using Adam (kingma2014adam) (learning rate set to ), early stopping (no patience) and dropout (srivastava2014dropout) (40%). The classifiers are fed with features extracted from the preprocessed texts. Depending on the classifier, we are using different features. These are summarized in Appendix A (Table 5), which also describes (B) other design decisions that are not shown here due to space reasons.
3.3 Running the system
At each parsing configuration, we first try to find a multiword concept or entity that matches the head elements of , to reduce the number of breakdowns, which turned out to be a difficult transition to learn (see §4.1). This is done by looking at a lookup table of multiword concepts222The most frequent subgraph. seen in the training set and a set of rules, as introduced in (damonte-cohen-satta:2017:EACLlong; gildea-satta-cl17).
We then invoke and call the corresponding subprocess when an additional concept or edge-label identification task is needed.
If the word at the top of occurred more than 4 times in the training set, we call a supervised classifier to predict the concept. Otherwise, we first look for a word-to-concept mapping in a lookup table. If not found, if it is a verb, we generate the concept lemma-01, and otherwise lemma.
Edge label identification
The classifier is invoked every time an edge is created. We use the list of valid ARGs allowed in propbank framesets by damonte-cohen-satta:2017:EACLlong. Also, if p and o are a propbank and a non-propbank concept, we restore inverted edges of the form o p as o p.
4 Methods and Experiments
We use the LDC2015E86 corpus and its official splits: 16 833 graphs for training, 1 368 for development and 1 371 for testing. The final model is only trained on the training split.
We use Smatch (cai2013smatch) and the metrics from damonte-cohen-satta:2017:EACLlong.333It is worth noting that the calculation of Smatch and metrics derived from it suffers from a random component, as they involve finding an alignment between predicted and gold graphs with an approximate algorithm that can produce a suboptimal solution. Thus, as in previous work, reported Smatch scores may slightly underestimate the real score.
The code and the pretrained model used in this paper can be found at https://github.com/aghie/tb-amr.
4.1 Results and discussion
Table 3 shows accuracy of on the development set. confirm and reduce are the easiest transitions, as local information such as POS-tags and words are discriminative to distinguish between content and function words. breakdown is the hardest action.444This transition was trained/evaluated for non named-entity words that generated multiple nodes, e.g. father, that maps to have-rel-role-91 :ARG2 father. In early stages of this work, we observed that this transition could learn to correctly generate multiple-term concepts for named-entities that are not sparse (e.g. countries or people), but failed with sparse entities (e.g. dates or percent quantities). Low performance on identifying them negatively affects the edge metrics, which require both concepts of an edge to be correct. Because of this and to identify them properly, we use the mentioned complementary rules to handle named entities. right-arcs are harder than left-arcs, although the reason for this issue remains as an open question for us. The performance for no-arcs is high, but it would be interesting to achieve a higher recall at a cost of a lower precision, as predicting no-arcs makes the transition sequence longer, but could help identify more distant reentrancy. The accuracy of is 86%. The accuracy of is 79%. We do not show the detailed results since the number of classes is too high. was trained on concepts occurring more than 1 time in the training set, obtaining an accuracy of 83%. The accuracy on the development set with all concepts was 77%.
Table 4 compares the performance of our systems with state-of-the-art models on the test set. amr-covington obtains state-of-the-art results for all the standard metrics. It outperforms the rest of the models when handling reentrant edges. It is worth noting that D requires an additional classifier to handle a restricted set of reentrancy and P uses up to five classifiers to build the graph.
In contrast to related work that relies on ad-hoc procedures, the proposed algorithm handles cycles and reentrant edges natively. This is done by just removing the original constraints of the arc transitions in the original covington2001fundamental algorithm. The main drawback of the algorithm is its computational complexity. The transition system is expected to run in , as the original Covington parser. There are also collateral issues that impact the real speed of the system, such as predicting the concepts in a supervised way, given the large number of output classes (discarding the less frequent concepts the classifier needs to discriminate among more than 7 000 concepts). In line with previous discussions (damonte-cohen-satta:2017:EACLlong), it seems that using a supervised feed-forward network to predict the concepts does not lead to a better overall concept identification with respect of the use of simple lookup tables that pick up the most common node/subgraph. Currently, every node is kept in , and it is available to be part of new edges. We wonder if only storing in the head node for words that generate multiple-node subgraphs (e.g. for the word father that maps to have-rel-role-91 :ARG2 father, keeping in only the concept have-rel-role-91) could be beneficial for amr-covington.
As a side note, current amr evaluation involves elements such as neural network initialization, hooks and the (sub-optimal) alignments of evaluation metrics (e.g. Smatch) that introduce random effects that were difficult to quantify for us.
We introduce amr-covington, a non-projective transition-based parser for unrestricted amr. The set of transitions handles reentrancy natively. Experiments on the LDC2015E86 corpus show that our approach obtains results close to the state of the art and a good behavior on re-entrant edges.
As future work, amr-covington produces sequences of no-arcs which could be shortened by using non-local transitions (qi-manning:2017:Short; 2017arXiv171009340F). Sequential models have shown that fewer hooks and lookup tables are needed to deal with the high sparsity of amr (ballesteros-alonaizan:2017:EMNLP2017). Similarly, bist-covington (vilares2017non) could be adapted for this task.
This work is funded from the European Research Council (ERC), under the European Union’s Horizon 2020 research and innovation programme (FASTPARSE, grant agreement No 714150), from the TELEPARES-UDC project (FFI2014-51978-C2-2-R) and the ANSWER-ASAP project (TIN2017-85160-C2-1-R) from MINECO, and from Xunta de Galicia (ED431B 2017/01). We gratefully acknowledge NVIDIA Corporation for the donation of a GTX Titan X GPU.
Appendix A Supplemental Material
Table 5 indicates the features used to train the different classifiers. Concept features are randomized by to an special index that refers to an unknown concept during the training phase. This helps learn a generic embedding for unseen concepts in the test phase. Also, concepts that occur only one time in the training set are not considered as output classes by .
|Labels from the predicted|
Internal and external (from GloVe) word embedding sizes are set to 100. The size of the concept embedding is set to 50. The rest of the embeddings size are set to 20. The weights are initialized to zero.
Appendix B Additional design decisions
We try to list more in detail the main hooks and design decisions followed in this work to mitigate the high sparsity in Abstract Meaning Representation which, at least in our experience, was a struggling issue. These decisions mainly affect the mapping from words to multiple-concept subgraphs.
We identify named entities and nationalities, and update the training configurations to generate the corresponding subgraph by applying a set of hooks.555The hooks are based on the text file resources for countries, cities, etc, released by damonte-cohen-satta:2017:EACLlong and an analysis of how named entities are generated in the training/development set. The intermediate training configurations are not fed as samples to the classifier. On early experiments it was observed that the breakdown transition could acceptably learn non-sparse named entities (e.g. countries and nationalities), but failed on the sparse ones (e.g. dates or money amounts). By processing the named entities with hooks instead, the aim was to make the parser familiar with the parsing configurations that are obtained after applying the hooks.
Additionally, named-entity subgraphs and subgraphs coming from phrases (involving two or more terms) from the training set are saved into a lookup table. The latter ones had little impact.
We store in a lookup table some single-word expressions that generated multiple-concept subgraphs in the training set, based on simple heuristics. We store words that denote a negative expression (e.g. undecided that maps to decide-01 :polarity -). We store words that always generated the same subgraph and occurred more than 5 times. We also store capitalized single words that were not previously identified as named entities.
We use the verbalization list from wang2015transition (another lookup table).
When predicting a confirm or breakdown for an uncommon word, we check if that word was mapped to a concept in the training set. If not, we generate the concept lemma-01 if it is a verb, otherwise lemma.
Dates formatted as YYMMDD or YYYYMMDD are identified using a simple criterion (sequence of 6 or 8 digits) and transformed into YYYY-MM-DD on the test phase, as they were consistently misclassified as integer numbers in the development set.
We apply a set of hooks similar to (damonte-cohen-satta:2017:EACLlong) to determine if the predicted label is valid for that edge.
We forbid to generate the same concept twice consecutively. Also, we set for breakdown.
If a node is created, but it is not attached to any head node, we post-process it and connect to the root node.
We assume multi-sentence graphs should contain sentence punctuation symbols. If we predict a multi-sentence graph, but there is no punctuation symbol that splits sentences, we post-process the graph and transform the root node into an and node.