A Transition-based Algorithm for Unrestricted AMR Parsing

A Transition-based Algorithm for Unrestricted AMR Parsing

David Vilares
Universidade da Coruña
FASTPARSE Lab, LyS Group
Departamento de Computación
Campus de A Elviña s/n, 15071
A Coruña, Spain
david.vilares@udc.es

&Carlos Gómez-Rodríguez
Universidade da Coruña
FASTPARSE Lab, LyS Group
Departamento de Computación
Campus de A Elviña s/n, 15071
A Coruña, Spain
carlos.gomez@udc.es
Abstract

Non-projective parsing can be useful to handle cycles and reentrancy in amr graphs. We explore this idea and introduce a greedy left-to-right non-projective transition-based parser. At each parsing configuration, an oracle decides whether to create a concept or whether to connect a pair of existing concepts. The algorithm handles reentrancy and arbitrary cycles natively, i.e. within the transition system itself. The model is evaluated on the LDC2015E86 corpus, obtaining results close to the state of the art, including a Smatch of 64%, and showing good behavior on reentrant edges.

A Transition-based Algorithm for Unrestricted AMR Parsing


David Vilares Universidade da Coruña FASTPARSE Lab, LyS Group Departamento de Computación Campus de A Elviña s/n, 15071 A Coruña, Spain david.vilares@udc.es                        Carlos Gómez-Rodríguez Universidade da Coruña FASTPARSE Lab, LyS Group Departamento de Computación Campus de A Elviña s/n, 15071 A Coruña, Spain carlos.gomez@udc.es

1 Introduction

Abstract Meaning Representation (amr) is a semantic representation language to map the meaning of English sentences into directed, cycled, labeled graphs (banarescu2013abstract). Graph vertices are concepts inferred from words. The concepts can be represented by the words themselves (e.g. dog), PropBank framesets (palmer2005proposition) (e.g. eat-01), or keywords (like named entities or quantities). The edges denote relations between pairs of concepts (e.g. eat-01 :ARG0 dog). amr parsing integrates tasks that have usually been addressed separately in natural language processing (nlp), such as named entity recognition (nadeau2007survey), semantic role labeling (palmer2010semantic) or co-reference resolution (ng2002improving; lee2017scaffolding). Figure 1 shows an example of an amr graph.

\includegraphics

[width=1]amr-example

Figure 1: amr graph for ‘When the prince arrived on the Earth, he was surprised not to see any people’. Words can refer to concepts by themselves (green), be mapped to PropBank framesets (red) or be broken down into multiple-term/non-literal concepts (blue). Prince plays different semantic roles.

Several transition-based dependency parsing algorithms have been extended to generate amr. wang2015transition describe a two-stage model, where they first obtain the dependency parse of a sentence and then transform it into a graph. damonte-cohen-satta:2017:EACLlong propose a variant of the arc-eager algorithm to identify labeled edges between concepts. These concepts are identified using a lookup table and a set of rules. A restricted subset of reentrant edges are supported by an additional classifier. A similar configuration is used in (gildea-satta-cl17; peng-aaai18), but relying on a cache data structure to handle reentrancy, cycles and restricted non-projectivity. A feed-forward network and additional hooks are used to build the concepts. ballesteros-alonaizan:2017:EMNLP2017 use a modified arc-standard algorithm, where the oracle is trained using stack-lstms (dyer-EtAl:2015:ACL-IJCNLP). Reentrancy is handled through swap (nivre2009non) and they define additional transitions intended to detect concepts, entities and polarity nodes.

This paper explores unrestricted non-projective amr parsing and introduces amr-covington, inspired by covington2001fundamental. It handles arbitrary non-projectivity, cycles and reentrancy in a natural way, as there is no need for specific transitions, but just the removal of restrictions from the original algorithm. The algorithm has full coverage and keeps transitions simple, which is a matter of concern in recent studies (peng-aaai18).

2 Preliminaries and Notation

Notation

We use typewriter font for concepts and their indexes (e.g. dog or 1), regular font for raw words (e.g. dog or 1), and a bold style font for vectors and matrices (e.g. , ).

covington2001fundamental describes a fundamental algorithm for unrestricted non-projective dependency parsing. The algorithm can be implemented as a left-to-right transition system (nivre2008algorithms). The key idea is intuitive. Given a word to be processed at a particular state, the word is compared against the words that have previously been processed, deciding to establish or not a syntactic dependency arc from/to each of them. The process continues until all previous words are checked or until the algorithm decides no more connections with previous words need to be built, then the next word is processed. The runtime is in the worst scenario. To guarantee the single-head and acyclicity conditions that are required in dependency parsing, explicit tests are added to the algorithm to check for transitions that would break the constraints. These are then disallowed, making the implementation less straightforward.

3 The amr-Covington algorithm

The acyclicity and single-head constraints are not needed in amr, as arbitrary graphs are allowed. Cycles and reentrancy are used to model semantic relations between concepts (as shown in Figure 1) and to identify co-references. By removing the constraints from the Covington transition system, we achieve a natural way to deal with them.111This is roughly equivalent to going back to the naive parser called ESH in (covington2001fundamental), which has not seen practical use in parsing due to the lack of these constraints.

Also, amr parsing requires words to be transformed into concepts. Dependency parsing operates on a constant-length sequence. But in amr, words can be removed, generate a single concept, or generate several concepts. In this paper, additional lookup tables and transitions are defined to create concepts when needed, following the current trend (damonte-cohen-satta:2017:EACLlong; ballesteros-alonaizan:2017:EMNLP2017; gildea-satta-cl17).

3.1 Formalization

Let = be an edge-labeled directed graph where: = is the set of concepts and is the set of labeled edges, we will denote a connection between a head concept and a dependent concept as , where l is the semantic label connecting them.

The parser will process sentences from left to right. Each decision leads to a new parsing configuration, which can be abstracted as a 4-tuple where:

  • is a buffer that contains unprocessed words. They await to be transformed to a concept, a part of a larger concept, or to be removed. In , represents the head of , and it optionally can be a concept. In that case, it will be denoted as b.

  • is a list of previously created concepts that are waiting to determine its semantic relation with respect to b. Elements in are concepts. In , i denotes its last element.

  • is a list that contains previously created concepts for which the relation with b has already been determined. Elements in are concepts. In , j denotes the head of .

  • is the set of the created edges.

Transitions Step Step

left-arc

right-arc

multiple-arc

shift

no-arc

confirm
breakdown
reduce


Table 1: Transitions for amr-covington

Given an input sentence, the parser starts at an initial configuration = and will apply valid transitions until a final configuration is reached, such that = . The set of transitions is formally defined in Table 1:

  • left-arc: Creates an edge . i is moved to .

  • right-arc: Creates an edge . i is moved to .

  • shift: Pops b from . , and b are appended.

  • no arc: It is applied when the algorithm determines that there is no semantic relationship between i and b, but there is a relationship between some other node in and b.

  • confirm: Pops from and puts the concept b in its place. This transition is called to handle words that only need to generate one (more) concept.

  • breakdown: Creates a concept b from , and places it on top of , but is not popped, and the new buffer state is . It is used to handle a word that is going to be mapped to multiple concepts. To guarantee termination, breakdown is parametrized with a constant , banning generation of more than consecutive concepts by using this operation. Otherwise, concepts could be generated indefinitely without emptying .

  • reduce: Pops from . It is used to remove words that do not add any meaning to the sentence and are not part of the amr graph.

left and right-arc handle cycles and reentrancy with the exception of cycles of length 2 (which only involve i and b). To assure full coverage, we include an additional transition: multiple-arc that creates two edges and . i is moved to . multiple-arcs are marginal and will not be learned in practice. amr-covington can be implemented without multiple-arc, by keeping i in after creating an arc and using no-arc when the parser has finished creating connections between i and b, at a cost to efficiency as transition sequences would be longer. Multiple edges in the same direction between i and b are handled by representing them as a single edge that merges the labels.

Example

Table 2 illustrates a valid transition sequence to obtain the amr graph of Figure 1.

Action (times)
w, t, p reduce
p, a, o confirm
p, a, o shift
p a, o, t confirm
p a, o, t left-arc
p a, o, t shift
p, a o, t, E reduce
p, a E, h, w breakdown
p, a ‘E’, E, h shift
p, a, ‘E’ E, h, w breakdown
p, a, ‘E’ ‘E’, h, w shift
a, ‘E’, ‘E’ E, h, w breakdown
a, ‘E’, ‘E’ n, h, w left-arc
a, ‘E’ ‘E’ n, h, w shift
‘E’, ‘E’, n E, h, w confirm
‘E’, ‘E’, n p2, h, w left-arc
a, ‘E’, ‘E’ n, p2, h, w no-arc
p, a, ‘E’ ‘E’, n, p2, h, w left-arc
p, a ‘E’, ‘E’, n, p2, h, w shift
‘E’, n, p2 h, w, s reduce
‘E’, n, p2 s, n, t confirm
‘E’, n, p2 s, n, t left-arc
‘E’, ‘E’, n p2 s, n, t no-arc
p, a ‘E’, ‘E’, n s, n, t left-arc
p, a, ‘E’ s, n, t shift
n, p2, s n, t, s2 confirm
n, p2, s -, t, s2 shift
p2, s, - t, s2, a2 reduce
p2, s, - s2, a2, p3 confirm
p2, s, - s2, a2, p3 left-arc
n, p2, s - s2, a2, p3 no-arc
p a, ‘E’, ‘E’ s2, a2, p3 left-arc
p, a, ‘E’ s2, a2, p3 shift
s, -, s2 a2, p3 confirm
s, -, s2 a2, p3 shift
-, s2, a2 p3 confirm
-, s2, a2 p3 left-arc
s, -, s2 a2 p3 right-arc
s2, a2, p3 shift
Table 2: Sequence of gold transitions to obtain the amr graph for the sentence ‘When the prince arrived on the Earth, he was surprised not to see any people’, introduced in Figure 1. For brevity, we represent words (and concepts) by their first character (plus an index if it is duplicated) and we only show the top three words for , and . Steps from 20 to 23(2) and from 28 to 31 manage the reentrant edges for prince (p) from surprise-01 (s) and see-01 (s2).

3.2 Training the classifiers

The algorithm relies on three classifiers: (1) a transition classifier, , that learns the set of transitions introduced in §3.1, (2) a relation classifier, , to predict the label(s) of an edge when the selected action is a left-arc, right-arc or multiple-arc and (3) a hybrid process (a concept classifier, , and a rule-based system) that determines which concept to create when the selected action is a confirm or breakdown.

Preprocessing

Sentences are tokenized and aligned with the concepts using Jamr (flanigan2014discriminative). For lemmatization, tagging and dependency parsing we used UDpipe (straka2016udpipe) and its English pre-trained model (zeman2017conll). Named Entity Recognition is handled by Stanford CoreNLP (manning-EtAl:2014:P14-5).

Architecture

We use feed-forward neural networks to train the tree classifiers. The transition classifier uses 2 hidden layers (400 and 200 input neurons) and the relation and concept classifiers use 1 hidden layer (200 neurons). The activation function in hidden layers is a = and their output is computed as where and are the weights and bias tensors to be learned and the input at the th hidden layer. The output layer uses a function, computed as . All classifiers are trained in mini-batches (size=32), using Adam (kingma2014adam) (learning rate set to ), early stopping (no patience) and dropout (srivastava2014dropout) (40%). The classifiers are fed with features extracted from the preprocessed texts. Depending on the classifier, we are using different features. These are summarized in Appendix A (Table 5), which also describes (B) other design decisions that are not shown here due to space reasons.

3.3 Running the system

At each parsing configuration, we first try to find a multiword concept or entity that matches the head elements of , to reduce the number of breakdowns, which turned out to be a difficult transition to learn (see §4.1). This is done by looking at a lookup table of multiword concepts222The most frequent subgraph. seen in the training set and a set of rules, as introduced in (damonte-cohen-satta:2017:EACLlong; gildea-satta-cl17).

We then invoke and call the corresponding subprocess when an additional concept or edge-label identification task is needed.

Concept identification

If the word at the top of occurred more than 4 times in the training set, we call a supervised classifier to predict the concept. Otherwise, we first look for a word-to-concept mapping in a lookup table. If not found, if it is a verb, we generate the concept lemma-01, and otherwise lemma.

Edge label identification

The classifier is invoked every time an edge is created. We use the list of valid ARGs allowed in propbank framesets by damonte-cohen-satta:2017:EACLlong. Also, if p and o are a propbank and a non-propbank concept, we restore inverted edges of the form o p as o p.

4 Methods and Experiments

Corpus

We use the LDC2015E86 corpus and its official splits: 16 833 graphs for training, 1 368 for development and 1 371 for testing. The final model is only trained on the training split.

Metrics

We use Smatch (cai2013smatch) and the metrics from damonte-cohen-satta:2017:EACLlong.333It is worth noting that the calculation of Smatch and metrics derived from it suffers from a random component, as they involve finding an alignment between predicted and gold graphs with an approximate algorithm that can produce a suboptimal solution. Thus, as in previous work, reported Smatch scores may slightly underestimate the real score.

Sources

The code and the pretrained model used in this paper can be found at https://github.com/aghie/tb-amr.

4.1 Results and discussion

Table 3 shows accuracy of on the development set. confirm and reduce are the easiest transitions, as local information such as POS-tags and words are discriminative to distinguish between content and function words. breakdown is the hardest action.444This transition was trained/evaluated for non named-entity words that generated multiple nodes, e.g. father, that maps to have-rel-role-91 :ARG2 father. In early stages of this work, we observed that this transition could learn to correctly generate multiple-term concepts for named-entities that are not sparse (e.g. countries or people), but failed with sparse entities (e.g. dates or percent quantities). Low performance on identifying them negatively affects the edge metrics, which require both concepts of an edge to be correct. Because of this and to identify them properly, we use the mentioned complementary rules to handle named entities. right-arcs are harder than left-arcs, although the reason for this issue remains as an open question for us. The performance for no-arcs is high, but it would be interesting to achieve a higher recall at a cost of a lower precision, as predicting no-arcs makes the transition sequence longer, but could help identify more distant reentrancy. The accuracy of is 86%. The accuracy of is 79%. We do not show the detailed results since the number of classes is too high. was trained on concepts occurring more than 1 time in the training set, obtaining an accuracy of 83%. The accuracy on the development set with all concepts was 77%.

Action Prec. Rec. F-score
left-arc 81.62 87.73 84.57
right-arc 75.53 78.71 77.08
multiple-arc 00.00 00.00 00.00
shift 80.44 81.11 80.77
no-arc 89.71 86.71 88.18
confirm 84.91 96.11 90.16
reduce 96.77 91.53 94.08
breakdown 85.09 50.23 63.17
Table 3: scores on the development set.

Table 4 compares the performance of our systems with state-of-the-art models on the test set. amr-covington obtains state-of-the-art results for all the standard metrics. It outperforms the rest of the models when handling reentrant edges. It is worth noting that D requires an additional classifier to handle a restricted set of reentrancy and P uses up to five classifiers to build the graph.

Metric F W F’ D P Ours
Smatch 58 63 67 64 64 64
Unlabeled 61 69 69 69 - 68
No-WSD 58 64 68 65 - 65
NER 75 75 79 83 - 83
Wiki 0 0 75 64 - 70
Negations 16 18 45 48 - 47
Concepts 79 80 83 83 - 83
Reentrancy 38 41 42 41 - 44
SRL 55 60 60 56 - 57
Table 4: F-score comparison with F (flanigan2014discriminative), W (wang2015transition), F’ (flanigan2016cmu), D (damonte-cohen-satta:2017:EACLlong), P (peng-aaai18). D, P and our system are left-to-right transition-based.

Discussion

In contrast to related work that relies on ad-hoc procedures, the proposed algorithm handles cycles and reentrant edges natively. This is done by just removing the original constraints of the arc transitions in the original covington2001fundamental algorithm. The main drawback of the algorithm is its computational complexity. The transition system is expected to run in , as the original Covington parser. There are also collateral issues that impact the real speed of the system, such as predicting the concepts in a supervised way, given the large number of output classes (discarding the less frequent concepts the classifier needs to discriminate among more than 7 000 concepts). In line with previous discussions (damonte-cohen-satta:2017:EACLlong), it seems that using a supervised feed-forward network to predict the concepts does not lead to a better overall concept identification with respect of the use of simple lookup tables that pick up the most common node/subgraph. Currently, every node is kept in , and it is available to be part of new edges. We wonder if only storing in the head node for words that generate multiple-node subgraphs (e.g. for the word father that maps to have-rel-role-91 :ARG2 father, keeping in only the concept have-rel-role-91) could be beneficial for amr-covington.

As a side note, current amr evaluation involves elements such as neural network initialization, hooks and the (sub-optimal) alignments of evaluation metrics (e.g. Smatch) that introduce random effects that were difficult to quantify for us.

5 Conclusion

We introduce amr-covington, a non-projective transition-based parser for unrestricted amr. The set of transitions handles reentrancy natively. Experiments on the LDC2015E86 corpus show that our approach obtains results close to the state of the art and a good behavior on re-entrant edges.

As future work, amr-covington produces sequences of no-arcs which could be shortened by using non-local transitions (qi-manning:2017:Short; 2017arXiv171009340F). Sequential models have shown that fewer hooks and lookup tables are needed to deal with the high sparsity of amr (ballesteros-alonaizan:2017:EMNLP2017). Similarly, bist-covington (vilares2017non) could be adapted for this task.

Acknowledgments

This work is funded from the European Research Council (ERC), under the European Union’s Horizon 2020 research and innovation programme (FASTPARSE, grant agreement No 714150), from the TELEPARES-UDC project (FFI2014-51978-C2-2-R) and the ANSWER-ASAP project (TIN2017-85160-C2-1-R) from MINECO, and from Xunta de Galicia (ED431B 2017/01). We gratefully acknowledge NVIDIA Corporation for the donation of a GTX Titan X GPU.

References

Appendix A Supplemental Material

Table 5 indicates the features used to train the different classifiers. Concept features are randomized by to an special index that refers to an unknown concept during the training phase. This helps learn a generic embedding for unseen concepts in the test phase. Also, concepts that occur only one time in the training set are not considered as output classes by .

Features
From
pos
w
ew
c
entity
lm
lm
lm
rm
rm
rm
nh, nc
depth
npunkt
hl
ct
g
Labels from the predicted
dependency tree
d
Table 5: Set of proposed features for each classifier. pos, w, c, entity are part-of-speech tag, word, concept and entity embeddings. ew are pre-trained external word embeddings, fine-tuned during the training phase (http://nlp.stanford.edu/data/glove.6B.zip, 100 dimensions). lm and rm are the leftmost and the rightmost function; and h, c, cc represent head, child and grand-child concepts of a concept; so, for example, lm stands for the leftmost child of the concept. nh and nc are the number of heads and children of a concept. npunkt indicates the number of ‘.’, ‘;’, ‘:’, ‘?’, ‘!’ that have already been processed. hl denotes the labels of the last assigned head. ct indicates the type of concept (constant, propbank frameset, other). g indicates if a concept was generated using a confirm or breakdown. d denotes the dependency label existing in the dependency tree between the word at the th position in and the th last in and vice versa. The word that generated a concept is still accessible after creating the concept.

Internal and external (from GloVe) word embedding sizes are set to 100. The size of the concept embedding is set to 50. The rest of the embeddings size are set to 20. The weights are initialized to zero.

Appendix B Additional design decisions

We try to list more in detail the main hooks and design decisions followed in this work to mitigate the high sparsity in Abstract Meaning Representation which, at least in our experience, was a struggling issue. These decisions mainly affect the mapping from words to multiple-concept subgraphs.

  • We identify named entities and nationalities, and update the training configurations to generate the corresponding subgraph by applying a set of hooks.555The hooks are based on the text file resources for countries, cities, etc, released by damonte-cohen-satta:2017:EACLlong and an analysis of how named entities are generated in the training/development set. The intermediate training configurations are not fed as samples to the classifier. On early experiments it was observed that the breakdown transition could acceptably learn non-sparse named entities (e.g. countries and nationalities), but failed on the sparse ones (e.g. dates or money amounts). By processing the named entities with hooks instead, the aim was to make the parser familiar with the parsing configurations that are obtained after applying the hooks.

  • Additionally, named-entity subgraphs and subgraphs coming from phrases (involving two or more terms) from the training set are saved into a lookup table. The latter ones had little impact.

  • We store in a lookup table some single-word expressions that generated multiple-concept subgraphs in the training set, based on simple heuristics. We store words that denote a negative expression (e.g. undecided that maps to decide-01 :polarity -). We store words that always generated the same subgraph and occurred more than 5 times. We also store capitalized single words that were not previously identified as named entities.

  • We use the verbalization list from wang2015transition (another lookup table).

  • When predicting a confirm or breakdown for an uncommon word, we check if that word was mapped to a concept in the training set. If not, we generate the concept lemma-01 if it is a verb, otherwise lemma.

  • Dates formatted as YYMMDD or YYYYMMDD are identified using a simple criterion (sequence of 6 or 8 digits) and transformed into YYYY-MM-DD on the test phase, as they were consistently misclassified as integer numbers in the development set.

  • We apply a set of hooks similar to (damonte-cohen-satta:2017:EACLlong) to determine if the predicted label is valid for that edge.

  • We forbid to generate the same concept twice consecutively. Also, we set for breakdown.

  • If a node is created, but it is not attached to any head node, we post-process it and connect to the root node.

  • We assume multi-sentence graphs should contain sentence punctuation symbols. If we predict a multi-sentence graph, but there is no punctuation symbol that splits sentences, we post-process the graph and transform the root node into an and node.

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
198956
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description