Hierarchical Pointer Net Parsing
Transition-based top-down parsing with pointer networks has achieved state-of-the-art results in multiple parsing tasks, while having a linear time complexity. However, the decoder of these parsers has a sequential structure, which does not yield the most appropriate inductive bias for deriving tree structures. In this paper, we propose hierarchical pointer network parsers, and apply them to dependency and sentence-level discourse parsing tasks. Our results on standard benchmark datasets demonstrate the effectiveness of our approach, outperforming existing methods and setting a new state-of-the-art.
Parsing of sentences is a core natural language understanding task, where the goal is to construct a tree structure that best describes the relationships between the tree constituents (e.g., words, phrases). For example, Figure 1 shows examples of a dependency tree and a sentence-level discourse tree that respectively represent how the words and clauses are related in a sentence. Such parse trees are directly useful in numerous NLP applications, and also serve as intermediate representations for further language processing tasks such as semantic and discourse processing.
Existing approaches to parsing can be distinguished based on whether they employ a greedy transition-based algorithm Marcu99; zhang-nivre-2011-transition; Wang-acl-2017 or a globally optimized algorithm such as graph-based methods for dependency parsing Eisner:1996:TNP:992628.992688 or chart parsing for discourse Marcu03; joty-carenini-ng-cl-15. Transition-based parsers build the tree incrementally by making a series of shift-reduce decisions. The advantage of this method is that the parsing time is linear with respect to the sequence length. The limitation, however, is that the decisions made at each step are based on local information, disallowing the model to capture long distance dependencies and also causing error propagation to subsequent steps. Recent methods attempt to address this issue using neural networks capable of remembering long range relationships such as Stacked LSTMs dyer-etal-2015-transition; ballesteros-etal-2015-improved or using globally normalized models andor-etal-2016-globally.
The globally optimized methods, on the other hand, learn scoring functions for subtrees and perform search over all possible trees to find the most probable tree for a text. Recent graph-based methods use neural models as scoring functions kiperwasser-goldberg-2016-simple; DozatMann17. Despite being more accurate than greedy parsers, these methods are generally slow having a polynomial time complexity ( or higher).
Recently, transition-based top-down parsing with Pointer Networks Vinyals_NIPS2015 has attained state-of-the-art results in both dependency and discourse parsing tasks with the same computational efficiency Xuezhe18; Xiang19; thanks to the encoder-decoder architecture that makes it possible to capture information from the whole text and the previously derived subtrees, while limiting the number of parsing steps to linear. However, the decoder of these parsers has a sequential structure, which may not yield the most appropriate inductive bias for deriving a hierarchical structure. For example, when decoding “pens” in Figure 1 in a top-down depth-first manner, the decoder state is directly conditioned on “and” as opposed to the states representing its parent “sell”. This on one hand may induce irrelevant information in the current state, on the other, as the text length gets longer, the decoder state at later steps tends to forget more relevant information due to long distance. Having an explicit hierarchical inductive bias should allow the model to receive more relevant information and help with the long-term dependency problem by providing shortcuts for gradient back-propagation.
In this paper, we propose a Hierarchical Pointer Network (H-PtrNet) parser to address the above mentioned limitations. In addition to the sequential dependencies, our parser also directly models the parent-child and sibling relationships in the decoding process. We apply our proposed method to both dependency and discourse parsing tasks. To verify the effectiveness of our approach, we conduct extensive experiments and analysis on both tasks. Our results demonstrate that in dependency parsing, our model outperforms in most of the languages. In discourse parsing, we push forward the state-of-the-art in all evaluation metrics. Furthermore, our results on the hardest task of relation labeling have touched human agreement scores on this task. We have released our code at https://ntunlpsg.github.io/project/parser/ptrnet-depparser/ for research purposes.
2.1 Dependency Parsing
Dependency parsing is the task of predicting the existence and type of linguistic dependency relations between words in a sentence (Figure 1a). Given an input sentence, the output is a tree that shows relationships (e.g., Nominal Subject (NSUBJ), Determiner (DET)) between head words and words that modify those heads, called dependents or modifiers.
Approaches to dependency parsing can be divided into two main categories: greedy transition-based parsing and graph-based parsing. In both paradigms, neural models have proven to be more effective than feature-based models where selecting the composition of features is a major challenge. kiperwasser-goldberg-2016-simple proposed graph-based and transition-based dependency parsers with a Bi-LSTM feature representation. Since then much work has been done to improve these two parsers. DozatMann17 proposed a bi-affine classifier for the prediction of arcs and labels based on graph-based model, and achieved state-of-the-art performance. DBLP:jPTDB adopted a joint modeling approach by adding a Bi-LSTM POS tagger to generate POS tags for the graph-based dependency parser. Though transition-based methods are superior in terms of time complexity, they fail in capturing the global dependency information when making decisions. To address this issue, andor-etal-2016-globally proposed a globally optimized transition-based model. Recently, by incorporating a stack within a pointer network, Xuezhe18 proposed a transition-based model and achieved state-of-the-art performance across many languages .
2.2 Discourse Parsing
Rhetorical Structure Theory or RST Mann88 is one of the most influential theories of discourse, which posits a tree structure (called discourse tree) to represent a text (Fig. 1b). The leaves of a discourse tree represent contiguous text spans called Elementary Discourse Units (EDUs). The adjacent EDUs and larger units are recursively connected by coherence relations (e.g., Condition, Attribution). Furthermore, the discourse units connected by a relation are distinguished based on their relative importance — Nucleus refers to the core part(s) while Satellite refers to the peripheral one. Coherence analysis in RST consists of two subtasks: (a) identifying the EDUs in a text, referred to as Discourse Segmentation, and (b) building a discourse tree by linking the EDUs hierarchically, referred to as Discourse Parsing. This work focuses on the more challenging task of discourse parsing assuming that EDUs have already been identified. In fact, state-of-the-art segmenter Xiang19 has already achieved 95.6 on RST discourse treebank, where the human agreement is 98.3 .
Earlier methods have mostly utilized hand-crafted lexical and syntactic features Marcu03; Feng-14-ACL; joty-carenini-ng-cl-15; Wang-acl-2017. Recent approaches have shown competitive results with neural models that are able to automatically learn the feature representations in an end-to-end fashion ji-eisenstein:2014:P14-1; Li-2014-acl. Very recently, Xiang19 propose a parser based on pointer networks and achieve state-of-the-art performance.
Although related, the dependency and RST tree structures (hence the parsing tasks) are different. RST structure is similar to constituency structure. Therefore, the differences between constituency and dependency structures also hold here.111There are also studies that use dependency structure to directly represent the relations between the EDUs; see muller-etal-2012-constrained; Li-2014-acl; morey-etal-2018-dependency. First, while dependency relations can only be between words, discourse relations can be between elementary units, between larger units or both. Second, in dependency parsing, any two words can be linked, whereas RST allows connections only between two adjacent units. Third, in dependency parsing, a head word can have multiple modifier words, whereas in discourse parsing, a discourse unit can be associated with only one connection. The parsing algorithm needs to be adapted to account for these differences.
2.3 Pointer Networks.
Pointer networks Vinyals_NIPS2015 are a class of encoder-decoder models that can tackle problems where the output vocabulary depends on the input sequence. They use attentions as pointers to the input elements. An encoder network first converts the input sequence into a sequence of hidden states . At each time step , the decoder takes the input from previous step, generates a decoder state , and uses it to attend over the input elements. The attention gives a distribution over the input elements.
where is a scoring function for attention, which can be a neural network or an explicit formula like dot product. The model uses to infer the output: where is the set of parameters. To condition on , the corresponding input is copied as the input to the decoder.
3 Hierarchical Pointer Networks
Before presenting our proposed hierarchical pointer networks, we first revisit how pointer networks have been used for parsing tasks.
3.1 Pointer Networks for Parsing.
Xuezhe18 and Xiang19 both use a pointer network as the backbone of their parsing models and achieve state-of-the-art performance in dependency and discourse parsing tasks, respectively. As shown in Figures 2(a) and 3, in both cases, the parsing algorithm is implemented in a top-down depth-first order. They share the same encoder-decoder structure. A bi-directional Recurrent Neural Network (RNN) encodes a sequence of word embeddings into a sequence of hidden states . The decoder implements a uni-directional RNN to greedily generate the corresponding tree. It maintains a stack to keep track of the units that yet need to be parsed, i.e., head words for dependency parsing and larger units for discourse parsing. At each step , the decoder takes out an element from the stack and generates a decoder state , which is in turn used in the pointer layer to compute the attention over the relevant input elements. In the case of dependency parsing, the representation of the head word is used to find its dependent. For discourse parsing, it uses the representation of the span to identify the break position that splits the text span into two subspans.
In addition to the tree structure, the parser also deals with the corresponding labelling tasks. Whenever the pointer network yields a newly created pair (i.e., head-dependent in dependency parsing, two sub-spans in discourse parsing), a separate classifier is applied to predict the corresponding relation between them.
3.2 Limitations of Existing Methods
One crucial limitation of the existing models is that the decoder has a linear structure, although the task is to construct a hierarchical structure. This can be noticed in the Figures 2 and 3, where the current decoder state is conditioned on the previous state (see horizontal blue lines), but not on its parent’s decoder state or siblings’ decoder state, when it was pointed from its head. This can induce irrelevant information if the previous decoding state corresponds to an element that is not relevant to the current element. For example, in Figure 2, the decoder state for pointing to “pens” is conditioned on the state used for pointing to “and”, but not the one used for pointing to “sell”, which is more relevant according to the dependency structure. Also, the decoder state for “sell” is far apart from the one for “pens”. Therefore, more relevant information could be diminished in a sequential decoder, especially for long range dependencies.
3.3 Hierarchical Decoder
To address the above issues, we propose hierarchical pointer network (H-PtrNet), which poses a hierarchical decoder that reflects the underlying tree structure. H-PtrNet has the same encoder-decoder architecture as the original pointer network except that each decoding state is conditioned directly on its parent’s decoder state and its immediate sibling’s decoder state in addition to the previous decoder state and parent’s encoder state (from input). Formally, the pointing mechanism in H-PtrNet can be defined as:
where is a fusion function to combine the four components into a decoder state, and other terms are similarly defined as before for Eq. 1. Figure 2(b) shows an example of H-PtrNet decoder connections for dependency parsing.
The fusion function can be implemented in multiple ways and may depend on the specific parsing task. More variants of the fusion function will be discussed in Section 4.
Decoder Time Complexity.
Given a sentence of length , the number of decoding steps to build a parse tree is linear. The attention mechanism at each decoding step computes an attention vector of length . The overall decoding complexity is , which is same as the StackPointer Parser Xuezhe18.
If we look at the decoding steps of the StackPointer Parser Xuezhe18 more closely, we notice that it also takes the decoder state of the immediate sibling (when it points to itself). This decoder state represents the state after all its children are generated. Thus it contains information about its children. In contrast, in our model we consider the decoder state when the sibling was first generated from its parent. Therefore this state contains the sibling’s parent information, which helps with capturing long term dependencies.
3.4 Model Specifics for Dependency Parsing
Figure 2(a) shows the encoding and decoding steps of H-PtrNet for dependency parsing. We use the same encoder as Xuezhe18 (red color).222https://github.com/XuezheMax/NeuroNLP2 Given a sentence, a convolutional neural network (CNN) is used to encode character-level representation of each word, which is then concatenated with word embedding and POS embedding vectors to generate the input sequence . Then a three-layer bi-directional LSTM encodes into a sequence of hidden states . The decoder (blue color) is a single layer uni-directional LSTM, and also maintains a stack to track of the decoding status. At each decoding step , the decoder receives the encoder state of the parent from the stack. In addition, it gets decoder states from three different sources: previous decoding step , parent and immediate sibling .
Instead of simply feeding these three components to the decoder, we incorporate a gating mechanism to generalize the ability of our model to extract the most useful information. Eventually, the fusion function in Eq. 2 is defined with a gating mechanism. We experimented with two different gating functions:
where , , and are the gating weights. The fusion function is then defined as
where , , are the weights to get the intermediate hidden state , and is a gate to control the information flow from the three decoder states LSTM is the LSTM layer that accepts as the the hidden state and as its input. The LSTM decoder state is then used to compute the attention distribution over the encoder states in pointer layer.
Pointer and Classifier.
Same as Xuezhe18, the pointer and the label classifier are implemented as bi-affine layers. Formally, the scoring function in Eq. 3 is defined as:
where , and are the weights, and and are two single layer MLPs with ELU activations. The dependency label classifier has the same structure as the pointer. More specifically,
where is the encoder state of the dependent word, is the decoder state of the head word, , and are the weights, and and are two single layer MLPs with ELU activations.
Partial Tree Information.
Similar to Xuezhe18, we provide the decoder at each step with higher order information about the parent and the sibling of the current node.
3.5 Model Specifics for Discourse Parsing.
For discourse parsing, our model uses the same structure as Xiang19.333https://ntunlpsg.github.io/project/parser/pointer-net-parser The encoder is a 5-layer bidirectional RNN based on Gated Recurrent Units (BiGRU) ChoGRU. As shown in Figure 3, after obtaining a sequence of encoder hidden states representing the words, the last hidden states of the EDUs (e.g., , , , , and ) are taken as the EDU representations, generating a sequence of EDU representations for the input sentence.
Our hierarchical decoder is based on a 5-layer unidirectional GRU. The decoder maintains a stack to keep track of the spans that need to be parsed further. At time step , the decoder takes the text span (e.g., ) representation from the top of the stack and receives the corresponding parent decoder state , sibling decoder state and previous decoder state as the input to generate a current decoder state . For discourse parsing, we apply Eq. 7 and 9 (with GRU) to implement the fusion function and to get the decoder state .444Adding gating mechanism did not give any gain, rather increased the number of parameters. The decoder state is then used in the pointer layer to compute the attention score over the current text span (e.g., ) in order to find the position to generate a new split . The parser then applies a relation classifier to predict the relation and the nuclearity labels for the the new split.
For pointing, the parser uses a simple dot product attention. For labeling, it uses a bi-affine classifier similar to the one in Eq. 11. It takes the representations of two spans (i.e., for , for ) as input and predicts the corresponding relation between them. Whenever the length of any of the newly created span ( and ) is larger than two, the parser pushes it onto the stack for further processing. Similar to dependency parsing, the decoder is also provided with partial tree information – the representations of the parent and the immediate sibling.
3.6 Objective Function
Same as Xuezhe18 and Xiang19, our parsers are trained to minimize the total loss (cross entropy) for building the right tree structure for a given sentence . The structure loss is the pointing loss for the pointer network:
where denotes the model parameters, represents the subtrees that have been generated by our parser at previous steps, and is the number needed for parsing the whole sentence (i.e., number of words in dependency parsing and spans containing more than two EDUs in discourse parsing). The label classifiers are trained simultaneously, so the final loss function is the sum of structure loss (Eq. 12) and the loss for label classifier.
In this section, we describe the experimental details about dependency parsing and discourse parsing, as well as the analysis on both tasks.
Apart from the two gating-based fusion functions described in Section 3 (Eq. 5-6), we experimented with three different versions of our model depending on which connections are considered in the decoder. We append suffixes P for parent, S for sibling and T for temporal to the model name (H-PtrNet) to denote different versions.
H-PtrNet-P: The H-PtrNet model with fusion function , where the decoder receives hidden (decoder) state only from the parent () in each decoding step. Note that is the encoder state of the parent.
H-PtrNet-PS: The H-PtrNet model with fusion function , where the decoder receives the hidden states from both the parent and sibling in each decoding step.
H-PtrNet-PST: This is the full model with fusion function (Eq. 2). In this model, the decoder receives the hidden states from its parent, sibling and previous step in each decoding step.
4.1 Dependency Parsing
We evaluate our model on the English Penn Treebank (PTB v3.0) Marcus93, which is converted to Stanford Dependencies format with Stanford Dependency Converter 3.3.0 Schuster2016EnhancedEU. To make a thorough empirical comparison with previous studies, we also evaluate our system on seven (7) languages from the Universal Dependency (UD) Treebanks555http://universaldependencies.org/ (version 2.3).
We evaluate the performance of our models with unlabeled attachment score (UAS) and labeled attachment score (LAS). We ignore punctuations in the evaluation for English.
|StackPtr (code)||H-PtrNet-PST (Gate)||H-PtrNet-PST (SGate)|
We use the same setup as Xuezhe18 in the experiments for English Penn Treebank and UD Treebanks. For a fair comparison, we rerun their model with the hyperparameters provided by the authors on the same machine as our experiments. For all the languages, we follow the standard split for training, validation and testing. It should be noted that Xuezhe18 used UD Treebanks 2.1, which is not the most up-to-date version. Therefore, during experiments, we rerun their codes with UD Treebanks 2.3 to match our experiments. To be specific, we use structured-skipgram ling2015two for English and German, while Polyglot embedding al2013polyglot for the other languages. Adam optimizer Kingma2015AdamAM is used as the optimization algorithm. We apply 0.33 dropout rate between layers of encoder and to word embeddings as well as Eq. 8. We use beam size of 10 for English Penn Treebank, and beam size of 1 for UD Treebanks. The gold-standard POS tags is used for English Penn Treebank. We also use the universal POS tags petrov12 provided in the dataset for UD Treebanks. See Appendix for a complete list of hyperparameters.
Results on UD Treebanks.
We evaluate on 7 different languages from the UD Treebanks: 4 major ones: English (en), German (de), French (fr), and Italian (it), and 3 relatively minor ones: Bulgarian (bg), Catalan (ca), and Romanian (ro). Table 1 shows the results. We refer to the results of our run of the code released by Xuezhe18 as StackPtr (code).666We do not directly report the results from their paper because we use a different version of the UD Treebanks. StackPtr (code) and our models are trained in identical settings making them comparable. H-PtrNet-PST (Gate) (Eq. 5) and H-PtrNet-PST (SGate) (Eq. 6) are H-PtrNet models with gating mechanism. Element wise product in Eq. 6 has the effect of similarity comparison, so we denote it as SGate. With gating mechanism, our model shows consistent improvements against the baseline on bg, en, de, fr, it and ro. We also tested H-PtrNet-PS on these 7 languages, but the performances are worse than StackPtr.
Results on English Penn Treebank.
Table 2 presents the results on English Penn Treebank. StackPtr (paper) refer to the results reported by Xuezhe18, and StackPtr (code) is our run of their code in identical settings as ours. Our model H-PtrNet-PST (Gate) outperforms the baseline by 0.09 and 0.08 in terms of UAS and LAS, respectively. Performance of H-PtrNet-PST (SGate) is close to that of H-PtrNet-PST (Gate), though we see slight improvement. We also test H-PtrNet-PS (Gate), the model with parent and sibling connections only, which further improves the performance to 96.09 and 95.03 in UAS and LAS.
To make a thorough analysis of our model, we breakdown UAS in terms of sentence lengths to compare the performance of our model and StackPtr. We first take the performance on UD German as an example, which is shown in Figure 4. The blue line shows the performance of StackPtr, and the orange line shows the performance of our model. From Figure 4(a) we can see that our model without gate performs better on relatively short sentences (10 to 29 words), however, the accuracy drops on longer sentences. The reason could be that adding parent and sibling hidden states to decoder may amplify error accumulation from early parsing mistakes.
Figure 4(b) shows the performance of our model with SGate (Eq. 6), where we can see that the performance on long sentences has been improved significantly. In the meanwhile, it still maintains higher accuracy than StackPtr on the short sentences (10 to 29 words). Figure 5 shows two more examples, again, from which we can see that our model with SGate tends to outperform StackPtr on longer sentences.
4.2 Discourse Parsing
We use the standard RST Discourse Treebank (RST-DT) Carlson02, which contains discourse annotations for 385 news articles from Penn Treebank Marcus93. We evaluate our model in sentence-level parsing, for which we extract all the well-formed sentence-level discourse trees from document-level trees. In all, the training data contains 7321 sentences, and the testing data contains 951 sentences. These numbers match the statistics reported by Xiang19. We follow the same settings as in their experiments and randomly choose 10% of the training data for hyperparameter tuning.
Metric and Relation Labels.
Following the standard in RST parsing, we use the unlabeled (Span) and labeled (Nuclearity, Relation) metrics proposed by Marcu00. We only present -score for space limitations. Following the previous work, we attach the nuclearity labels (NS, SN, NN) to 18 discourse relations, together giving 39 distinctive relation labels.
Since our goal is to evaluate our parsing method, we conduct the experiments based on gold EDU segmentations. We compare our results with the recently proposed pointer network based parser of Xiang19 (Pointer Net). However, unlike their paper, we report results for both cases: () when the model was selected based on the best performance on Span identification; and () when it was selected based on the relation labeling performance on the development set. We retrain their model for both settings. We also apply Adam optimizer as optimization algorithm and ELMo Peters:2018 with 0.5 dropout rate as word embeddings.
|Pointer Net Xiang19||97.39||91.01||81.08|
|Pointer Net Xiang19||97.14||91.00||81.29|
We present the results in Table 3. In discourse parsing, the number of EDUs in a sentence is relatively small compared to the sentence lengths (in words) in dependency parsing. Based on the observation in dependency parsing that the performance of H-PtrNet may drop for longer sentences due to parent error accumulation, we expect that in discourse parsing, this should not be the case since the the number of parsing steps is much smaller compared to that of dependency parsing.
We first consider the models that were selected based on Span performance (models with superscript). H-PtrNet-P, with only parent connection, outperforms the baseline in all three tasks. It achieves an absolute improvement of 0.29 in span identification compared to the baseline. Considering the performance has already exceeded the human agreement of 95.7 , this gain is remarkable. Thanks to the higher accuracy on finding the right spans, we also achieve 0.85 and 0.74 absolute improvements in Nuclearity and Relation tasks, respectively. By adding the sibing and temporal connections, we test the performance of our full model, H-PtrNet-PST. The performance on Span is 0.17 higher than the baseline. However, it is not on par with our H-PtrNet-P. But, it is not surprising since we adopt binary tree structures in discourse parsing, which means the sibling information could be redundant in most cases. This also accords with our previous assumption that parent connections may bring enough information to decode RST trees.
Now we consider the models that were selected based on Relation labeling performance (models with superscript). We achieve significant improvement in Relation compared to the baseline. Eventually the parser yields an of 82.77, which is very close to the human agreement (83.0 ). We observe that the performance in H-PtrNet-PS and H-PtrNet-PST is better than the H-PtrNet-P. As the relation classifier and the pointer network share the same encoder information (Sec. 3), we believe that richer decoder information leads the model to learn better representations of the text spans (encoder states) and further leads to a better performance in relation labeling.
We further analyze the performance of our proposed model in terms of number of EDUs. We present the score in Span of H-PtrNet-P and H-PtrNet-PST as well as the baseline in Figure 6. It can be observed that both H-PtrNet-P and H-PtrNet-PST outperform the baseline with respect to almost every number of EDUs. Moreover, we can see that the H-PtrNet-P performs better in most of the cases, which once again conforms to our assumption that parent information is enough to decode RST trees. However, as discussed in our dependency parsing experiments, when the number of words (EDUs for discourse parsing) increases, the model may suffer from error accumulation from early parsing. Hence, H-PtrNet-PST tends to perform better when EDU number becomes large.
In this paper, we propose hierarchical pointer network parsers and apply them to dependency and discourse parsing tasks. Our parsers address the limitation of previous methods, where the decoder has a sequential structure while it is decoding a hierarchical tree structure, by allowing more flexible information flow to help the decoder receive the most relevant information. For both tasks, our parsers outperform existing methods and set new state-of-the-arts of the two tasks. The broken-down analysis clearly illustrates that our parsers perform better for long sequences, complying with the motivation of our model.
This research is partly supported by the Alibaba-NTU Singapore Joint Research Institute, Nanyang Technological University (NTU), Singapore. Shafiq Joty would like to thank the funding support from his Start-up Grant (M4082038.020).