Parsing All: Syntax and Semantics, Dependencies and Spans

Parsing All: Syntax and Semantics, Dependencies and Spans

Junru Zhou{}^{1,2,3} , Zuchao Li {}^{1,2,3}, Hai Zhao{}^{1,2,3}
{}^{1}Department of Computer Science and Engineering, Shanghai Jiao Tong University
{}^{2}Key Laboratory of Shanghai Education Commission for Intelligent Interaction
and Cognitive Engineering, Shanghai Jiao Tong University, Shanghai, China
{}^{3}MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University
{zhoujunru,charlee}@sjtu.edu.cn, zhaohai@cs.sjtu.edu.cn
\ Corresponding author. This paper was partially supported by National Key Research and Development Program of China (No. 2017YFB0304100) and Key Projects of National Natural Science Foundation of China (No. U1836222 and No. 61733011).
Abstract

Both syntactic and semantic structures are key linguistic contextual clues, in which parsing the latter has been well shown beneficial from parsing the former. However, few works ever made an attempt to let semantic parsing help syntactic parsing. As linguistic representation formalisms, both syntax and semantics may be represented in either span (constituent/phrase) or dependency, on both of which joint learning was also seldom explored. In this paper, we propose a novel joint model of syntactic and semantic parsing on both span and dependency representations, which incorporates syntactic information effectively in the encoder of neural network and benefits from two representation formalisms in a uniform way. The experiments show that semantics and syntax can benefit each other by optimizing joint objectives. Our single model achieves new state-of-the-art or competitive results on both span and dependency semantic parsing on Propbank benchmarks and both dependency and constituent syntactic parsing on Penn Treebank.

1 Introduction

This work makes the first attempt to fill the gaps on syntactic and semantic parsing from jointly considering its representation forms and their linguistic processing layers. First, both span (constituent) and dependency are effective formal representations for both semantics and syntax, which have been well studied and discussed from both linguistic and computational perspective, though few works comprehensively considered the impact of either/both representation styles over the respective parsing chomsky1981lectures; Li-aaai-19. Second, as semantics is usually considered as a higher layer of linguistics over syntax, most previous studies focus on how the latter helps the former. Though there comes a trend that syntactic clues show less impact on enhancing semantic parsing since neural models were introduced marcheggiani-titov-2017-encoding; li-etal-2018-unified. In fact, recent works he-etal-2017-deep; marcheggiani-etal-2017-simple propose syntax-agnostic models for semantic parsing and achieve competitive and even state-of-the-art results. However, semantics may not only benefit from syntax which has been well known, but syntax may also benefit from semantics, which is an obvious gap in explicit linguistic structure parsing and few attempts were ever reported. To our best knowledge, only shi-etal-2016-exploiting ever made a brief attempt on Chinese Semantic Treebank to show the mutual benefits between dependency syntax and semantic roles.

To fill such a gap, in this work, we further exploit both strengths of the span and dependency representation of both semantic role labeling (SRL) lewis-etal-2015-joint; strubell-etal-2018-linguistically and syntax, and propose a joint model with multi-task learning Caruana1993Multitask in a balanced mode which improves both semantic and syntactic parsing. Moreover, in our model, semantics is learned in an end-to-end way with a uniform representation and syntactic parsing is represented as a joint span structure zhou-zhao-2019-head relating to head-driven phrase structure grammar (HPSG) pollard1994head which can incorporate both head and phrase information of dependency and constituent syntactic parsing.

We verify the effectiveness and applicability of the proposed model on Propbank semantic parsing 111It is also called semantic role labeling (SRL) for the semantic parsing task over the Propbank. in both span style (CoNLL-2005) carreras-marquez-2005-introduction and dependency style, (CoNLL-2009) hajic-etal-2009-conll and Penn Treebank (PTB) MarcusJ93-2004 for both constituent and dependency syntactic parsing. Our empirical results show that semantics and syntax can indeed benefit each other, and our single model reaches new state-of-the-art or competitive performance for all four tasks: span and dependency SRL, constituent and dependency syntactic parsing.

2 Structure Representation

In this section, we introduce a preprocessing method to handle span and dependency representation, which have strong inherent linguistic relation for both syntax and semantics.

For syntactic representation, we use a formal structure called joint span following zhou-zhao-2019-head to cover both constituent and head information of syntactic tree based on HPSG which is a highly lexicalized, constraint-based grammar pollard1994head. For semantic (SRL) representation, we propose a unified structure to simplify the training process and employ SRL constraints for span arguments to enforce exact inference.

(a) Constituent and dependency.
(b) Joint span structure.
Figure 1: Constituent, dependency, and joint span structures from zhou-zhao-2019-head, which is indexed from 1 to 9 and assigned interval range for each node. The dotted box represents the same part. The special category \# is assigned to divide the phrase with multiple heads. Joint span structure contains constitute phrase and dependency arc. Categ in each node represents the category of each constituent, and HEAD indicates the head word.
Figure 2: The framework of our joint parsing model.

2.1 Syntactic Representation

The joint span structure which is related to the HEAD FEATURE PRINCIPLE (HFP) of HPSG pollard1994head consists of all its children phrases in the constituent tree and all dependency arcs between the head and children in the dependency tree.

For example, in the constituent tree of Figure 1(a), Federal Paper Board is a phrase (1,3) assigned with category NP and in dependency tree, Board is parent of Federal and Paper, thus in our joint span structure, the head of phrase (1,3) is Board. The node S{}_{H}(1, 9) in Figure 1(b) as a joint span is: S{}_{H}(1, 9) = { S{}_{H}(1, 3) , S{}_{H}(4, 8) , S{}_{H}(9, 9), l(1, 9, <S>) , d(Board, sells) , d(., sells) }, where l(i, j, <S>) denotes category of span (i, j) with category S and d(r, h) indicates the dependency between the word r and its parent h. At last, the entire syntactic tree T being a joint span can be represented as:

S_{H}(T) = {S_{H}(1, 9), d(sells, root)}222For dependency label of each word, we train a separated multi-class classifier simultaneously with the parser by optimizing the sum of their objectives..

Following most of the recent work, we apply the PTB-SD representation converted by version 3.3.0 of the Stanford parser. However, this dependency representation results in around 1% of phrases containing two or three head words. As shown in Figure 1(a), the phrase (5,8) assigned with a category NP contains 2 head words of paper and products in dependency tree. To deal with the problem, we introduce a special category \# to divide the phrase with multiple heads to meet the criterion that there is only one head word for each phrase. After this conversion, only nearly 50 heads are errors in PTB.

Moreover, to simplify the syntactic parsing algorithm, we add a special empty category \O to spans to binarize the n-ary nodes and apply a unary atomic category to deal with the nodes of the unary chain, which is popularly adopted in constituent syntactic parsing SternD17b; Gaddy.

2.2 Semantic Representation

Similar to the semantic representation of Li-aaai-19, we use predicate-argument-relation tuples \mathcal{Y}\in\mathcal{P}\times\mathcal{A}\times\mathcal{R}, where \mathcal{P}=\{w_{1},w_{2},...,w_{n}\} is the set of all possible predicate tokens, \mathcal{A}=\{(w_{i},\dots,w_{j})|1\leq i\leq j\leq n\} includes all the candidate argument spans and dependencies, and \mathcal{R} is the set of the semantic roles and employ a null label \epsilon to indicate no relation between predicate-argument pair candidate. The difference from that of Li-aaai-19 is that in our model, we predict the span and dependency arguments at the same time which needs to distinguish the single word span arguments and dependency arguments. Thus, we represent all the span arguments \mathcal{A}=\{(w_{i},\dots,w_{j})|1\leq i\leq j\leq n\} as span S(i-1,j) and all the dependency arguments \mathcal{A}=\{(w_{i})|1\leq i\leq n\} as span S(i,i). We set a special start token at the beginning of sentence.

3 Our Model

3.1 Overview

As shown in Figure 2, our model includes four modules: token representation, self-attention encoder, scorer module, and two decoders. Using an encoder-decoder backbone, we apply self-attention encoder Vaswani17 that is modified by position partition Kitaev-2018-SelfAttentive. We take multi-task learning (MTL) approach Caruana1993Multitask sharing the parameters of token representation and self-attention encoder. Since we convert two syntactic representations as joint span structure and apply uniform semantic representation, we only need two decoders, one for syntactic tree based on joint span syntactic parsing algorithm zhou-zhao-2019-head, another for uniform SRL.

3.2 Token Representation

In our model, token representation x_{i} is composed of characters, words, and part-of-speech (POS) representation. For character-level representation, we use CharLSTM ling-etal-2015-finding. For word-level representation, we concatenate randomly initialized and pre-trained word embeddings. We concatenate character representation and word representation as our token representation x_{i}=[x_{char};x_{word};x_{POS}].

In addition, we also augment our model with ELMo PetersN18-1202, BERT Jacobbert or XLNet XLNet-Zhilin-2019 as the sole token representation to compare with other pre-training models. Since BERT and XLNet are based on sub-word, we only take the last sub-word vector of the word in the last layer of BERT or XLNet as our sole token representation x_{i}.

3.3 Self-Attention Encoder

The encoder in our model is adapted from Vaswani17 and factor explicit content and position information in the self-attention process. The input matrices X=[x_{1},x_{2},\dots,x_{n}] in which x_{i} is concatenated with position embedding are transformed by a self-attention encoder. We factor the model between content and position information both in self-attention sub-layer and feed-forward network, whose setting details follow Kitaev-2018-SelfAttentive.

3.4 Scorer Module

Since span and dependency SRL share uniform representation, we only need three types of scores: syntactic constituent span, syntactic dependency head, and semantic role scores.

We first introduce the span representation s_{ij} for both constituent span and semantic role scores. We define the left end-point vector as concatenation of the adjacent token \overleftarrow{pl_{i}}=[\overleftarrow{y_{i}};\overleftarrow{y_{i+1}}], which \overleftarrow{y_{i}} is constructed by splitting in half the outputs from the self-attention encoder. Similarly, the right end-point vector is \overrightarrow{pr_{i}}=[\overrightarrow{y_{i+1}};\overrightarrow{y_{i}}]. Then, the span representation s_{ij} is the differences of the left and right end-point vectors s_{ij}=[\overrightarrow{pr_{j}}-\overleftarrow{pl_{i}}] 333Since we use the same end-point span s_{ij}=[\overrightarrow{pr_{j}}-\overleftarrow{pl_{i}}] to represent the dependency arguments for our uniform SRL, we distinguish the left and right end-point vector (\overleftarrow{pl_{i}} and \overrightarrow{pr_{i}}) to avoid make zero span representation s_{ij}..

Constituent Span Score  We follow the constituent syntactic parsing zhou-zhao-2019-head; Kitaev-2018-SelfAttentive; Gaddy to train constituent span scorer. We apply one-layer feedforward networks to generate span scores vector, taking span vector s_{ij} as input:

S(i,j)=W_{2}g(LN(W_{1}s_{ij}+b_{1}))+b_{2},

where LN denotes Layer Normalization, g is the Rectified Linear Unit nonlinearity. The individual score of category \ell is denoted by

S_{categ}(i,j,\ell)=[S(i,j)]_{\ell},

where []_{\ell} indicates the value of corresponding the l-th element \ell of the score vector. The score s(T) of the constituent parse tree T is obtained by adding all scores of span (i, j) with category \ell:

s(T)=\sum_{(i,j,\ell)\in T}S_{categ}(i,j,\ell).

The goal of constituent syntactic parsing is to find the tree with the highest score: \hat{T}=\arg\max_{T}s(T). We use CKY-style algorithm Gaddy to obtain the tree \hat{T} in O(n^{3}) time complexity. This structured prediction problem is handled with satisfying the margin constraint:

s(T^{*})\geq s(T)+\Delta(T,T^{*}),

where T^{*} denotes correct parse tree, and \Delta is the Hamming loss on category spans with a slight modification during the dynamic programming search. The objective function is the hinge loss,

J_{1}(\theta)=\max(0,\max_{T}[s(T)+\Delta(T,T^{*})]-s(T^{*})).

Dependency Head Score  We predict a distribution over the possible head for each word and use the biaffine attention mechanism Dozat2017Deep to calculate the score as follow:

\alpha_{ij}=h_{i}^{T}Wg_{j}+U^{T}h_{i}+V^{T}g_{j}+b,

where \alpha_{ij} indicates the child-parent score, W denotes the weight matrix of the bi-linear term, U and V are the weight vectors of the linear term, and b is the bias item, h_{i} and g_{i} are calculated by a distinct one-layer perceptron network.

We minimize the negative log-likelihood of the golden dependency tree Y, which is implemented as a cross-entropy loss:

J_{2}(\theta)=-\left(logP_{\theta}(h_{i}|x_{i})+logP_{\theta}(l_{i}|x_{i},h_{i% })\right),

where P_{\theta}(h_{i}|x_{i}) is the probability of correct parent node h_{i} for x_{i}, and P_{\theta}(l_{i}|x_{i},h_{i}) is the probability of the correct dependency label l_{i} for the child-parent pair (x_{i},h_{i}).

Semantic Role Score  To distinguish the currently considered predicate from its candidate arguments in the context, we employ one-layer perceptron to contextualized representation for argument a_{ij}444When i=j, it means a uniform representation of dependency semantic role. candidates:

a_{ij}=g(W_{3}s_{ij}+b_{1}),

where g is the Rectified Linear Unit nonlinearity and s_{ij} denotes span representation.

And predicate candidates p_{k} is simply represented by the outputs from the self-attention encoder: p_{k}=y_{k}.

For semantic role, different from Li-aaai-19, we simply adopt concatenation of predicates and arguments representations, and one-layer feedforward networks to generate semantic role score:

\Phi_{r}(p,a)=W_{5}g(LN(W_{4}[p_{k};a_{ij}]+b_{4}))+b_{5}, and the individual score of semantic role label r is denoted by: \Phi_{r}(p,a,r)=[\Phi_{r}(p,a)]_{r}.

Since the total of predicate-argument pairs are O(n^{3}), which is computationally impractical. We apply candidates pruning method in Li-aaai-19; he-etal-2018-jointly. First of all, we train separated scorers (\phi_{p} and \phi_{a}) for predicates and arguments by two one-layer feedforward networks. Then, the predicate and argument candidates are ranked according to their predicted score (\phi_{p} and \phi_{a}), and we select the top n_{p} and n_{a} predicate and argument candidates, respectively:

n_{p}=\min(\lambda_{p}n,m_{p}),n_{a}=\min(\lambda_{a}n,m_{a}),

where \lambda_{p} and \lambda_{a} are pruning rate, and m_{p} and m_{a} are maximal numbers of candidates.

Finally, the semantic role scorer is trained to optimize the probability \textit{P}_{\theta}(\hat{y}|s) of the predicate-argument-relation tuples \hat{y}_{(p,a,r)}\in\mathcal{Y} given the sentence s, which can be factorized as:

\displaystyle J_{3}(\theta) \displaystyle=\sum_{p\in\mathcal{P},a\in\mathcal{A},r\in\mathcal{R}}-log% \textit{P}_{\theta}(y_{(p,a,r)}|s)
\displaystyle=\sum_{p\in\mathcal{P},a\in\mathcal{A},r\in\mathcal{R}}-log\frac{% \exp{\phi(p,a,r)}}{\sum_{\hat{r}\in\mathcal{R}}\exp{\phi(p,a,\hat{r})}}

where \theta represents the model parameters, and \phi(p,a,r)=\phi_{p}+\phi_{a}+\Phi_{r}(p,a,r) is the score by the predicate-argument-relation tuple including predicate score \phi_{p}, argument score \phi_{a} and semantic role label score \Phi_{r}(p,a,r). In addition, we fix the score of null label \phi(p,a,\epsilon)=0.

At last, we train our scorer for simply minimizing the overall loss:

J_{overall}(\theta)=J_{1}(\theta)+J_{2}(\theta)+J_{3}(\theta).

3.5 Decoder Module

Algorithm 1 Joint span syntactic parsing algorithm
0:  sentence leng n, span and dependency score s(i,j,\ell), d(r,h), 1\leq i\leq j\leq n,\forall r,h,\ell
0:  maximum value S_{H}(T) of tree T
  Initialization:
  s_{c}[i][j][h]=s_{i}[i][j][h]=0,\forall i,j,h
  for len=1 to n do
     for i=1 to n-len+1 do
         j=i+len-1
         if len=1 then
            s_{c}[i][j][i]=s_{i}[i][j][i]=\max_{\ell}s(i,j,\ell)
         else
            for h=i to j do
               \begin{aligned} \displaystyle split_{l}=&\displaystyle\max_{i\leq r<h}\ \{\ % \max_{r\leq k<h}\ \{\ s_{c}[i][k][r]+\\ &\displaystyle s_{i}[k+1][j][h]\ \}+d(r,h)\ \}\end{aligned}
               \begin{aligned} \displaystyle split_{r}=&\displaystyle\max_{h<r\leq j}\ \{\ % \max_{h\leq k<r}\ \{\ s_{i}[i][k][h]+\\ &\displaystyle s_{c}[k+1][j][r]\ \}+d(r,h)\ \}\end{aligned}
               \begin{aligned} \displaystyle s_{c}[i][j][h]=&\displaystyle\max\ \{\ split_{l}% ,split_{r}\ \}+\\ &\displaystyle\max_{\ell\neq\varnothing}s(i,j,\ell)\end{aligned}
               \begin{aligned} \displaystyle s_{i}[i][j][h]=&\displaystyle\max\ \{\ split_{l}% ,split_{r}\ \}+\\ &\displaystyle\max_{\ell}s(i,j,\ell)\end{aligned}
            end for
         end if
     end for
  end for
  S_{H}(T)=\max_{1\leq h\leq n}\ \{\ s_{c}[1][n][h]+d(h,root)\ \}

Decoder for Joint Span Syntax

As the joint span is defined in a recursive way, to score the root joint span has been equally scoring all spans and dependencies in syntactic tree.

During testing, we apply the joint span CKY-style algorithm zhou-zhao-2019-head, as shown in Algorithm 1 to explicitly find the globally highest score S_{H}(T) of our joint span syntactic tree T555For further details, see zhou-zhao-2019-head which has discussed the different between constituent syntactic parsing CKY-style algorithm, how to binarize the joint span tree and the time, space complexity..

Also, to control the effect of combining span and dependency scores, we apply a weight \lambda_{H}666We also try to incorporate the head information in constituent syntactic training process, namely max-margin loss for both two scores, but it makes the training process become more complex and unstable. Thus we employ a parameter to balance two different scores in joint decoder which is easily implemented with better performance.:

s(i,j,\ell)=\lambda_{H}S_{categ}(i,j,\ell),d(i,j)=(1-\lambda_{H})\alpha_{ij},

where \lambda_{H} in the range of 0 to 1. In addition, we can merely generate constituent or dependency syntactic parsing tree by setting \lambda_{H} to 1 or 0, respectively.

Decoder for Uniform Semantic Role Since we apply uniform span for both dependency and span semantic role, we use a single dynamic programming decoder to generate two semantic forms following the non-overlapping constraints: span semantic arguments for the same predicate do not overlap punyakanok-2008-importance.

4 Experiments

We evaluate our model on CoNLL-2009 shared task hajic-etal-2009-conll for dependency-style SRL, CoNLL-2005 shared task carreras-marquez-2005-introduction for span-style SRL both using the Propbank convention palmer-etal-2005-proposition, and English Penn Treebank (PTB) MarcusJ93-2004 for constituent syntactic parsing, Stanford basic dependencies (SD) representation Marieffe06generatingtyped converted by the Stanford parser777http://nlp.stanford.edu/software/lex-parser.html for dependency syntactic parsing. We follow standard data splitting: semantic (SRL) and syntactic parsing take section 2-21 of Wall Street Journal (WSJ) data as training set, SRL takes section 24 as development set while syntactic parsing takes section 22 as development set, SRL takes section 23 of WSJ together with 3 sections from Brown corpus as test set while syntactic parsing only takes section 23. POS tags are predicted using the Stanford tagger Toutanova:2003. In addition, we use two SRL setups: end-to-end and pre-identified predicates.

For the predicate disambiguation task in dependency SRL, we follow marcheggiani-titov-2017-encoding and use the off-the-shelf disambiguator from roth-lapata-2016-neural. For constituent syntactic parsing, we use the standard evalb888http://nlp.cs.nyu.edu/evalb/ tool to evaluate the F1 score. For dependency syntactic parsing, following previous work Dozat2017Deep, we report the results without punctuations of the labeled and unlabeled attachment scores (LAS, UAS).

4.1 Setup

Hyperparameters In our experiments, we use 100D GloVe PenningtonD14-1162 pre-train embeddings. For the self-attention encoder, we set 12 self-attention layers and use the same other hyperparameters settings as Kitaev-2018-SelfAttentive. For semantic role scorer, we use 512-dimensional MLP layers and 256-dimensional feed-forward networks. For candidates pruning, we set \lambda_{p} = 0.4 and \lambda_{a} = 0.6 for pruning predicates and arguments, m_{p} = 30 and m_{a} = 300 for max numbers of predicates and arguments respectively. For constituent span scorer, we apply a hidden size of 250-dimensional feed-forward networks. For dependency head scorer, we employ two 1024-dimensional MLP layers with the ReLU as the activation function for learning specific representation and a 1024-dimensional parameter matrix for biaffine attention.

In addition, when augmenting our model with ELMo, BERT and XLNet, we set 4 layers of self-attention for ELMo and 2 layers of self-attention for BERT and XLNet.

Training Details we use 0.33 dropout for biaffine attention and MLP layers. All models are trained for up to 150 epochs with batch size 150 on a single NVIDIA GeForce GTX 1080Ti GPU with Intel i7-7800X CPU. We use the same training settings as Kitaev-2018-SelfAttentive and kitaev2018multilingual.

Figure 3: Syntactic parsing performance of different parameter \lambda_{H} on PTB dev set.
Model F1 UAS LAS
separate constituent 93.98 - -
converted dependency 95.38 94.06
separate dependency - 95.80 94.40

joint span \lambda_{H} = 1.0
93.89 - -
joint span \lambda_{H} = 0.0 - 95.90 94.50
joint span \lambda_{H} = 0.8 93.98 95.99 94.53
converted dependency 95.70 94.60
Table 1: PTB dev set performance of joint span syntactic parsing. The converted means the corresponding dependency syntactic parsing results are from the corresponding constituent parse tree using head rules.

4.2 Joint Span Syntactic Parsing

This subsection examines joint span syntactic parsing decoder 1 with semantic parsing both of dependency and span. The weight parameter \lambda_{H} plays an important role to balance the syntactic span and dependency scores. When \lambda_{H} is set to 0 or 1, the joint span parser works as the dependency-only parser or constituent-only parser respectively. \lambda_{H} set to between 0 to 1 indicates the general joint span syntactic parsing, providing both constituent and dependency structure prediction. We set the \lambda_{H} parameter from 0 to 1 increased by 0.1 step as shown in Figure 3. The best results come out when \lambda_{H} is set to 0.8 which achieves the best performance of both syntactic parsing.

In addition, we compare the joint span syntactic parsing decoder with a separate learning constituent syntactic parsing model which takes the same token representation, self-attention encoder and joint learning setting of semantic parsing on PTB dev set. The constituent syntactic parsing results are also converted into dependency ones by PTB-SD for comparison.

Table 1 shows that joint span decoder benefit both of constituent and dependency syntactic parsing. Besides, the comparison also shows that the directly predicted dependencies from our model are better than those converted from the predicted constituent parse trees in UAS term.

System SEM{}_{span} SEM{}_{dep} SYN{}_{con} SYN{}_{dep}
F{}_{1} F{}_{1} F{}_{1} UAS LAS
End-to-end
SEM{}_{span} 82.27 - - - -
SEM{}_{dep} - 84.90 - - -
SEM{}_{span,dep} 83.50 84.92 - - -
SEM{}_{span,dep}, SYN{}_{con} 83.81 84.95 93.98 - -
SEM{}_{span,dep}, SYN{}_{dep} 83.13 84.24 - 95.80 94.40
SYN{}_{con,dep} - - 93.78 95.92 94.49
SEM{}_{span,dep}, SYN{}_{con,dep} 83.12 83.90 93.98 95.95 94.51
Given predicate
SEM{}_{span} 83.16 - - - -
SEM{}_{dep} - 88.23 - - -
SEM{}_{span,dep} 84.74 88.32 - - -
SEM{}_{span,dep}, SYN{}_{con} 84.46 88.40 93.78 - -
SEM{}_{span,dep}, SYN{}_{dep} 84.76 87.58 - 95.94 94.54
SEM{}_{span,dep}, SYN{}_{con,dep} 84.43 87.58 94.07 96.03 94.65
Table 2: Joint learning analysis on CoNLL-2005, CoNLL-2009, and PTB dev sets.

4.3 Joint Learning Analysis

Table 2 compares the different joint setting of semantic (SRL) and syntactic parsing to examine whether semantics and syntax can enjoy their joint learning.

In the end-to-end mode, we find that constituent syntactic parsing can boost both styles of semantics while dependency syntactic parsing cannot. Moreover, the results of the last two rows indicate that semantics can benefit syntax simply by optimizing the joint objectives. While in the given predicate mode, both constituent and dependency syntactic parsing can enhance SRL. In addition, joint learning of our uniform SRL performs better than separate learning of either dependency or span SRL in both modes.

Overall, joint semantic and constituent syntactic parsing achieve relatively better SRL results than the other settings. Thus, the rest of the experiments are done with multi-task learning of semantics and constituent syntactic parsing (wo/dep). Since semantics benefits both of two syntactic formalisms, we also compare the results of joint learning with semantics and two syntactic parsing models (w/dep).

4.4 Syntactic Parsing Results

In the wo/dep setting, we convert constituent syntactic parsing results into dependency ones by PTB-SD for comparison and set \lambda_{H} described in 1 to 1 for generating constituent syntactic parsing only.

Compared to the existing state-of-the-art models without pre-training, our performance exceeds zhou-zhao-2019-head nearly 0.2 in LAS of dependency and 0.3 F1 of constituent syntactic parsing which are considerable improvements on such strong baselines. Compared with strubell-etal-2018-linguistically shows that our joint model setting boosts both of syntactic parsing and SRL which are consistent with shi-etal-2016-exploiting that syntactic parsing and SRL benefit relatively more from each other.

We augment our parser with ELMo, a larger version of BERT and XLNet as the sole token representation to compare with other models. Our single model in XLNet setting achieving 96.18 F1 score of constituent syntactic parsing, 97.23% UAS and 95.65% LAS of dependency syntactic parsing.

UAS LAS
Dozat2017Deep 95.74 94.08
Ma2018Stack 95.87 94.19
strubell-etal-2018-linguistically 94.92 91.87
Daniel-2019-naacl-left 96.04 94.43
zhou-zhao-2019-head 96.09 94.68
Ours converted (wo/dep) 95.20 93.90
Ours (w/dep) 96.15 94.85
Pre-training
strubell-etal-2018-linguistically(ELMo) 96.48 94.40
zhou-zhao-2019-head(ELMo) 96.76 94.68
zhou-zhao-2019-head(BERT) 97.00 95.43
Our converted (wo/dep) + ELMo 96.21 95.02
Our (w/dep) + ELMo 96.72 95.00
Ours converted (wo/dep) + BERT 96.77 95.72
Ours (w/dep) + BERT 96.90 95.32
Ours converted (wo/dep) + XLNet 97.21 96.25
Ours (w/dep) + XLNet 97.23 95.65
Table 3: Dependency syntactic parsing on WSJ test set.
LR LP F1
Gaddy 91.76 92.41 92.08
SternD17b 92.57 92.56 92.56
Kitaev-2018-SelfAttentive 93.20 93.90 93.55
zhou-zhao-2019-head 93.64 93.92 93.78
Ours (wo/dep) 93.56 94.01 93.79
Ours (w/dep) 93.94 94.20 94.07
Pre-training
Kitaev-2018-SelfAttentive(ELMo) 94.85 95.40 95.13
kitaev2018multilingual(BERT) 95.46 95.73 95.59
zhou-zhao-2019-head(ELMo) 95.04 95.39 95.22
zhou-zhao-2019-head(BERT) 95.70 95.98 95.84
Ours (wo/dep) + ELMo 94.73 95.13 94.93
Ours (w/dep) + ELMo 95.07 95.40 95.23
Ours (wo/dep) + BERT 95.27 95.76 95.51
Ours (w/dep) + BERT 95.39 95.64 95.52
Ours (wo/dep) + XLNet 96.01 96.36 96.18
Ours (w/dep) + XLNet 96.10 96.26 96.18
Table 4: Constituent syntactic parsing on WSJ test set
System WSJ Brown
P R F{}_{1} P R F{}_{1}
End-to-end
he-etal-2018-jointly 81.2 83.9 82.5 69.7 71.9 70.8
Li-aaai-19 - - 83.0 - - -
Tan-Deep-Semantic 84.5 85.2 84.8 73.5 74.6 74.1
strubell-etal-2018-linguistically 84.07 83.16 83.61 73.32 70.56 71.91
strubell-etal-2018-linguistically* 85.53 84.45 84.99 75.8 73.54 74.66
Ours (wo/dep) 83.65 85.48 84.56 72.02 73.08 72.55
Ours (w/dep) 83.54 85.30 84.41 71.84 72.07 71.95
+ Pre-training
he-etal-2018-jointly 84.8 87.2 86.0 73.9 78.4 76.1
Li-aaai-19 85.2 87.5 86.3 74.7 78.1 76.4
strubell-etal-2018-linguistically 86.69 86.42 86.55 78.95 77.17 78.05
strubell-etal-2018-linguistically* 87.13 86.67 86.90 79.02 77.49 78.25
Ours (wo/dep) + ELMo 85.30 87.70 86.48 76.07 78.27 77.15
Ours (w/dep) + ELMo 85.33 87.70 86.50 75.95 78.30 77.11
Ours (wo/dep) + BERT 86.77 88.49 87.62 79.06 81.67 80.34
Ours (w/dep) + BERT 86.46 88.23 87.34 77.26 80.20 78.70
Ours (wo/dep) + XLNet 87.65 89.66 88.64 80.77 83.92 82.31
Ours (w/dep) + XLNet 87.48 89.51 88.48 80.46 84.15 82.26
Given predicate
he-etal-2018-jointly - - 83.9 - - 73.7
ouchi-etal-2018-span 84.7 82.3 83.5 76.0 70.4 73.1
strubell-etal-2018-linguistically 84.72 84.57 84.64 74.77 74.32 74.55
strubell-etal-2018-linguistically* 86.02 86.05 86.04 76.65 76.44 76.54
Ours (wo/dep) 85.93 85.76 85.84 76.92 74.55 75.72
Ours (w/dep) 85.61 85.39 85.50 73.9 73.22 73.56
+ Pre-training
he-etal-2018-jointly - - 87.4 - - 80.4
ouchi-etal-2018-span 88.2 87.0 87.6 79.9 77.5 78.7
Li-aaai-19 87.9 87.5 87.7 80.6 80.4 80.5
Ours (wo/dep) + ELMo 87.76 88.29 88.02 79.59 78.64 79.11
Ours (w/dep) + ELMo 87.75 87.91 87.82 80.81 79.51 80.15
Ours (wo/dep) + BERT 89.04 88.79 88.91 81.89 80.98 81.43
Ours (w/dep) + BERT 88.94 88.53 88.73 81.66 80.80 81.23
Ours (wo/dep) + XLNet 89.89 89.74 89.81 85.35 84.57 84.96
Ours (w/dep) + XLNet 89.62 89.82 89.72 85.08 84.84 84.96
Table 5: Span SRL results on CoNLL-2005 test sets. * represents injecting state-of-the-art predicted parses.
System WSJ Brown
P R F{}_{1} P R F{}_{1}
End-to-end
Li-aaai-19 - - 85.1 - - -
Ours (wo/dep) 84.24 87.55 85.86 76.46 78.52 77.47
Ours (w/dep) 83.73 86.94 85.30 76.21 77.89 77.04
+ Pre-training
he-etal-2018-syntax 83.9 82.7 83.3 - - -
cai-etal-2018-full 84.7 85.2 85.0 - - 72.5
Li-aaai-19 84.5 86.1 85.3 74.6 73.8 74.2
Ours(wo/dep) + ELMo 85.21 88.17 86.66 78.62 80.76 79.68
Ours (w/dep) + ELMo 84.85 88.21 86.50 78.43 80.52 79.46
Ours (wo/dep) + BERT 87.40 88.96 88.17 80.32 82.89 81.58
Ours (w/dep) + BERT 86.77 89.14 87.94 79.71 82.40 81.03
Ours (wo/dep) + XLNet 86.58 90.40 88.44 80.96 85.31 83.08
Ours (w/dep) + XLNet 86.35 90.16 88.21 80.90 85.38 83.08
Given predicate
kasai-2019-naacl-syntax 89.0 88.2 88.6 78.0 77.2 77.6
Ours (wo/dep) 88.73 89.83 89.28 82.46 83.20 82.82
Ours (w/dep) 88.02 89.03 88.52 80.98 82.10 81.54
+ Pre-training
he-etal-2018-syntax 89.7 89.3 89.5 81.9 76.9 79.3
cai-etal-2018-full 89.9 89.2 89.6 79.8 78.3 79.0
Li-aaai-19 90.0 90.0 90.0 81.7 81.4 81.5
kasai-2019-naacl-syntax 90.3 90.0 90.2 81.0 80.5 80.8
Ours (wo/dep) + ELMo 89.71 90.90 90.30 83.94 85.04 84.49
Ours (w/dep) + ELMo 89.38 90.26 89.82 83.96 84.80 84.38
Ours (wo/dep) + BERT 91.21 91.19 91.20 85.65 86.09 85.87
Ours (w/dep) + BERT 91.14 91.03 91.09 85.18 85.41 85.29
Ours (wo/dep) + XLNet 91.16 91.60 91.38 87.04 87.54 87.29
Ours (w/dep) + XLNet 90.80 91.74 91.27 86.43 87.25 86.84
Table 6: Dependency SRL results on CoNLL-2009 Propbank test sets.

4.5 Semantic Parsing Results

We present all results using the official evaluation script from the CoNLL-2005 and CoNLL-2009 shared tasks, and compare our model with previous state-of-the-art models in Table 5, 6. The upper part of the tables presents results from end-to-end mode while the lower part shows the results of given predicate mode to compare to more previous works with pre-identified predicates. In given predicate mode, we simply replace predicate candidates with the gold predicates without other modification on the input or encoder.

Span SRL Results Table 5 shows results on CoNLL-2005 in-domain (WSJ) and out-domain (Brown) test sets. It is worth noting that strubell-etal-2018-linguistically injects state-of-the-art predicted parses in terms of setting of Dozat2017Deep at test time and aims to use syntactic information to help SRL. While our model not only excludes other auxiliary information during test time but also benefits both syntax and semantics. We obtain comparable results with the latest state-of-the-art method strubell-etal-2018-linguistically and outperform all recent models without additional information in test time.

After incorporating with pre-training contextual representations, our model achieves new state-of-the-art both of end-to-end and given predicate mode and both of in-domain and out-domain text.

Dependency SRL Results Table 6 presents the results on CoNLL-2009. We obtain new state-of-the-art both of end-to-end and given predicate mode and both of in-domain and out-domain text. These results demonstrate that our improved uniform SRL representation can be adapted to perform dependency SRL and achieves impressive performance gains.

5 Related Work

In the early work of SRL, most of the researchers focus on feature engineering based on training corpus. The traditional approaches to SRL focused on developing rich sets of linguistic features templates and then employ linear classifiers such as SVM zhao-etal-2009-multilingual-dependency. Recently, especially with the impressive success of neural networks, considerable attention has been paid to syntactic features strubell-etal-2018-linguistically; kasai-2019-naacl-syntax; he-etal-2018-syntax; li-etal-2018-unified.

Besides, both span and dependency are effective formal representations for both semantics and syntax. On one hand, researchers are interested in two forms of SRL models that may benefit from each other rather than their separated development, which has been roughly discussed in johansson-nugues-2008-dependency. he-etal-2018-jointly is the first to apply span-graph structure based on contextualized span representations to span SRL and Li-aaai-19 built on these span representations achieves state-of-art results on both span and dependency SRL using the same model but training individually..

On the other hand, researchers have discussed how to encode lexical dependencies in phrase structures, like lexicalized tree adjoining grammar (LTAG) SCHABESC88-2121 and head-driven phrase structure grammar (HPSG) pollard1994head which is a constraint-based highly lexicalized non-derivational generative grammar framework.

6 Conclusions

This paper presents the first joint learning model which is evaluated on four tasks: span and dependency SRL, constituent and dependency syntactic parsing. We exploit the relationship between semantics and syntax and conclude that not only syntax can help semantics but also semantics can improve syntax performance. Besides, we propose two structure representations, uniform SRL and joint span of syntactic structure, to combine the span and dependency forms. From experiments on these four parsing tasks, our single model achieves state-of-the-art or competitive results.

References

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
388182
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description