Syntax-aware Multilingual Semantic Role Labeling

Syntax-aware Multilingual Semantic Role Labeling

Shexia He, Zuchao Li, Hai Zhao
Department of Computer Science and Engineering, Shanghai Jiao Tong University
Key Laboratory of Shanghai Education Commission for Intelligent Interaction
and Cognitive Engineering, Shanghai Jiao Tong University, Shanghai, China
MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University
{heshexia, charlee},
These authors made equal contribution. Corresponding author. This paper was partially supported by National Key Research and Development Program of China (No. 2017YFB0304100) and Key Projects of National Natural Science Foundation of China (No. U1836222 and No. 61733011).

Recently, semantic role labeling (SRL) has earned a series of success with even higher performance improvements, which can be mainly attributed to syntactic integration and enhanced word representation. However, most of these efforts focus on English, while SRL on multiple languages more than English has received relatively little attention so that is kept underdevelopment. Thus this paper intends to fill the gap on multilingual SRL with special focus on the impact of syntax and contextualized word representation. Unlike existing work, we propose a novel method guided by syntactic rule to prune arguments, which enables us to integrate syntax into multilingual SRL model simply and effectively. We present a unified SRL model designed for multiple languages together with the proposed uniform syntax enhancement. Our model achieves new state-of-the-art results on the CoNLL-2009 benchmarks of all seven languages. Besides, we pose a discussion on the syntactic role among different languages and verify the effectiveness of deep enhanced representation for multilingual SRL.

1 Introduction

Semantic role labeling (SRL) aims to derive the meaning representation such as an instantiated predicate-argument structure for a sentence. The currently popular formalisms to represent the semantic predicate-argument structure are based on dependencies and spans. Their main difference is that dependency SRL annotates the syntactic head of argument rather than the entire constituent (span), and this paper will focus on the dependency-based SRL. Be it dependency or span, SRL plays a critical role in many natural language processing (NLP) tasks, including information extraction Christensen et al. (2011), machine translation Xiong et al. (2012) and question answering Yih et al. (2016).

Almost all of traditional SRL methods relied heavily on syntactic features, which suffered the risk of erroneous syntactic input, leading to undesired error propagation. To alleviate this inconvenience, researchers as early as Zhou and Xu (2015) propose neural SRL models without syntactic input. Cai et al. (2018) employ the biaffine attentional mechanism Dozat and Manning (2017) for dependency-based SRL. In the meantime, a series of studies Roth and Lapata (2016); Marcheggiani and Titov (2017); Strubell et al. (2018); Li et al. (2018) have introduced syntactic clue in creative ways for further performance improvement, which achieve favorable results. However, applying the -order syntactic tree pruning of He et al. (2018) to the biaffine SRL model Cai et al. (2018) does not boost the performance as expected, which indicates that exploiting syntactic clue in state-of-the-art SRL models still deserves deep exploration.

Besides, most of SRL literature is dedicated to impressive performance gains on English and Chinese, but other multiple languages have received relatively little attention. We even observe that to date the best reported results of some languages (Catalan and Japanese) are still from the initial CoNLL-2009 shared task Hajič et al. (2009). Therefore, we launch this multilingual SRL study to fill the obvious gap ignored since a long time ago. Especially, we attempt to improve the overall performance of multilingual SRL by incorporating syntax and introducing contextualized word representation, and explore syntactic effect on other multiple languages.

Multilingual SRL needs to be carefully handled for the diversity of syntactic and semantic representations among quite different languages. Despite such a diversity, in this paper, we manage to develop a simple and effective neural SRL model to integrate syntactic information by applying argument pruning method in a uniform way. Specifically, we introduce new pruning rule based on syntactic parse tree unlike the -order pruning of He et al. (2018), which is only simply determined by the relative distance of predicate and argument. Furthermore, we propose a novel method guided by syntactic rule to prune arguments for dependency SRL, different from the existing work. With the help of the proposed pruning method, our model can effectively alleviate the imbalanced distribution of arguments and non-arguments, achieving faster convergence during training.

To verify the effectiveness and applicability of the proposed method, we evaluate the model on all seven languages of CoNLL-2009 datasets. Experimental results indicate that our argument pruning method is generally effective for multilingual SRL over our unified modeling. Moreover, our model using contextualized word representation achieves the new best results on all seven datasets, which is the first overall update since 2009. To the best of our knowledge, this is the first attempt to study seven languages comprehensively in deep learning models.

Figure 1: Overall architecture of our SRL model. Red denotes the given predicate, and gray indicates that these units are dropped according to syntactic rule. The bottom is syntactic dependency.

2 Model

Given a sentence, SRL can be decomposed into four classification subtasks, predicate identification and disambiguation, argument identification and classification. Since the CoNLL-2009 shared task has indicated all predicates beforehand, we focus on identifying arguments and labeling them with semantic roles. Our model builds on a recent syntax-agnostic SRL model Cai et al. (2018) by introducing argument pruning and enhanced word representation. In this work, we handle argument identification and classification in one shot, treating the SRL task as word predicate-argument pair classification. Figure 1 illustrates the overall architecture of our model, which consists of three modules, (1) a bidirectional LSTM (BiLSTM) encoder, (2) an argument pruning layer which takes as input the BiLSTM representations, and (3) a biaffine scorer which takes as input the predicate and its argument candidates.

2.1 BiLSTM Encoder

Given a sentence and marked predicates, we adopt the bidirectional Long Short-term Memory neural network (BiLSTM) Hochreiter and Schmidhuber (1997) to encode sentence, which takes as input the word representation. Following Cai et al. (2018), the word representation is the concatenation of five vectors: randomly initialized word embedding, lemma embedding, part-of-speech (POS) tag embedding, pre-trained word embedding and predicate-specific indicator embedding.

Besides, the latest work Li et al. (2018) has demonstrated that the contextualized representation ELMo (Embeddings from Language Models) Peters et al. (2018) could boost performance of dependency SRL model on English and Chinese. To explore whether the deep enhanced representation can help other multiple languages, we further enhance the word representation by concatenating an external embedding from the recent successful language models, ELMo and BERT (Bidirectional Encoder Representations from Transformers) Devlin et al. (2018), which are both contextualized representations. It is worth noting that we use ELMo or BERT to obtain pre-trained contextual embeddings rather than fine-tune the model, which are fixed contextual representations.

2.2 Argument Pruning Layer

For word pair classification modeling, one major performance bottleneck is caused by unbalanced data, especially for SRL, where more than 90% of argument candidates are non-arguments. A series of pruning methods are then proposed to alleviate the imbalanced distribution, such as the -order pruning He et al. (2018). However, it does not extend well to other languages, and even hinders the syntax-agnostic SRL model as Cai et al. (2018) has experimented with different values on English. The reason might be that this pruning method breaks up the whole sentence, leading the BiLSTM encoder to take the incomplete sentence as input and fail to learn sentence representation sufficiently.

To alleviate such a drawback from the previous syntax-based pruning methods, we propose a novel pruning rule extraction method based on syntactic parse tree, which generally suits multilingual cases at the same time. In detailed model implementation, we add an argument pruning layer guided by syntactic rule following BiLSTM layers, which can absorb the syntactic clue simply and effectively.

Syntactic Rule

Considering that all arguments are predicate-specific instances, it has been generally observed that the distances between predicate and its arguments on syntactic tree are within a certain range for most languages. Therefore, we introduce language-specific rule based on syntactic dependency parses to prune some unlikely arguments, henceforth syntactic rule. Specifically, given a predicate and its argument , we define and to be the distance from and to their nearest common ancestor node (namely, the root of the minimal subtree which includes and ) respectively. For example, denotes that predicate or argument itself is their nearest common ancestor, while represents that their nearest common ancestor is the parent of predicate or argument. Then we use the distance tuple (, ) as their relative position representation inside the parse tree. Finally, we make a list of all tuples ordered according to how many times that each distance tuple occurs in the training data, which is counted for each language independently.

It is worth noting that our syntactic rule is determined by the top- frequent distance tuples. During training and inference, the syntactic rule takes effect by excluding all candidate arguments whose predicate-argument relative position in parse tree is not in the list of top- frequent tuples.

Figure 2 shows simplified examples of syntactic dependency tree. Given an English sentence in (a), the current predicate is likes, whose arguments are cat and fish. For likes and cat, the predicate (likes) is their common ancestor (denoted as ) according to the syntax tree. Therefore, the relative position representation of predicate and argument is , so it is for likes and fish. As for the right one in (b), suppose the marked predicate has two arguments and , the common ancestors of predicate and arguments are respectively and . In this case, the relative position representations are and .

Figure 2: Syntactic parse tree examples (dependency relations are omitted). Red represents the current predicate, and blue indicates its arguments.

Argument Pruning Method

To maintain the integrity of sequential inputs from the whole sentence, we propose a novel syntax-based method to prune arguments, unlike most existing work Xue and Palmer (2004); Zhao et al. (2009a); He et al. (2018) which prunes argument candidates in the pre-processing stage. As shown in Figure 1, the way to perform argument pruning strategy is very straightforward. In the argument pruning layer, our model drops these candidate arguments (more exactly, BiLSTM representations) which do not comply with the syntactic rule. In other words, only the predicates and arguments that satisfy the syntactic rule will be output to next layer.

For example (in Figure 1), given the sentence Keep your heart and mind open, the predicate Keep and the corresponding syntactic dependency (bottom), by definition, is inside the syntactic rule on this occasion. Therefore, these candidate arguments (i.e., your, and, mind) will be pruned by the argument pruning layer.

2.3 Biaffine Scorer

As mentioned above, our model treats SRL task as a word pair classification problem, tackling argument identification and classification in one shot. To label arguments of given predicates, we employ a scorer with biaffine attention Dozat and Manning (2017) (biaffine scorer for short) as role classifier on top of argument pruning layer for the final prediction, similar to Cai et al. (2018). Biaffine scorer takes as input the BiLSTM hidden states of predicate and candidate arguments filtered by argument pruning layer, denoted by and respectively, and then computes the probability of corresponding semantic labels using biaffine transformation as follows:

where represents concatenation operator, and denote the weight matrix of the bilinear and the linear terms respectively, and b is the bias item. Note that the predicate itself is also included in its own argument candidate list and will be applied to compute scores, because a nominal predicate sometimes takes itself as its own argument.

3 Experiments

Our model111The code is available at is evaluated on the CoNLL-2009 benchmark datasets, including Catalan, Chinese, Czech, English, German, Japanese and Spanish. The statistics of the training datasets can be seen in Table 1. For the predicate disambiguation task, we follow previous work, using models Zhao et al. (2009a) for Catalan and Spanish, and the ones Björkelund et al. (2009) for other languages. Besides, we use the officially predicted POS tags and syntactic parses provided by CoNLL-2009 shared-task for all languages.222There were two tracks in the CoNLL-2009 shared task, SRL-only and joint. For the former, all participants did not have to develop their own syntactic parsers and focused on the SRL model development, while for the latter, the participants had to build their own syntactic parser as well. For the sake of focusing the SRL work, in this work, we will take the official syntax provided by CoNLL-2009. As for the contextualized representation, ELMo, we employ the multilingual version from Che et al. (2018). For BERT, this work uses the BERT-Base, Multilingual Cased model Devlin et al. (2018). For syntactic rule in argument pruning layer, to ensure more than 99% coverage of true arguments in pruning output, we use the top- distance tuples on Japanese and top- on other multiple languages for a better trade-off between computation and coverage.

Dataset #sent #token #pred #arg
Catalan 13,200 390,302 37,431 84,367
Chinese 22,277 609,060 102,813 231,869
Czech 38,727 652,544 414,237 365,255
English 39,279 958,167 179,014 393,699
German 36,020 648,677 17,400 34,276
Japanese 4,393 112,555 25,712 43,957
Spanish 14,329 427,442 43,824 99,054
Table 1: Training data statistics of sentences, tokens, predicates and arguments. # denotes numbers.
Model English Chinese
Zhao et al. (2009a) 86.2 80.4 75.2 77.7
Björkelund et al. (2009) 88.6 85.2 86.9 82.4 75.1 78.6
FitzGerald et al. (2015) 87.3
Roth and Lapata (2016) 90.0 85.5 87.7 83.2 75.9 79.4
Marcheggiani et al. (2017) 88.7 86.8 87.7 83.4 79.1 81.2
Marcheggiani and Titov (2017) 89.1 86.8 88.0 84.6 80.4 82.5
He et al. (2018) (with ELMo) 89.7 89.3 89.5 84.2 81.5 82.8
Cai et al. (2018) 89.9 89.2 89.6 84.7 84.0 84.3
Li et al. (2018) (with ELMo) 90.3 89.3 89.8 84.8 81.2 83.0
Li et al. (2019) (with ELMo) 89.6 91.2 90.4
Our baseline 89.30 89.93 89.61 82.88 85.26 84.05
+ AP 89.96 89.96 89.96 84.60 84.50 84.55
+ BERT 89.80 91.20 90.50 85.76 86.50 86.13
+ AP + ELMo 90.00 90.65 90.32 84.44 84.95 84.70
+ AP + BERT 90.41 91.32 90.86 86.15 86.70 86.42
Table 2: Precision, recall and semantic F-score on CoNLL-2009 English in-domain data and Chinese test set.

3.1 Model Setup

In our experiments, all real vectors are randomly initialized, including 100-dimensional word, lemma, POS tag embeddings and 16-dimensional predicate-specific indicator embedding He et al. (2018). The pre-trained word embedding is 100-dimensional GloVe vectors Pennington et al. (2014) for English, 300-dimensional fastText vectors Grave et al. (2018) trained on Common Crawl and Wikipedia for other languages, while the dimension of ELMo or BERT word embedding is 1024. Besides, we use 3 layers BiLSTM with 400-dimensional hidden states, applying dropout with an 80% keep probability between time-steps and layers. For biaffine scorer, we employ two 300-dimensional affine transformations with the ReLU non-linear activation, also setting the dropout probability to 0.2. During training, we use the categorical cross-entropy as objective, with Adam optimizer Kingma and Ba (2015) initial learning rate . All models are trained for up to 500 epochs with batch size 64.

Model Catalan Chinese Czech English German Japanese Spanish
CoNLL-2009 ST best system 80.3 78.6 85.4 85.6 79.7 78.2 80.5
Zhao et al. (2009a) 80.3 77.7 85.2 86.2 76.0 78.2 80.5
Roth and Lapata (2016) 79.4 87.7 80.1 80.2
Marcheggiani et al. (2017) 81.2 86.0 87.7 80.3
Li et al. (2019) 90.4
The best previously published 80.3 84.3 86.0 90.4 80.1 78.2 80.5
Our baseline 84.07 84.05 88.35 89.61 78.36 83.08 83.47
+ AP 84.35 84.55 88.76 89.96 78.54 83.12 83.70
+ BERT 84.88 86.13 89.06 90.50 80.68 83.57 84.50
+ AP + ELMo 84.35 84.70 89.52 90.32 78.65 83.43 83.82
+ AP + BERT 85.14 86.42 89.66 90.86 80.87 83.76 84.60
Table 3: Semantic F-score on CoNLL-2009 in-domain test set. The first row is the best result of CoNLL-2009 shared task Hajič et al. (2009). The previously best published results of Catalan and Japanese is from Zhao et al. (2009a), Chinese from Cai et al. (2018), Czech from Marcheggiani et al. (2017), English from Li et al. (2019), German and Spanish from Roth and Lapata (2016).

3.2 Results and Discussion

In Table 2, we compare our single model (AP is an acronym for argument pruning) against previous work on English in-domain data and Chinese test set. Our baseline is a modification to the model of Cai et al. (2018) which uniformly handled the predicate disambiguation. For English, our baseline gives slightly weaker performance than the work of Li et al. (2019), which used ELMo and employed a sophisticated span selection model for predicting predicates and arguments jointly. Our model with the proposed argument pruning layer (+ AP) brings absolute improvements of 0.35% and 0.5% F on English and Chinese, respectively, which is on par with the best published scores. Moreover, we introduce deep enhanced representation based on the argument pruning. Our model utilizing BERT (+ AP + BERT) achieves the new best results on English and Chinese benchmarks.

Model Catalan Chinese English German Spanish
Our baseline 87.50 84.07 94.92 84.05 95.59 89.61 81.45 78.36 86.53 83.47
Biaffine SRL 89.10 84.70 95.60 84.56 95.04 89.60 81.64 78.45 87.44 83.85
+ AP 89.52 84.90 95.60 84.76 95.38 89.88 81.65 78.50 87.56 83.92
+ AP + BERT 90.08 86.04 96.17 86.90 96.37 91.00 82.36 81.14 88.27 85.15
Table 4: Results of full end-to-end model. PD denotes the accuracy of predicate disambiguation. + AP represents biaffine SRL+Argument Pruning Layer, while the last row indicates biaffine SRL+Argument Pruning Layer+BERT.

Table 3 presents all test results on seven languages of CoNLL-2009 datasets. So far, the best previously reported results of Catalan, Japanese and Spanish are still from CoNLL-2009 shared task. Compared with previous methods, our baseline yields strong performance on all datasets except German. Especially for Catalan, Czech, Japanese and Spanish, our baseline performs better than existing methods with a large margin of 3.5% F on average. Nevertheless, applying our argument pruning to the strong syntax-agnostic baseline can still boost the model performance, which demonstrates the effectiveness of proposed method. On the other hand, it indicates that syntax is generally beneficial to multiple languages, and can enhance the multilingual SRL performance with effective syntactic integration.

Besides, we report the scores of leveraging ELMo and BERT for multiple languages (the last three rows in Table 3). The use of contextualized word representation further improves model performance, which overwhelmingly outperforms previously published best results and achieves the new state of the art in multilingual SRL for the first time. Furthermore, we find that ELMo promotes the overall performance of SRL model, but BERT gives more significant performance increase than ELMo on all languages, which suggests that BERT is better at enriching contextual information. More interestingly, we observe that the performance gains from the proposed argument pruning method or these deep enhanced representations are relatively marginal on Japanese, one possible reason is the relatively small size of its training set. This observation also indicates that ELMo or BERT is more suitable for learning on large annotated corpus.

3.3 End-to-end SRL

As mentioned above, we combine the predicate sense output of previous work to make results directly comparable, since the official evaluation script includes such prediction in the F-score calculation. However, predicate disambiguation is considered a simpler task with higher semantic F-score and deserves more further research. To this end, we present a full end-to-end neural model for multilingual SRL, namely Biaffine SRL, following Cai et al. (2018).

Unlike most of SRL work treating the predicate sense disambiguation and semantic role assignment tasks as independent, we jointly handle predicate disambiguation and argument labeling in one shot by introducing a virtual node VR as the nominal semantic head of predicate. It should be noted that the predicate sense annotation of Czech and Japanese is simply the lemmatized token of the predicate, a one-to-one predicate-sense mapping. Therefore, we ignore them and conduct experiments on other five languages.

Table 4 shows the results of end-to-end setting. Compared to the baseline, our full end-to-end model (Biaffine SRL) yields slightly higher precision of predicate disambiguation as a whole, which gives rise to a corresponding gain of semantic F. What is more, our model (using argument pruning and BERT) reaches the highest scores on the five benchmarks. Besides, experiments indicate that argument pruning promotes role labeling performance while BERT significantly improves the performance of predicate disambiguation.

4 Analysis

In this section, we perform further analysis to better understand our model, exploring the impact of language features, syntactic rule and syntactic contribution for multilingual SRL. Since recent work well studied dependency SRL on English and Chinese, we focus on other five languages, and these analyses are performed on CoNLL-2009 test sets without using ELMo or BERT embeddings.

4.1 Effectiveness of Language Feature

As Marcheggiani et al. (2017) point out, POS tag information is highly beneficial for English. Consequently, we conduct an ablation study on in-domain test set to explore how the language features impact our model. Table 5 reports the F scores of model which removes POS tag or lemma from the baseline. Results show that omitting POS tag or lemma leads to slight performance degradation ( and F on average, respectively), indicating that both can help improve performance of multilingual SRL. Interestingly, we see a drop of 1.0% F for Japanese not using POS tag, which demonstrates its importance.

Dataset baseline w/o POS tag w/o lemma
Catalan 84.07 83.83 (-0.24) 83.60 (-0.47)
Czech 88.35 88.10 (-0.25) 88.20 (-0.15)
German 78.36 77.80 (-0.56) 78.12 (-0.24)
Japanese 83.08 82.02 (-1.06) 82.80 (-0.28)
Spanish 83.47 83.15 (-0.32) 83.00 (-0.47)
Table 5: Ablation of POS tag and lemma on test set.
Figure 3: F scores on test set by top- argument pruning for German, Catalan and Japanese.

4.2 Effectiveness of Syntactic Rule

According to our statistics of syntactic rule on training data for seven languages, which is based on the automatically predicted parse provided by CoNLL-2009 shared task, the total number of distance tuples in syntactic rule is no more than 120 in these languages except that Japanese is about 260. It is in favor of previous hypothesis that the relative distances between predicate and its arguments are within a certain range. Besides, is the most frequently occurring relationships in all languages except Japanese, which indicates that arguments most frequently appear to be the children of their predicate in dependency syntax tree.

Figure 3 shows F scores on test set by top- argument pruning for German, Catalan and Japanese,333Note that the maximum values of for German, Catalan and Japanese are 36, 50 and 260, respectively. We thus show only the top 50 for Japanese due to the limited space. where = 0 represents the baseline without pruning. We observe that the case of = 20 yields the best performance on German and Catalan. As for Japanese, it falls short of the baseline in our current observation range, but our experiment has shown that the setting of top-120 can achieve the best results.

Dataset syntactic rule -order F
Catalan 84.35 84.18 0.17
Czech 88.76 88.75 0.01
German 78.54 78.42 0.12
Japanese 83.12 82.61 0.51
Spanish 83.70 83.53 0.17
Table 6: Comparison of our model with syntactic rule and -order argument pruning.
Dataset syntax- syntax-aware
agnostic predicted (UAS) gold
Catalan 84.07 84.35 (89.43) 85.50
Czech 88.35 88.76 (85.69) 88.92
German 78.36 78.54 (88.91) 78.56
Japanese 83.08 83.12 (92.29) 83.20
Spanish 83.47 83.70 (89.39) 84.82
Table 7: Syntactic contribution to multilingual SRL. predicted and gold denote the use of syntactic parse. The UAS of predicted syntax is in parenthesis.

To reveal the strengths of proposed syntactic rule, we conduct further experiments, replacing our syntactic rule based pruning with the -order argument pruning of He et al. (2018). Following their setting, we use the tenth-order pruning for pursuing the best performance. Table 6 shows the performance gaps between two pruning methods. Comparing with syntactic rule, the -order pruning declines model performance by 0.2% F on average, showing that our syntactic rule based pruning method is more effective and can be extended well to multiple languages especially for Japanese.

4.3 Syntactic Impact

In this part, we attempt to explore the syntactic impact on other five languages. To investigate the most contribution of syntax to multilingual SRL, we perform experiments using the gold syntactic parse also officially provided by the CoNLL-2009 shared task instead of the predicted one.444In this work, we use gold syntax rather than other better parse to explore the greatest syntactic contribution, considering the current state-of-the-art syntactic parsers are being upgraded so fast now. To be more precise, the syntactic rule is counted based on gold syntactic tree and applied to argument pruning layer. The corresponding results of our syntax-agnostic and syntax-aware models are summarized in Table 7. We also report the unlabeled attachment scores (UAS) of predicted syntax as syntactic accuracy measurement, considering that we do not use the dependency labels.

Results indicate that high-quality syntax can further improve model performance, showing syntactic information is generally effective for multilingual SRL. In particular, based on gold syntax, the top- argument pruning for Catalan and Spanish has reached 100 percent coverage (namely, for Catalan and Spanish, all arguments are the children of predicates in gold dependency syntax tree), and hence our syntax-aware model obtains significant gains of 1.43% and 1.35%, respectively. In addition, combining the results of Tables 6 and 7, we find that applying the -order pruning to syntax-agnostic model results in better performance on most languages. However, Cai et al. (2018) argue that -order pruning does not boost the performance for English. One reason to account for this finding is the lack of effective approaches for incorporating syntactic information into sequential neural networks. Nevertheless, syntactic contribution is overall limited for multilingual SRL in this work, due to strong syntax-agnostic baseline. Therefore, more effective methods to incorporate syntax into neural SRL model are worth exploring and we leave it for future work.

5 Related Work

In early work of semantic role labeling, most of researchers were dedicated to feature engineering Pradhan et al. (2005); Punyakanok et al. (2008); Zhao et al. (2009b, 2013). The first neural SRL model was proposed by Collobert et al. (2011), which used convolutional neural network but their efforts fell short. Later, Foland and Martin (2015) effectively extended their work by using syntactic features as input. Roth and Lapata (2016) introduced syntactic paths to guide neural architectures for dependency SRL.

However, putting syntax aside has sparked much research interest since Zhou and Xu (2015) employed deep BiLSTMs for span SRL. A series of neural SRL models without syntactic inputs were proposed. Marcheggiani et al. (2017) applied a simple LSTM model with effective word representation, achieving encouraging results on English, Chinese, Czech and Spanish. Cai et al. (2018) built a full end-to-end SRL model with biaffine attention and provided strong performance on English and Chinese. Li et al. (2019) also proposed an end-to-end model for both dependency and span SRL with a unified argument representation, obtaining favorable results on English.

Despite the success of syntax-agnostic SRL models, more recent work attempts to further improve performance by integrating syntactic information, with the impressive success of deep neural networks in dependency parsing Zhang et al. (2016); Zhou and Zhao (2019). Marcheggiani and Titov (2017) used graph convolutional network to encode syntax into dependency SRL. He et al. (2018) proposed an extended -order argument pruning algorithm based on syntactic tree and boosted SRL performance. Li et al. (2018) presented a unified neural framework to provide multiple methods for syntactic integration. Our method is closely related to the one of He et al. (2018), designed to prune as many unlikely arguments as possible.

Multilingual SRL

To promote NLP applications, the CoNLL-2009 shared task advocated performing SRL for multiple languages. Among the participating systems, Zhao et al. (2009a) proposed an integrated approach by exploiting large-scale feature set, while Björkelund et al. (2009) used a generic feature selection procedure. Until now, only a few of work Lei et al. (2015); Swayamdipta et al. (2016); Mulcaire et al. (2018) seriously considered multilingual SRL. Among them, Mulcaire et al. (2018) built a polyglot model (training one model on multiple languages) for multilingual SRL, but their results were far from satisfactory. Therefore, this work aims to complete the overall upgrade since CoNLL-2009 shared task and leaves polyglot training as our future work.

6 Conclusion

This paper is dedicated to filling the long-term performance gap of multilingual SRL since a long time ago with a newly proposed syntax-based argument pruning method. Experimental results demonstrate its effectiveness and shed light on a new perspective for many NLP tasks to incorporate syntax simply and effectively. Besides, our model substantially boosts multilingual SRL performance by introducing deep enhanced representation, achieving new state-of-the-art results on the in-domain CoNLL-2009 benchmark for Catalan, Chinese, Czech, English, German, Japanese and Spanish. These results further show that syntactic information and deep enhanced representation can also promote multiple languages rather than only the case of English.


  • Björkelund et al. (2009) Anders Björkelund, Love Hafdell, and Pierre Nugues. 2009. Multilingual semantic role labeling. In Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL 2009): Shared Task, pages 43–48.
  • Cai et al. (2018) Jiaxun Cai, Shexia He, Zuchao Li, and Hai Zhao. 2018. A full end-to-end semantic role labeler, syntactic-agnostic over syntactic-aware? In Proceedings of the 27th International Conference on Computational Linguistics (COLING), pages 2753–2765.
  • Che et al. (2018) Wanxiang Che, Yijia Liu, Yuxuan Wang, Bo Zheng, and Ting Liu. 2018. Towards better UD parsing: Deep contextualized word embeddings, ensemble, and treebank concatenation. In Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pages 55–64.
  • Christensen et al. (2011) Janara Christensen, Mausam, Stephen Soderland, and Oren Etzioni. 2011. An analysis of open information extraction based on semantic role labeling. In Proceedings of the 6th International Conference on Knowledge Capture, pages 113–120.
  • Collobert et al. (2011) Ronan Collobert, Jason Weston, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12(1):2493–2537.
  • Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  • Dozat and Manning (2017) Timothy Dozat and Christopher D Manning. 2017. Deep biaffine attention for neural dependency parsing. In Proceedings of 5th International Conference on Learning Representations (ICLR).
  • FitzGerald et al. (2015) Nicholas FitzGerald, Oscar Täckström, Kuzman Ganchev, and Dipanjan Das. 2015. Semantic role labeling with neural network factors. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 960–970.
  • Foland and Martin (2015) William Foland and James Martin. 2015. Dependency-based semantic role labeling using convolutional neural networks. In Joint Conference on Lexical and Computational Semantics, pages 279–288.
  • Grave et al. (2018) Edouard Grave, Piotr Bojanowski, Prakhar Gupta, Armand Joulin, and Tomas Mikolov. 2018. Learning word vectors for 157 languages. In Proceedings of the International Conference on Language Resources and Evaluation (LREC).
  • Hajič et al. (2009) Jan Hajič, Massimiliano Ciaramita, Richard Johansson, Daisuke Kawahara, Maria Antònia Martí, Lluís Màrquez, Adam Meyers, Joakim Nivre, Sebastian Padó, Jan Štěpánek, Pavel Straňák, Mihai Surdeanu, Nianwen Xue, and Yi Zhang. 2009. The CoNLL-2009 shared task: Syntactic and semantic dependencies in multiple languages. In Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL 2009): Shared Task, pages 1–18.
  • He et al. (2018) Shexia He, Zuchao Li, Hai Zhao, Hongxiao Bai, and Gongshen Liu. 2018. Syntax for semantic role labeling, to be, or not to be. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL), pages 2061–2071.
  • Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation, 9(8):1735–1780.
  • Kingma and Ba (2015) Diederik Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations (ICLR).
  • Lei et al. (2015) Tao Lei, Yuan Zhang, Lluís Màrquez, Alessandro Moschitti, and Regina Barzilay. 2015. High-order low-rank tensors for semantic role labeling. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL: HLT), pages 1150–1160.
  • Li et al. (2018) Zuchao Li, Shexia He, Jiaxun Cai, Zhuosheng Zhang, Hai Zhao, Gongshen Liu, Linlin Li, and Luo Si. 2018. A unified syntax-aware framework for semantic role labeling. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2401–2411.
  • Li et al. (2019) Zuchao Li, Shexia He, Hai Zhao, Yiqing Zhang, Zhuosheng Zhang, Xi Zhou, and Xiang Zhou. 2019. Dependency or span, end-to-end uniform semantic role labeling. arXiv preprint arXiv:1901.05280.
  • Marcheggiani et al. (2017) Diego Marcheggiani, Anton Frolov, and Ivan Titov. 2017. A simple and accurate syntax-agnostic neural model for dependency-based semantic role labeling. In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), pages 411–420.
  • Marcheggiani and Titov (2017) Diego Marcheggiani and Ivan Titov. 2017. Encoding sentences with graph convolutional networks for semantic role labeling. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1506–1515.
  • Mulcaire et al. (2018) Phoebe Mulcaire, Swabha Swayamdipta, and Noah A. Smith. 2018. Polyglot semantic role labeling. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL), pages 667–672.
  • Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543.
  • Peters et al. (2018) Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL: HLT).
  • Pradhan et al. (2005) Sameer Pradhan, Wayne Ward, Kadri Hacioglu, James Martin, and Daniel Jurafsky. 2005. Semantic role labeling using different syntactic views. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL), pages 581–588.
  • Punyakanok et al. (2008) Vasin Punyakanok, Dan Roth, and Wen-tau Yih. 2008. The importance of syntactic parsing and inference in semantic role labeling. Computational Linguistics, 34(2):257–287.
  • Roth and Lapata (2016) Michael Roth and Mirella Lapata. 2016. Neural semantic role labeling with dependency path embeddings. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL), pages 1192–1202.
  • Strubell et al. (2018) Emma Strubell, Patrick Verga, Daniel Andor, David Weiss, and Andrew McCallum. 2018. Linguistically-informed self-attention for semantic role labeling. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5027–5038.
  • Swayamdipta et al. (2016) Swabha Swayamdipta, Miguel Ballesteros, Chris Dyer, and Noah A. Smith. 2016. Greedy, joint syntactic-semantic parsing with stack lstms. In Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning (CoNLL), pages 187–197.
  • Xiong et al. (2012) Deyi Xiong, Min Zhang, and Haizhou Li. 2012. Modeling the translation of predicate-argument structure for SMT. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (ACL), pages 902–911.
  • Xue and Palmer (2004) Nianwen Xue and Martha Palmer. 2004. Calibrating features for semantic role labeling. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 88–94.
  • Yih et al. (2016) Wen-tau Yih, Matthew Richardson, Chris Meek, Ming-Wei Chang, and Jina Suh. 2016. The value of semantic parse labeling for knowledge base question answering. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL), pages 201–206.
  • Zhang et al. (2016) Zhisong Zhang, Hai Zhao, and Lianhui Qin. 2016. Probabilistic graph-based dependency parsing with convolutional neural network. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL), pages 1382–1392.
  • Zhao et al. (2009a) Hai Zhao, Wenliang Chen, Jun’ichi Kazama, Kiyotaka Uchimoto, and Kentaro Torisawa. 2009a. Multilingual dependency learning: Exploiting rich features for tagging syntactic and semantic dependencies. In Proceedings of the Thirteenth Conference on Computational Natural Language Learning - Shared Task (CoNLL), pages 61–66.
  • Zhao et al. (2009b) Hai Zhao, Wenliang Chen, and Chunyu Kit. 2009b. Semantic dependency parsing of NomBank and PropBank: An efficient integrated approach via a large-scale feature selection. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 30–39.
  • Zhao et al. (2013) Hai Zhao, Xiaotian Zhang, and Chunyu Kit. 2013. Integrative semantic dependency parsing via efficient large-scale feature selection. Journal of Artificial Intelligence Research, 46:203–233.
  • Zhou and Xu (2015) Jie Zhou and Wei Xu. 2015. End-to-end learning of semantic role labeling using recurrent neural networks. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics (ACL), pages 1127–1137, Beijing, China.
  • Zhou and Zhao (2019) Junru Zhou and Hai Zhao. 2019. Head-driven phrase structure grammar parsing on Penn treebank. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL).
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description