Deep Semantic Role Labeling with Self-Attention
Semantic Role Labeling (SRL) is believed to be a crucial step towards natural language understanding and has been widely studied. Recent years, end-to-end SRL with recurrent neural networks (RNN) has gained increasing attention. However, it remains a major challenge for RNNs to handle structural information and long range dependencies. In this paper, we present a simple and effective architecture for SRL which aims to address these problems. Our model is based on self-attention which can directly capture the relationships between two tokens regardless of their distance. Our single model achieves F on the CoNLL-2005 shared task dataset and F on the CoNLL-2012 shared task dataset, which outperforms the previous state-of-the-art results by and F score respectively. Besides, our model is computationally efficient, and the parsing speed is 50K tokens per second on a single Titan X GPU.
Semantic Role Labeling is a shallow semantic parsing task, whose goal is to determine essentially “who did what to whom”, “when” and “where”. Semantic roles indicate the basic event properties and relations among relevant entities in the sentence and provide an intermediate level of semantic representation thus benefiting many NLP applications, such as Information Extraction , Question Answering , Machine Translation  and Multi-document Abstractive Summarization .
Semantic roles are closely related to syntax. Therefore, traditional SRL approaches rely heavily on the syntactic structure of a sentence, which brings intrinsic complexity and restrains these systems to be domain specific. Recently, end-to-end models for SRL without syntactic inputs achieved promising results on this task . As the pioneering work, Zhou and Xu introduced a stacked long short-term memory network (LSTM) and achieved the state-of-the-art results. He et al., reported further improvements by using deep highway bidirectional LSTMs with constrained decoding. These successes involving end-to-end models reveal the potential ability of LSTMs for handling the underlying syntactic structure of the sentences.
Despite recent successes, these RNN-based models have limitations. RNNs treat each sentence as a sequence of words and recursively compose each word with its previous hidden state. The recurrent connections make RNNs applicable for sequential prediction tasks with arbitrary length, however, there still remain several challenges in practice. The first one is related to memory compression problem . As the entire history is encoded into a single fixed-size vector, the model requires larger memory capacity to store information for longer sentences. The unbalanced way of dealing with sequential information leads the network performing poorly on long sentences while wasting memory on shorter ones. The second one is concerned with the inherent structure of sentences. RNNs lack a way to tackle the tree-structure of the inputs. The sequential way to process the inputs remains the network depth-in-time, and the number of nonlinearities depends on the time-steps.
To address these problems above, we present a deep attentional neural network (DeepAtt) for the task of SRL
Although DeepAtt is fairly simple, it gives remarkable empirical results. Our single model outperforms the previous state-of-the-art systems on the CoNLL-2005 shared task dataset and the CoNLL-2012 shared task dataset by and F score respectively. It is also worth mentioning that on the out-of-domain dataset, we achieve an improvement upon the previous end-to-end approach  by F score. The feed-forward variant of DeepAtt allows significantly more parallelization, and the parsing speed is 50K tokens per second on a single Titan X GPU.
2Semantic Role Labeling
Given a sentence, the goal of SRL is to identify and classify the arguments of each target verb into semantic roles. For example, for the sentence “Marry borrowed a book from John last week.” and the target verb borrowed, SRL yields the following outputs:
Marry borrowed a book
from John last week
Here ARG0 represents the borrower, ARG1 represents the thing borrowed, ARG2 represents the entity borrowed from, AM-TMP is an adjunct indicating the timing of the action and V represents the verb.
Generally, semantic role labeling consists of two steps: identifying and classifying arguments. The former step involves assigning either a semantic argument or non-argument for a given predicate, while the latter includes labeling a specific semantic role for the identified argument. It is also common to prune obvious non-candidates before the first step and to apply post-processing procedure to fix inconsistent predictions after the second step. Finally, a dynamic programming algorithm is often applied to find the global optimum solution for this typical sequence labeling problem at the inference stage.
In this paper, we treat SRL as a BIO tagging problem. Our approach is extremely simple. As illustrated in Figure 1, the original utterances and the corresponding predicate masks are first projected into real-value vectors, namely embeddings, which are fed to the next layer. After that, we design a deep attentional neural network which takes the embeddings as the inputs to capture the nested structures of the sentence and the latent dependency relationships among the labels. On the inference stage, only the topmost outputs of attention sub-layer are taken to a logistic regression layer to make the final decision
3Deep Attentional Neural Network for SRL
In this section, we will describe DeepAtt in detail. The main component of our deep network consists of identical layers. Each layer contains a nonlinear sub-layer followed by an attentional sub-layer. The topmost layer is the softmax classification layer.
Self-attention or intra-attention, is a special case of attention mechanism that only requires a single sequence to compute its representation. Self-attention has been successfully applied to many tasks, including reading comprehension, abstractive summarization, textual entailment, learning task-independent sentence representations, machine translation and language understanding .
In this paper, we adopt the multi-head attention formulation by Vaswani et al. . Figure 2 depicts the computation graph of multi-head attention mechanism. The center of the graph is the scaled dot-product attention, which is a variant of dot-product (multiplicative) attention . Compared with the standard additive attention mechanism  which is implemented using a one layer feed-forward neural network, the dot-product attention utilizes matrix production which allows faster computation. Given a matrix of query vectors , keys and values , the scaled dot-product attention computes the attention scores based on the following equation:
where is the number of hidden units of our network.
The multi-head attention mechanism first maps the matrix of input vectors to queries, keys and values matrices by using different linear projections. Then parallel heads are employed to focus on different part of channels of the value vectors. Formally, for the -th head, we denote the learned linear maps by , and , which correspond to queries, keys and values respectively. Then the scaled dot-product attention is used to compute the relevance between queries and keys, and to output mixed representations. The mathematical formulation is shown below:
Finally, all the vectors produced by parallel heads are concatenated together to form a single vector. Again, a linear map is used to mix different channels from different heads:
where and .
The self-attention mechanism has many appealing aspects compared with RNNs or CNNs. Firstly, the distance between any input and output positions is 1, whereas in RNNs it can be . Unlike CNNs, self-attention is not limited to fixed window sizes. Secondly, the attention mechanism uses weighted sum to produce output vectors. As a result, the gradient propagations are much easier than RNNs or CNNs. Finally, the dot-product attention is highly parallel. In contrast, RNNs are hard to parallelize owing to its recursive computation.
The successes of neural networks root in its highly flexible nonlinear transformations. Since attention mechanism uses weighted sum to generate output vectors, its representational power is limited. To further increase the expressive power of our attentional network, we employ a nonlinear sub-layer to transform the inputs from the bottom layers. In this paper, we explore three kinds of nonlinear sub-layers, namely recurrent, convolutional and feed-forward sub-layers.
We use bidirectional LSTMs to build our recurrent sub-layer. Given a sequence of input vectors , two LSTMs process the inputs in opposite directions. To maintain the same dimension between inputs and outputs, we use the sum operation to combine two representations:
For convolutional sub-layer, we use the Gated Linear Unit (GLU) proposed by Dauphin et al. . Compared with the standard convolutional neural network, GLU is much easier to learn and achieves impressive results on both language modeling and machine translation task . Given two filters and , the output activations of GLU are computed as follows:
The filter width is set to 3 in all our experiments.
The feed-forward sub-layer is quite simple. It consists of two linear layers with hidden ReLU nonlinearity  in the middle. Formally, we have the following equation:
where and are trainable matrices. Unless otherwise noted, we set in all our experiments.
Previous works pointed out that deep topology is essential to achieve good performance . In this work, we use the residual connections proposed by He et al. to ease the training of our deep attentional neural network. Specifically, the output of each sub-layer is computed by the following equation:
We then apply layer normalization  after the residual connection to stabilize the activations of deep neural network.
The attention mechanism itself cannot distinguish between different positions. So it is crucial to encode positions of each input words. There are various ways to encode positions, and the simplest one is to use an additional position embedding. In this work, we try the timing signal approach proposed by Vaswani et al. , which is formulated as follows:
The timing signals are simply added to the input embeddings. Unlike the position embedding approach, this approach does not introduce additional parameters.
The first step of using neural networks to process symbolic data is to represent them by distributed vectors, also called embeddings . We take the very original utterances and the corresponding predicate masks as the input features. is set to if the corresponding word is a predicate, or if not.
Formally, in SRL task, we have a word vocabulary and mask vocabulary . Given a word sequence and a mask sequence , each word and its corresponding predicate mask are projected into real-valued vectors and through the corresponding lookup table layer, respectively. The two embeddings are then concatenated together as the output feature maps of the lookup table layers. Formally speaking, we have .
We then build our deep attentional neural network to learn the sequential and structural information of a given sentence based on the feature maps from the lookup table layer. Finally, we take the outputs of the topmost attention sub-layer as inputs to make the final predictions.
Since there are dependencies between semantic labels, most previous neural network models introduced a transition model for measuring the probability of jumping between the labels. Different from these works, we perform SRL as a typical classification problem. Latent dependency information is embedded in the topmost attention sub-layer learned by our deep models. This approach is simpler and easier to implement compared to previous works.
Formally, given an input sequence , the log-likelihood of the corresponding correct label sequence is
Our model predict the corresponding label based on the representation produced by the topmost attention sub-layer of DeepAtt:
Where is the softmax matrix and is Kronecker delta with a dimension for each output symbol, so is exactly the ’th element of the distribution defined by the softmax. Our training objective is to maximize the log probabilities of the correct output labels given the input sequence over the entire training set.
We report our empirical studies of DeepAtt on the two commonly used datasets from the CoNLL-2005 shared task and the CoNLL-2012 shared task.
The CoNLL-2005 dataset takes section 2-21 of the Wall Street Journal (WSJ) corpus as training set, and section 24 as development set. The test set consists of section 23 of the WSJ corpus as well as 3 sections from the Brown corpus . The CoNLL-2012 dataset is extracted from the OntoNotes v5.0 corpus. The description and separation of training, development and test set can be found in Pardhan et al. .
Initialization We initialize the weights of all sub-layers as random orthogonal matrices. For other parameters, we initialize them by sampling each element from a Gaussian distribution with mean and variance . The embedding layer can be initialized randomly or using pre-trained word embeddings. We will discuss the impact of pre-training in the analysis subsection.
Settings and Regularization The settings of our models are described as follows. The dimension of word embeddings and predicate mask embeddings is set to 100 and the number of hidden layers is set to 10. We set the number of hidden units to . The number of heads is set to 8. We apply dropout  to prevent the networks from over-fitting. Dropout layers are added before residual connections with a keep probability of 0.8. Dropout is also applied before the attention softmax layer and the feed-froward ReLU hidden layer, and the keep probabilities are set to 0.9. We also employ label smoothing technique  with a smoothing value of 0.1 during training.
Learning Parameter optimization is performed using stochastic gradient descent. We adopt Adadelta  ( and ) as the optimizer. To avoid exploding gradients problem, we clip the norm of gradients with a predefined threshold . Each SGD contains a mini-batch of approximately 4096 tokens for the CoNLL-2005 dataset and 8192 tokens for the CoNLL-2012 dataset. The learning rate is initialized to 1.0. After training 400k steps, we halve the learning rate every 100K steps. We train all models for 600K steps. For DeepAtt with FFN sub-layers, the whole training stage takes about two days to finish on a single Titan X GPU, which is 2.5 times faster than the previous approach .
In Table ? and ?, we give the comparisons of DeepAtt with previous approaches. On the CoNLL-2005 dataset, the single model of DeepAtt with RNN, CNN and FFN nonlinear sub-layers achieves an F score of , and respectively. The FFN variant outperforms previous best performance by 1.8 F score. Remarkably, we get 74.1 F score on the out-of-domain dataset, which outperforms the previous state-of-the-art system by F score. On the CoNLL-2012 dataset, the single model of FFN variant also outperforms the previous state-of-the-art by 1.0 F score. When ensembling 5 models with FFN nonlinear sub-layers, our approach achieves an F score of 84.6 and 83.9 on the two datasets respectively, which has an absolute improvement of 1.4 and 0.5 over the previous state-of-the-art. These results are consistent with our intuition that the self-attention layers is helpful to capture structural information and long distance dependencies.
In this subsection, we discuss the main factors that influence our results. We analyze the experimental results on the development set of CoNLL-2005 dataset.
Model Depth Previous works  show that model depth is the key to the success of end-to-end SRL approach. Our observations also coincide with previous works. Rows 1-5 of Table ? show the effects of different number of layers. For DeepAtt with 4 layers, our model only achieves 79.9 F score. Increasing depth consistently improves the performance on the development set, and our best model consists of 10 layers. For DeepAtt with 12 layers, we observe a slightly performance drop of 0.1 F.
Model Width We also conduct experiments with different model widths. We increase the number of hidden units from to and to as listed in rows 1, 6 and 7 of Table ?, and the corresponding hidden size of FFN sub-layers is increased to 1600 and 2400 respectively. Increasing model widths improves the F slightly, and the model with 600 hidden units achieves an F of 83.4. However, the training and parsing speed are slower as a result of larger parameter counts.
Word Embedding Previous works found that the performance can be improved by pre-training the word embeddings on large unlabeled data . We use the GloVe  embeddings pre-trained on Wikipedia and Gigaword. The embeddings are used to initialize our networks, but are not fixed during training. Rows 1 and 8 of Table ? show the effects of additional pre-trained embeddings. When using pre-trained GloVe embeddings, the F score increases from 79.6 to 83.1.
Position Encoding From rows 1, 9 and 10 of Table ? we can see that the position encoding plays an important role in the success of DeepAtt. Without position encoding, the DeepAtt with FFN sub-layers only achieves F score on the CoNLL-2005 development set. When using position embedding approach, the F score boosts to . The timing approach is surprisingly effective, which outperforms the position embedding approach by F score.
Nonlinear Sub-Layers DeepAtt requires nonlinear sub-layers to enhance its expressive power. Row 11 of Table ? shows the performance of DeepAtt without nonlinear sub-layers. We can see that the performance of 10 layered DeepAtt without nonlinear sub-layers only matches the 4 layered DeepAtt with FFN sub-layers, which indicates that the nonlinear sub-layers are the essential components of our attentional networks.
Constrained Decoding Table 1 show the effects of constrained decoding  on top of DeepAtt with FFN sub-layers. We observe a slightly performance drop when using constrained decoding. Moreover, adding constrained decoding slow down the decoding speed significantly. For DeepAtt, it is powerful enough to capture the relationships among labels.
|He et al.||91.87||87.10|
Detailed Scores We list the detailed performance on frequent labels in Table ?. The results of the previous state-of-the-art  are also shown for comparison. Compared with He et al. , our model shows improvement on all labels except AM-PNC, where He’s model performs better. Table 2 shows the results of identifying and classifying semantic roles. Our model improves the previous state-of-the-art on both identifying correct spans as well as correctly classifying them into semantic roles. However, the majority of improvements come from classifying semantic roles. This indicates that finding the right constituents remains a bottleneck of our model.
Labeling Confusion Table ? shows a confusion matrix of our model for the most frequent labels. We only consider predicted arguments that match gold span boundaries. Compared with the previous work , our model still confuses ARG2 with AM-DIR, AM-LOC and AM-MNR, but to a lesser extent. This indicates that our model has some advantages on such difficult adjunct distinction .
Srl Gildea and Jurafsky developed the first automatic semantic role labeling system based on FrameNet. Since then the task has received a tremendous amount of attention. The focus of traditional approaches is devising appropriate feature templates to describe the latent structure of utterances. Pradhan et al. ; Surdeanu et al. ; Palmer, Gildea, and Xue explored the syntactic features for capturing the overall sentence structure. Combination of different syntactic parsers was also proposed to avoid prediction risk which was introduced by Surdeanu et al. ; Koomen et al. ; Pradhan et al. .
Beyond these traditional methods above, Collobert et al. proposed a convolutional neural network for SRL to reduce the feature engineering. The pioneering work on building an end-to-end system was proposed by Zhou and Xu , who applied an 8 layered LSTM model which outperformed the previous state-of-the-art system. He et al. improved further with highway LSTMs and constrained decoding. They used simplified input and output layers compared with Zhou and Xu . Marcheggiani, Frolov, Titov also proposed a bidirectional LSTM based model. Without using any syntactic information, their approach achieved the state-of-the-art result on the CoNLL-2009 dataset.
Our method differs from them significantly. We choose self-attention as the key component in our architecture instead of LSTMs. Like He et al. , our system take the very original utterances and predicate masks as the inputs without context windows. At the inference stage, we apply argmax decoding approach on top of a simple logistic regression while Zhou and Xu chose a CRF approach and He et al. chose constrained decoding. This approach is much simpler and faster than the previous approaches.
Self-Attention Self-attention have been successfully used in several tasks. Cheng, Dong, and Lapata used LSTMs and self-attention to facilitate the task of machine reading. Parikh et al. utilized self-attention to the task of natural language inference. Lin et al. proposed self-attentive sentence embedding and applied them to author profiling, sentiment analysis and textual entailment. Paulus, Xiong, and Socher combined reinforcement learning and self-attention to capture the long distance dependencies nature of abstractive summarization. Vaswani et al. applied self-attention to neural machine translation and achieved the state-of-the-art results. Very recently, Shen et al. applied self-attention to language understanding task and achieved the state-of-the-art on various datasets. Our work follows this line to apply self-attention for learning long distance dependencies. Our experiments also show the effectiveness of self-attention mechanism on the sequence labeling task.
We proposed a deep attentional neural network for the task of semantic role labeling. We trained our SRL models with a depth of and evaluated them on the CoNLL-2005 shared task dataset and the CoNLL-2012 shared task dataset. Our experimental results indicate that our models substantially improve SRL performances, leading to the new state-of-the-art.
This work was done while the first author’s internship at Tencent Technology. This work is supported by the Natural Science Foundation of China (Grant No. 61573294, 61303082, 61672440), the Ph.D. Programs Foundation of Ministry of Education of China (Grant No. 20130121110040), the Foundation of the State Language Commission of China (Grant No. WT135-10) and the Natural Science Foundation of Fujian Province (Grant No. 2016J05161). We also thank the anonymous reviews for their valuable suggestions.
- Our source code is available at https://github.com/XMUNLP/Tagger
- In case of BIO violations, we simply treat the argument of the B tags as the argument of the whole span.
- To be strictly comparable to previous work, we use the same vocabularies and pre-trained embeddings as He et al..
Ba, J. L.; Kiros, J. R.; and Hinton, G. E. Layer normalization.
Bahdanau, D.; Cho, K.; and Bengio, Y. Neural machine translation by jointly learning to align and translate.
Bastianelli, E.; Castellucci, G.; Croce, D.; and Basili, R. Textual inference and meaning representation in human robot interaction.
Bengio, Y.; Ducharme, R.; Vincent, P.; and Janvin, C. A neural probabilistic language model.
Carreras, X., and Màrquez, L. Introduction to the CoNLL-2005 shared task: Semantic role labeling.
Cheng, J.; Dong, L.; and Lapata, M. Long short-term memory-networks for machine reading.
Collobert, R.; Weston, J.; Bottou, L.; Karlen, M.; Kavukcuoglu, K.; and Kuksa, P. Natural language processing (almost) from scratch.
Dan, S., and Lapata, M. Using semantic roles to improve question answering.
Dauphin, Y. N.; Fan, A.; Auli, M.; and Grangier, D. Language modeling with gated convolutional networks.
FitzGerald, N.; Täckström, O.; Ganchev, K.; and Das, D. Semantic role labeling with neural network factors.
Gehring, J.; Auli, M.; Grangier, D.; Yarats, D.; and Dauphin, Y. N. Convolutional sequence to sequence learning.
Genest, P.-E., and Lapalme, G. Framework for abstractive summarization using text-to-text generation.
Gildea, D., and Jurafsky, D. Automatic labeling of semantic roles.
He, K.; Zhang, X.; Ren, S.; and Sun, J. Deep residual learning for image recognition.
He, L.; Lee, K.; Lewis, M.; and Zettlemoyer, L. Deep semantic role labeling: What works and what’s next.
Kingsbury, P.; Palmer, M.; and Marcus, M. Adding semantic annotation to the penn treebank.
Knight, K., and Luk, S. K. Building a large-scale knowledge base for machine translation.
Koomen, P.; Punyakanok, V.; Roth, D.; and Yih, W.-t. Generalized inference with multiple semantic role labeling systems.
Lin, Z.; Feng, M.; Santos, C. N. d.; Yu, M.; Xiang, B.; Zhou, B.; and Bengio, Y. A structured self-attentive sentence embedding.
Luong, M.-T.; Pham, H.; and Manning, C. D. Effective approaches to attention-based neural machine translation.
Marcheggiani, D.; Frolov, A.; and Titov, I. A simple and accurate syntax-agnostic neural model for dependency-based semantic role labeling.
Moschitti, A.; Morarescu, P.; and Harabagiu, S. M. Open domain information extraction via automatic semantic labeling.
Nair, V., and Hinton, G. E. Rectified linear units improve restricted boltzmann machines.
Palmer, M.; Gildea, D.; and Xue, N. Semantic Role Labeling.
Parikh, A. P.; Täckström, O.; Das, D.; and Uszkoreit, J. A decomposable attention model for natural language inference.
Pascanu, R.; Gulcehre, C.; Cho, K.; and Bengio, Y. How to construct deep recurrent neural networks.
Paulus, R.; Xiong, C.; and Socher, R. A deep reinforced model for abstractive summarization.
Pennington, J.; Socher, R.; and Manning, C. D. Glove: Global vectors for word representation.
Pradhan, S.; Hacioglu, K.; Ward, W.; Martin, J. H.; and Jurafsky, D. Semantic role chunking combining complementary syntactic views.
Pradhan, S.; Moschitti, A.; Xue, N.; Ng, H. T.; Björkelund, A.; Uryupina, O.; Zhang, Y.; and Zhong, Z. Towards robust linguistic analysis using ontonotes.
Punyakanok, V.; Roth, D.; and tau Yih, W. The importance of syntactic parsing and inference in semantic role labeling.
Shen, T.; Zhou, T.; Long, G.; Jiang, J.; Pan, S.; and Zhang, C. Disan: Directional self-attention network for rnn/cnn-free language understanding.
Srivastava, N.; Hinton, G. E.; Krizhevsky, A.; Sutskever, I.; and Salakhutdinov, R. Dropout: a simple way to prevent neural networks from overfitting.
Surdeanu, M.; Harabagiu, S.; Williams, J.; and Aarseth, P. Using predicate-argument structures for information extraction.
Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; and Wojna, Z. Rethinking the inception architecture for computer vision.
Täckström, O.; Ganchev, K.; and Das, D. Efficient inference and structured learning for semantic role labeling.
Toutanova, K.; Haghighi, A.; and Manning, C. D. A global joint model for semantic role labeling.
Ueffing, N.; Haffari, G.; and Sarkar, A. Transductive learning for statistical machine translation.
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, L.; and Polosukhin, I. Attention is all you need.
Wu, D., and Fung, P. Semantic roles for smt: a hybrid two-pass model.
Zeiler, M. D. Adadelta: an adaptive learning rate method.
Zhou, J., and Xu, W. End-to-end learning of semantic role labeling using recurrent neural networks.