Joint RNN Model for Argument Component Boundary Detection
Argument Component Boundary Detection
(ACBD) is an important sub-task in argumentation mining; it aims at identifying the word sequences that constitute argument components, and is usually considered as the first sub-task in the argumentation mining pipeline. Existing ACBD methods heavily depend on task-specific knowledge, and require considerable human efforts on feature-engineering. To tackle these problems, in this work, we formulate ACBD as a sequence labeling problem and propose a variety of Recurrent Neural Network (RNN) based methods, which do not use domain specific or handcrafted features beyond the relative position of the sentence in the document. In particular, we propose a novel joint RNN model that can predict whether sentences are argumentative or not, and use the predicted results to more precisely detect the argument component boundaries. We evaluate our techniques on two corpora from two different genres; results suggest that our joint RNN model obtain the state-of-the-art performance on both datasets.
aims at automatically extracting arguments from natural language texts . An argument is a basic unit people use to persuade their audiences to accept a particular state of affairs , and it usually consists of one or more argument components, for example a claim and some premises offered in support of the claim. As a concrete example, consider the essay excerpt below (obtained from the essay corpus in ):
: Furthermore, []. That is the reason why [[sectors such as medical care and education deserve more governmental support]], because [].
The above example includes three argument components ([[ ]] give their boundaries): one claim (in bold face) and two premises (underlined). Premises and support the claim . As argumentation mining reveals the discourse relations between clauses, it can be potentially used in applications like decision making, document summarising, essay scoring, etc., and thus receives growing research interests in recent years (see, e.g. ).
A typical argumentation mining pipeline consists of three consecutive subtasks: i) separating argument components from non-argumentative texts, ii) classifying the type (e.g. claim or premise or others) of argument components; and iii) predicting the relations (e.g. support or attack) between argument components. The first subtask is also known as argument component boundary detection (ACBD); it aims at finding the exact boundary of a consecutive token subsequence that constitutes an argument component, thus separating it from non-argumentative texts. In this work, we focus on the ACBD subtask, because ACBD’s performance significantly influences downstream argumentation mining subtasks’ performances, but there exist relatively little research working on ACBD.
Most existing ACBD techniques require sophisticated hand-crafted features (e.g. syntactic, structural and lexical features) and domain-specific resources (e.g. indicator gazetteers), resulting in their poor cross-domain applicabilities. To combat these problems, in this work, we consider ACBD as a sequence labeling task at the token level and propose some novel neural network based ACBD methods, so that no domain specific or hand-crafted features beyond the relative location of sentences are used. Although neural network based approaches have been recently used in some related Natural Language Processing (NLP) tasks, such as linguistic sequence labelling and named entity recognition (NER), applying neural network to ACBD is challenging because the length of an argument component is much longer than that of a name/location in NER:  reports that an argument component includes 24.25 words in average, while a name/location usually consists of only 2 to 5 words. In fact, it has been reported in  that separating argumentative and non-argumentative texts is often subtle even to human annotators.
In particular, our neural network models are designed to capture two intuitions. First, since an argument component often consists of considerable number of words, it is essential to jointly considering multiple words’ labels so as to detect argument components’ boundaries; hence, we propose a bidirectional Recurrent Neural Network (RNN)  with a Conditional Random Field (CRF)  layer above it, as both RNN and CRF are widely recognized as effective methods for considering contextual information. Second, we believe that if the argumentative-or-not information of each sentence
The contributions of this work are threefold: i) we present the first deep-learning based ACBD technique, so that the feature-engineering demand is greatly reduced and the technique’s cross-domain applicability is significantly improved; ii) we propose a novel joint RNN model that can classify the argumentative status of sentences and separating argument components from non-argumentative texts simultaneously, which can significantly improve the performance of ACBD; and iii) we test our ACBD methods on two different text genres, and results suggest that our approach outperforms the state-of-the-art techniques in both domains.
In this section, we first review ACBD techniques, and then review works that apply RNN to applications related to ACBD, e.g. sequence labeling and text classification.
Most existing ACBD methods consist of two consecutive subtasks: identifying argumentative sentences (i.e. sentences that include some argument components) and detecting the component boundaries . Levy et al.  identify context-dependent claims in Wikipedia articles by using a cascade of classifiers. They first use logistic regression to identify sentences containing topic-related claims (the topic is provide a priori), and then detect the boundaries of claims and rank the candidate boundaries, so as to identify the most relevant claims for the topic. However, the importance of topic information is questionable, as Lippi and Torroni  achieve a similar result on the first subtask without using the topic information. Goudas et al.  propose a ACBD technique and test it on a corpus constructed from social media texts. They first use a variety of classifiers to perform the first subtask, and then employ a feature-rich CRF to perform the second subtask.
Besides the two-stage model presented above, some works consider ACBD as a sequence labeling task at token level. Stab and Gurevych  employ a CRF model with four kinds of hand-craft features (structural, syntactic, lexical and probability features) to perform ACBD on their persuasive essay corpus. Unlike texts in Wikipedia, persuasive essays are organised by structurally and the percentage of argumentative sentences are much higher (77.1% sentences in persuasive essays include argument component). The performance of this ACBD technique (in terms of macro F1) is .867.
2.2RNN on Similar Tasks
RNN techniques, especially Long Short-Term Memory (LSTM) , have recently been successfully applied to sequence labeling and text classification tasks in various NLP problems. Graves et al.  propose a bidirectional RNN for speech recognition, which takes the context on both sides of each word into account. However, in sequential labeling tasks with strong dependencies between output labels, the performance of RNN is not ideal. To tackle this problem, instead of modeling tagging decisions independently, Huang et al. and Lample et al.  apply a sequential CRF to jointly decode labels for the whole sequence.
RNN has also be successfully used in text classification. Lai et al.  propose a Recurrent Convolutional Neural Network, which augments a max-pooling layer after bidirectional RNN. The purpose of the pooling layer is to capture the most important latent semantic factors in the document. Then the softmax function is used to predict the classification distribution. Results shows the effectiveness on the text classification tasks.
Some RNN-based techniques are developed for the spoken language understanding task, in which both text classification and sequence labeling are involved: intent detection is a classification problem, while slot filling is a sequence labeling problem. Liu and Lane  propose an attention-based bidirectional RNN model to perform these two tasks simultaneously; the method achieves state-of-the-art performance on both tasks.
We consider a sentence in the document as a sequence of tokens/words and label the argument boundaries using the IOB-tagset: a word is labelled as “B” if it is the first token in an argument component, “I” if it constitutes, but not leads, an argument component, and “O” if it is not included in any argument components. In this section, we first review some widely used techniques for sequence labeling , then we present our joint RNN model, which can distinguish argumentative and non-argumentative sentences and use this information to detect boundaries, in Sect. Section 3.5.
RNN  is a neural architecture designed for dealing with sequential data. RNN takes as input a vector and returns a feature vector sequence at every time step.
where is the bias vector for gate (where can be , , or ), is the element-wise sigmoid function, and is the element-wise multiplication operator. , and are the network parameters.
The LSTM presented above is known as single direction LSTM, because it only considers the preceding states, ignoring the states following the current state; thus, it fails to consider the “future” information. Bidirectional LSTM (Bi-LSTM)  is proposed to combat this problem. Bi-LSTM includes a forward LSTM and a backward LSTM, thus can capture both past and future information. Then the final output of Bi-LSTM is the product of concatenating the past and future context representations: , where and are the forward and backward LSTM, resp.
CRF  is widely used in sequence labeling tasks. For a given sequence and its labels , CRF gives a real-valued score as follows:
where is the unary potential for the label at position and is the pairwise potential of labels at and . The probability of y given X can be obtained from the score:
Given a new input , the goal of CRF is find a label for , whose conditional probability is maximised:
The process for obtaining the optimal label is termed decoding. For a linear chain CRF described above that only models bigram interactions between outputs, both training (Eq. ) and decoding (Eq. ) can be solved efficiently by dynamic programming.
In sequential labeling tasks where there exist strong dependencies between neighbouring labels, the performance of Bi-LSTM is not ideal. To tackle this problem, Huang et al.  propose the Bi-LSTM-CRF method, which augments a CRF layer after the output of Bi-LSTM, so as to explicitly model the dependencies between the output labels. Fig. ? illustrates the structure of Bi-LSTM-CRF networks.
For a given input sentence , is the output of Bi-LSTM, where and are the output of the forward and backward LSTM, resp. The connection layer is used for connecting the structure features and the output of Bi-LSTM, namely the feature representations . Note that is the relative position of the input sentence in document, and is not shown in Fig. ?. The output of is a matrix of scores, denoted by . is of size , where is the number of distinct tags, and corresponds to the score of the tag of word in a sentence. The score of a sentence along with a path of tags is then defined as follows:
where is the transition matrix, which gives the transition scores between tags such that is the score of a transition from the tag to tag . We add two special tags at the beginning and end of the sequence so that is a squared matrix of size . The conditional probability for a label sequence y given a sentence X thus can be obtained as follows:
where represents all possible tag sequences for a input sentence . The network is trained by minimizing the negative log-probability of the correct tag sequence y. Dynamic programming techniques can be used to efficiently compute the transition matrix and the optimal tag sequence for inference.
3.4Attention based RNN for Classification
Besides in sequence labeling, RNN is also widely used in text classification tasks. Lai et al.  combine the word embeddings and representation output by Bi-LSTM as the feature representation for text classification, weighting each input word equally. However, as the importances of words differ, such a equal-weighting strategy fails to highlight the truly important information. The attention mechanism  is proposed to tackle this problem. As the name suggests, the attention mechanism computes a weight vector to measure the importance of each word and aggregates those informative words to form a sentence vector. Specifically,
The sentence vector is the weighted sum of the word embeddings , weighted by . Vector gives additional supporting information, especially the information that requires longer term dependencies; this information can hardly be fully captured by the hidden states.
The architecture of the RNN for classification is illustrated in Fig ?. Remind that, Bi-LSTM can capture both the past and the future context information and can convert the tokens comprising each document into a feature representation . Max-pooling operation is used to extract maximum values over the time-step dimensions of to obtain a sentence level representation . Then the sentence’s argumentative status is predicted by the concatenating of context feature , weighted sentence vector and structure feature (relative location of the sentence):
where is the output of softmax function, which represents the probabilities of sentence’s argumentative status.
3.5Joint RNN Model
The joint model for argumentative sentence classification and sequence labeling in boundary detection is shown in Fig ?. In the proposed model, a Bi-LSTM reads the source sentence in both forward and backward directions and creates the hidden states . For sentence argumentative classification, as we mentioned in Sect. Section 3.4, an attention mechanism aggregate the input words into a sentence vector . The max-pooling operation is applied to capture the key components of the latent information. the sentence’s argumentative status is then predicted by the combination of vector , vector (output by max-pooling operation) and relative location feature .
For sequence labeling in boundary detection, we reuse the pre-computed hidden states of the Bi-LSTM. At each time-step, we combine each hidden state with the relative location feature and the sentence’s predicted argumentative status created by the above mentioned classification operation: . Then the will be the scores matrix described in Section 3.3 which will be given to the CRF layer.
The sequence labeling operation is as same as Sect. Section 3.3. The network of joint model is trained to find the parameters that minimize the cross-entropy of the predicted and true argumentative status for sentence and the negative log-probability of the sentence’s labels jointly.
We first present the argumentation corpora on which we test our techniques in Sect. Section 4.1, introduce our experimental settings in Sect. Section 4.2, and present and analyse the empirical results in Sect. Section 4.3.
We evaluate the neural network based ACBD techniques on two different corpora: the persuasive essays corpus  and the Wikipedia corpus . The persuasive essay corpus has three types of argument components: major claims, claims and premises. The corpus contains 402 English essays on a variety of topics, consisting of 7116 sentences and 147271 tokens (words). The Wikipedia corpus contains 315 Wikipedia articles grouped into 33 topics, and 1392 context-dependent claims have been annotated in total. A context-dependent claim is “a general, concise statement that directly supports or contests the given topic”, thus claims that do not support/attack the claim are not annotated. Note that the Wikipedia corpus is very imbalanced: only 2% sentences are argumentative (i.e. contain some argument components).
On the persuasive essay corpus, in line with , we use precision (P), recall (R) and macro-F1 as evaluation metrics, and use the same train/test split ratio: 322 essays are used in training, and the remaining 80 essays are used in test. On the Wikipedia corpus, in line with , a predicted claim is considered as True Positive if and only if it precisely matches a labeled claim. For all the articles across 33 topics, we randomly select 1/33 of all the sentences to serve as the test set, and the remaining sentences are used in training. As the corpus is very imbalanced, we apply a random under-sampling on the training set so ensure that the ratio between non-argumentative and argumentative sentences is 4:1.
In experiments on both corpora, we randomly select 10% data in the training set to serve as the validation set. Training occurs for 200 epochs. Only the models that perform best (in terms of F1) on the validation set are tested on the test set. The RNN-based methods read in texts sentence by sentence, and each sentence is represented by concatenating all its component words’ embeddings. All RNN are trained using the Adam update rule  with initial learning rate 0.001. We let the batch size be 50 and the attention hidden size be 150. To mitigate overfitting, we apply both the dropout method  and L2 weight regularization. We let dropout rate be 0.5 throughout our experiments for all dropout layers on both the input and the output vectors of Bi-LSTM. The regularization weight of the parameters is 0.001. Some hyper-parameter settings of the RNNs may depend on the dataset being used. In experiments on persuasive essays, we use Google’s word2vec  300-dimensional as the pre-trained word embeddings, and set the hidden size to 150 and use only one hidden layer in LSTM; on Wikipedia articles, we use glove’s 300-dimensional embeddings , and let the hidden size of LSTM be 80.
4.3Results and Discussion
|CRF (stab et al.)||0.867||0.873||0.861||0.809||0.934||0.857|
|Human Upper Bound||0.886||0.887||0.885||0.821||0.941||0.892|
|Levy et al. ||0.09||0.73||0.16|
|TK + Topic ||0.105||0.629||0.180|
|Joint RNN Model||0.156||0.630||0.250|
|Levy et al. ||0.120||0.160||0.200|
|Joint RNN Model||0.190||0.122||0.435|
The performance of different methods on the persuasive essays are presented in Table ?. Note that the performance of CRF is obtained from . Bi-LSTM achieves .825 macro F1 thanks to the context information captured by the LSTM layer. Adding a CRF layer to Bi-LSTM can significantly improve the performance and can achieve comparable results with the CRF method that uses a number of hand-crafted features. The third row in Table ? gives the performance of Bi-LSTM-CRF with ground-truth argumentative-or-not information for each sentence, i.e. the feature in Figure 3 are ground-truth labels; surprisingly, this method even outperforms the “human upperbound” performance
The performances on the Wikipedia articles are presented in Table ? and Table ?. The upper part of these two tables give the performances of some existing ACBD methods, and we can see that the performance metrics used for existing methods and for our RNN-based methods are different: our RNN-based methods output a unique boundary and component type for the input sentence, thus the performance metric is P/R/F1; however, existing ACBD methods produce a ranked list of candidate argument component boundries, thus their performance metrics are, e.g., precision@200, i.e. the probability that the true boundary is included in the top 200 predicted boundaries (definitions of recall@200 and F1@200 can be obtained similarly). Also note that, the results reported in  are obtained from a slightly older version of the dataset, containing only 32 topics (instead of 33) and 976 claims (instead of 1332).
From Table ?, we find that for argumentative sentence classification, our joint model significantly outperforms all the other techniques. From Table ?, we find that the joint RNN model prevails over the other Bi-LSTM based models, again confirms that the argumentative-or-not information can further improve the boundary detection performance. Note that, performances on Wikipedia corpus are not that high in general. One of the reasons is that the length of the argument component is long and the performance metrics we use are strict. In addition, only topic-dependent claims are annotated in the Wikipedia corpus; our RNN-based approaches do not consider the topic information, thus identify some topic-irrelevant claims, which are treated as False Positive. Similar observations are also made in .
In this work, we present the first deep-learning based family of algorithms for the argument component boundary detection (ACBD) task. In particular, we propose a novel joint model that combines an attention-based classification RNN to predict the argumentative-or-not information and a Bi-LSTM-CRF network to identify the exact boundary. We empirically compare the joint model with Bi-LSTM, Bi-LSTM-CRF and some state-of-the-art ACBD methods on two benchmark corpora; results suggest that our joint model outperforms all the other methods, suggesting that our joint model can effectively use the argumentative-or-not information to improve the boundary detection performance. As for the future work, a natural next step is to apply deep learning techniques to other sub-tasks of argumentation mining; in addition, a deep-learning-based end-to-end argumentation mining tool is also worthy of further investigation.
- In this work, we assume that an argument component cannot span across multiple sentences. This assumption is valid in most existing argumentation corpora, e.g. .
- In this work, we let be the short-hand notation for vector , where is the length of the vector.
- The human upperbound performance is obtained by by averaging the evaluation scores of all three annotator pairs on test data. Note that sentences’ argumentative-or-not information are not used in obtaining the human upperbound performance.
- A benchmark dataset for automatic detection of claims and evidence in the context of controversial topics.
Ehud Aharoni, Anatoly Polnarov, Tamar Lavee, Daniel Hershcovich, Ran Levy, Ruty Rinott, Dan Gutfreund, and Noam Slonim. In Proceedings of the First Workshop on Argumentation Mining, pages 64–68, 2014.
- Learning long-term dependencies with gradient descent is difficult.
Yoshua Bengio, Patrice Simard, and Paolo Frasconi. IEEE transactions on neural networks, 5(2):157–166, 1994.
- On the role of discourse markers for discriminating claims and premises in argumentative discourse.
Judith Eckle-Kohler, Roland Kluge, and Iryna Gurevych. In EMNLP, pages 2236–2242, 2015.
- Learning task-dependent distributed representations by backpropagation through structure.
Christoph Goller and Andreas Kuchler. In Neural Networks, 1996., IEEE International Conference on, volume 1, pages 347–352. IEEE, 1996.
- Argument extraction from news, blogs, and social media.
Theodosis Goudas, Christos Louizos, Georgios Petasis, and Vangelis Karkaletsis. In Hellenic Conference on Artificial Intelligence, pages 287–299. Springer, 2014.
- Speech recognition with deep recurrent neural networks.
Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. In Acoustics, speech and signal processing (icassp), 2013 ieee international conference on, pages 6645–6649. IEEE, 2013.
- Framewise phoneme classification with bidirectional lstm and other neural network architectures.
Alex Graves and Jürgen Schmidhuber. Neural Networks, 18(5):602–610, 2005.
- Long short-term memory.
Sepp Hochreiter and Jürgen Schmidhuber. Neural computation, 9(8):1735–1780, 1997.
- Bidirectional lstm-crf models for sequence tagging.
Zhiheng Huang, Wei Xu, and Kai Yu. arXiv preprint arXiv:1508.01991, 2015.
- Adam: A method for stochastic optimization.
Diederik Kingma and Jimmy Ba. arXiv preprint arXiv:1412.6980, 2014.
- Conditional random fields: Probabilistic models for segmenting and labeling sequence data.
John Lafferty, Andrew McCallum, Fernando Pereira, et al. In Proceedings of the eighteenth international conference on machine learning, ICML, volume 1, pages 282–289, 2001.
- Recurrent convolutional neural networks for text classification.
Siwei Lai, Liheng Xu, Kang Liu, and Jun Zhao. In AAAI, volume 333, pages 2267–2273, 2015.
- Neural architectures for named entity recognition.
Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. arXiv preprint arXiv:1603.01360, 2016.
- Context dependent claim detection.
Ran Levy, Yonatan Bilu, Daniel Hershcovich, Ehud Aharoni, and Noam Slonim. 2014.
- Context-independent claim detection for argument mining.
Marco Lippi and Paolo Torroni. In IJCAI, volume 15, pages 185–191, 2015.
- Argumentation mining: State of the art and emerging trends.
Marco Lippi and Paolo Torroni. ACM Transactions on Internet Technology (TOIT), 16(2):10, 2016.
- Attention-based recurrent neural network models for joint intent detection and slot filling.
Bing Liu and Ian Lane. arXiv preprint arXiv:1609.01454, 2016.
- Distributed representations of words and phrases and their compositionality.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. In Advances in neural information processing systems, pages 3111–3119, 2013.
- Argumentation mining: Where are we now, where do we want to be and how do we get there?
Marie-Francine Moens. In Post-Proceedings of the 4th and 5th Workshops of the Forum for Information Retrieval Evaluation, page 2. ACM, 2013.
- Glove: Global vectors for word representation.
Jeffrey Pennington, Richard Socher, and Christopher D Manning. In EMNLP, volume 14, pages 1532–1543, 2014.
- A neural attention model for abstractive sentence summarization.
Alexander M Rush, Sumit Chopra, and Jason Weston. arXiv preprint arXiv:1509.00685, 2015.
- Dropout: a simple way to prevent neural networks from overfitting.
Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Journal of Machine Learning Research, 15(1):1929–1958, 2014.
- Annotating argument components and relations in persuasive essays.
Christian Stab and Iryna Gurevych. In COLING, pages 1501–1510, 2014.
- Identifying argumentative discourse structures in persuasive essays.
Christian Stab and Iryna Gurevych. In EMNLP, pages 46–56, 2014.
- Parsing argumentation structures in persuasive essays.
Christian Stab and Iryna Gurevych. arXiv preprint arXiv:1604.07370, 2016.