Neural Discourse Structure for Text Categorization

Neural Discourse Structure for Text Categorization

Yangfeng Ji and Noah A. Smith
Paul G. Allen School of Computer Science & Engineering
University of Washington
Seattle, WA 98195, USA

We show that discourse structure, as defined by Rhetorical Structure Theory and provided by an existing discourse parser, benefits text categorization. Our approach uses a recursive neural network and a newly proposed attention mechanism to compute a representation of the text that focuses on salient content, from the perspective of both RST and the task. Experiments consider variants of the approach and illustrate its strengths and weaknesses.

Neural Discourse Structure for Text Categorization

Yangfeng Ji and Noah A. Smith Paul G. Allen School of Computer Science & Engineering University of Washington Seattle, WA 98195, USA {yangfeng,nasmith}

1 Introduction

Advances in text categorization have the potential to improve systems for analyzing sentiment, inferring authorship or author attributes, making predictions, and many more. Several past researchers have noticed that methods that reason about the relative salience or importance of passages within a text can lead to improvements (Ko et al., 2004). Latent variables (Yessenalina et al., 2010), structured-sparse regularizers (Yogatama and Smith, 2014), and neural attention models (Yang et al., 2016) have all been explored.

Figure 1: A manually constructed example of the RST (Mann and Thompson, 1988) discourse structure on a text.

Discourse structure, which represents the organization of a text as a tree (for an example, see Figure 1), might provide cues for the importance of different parts of a text. Some promising results on sentiment classification tasks support this idea: Bhatia et al. (2015) and Hogenboom et al. (2015) applied hand-crafted weighting schemes to the sentences in a document, based on its discourse structure, and showed benefit to sentiment polarity classification.

In this paper, we investigate the value of discourse structure for text categorization more broadly, considering five tasks, through the use of a recursive neural network built on an automatically-derived document parse from a top-performing, open-source discourse parser, DPLP (Ji and Eisenstein, 2014). Our models learn to weight the importance of a document’s sentences, based on their positions and relations in the discourse tree. We introduce a new, unnormalized attention mechanism to this end.

Experimental results show that variants of our model outperform prior work on four out of five tasks considered. Our method unsurprisingly underperforms on the fifth task, making predictions about legislative bills—a genre in which discourse conventions are quite different from those in the discourse parser’s training data. Further experiments show the effect of discourse parse quality on text categorization performance, suggesting that future improvements to discourse parsing will pay off for text categorization, and validate our new attention mechanism.

Our implementation is available at

2 Background: Rhetorical Structure Theory

Rhetorical Structure Theory (RST; Mann and Thompson, 1988) is a theory of discourse that has enjoyed popularity in NLP. RST posits that a document can be represented by a tree whose leaves are elementary discourse units (EDUs, typically clauses or sentences). Internal nodes in the tree correspond to spans of sentences that are connected via discourse relations such as Contrast and Elaboration. In most cases, a discourse relation links adjacent spans denoted “nucleus” and “satellite,” with the former more essential to the writer’s purpose than the latter.111There are also a few exceptions in which a relation can be realized with multiple nuclei.

An example of a manually constructed RST parse for a restaurant review is shown in Figure 1. The six EDUs are indexed from to ; the discourse tree organizes them hierarchically into increasingly larger spans, with the last Contrast relation resulting in a span that covers the whole review. Within each relation, the RST tree indicates the nucleus pointed by an arrow from its satellite (e.g., in the Elaboration relation, is the nucleus and is the satellite).

The information embedded in RST trees has motivated many applications in NLP research, including document summarization (Marcu, 1999), argumentation mining (Azar, 1999), and sentiment analysis (Bhatia et al., 2015). In most applications, RST trees are built by automatic discourse parsing, due to the expensive cost of manual annotation. In this work, we use a state-of-the-art open-source RST-style discourse parser, DPLP (Ji and Eisenstein, 2014).222

We follow recent work that suggests transforming the RST tree into a dependency structure (Yoshida et al., 2014).333The transformation is trivial and deterministic given the nucleus-satellite mapping for each relation. The procedure is analogous to the transformation of a headed phrase-structure parse in syntax into a dependency tree (e.g., Yamada and Matsumoto, 2003). Figure (a) shows the corresponding dependency structure of the RST tree in Figure 1. It is clear that is the root of the tree, and in fact this clause summarizes the review and suffices to categorize it as negative. This dependency representation of the RST tree offers a form of inductive bias for our neural model, helping it to discern the most salient parts of a text in order to assign it a label.

3 Model

Our model is a recursive neural network built on a discourse dependency tree. It includes a distributed representation computed for each EDU, and a composition function that combines EDUs and partial trees into larger trees. At the top of the tree, the representation of the complete document is used to make a categorization decision. Our approach is analogous to (and inspired by) the use of recursive neural networks on syntactic dependency trees, with word embeddings at the leaves (Socher et al., 2014).

(a) dependency structure
(b) recursive neural network structure
Figure 2: The dependency discourse tree derived from the example RST tree in Figure 1 (a) and the corresponding recursive neural network model on the tree (b).

3.1 Representation of Sentences

Let be the distributed representation of an EDU. We use a bidirectional LSTM on the words’ embeddings within each EDU (details of word embeddings are given in section 4), concatenating the last hidden state vector from the forward LSTM () with that of the backward LSTM () to get .

There is extensive recent work on architectures for embedding representations of sentences and other short pieces of text, including, for example, (bi)recursive neural networks (Paulus et al., 2014) and convolutional neural networks (Kalchbrenner et al., 2014). Future work might consider alternatives; we chose the bidirectional LSTM due to its effectiveness in many settings.

3.2 Full Recursive Model

Given the discourse dependency tree for an input text, our recursive model builds a vector representation through composition at each arc in the tree. Let denote the vector representation of EDU and its descendants. For the base case where EDU is a leaf in the tree, we let , which is the elementwise hyperbolic tangent function.

For an internal node , the composition function considers a parent and all of its children, whose indices are denoted by . In defining this composition function, we seek for (i.) the contribution of the parent node to be central; and (ii.) the contribution of each child node be determined by its content as well as the discourse relation it holds with the parent. We therefore define


where is a relation-specific composition matrix indexed by the relation between and , .

is an “attention” weight, defined as


where is the elementwise sigmoid and contains attention parameters (these are relation-independent). Our attention mechanism differs from prior work (Bahdanau et al., 2015), in which attention weights are normalized to sum to one across competing candidates for attention. Here, does not depend on node ’s other children. This is motivated by RST, in which the presence of a node does not signify lesser importance to its siblings. Consider, for example, EDU and text span in Figure 1, which in parallel provide Explanation for EDU . This scenario differs from machine translation, where attention isused to implicitly and softly align output-language words to relatively few input-language words. It also differs from attention in composition functions used in syntactic parsing (Kuncoro et al., 2017), where attention can mimic head rules that follow from an endocentricity hypothesis of syntactic phrase representation.

Our recursive composition function, through the attention mechanism and the relation-specific weight matrices, is designed to learn how to differently weight EDUs for the categorization task. This idea of using a weighting scheme along with discourse structure is explored in prior works (Bhatia et al., 2015; Hogenboom et al., 2015), although they are manually designed, rather than learned from training data.

Once we have of a text, the prediction of its category is given by .

We refer to this model as the Full model, since it makes use of the entire discourse dependency tree.

3.3 Unlabeled Model

The Full model based on Equation 1 uses a dependency discourse tree with relations. Because alternate discourse relation labels have been proposed (e.g., Prasad et al., 2008), we seek to measure the effect of these labels. We therefore consider an Unlabeled model based only on the tree structure, without the relations:


Here, only attention weights are used to compose the children nodes’ representations, significantly reducing the number of model parameters.

This Unlabeled model is similar to the depth weighting scheme introduced by Bhatia et al. (2015), which also uses an unlabeled discourse dependency tree, but our attention weights are computed by a function whose parameters are learned. This approach sits squarely between Bhatia et al. (2015) and the flat document structure used by Yang et al. (2016); the Unlabeled model still uses discourse to bias the model toward some content (that which is closer to the tree’s root).

3.4 Simpler Variants

We consider two additional baselines that are even simpler. The first, Root, uses the discourse dependency structure only to select the root EDU, which is used to represent the entire text: No composition function is needed. This model variant is motivated by work on document summarization (Yoshida et al., 2014), where the most central EDU is used to represent the whole text.

The second variant, Additive, uses all the EDUs with a simple composition function, and does not depend on discourse structure at all: where is the total number of EDUs. This serves as a baseline to test the benefits of discourse, controlling for other design decisions and implementation choices. Although sentence representations are built in a different way from the work of Yang et al. (2016), this model is quite similar to their HN-AVE model on building document representations.

4 Implementation Details

The parameters of all components of our model (top-level classification, composition, and EDU representation) are learned end-to-end using standard methods. We implement our learning procedure with the DyNet package (Neubig et al., 2017).


For all datasets, we use the same preprocessing steps, mostly following recent work on language modeling (e.g., Mikolov et al., 2010). We lowercased all the tokens and removed tokens that contain only punctuation symbols. We replaced numbers in the documents with a special number token. Low-frequency word types were replaced by unk; we reduce the vocabulary for each dataset until approximately 5% of tokens are mapped to unk. The vocabulary sizes after preprocessing are also shown in Table 1.

Discourse parsing.

Our model requires the discourse structure for each document. We used DPLP, the RST parser from Ji and Eisenstein (2014), which is one of the best discourse parsers on the RST discourse treebank benchmark (Carlson et al., 2001). It employs a greedy decoding algorithm for parsing, producing 2,000 parses per minute on average on a single CPU. DPLP provides discourse segmentation, breaking a text into EDUs, typically clauses or sentences, based on syntactic parses provided by Stanford CoreNLP. RST trees are converted to dependencies following the method of Yoshida et al. (2014). DPLP as distributed is trained on 347 Wall Street Journal articles from the Penn Treebank (Marcus et al., 1993).

Word embeddings.

In cases where there are 10,000 or fewer training examples, we used pretrained GloVe word embeddings (Pennington et al., 2014), following previous work on neural discourse processing (Ji and Eisenstein, 2015). For larger datasets, we randomly initialized word embeddings and trained them alongside other model parameters.

Learning and hyperparameters.

Online learning was performed with the optimization method and initial learning rate as hyperparameters. To avoid the exploding gradient problem, we used the norm clipping trick with a threshold of . In addition, dropout rate 0.3 was used on both input and hidden layers to avoid overfitting. We performed grid search over the word vector representation dimensionality, the LSTM hidden state dimensionality (both ), the initial learning rate (), and the update method (SGD and Adam, Kingma and Ba, 2015). For each corpus, the highest-accuracy combination of these hyperparameters is selected using development data or ten-fold cross validation, which will be specified in section 5.

5 Datasets

Number of docs.
Dataset Task Classes Total Training Development Test Vocab. size
Yelp Sentiment 5 700K 650K 50K 10K
MFC Frames 15 4.2K 7.5K
Debates Vote 2 1.6K 1,135 105 403 5K
Movies Sentiment 2 2.0K 5K
Bills Survival 2 52K 46K 6K 10K
Table 1: Information about the five datasets used in our experiments. To compare with prior work, we use different experimental settings. For Yelp and Bill corpora, we use 10% of the training examples as development data. For MFC and Movies corpora, we use 10-fold cross validation and report averages across all folds.

We selected five datasets of different sizes and corresponding to varying categorization tasks. Some information about these datasets is summarized in Table 1.

Sentiment analysis on Yelp reviews.

Originally from the Yelp Dataset Challenge in 2015, this dataset contains 1.5 million examples. We used the preprocessed dataset from Zhang et al. (2015), which has 650,000 training and 50,000 test examples. The task is to predict an ordinal rating (1–5) from the text of the review. To select the best combination of hyperparameters, we randomly sampled 10% training examples as the development data. We compared with hierarchical attention networks (Yang et al., 2016), which use the normalized attention mechanism on both word and sentence layers with a flat document structure, and provide the state-of-the-art result on this corpus.

Framing dimensions in news articles.

The Media Frames Corpus (MFC; Card et al., 2015) includes around 4,200 news articles about immigration from 13 U.S. newspapers over the years 1980–2012. The annotations of these articles are in terms of a set of 15 general-purpose labels, such as Economics and Morality, designed to categorize the emphasis framing applied to the immigration issue within the articles. We focused on predicting the single primary frame of each article. The state-of-the-art result on this corpus is from Card et al. (2016), where they used logistic regression together with unigrams, bigrams and Bamman-style personas (Bamman et al., 2014) as features. The best feature combination in their model alongside other hyperparameters was identified by a Bayesian optimization method (Bergstra et al., 2015). To select hyperparameters, we used a small set of examples from the corpus as a development set. Then, we report average accuracy across 10-fold cross validation as in (Card et al., 2016).

Congressional floor debates.

The corpus was originally collected by Thomas et al. (2006), and the data split we used was constructed by Yessenalina et al. (2010). The goal is to predict the vote (“yea” or “nay”) for the speaker of each speech segment. The most recent work on this corpus is from Yogatama and Smith (2014), which proposed structured regularization methods based on linguistic components, e.g., sentences, topics, and syntactic parses. Each regularization method induces a linguistic bias to improve text classification accuracy, where the best result we repeated here is from the model with sentence regularizers.

Movie reviews.

This classic movie review corpus was constructed by Pang and Lee (2004) and includes 1,000 positive and 1,000 negative reviews. On this corpus, we used the standard ten-fold data split for cross validation and reported the average accuracy across folds. We compared with the work from both Bhatia et al. (2015) and Hogenboom et al. (2015), which are two recent works on discourse for sentiment analysis. Bhatia et al. (2015) used a hand-crafted weighting scheme to bias the bag-of-word representations on sentences. Hogenboom et al. (2015) also considered manually-designed weighting schemes and a lexicon-based model as classifier, achieving performance inferior to fully-supervised methods like Bhatia et al. (2015) and ours.

Congressional bill corpus.

This corpus, collected by Yano et al. (2012), includes 51,762 legislative bills from the 103rd to 111th U.S. Congresses. The task is to predict whether a bill will survive based on its content. We randomly sampled 10% training examples as development data to search for the best hyperparameters. To our knowledge, the best published results are due to Yogatama and Smith (2014), which is the same baseline as for the congressional floor debates corpus.

6 Experiments

Method Yelp MFC Debates Movies Bills
Prior work
1. Yang et al. (2016) 71.0
2. Card et al. (2016) 56.8
3. Yogatama and Smith (2014) 74.0 88.5
4. Bhatia et al. (2015) 82.9
5. Hogenboom et al. (2015) 71.9
Variants of our model
6. Additive 68.5 57.6 69.0 82.7 80.1
7. Root 54.3 51.2 60.3 68.7 70.5
8. Unlabeled 71.3 58.4 75.7 83.1 78.4
9. Full 71.8 56.3 74.2 79.5 77.0
Table 2: Test-set accuracy across five datasets. Results from prior work are reprinted from the corresponding publications. Boldface marks performance stronger than the previous state of the art.

We evaluated all variants of our model on the five datasets presented in section 5, comparing in each case to the published state of the art as well as the most relevant works.


See Table 2. On four out of five datasets, our Unlabeled model (line 8) outperforms past methods. In the case of the very large Yelp dataset, our Full model (line 9) gives even stronger performance, but not elsewhere, suggesting that it is overparameterized for the smaller datasets. Indeed, on the MFC and Movies tasks, the discourse-ignorant Additive outperforms the Full model. On these datasets, the selected Full model had nearly 20 times as many parameters as the Unlabeled model, which in turn had twice as many parameters as the Additive.

This finding demonstrates the benefit of explicit discourse structure—even the output from an imperfect parser—for text categorization in some genres. This benefit is supported by both Unlabeled and Full, since both of them use discourse structures of texts. The advantage of using discourse information varies on different genres and different corpus sizes. Even though the discourse parser is trained on news text, it still offers benefit to restaurant and movie reviews and to the genre of congressional debates. Even for news text, if the training dataset is small (e.g., MFC), a lighter-weight variant of discourse (Unlabeled) is preferred.

Legislative bills, which have technical legal content and highly specialized conventions (see the supplementary material for an example), are arguably the most distant genre from news among those we considered. On that task, we see discourse working against accuracy. Note that the corpus of bills is more than ten times larger than three cases where our Unlabeled model outperformed past methods, suggesting that the drop in performance is not due to lack of data.

It is also important to notice that the Root model performs quite poorly in all cases. This implies that discourse structure is not simply helping by finding a single EDU upon which to make the categorization decision.

Qualitative analysis.

Figure 3 shows some example texts from the Yelp Review corpus with their discourse structures produced by DPLP, where the weights were generated with the Full model. Figures (a) and (b) are two successful examples of the Full model. Figure (a) shows a simple case with respect to the discourse structure. Figure (b) is slightly different—the text in this example may have more than one reasonable discourse structure, e.g., could be a child of instead of . In both cases, discourse structures help the Full model bias to the important sentences.

Figure 3(c), on the other hand, presents a negative example, where DPLP failed to identify the most salient sentence . In addition, the weights produced by the Full model do not make much sense, which we suspect the model was confused by the structure. Figure (c) also presents a manually-constructed discourse structure on the same text for reference. A more accurate prediction is expected if we use this manually-constructed discourse structure, because it has the appropriate dependency between sentences. In addition, the annotated discourse relations are able to select the right relation-specific composition matrices in Full model, which are consistent with the training examples.

(a) true label: 2, predicted label: 2
(b) true label: 5, predicted label: 5
(c) true label: 1, predicted label: 3
Figure 3: Some example texts (with light revision for readability) from the Yelp Review corpus and their corresponding dependency discourse parses from DPLP (Ji and Eisenstein, 2014). The numbers on dependency edges are attention weights produced by the Full model.

Effect of parsing performance.

A natural question is whether further improvements to RST discourse parsing would lead to even greater gains in text categorization. While advances in discourse parsing are beyond the scope of this paper, we can gain some insight by exploring degradation to the DPLP parser. An easy way to do this is to train it on subsets of the RST discourse treebank. We repeated the conditions described above for our Full model, training DPLP on 25%, 50%, and 75% of the training set (randomly selected in each case) before re-parsing the data for the sentiment analysis task. We did not repeat the hyperparameter search. In Figure 4, we plot accuracy of the classifier (-axis) against the performance of the discourse parser (-axis). Unsurprisingly, lower parsing performance implies lower classification accuracy. Notably, if the RST discourse treebank were reduced to 25% of its size, our method would underperform the discourse-ignorant model of Yang et al. (2016). While we cannot extrapolate with certainty, these findings suggest that further improvements to discourse parsing, through larger annotated datasets or improved models, could lead to greater gains.

Figure 4: Varying the amount of training data for the discourse parser, we can see how parsing performance affects accuracy on the Yelp review task.

Attention mechanism.

In section 3, we contrasted our new attention mechanism (Equation 2), which is inspired by RST’s lack of “competition” for salience among satellites, with the attention mechanism used in machine translation (Bahdanau et al., 2015). We consider here a variant of our model with normalized attention:


The result here is a vector , with one element for each child node , and which sums to one.

On Yelp dateset, this variant of the Full model achieves 70.3% accuracy (1.5% absolute behind our Full model), giving empirical support to our theoretically-motivated design decision not to normalize attention. Of course, further architecture improvements may yet be possible.


Our findings in this work show the benefit of using discourse structure for text categorization. Although discourse structure strongly improves the performance on most of corpora in our experiments, its benefit is limited particularly by two factors: (1) the state-of-the-art performance on RST discourse parsing; and (2) domain mismatch between the training corpus for a discourse parser and the domain where the discourse parser is used. For the first factor, discourse parsing is still an active research topic in NLP, and may yet improve. The second factor suggests exploring domain adaptation methods or even direct discourse annotation for genres of interest.

7 Related Work

Early work on text categorization often treated text as a bag of words (e.g., Joachims, 1998; Yang and Pedersen, 1997). Representation learning, for example through matrix decomposition (Deerwester et al., 1990) or latent topic variables (Ramage et al., 2009), has been considered to avoid overfitting in the face of sparse data.

The assumption that all parts of a text should influence categorization equally persists even as more powerful representation learners are considered. Zhang et al. (2015) treat a text as a sequence of characters, proposing to a deep convolutional neural network to build text representation. Xiao and Cho (2016) extended that architecture by inserting a recurrent neural network layer between the convolutional layer and the classification layer.

In contrast, our contributions follow Ko et al. (2004), who sought to weight the influence of different parts of an input text on the task. Two works that sought to learn the importance of sentences in a document are Yessenalina et al. (2010) and Yang et al. (2016). The former used a latent variable for the informativeness of each sentence, and the latter used a neural network to learn an attention function. Neither used any linguistic bias, relying only on task supervision to discover the latent variable distribution or attention function. Our work builds the neural network directly on a discourse dependency tree, favoring the most central EDUs over the others but giving the model the ability to overcome this bias.

Another way to use linguistic information was presented by Yogatama and Smith (2014), who used a bag-of-words model. The novelty in their approach was a data-driven regularization method that encouraged the model to collectively ignore groups of features found to coocur. Most related to our work is their “sentence regularizer,” which encouraged the model to try to ignore training-set sentences that were not informative for the task. Discourse structure was not considered.

Discourse for sentiment analysis.

Recently, discourse structure has been considered for sentiment analysis, which can be cast as a text categorization problem. Bhatia et al. (2015) proposed two discourse-motivated models for sentiment polarity prediction. One of the models is also based on discourse dependency trees, but using a hand-crafted weighting scheme. Our method’s attention mechanism automates the weighting.

8 Conclusion

We conclude that automatically-derived discourse structure can be helpful to text categorization, and the benefit increases with the accuracy of discourse parsing. We did not see a benefit for categorizing legislative bills, a text genre whose discourse structure diverges from that of news. These findings motivate further improvements to discourse parsing, especially for new genres.


We thank anonymous reviewers and members of Noah’s ARK for helpful feedback on this work. We thank Dallas Card and Jesse Dodge for helping prepare the Media Frames Corpus and the Congressional bill corpus. This work was made possible by a University of Washington Innovation Award.


  • Azar (1999) Moshe Azar. 1999. Argumentative text as rhetorical structure: An application of rhetorical structure theory. Argumentation 13(1):97–114.
  • Bahdanau et al. (2015) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In ICLR.
  • Bamman et al. (2014) David Bamman, Brendan O’Connor, and Noah A Smith. 2014. Learning latent personas of film characters. In ACL.
  • Bergstra et al. (2015) James Bergstra, Brent Komer, Chris Eliasmith, Dan Yamins, and David D. Cox. 2015. Hyperopt: a Python library for model selection and hyperparameter optimization. Computational Science & Discovery 8(1).
  • Bhatia et al. (2015) Parminder Bhatia, Yangfeng Ji, and Jacob Eisenstein. 2015. Better document-level sentiment analysis from RST discourse parsing. In EMNLP.
  • Card et al. (2015) Dallas Card, Amber E. Boydstun, Justin H. Gross, Philip Resnik, and Noah A. Smith. 2015. The Media Frames Corpus: Annotations of frames across issues. In ACL.
  • Card et al. (2016) Dallas Card, Justin Gross, Amber E. Boydstun, and Noah A. Smith. 2016. Analyzing framing through the casts of characters in the news. In EMNLP.
  • Carlson et al. (2001) Lynn Carlson, Daniel Marcu, and Mary Ellen Okurowski. 2001. Building a discourse-tagged corpus in the framework of Rhetorical Structure Theory. In Proceedings of Second SIGdial Workshop on Discourse and Dialogue.
  • Deerwester et al. (1990) Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and Richard Harshman. 1990. Indexing by latent semantic analysis. Journal of the American Society for Information Science 41(6):391.
  • Hogenboom et al. (2015) Alexander Hogenboom, Flavius Frasincar, Franciska de Jong, and Uzay Kaymak. 2015. Using rhetorical structure in sentiment analysis. Communications of the ACM 58(7):69–77.
  • Ji and Eisenstein (2014) Yangfeng Ji and Jacob Eisenstein. 2014. Representation learning for document-level discourse parsing. In ACL.
  • Ji and Eisenstein (2015) Yangfeng Ji and Jacob Eisenstein. 2015. One vector is not enough: Entity-augmented distributed semantics for discourse relations. Transactions of the Association of Computational Linguistics 3:329–344.
  • Joachims (1998) Thorsten Joachims. 1998. Text categorization with support vector machines: Learning with many relevant features. In ECML.
  • Kalchbrenner et al. (2014) Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. 2014. A convolutional neural network for modelling sentences. ArXiv:1404.2188.
  • Kingma and Ba (2015) Diederik Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In ICLR.
  • Ko et al. (2004) Youngjoong Ko, Jinwoo Park, and Jungyun Seo. 2004. Improving text categorization using the importance of sentences. Information Processing & Management 40(1):65–79.
  • Kuncoro et al. (2017) Adhiguna Kuncoro, Miguel Ballesteros, Lingpeng Kong, Chris Dyer, Graham Neubig, and Noah A. Smith. 2017. What do recurrent neural network grammars learn about syntax? In EACL.
  • Mann and Thompson (1988) William Mann and Sandra Thompson. 1988. Rhetorical Structure Theory: Toward a functional theory of text organization. Text 8(3):243–281.
  • Marcu (1999) Daniel Marcu. 1999. Discourse trees are good indicators of importance in text. In Inderjeet Mani and Mark T. Maybury, editors, Advances in Automatic Text Summarization, pages 123–136.
  • Marcus et al. (1993) Mitchell P. Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. 1993. Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics 19(2):313–330.
  • Mikolov et al. (2010) Tomas Mikolov, Martin Karafiát, Lukas Burget, Jan Cernockỳ, and Sanjeev Khudanpur. 2010. Recurrent neural network based language model. In INTERSPEECH.
  • Neubig et al. (2017) Graham Neubig, Chris Dyer, Yoav Goldberg, Austin Matthews, Waleed Ammar, Antonios Anastasopoulos, Miguel Ballesteros, David Chiang, Daniel Clothiaux, Trevor Cohn, et al. 2017. Dynet: The dynamic neural network toolkit. ArXiv:1701.03980.
  • Pang and Lee (2004) Bo Pang and Lillian Lee. 2004. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In Proceedings of the 42nd annual meeting on Association for Computational Linguistics. Association for Computational Linguistics, page 271.
  • Paulus et al. (2014) Romain Paulus, Richard Socher, and Christopher D Manning. 2014. Global belief recursive neural networks. In NIPS.
  • Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. GloVe: Global vectors for word representation. In EMNLP.
  • Prasad et al. (2008) Rashmi Prasad, Nikhil Dinesh, Alan Lee, Eleni Miltsakaki, Livio Robaldo, Aravind Joshi, and Bonnie Webber. 2008. The Penn Discourse Treebank 2.0. In LREC.
  • Ramage et al. (2009) Daniel Ramage, David Hall, Ramesh Nallapati, and Christopher D. Manning. 2009. Labeled lda: A supervised topic model for credit attribution in multi-labeled corpora. In EMNLP.
  • Socher et al. (2014) Richard Socher, Andrej Karpathy, Quoc V Le, Christopher D. Manning, and Andrew Y. Ng. 2014. Grounded compositional semantics for finding and describing images with sentences. Transactions of the Association for Computational Linguistics 2:207–218.
  • Thomas et al. (2006) Matt Thomas, Bo Pang, and Lillian Lee. 2006. Get out the vote: Determining support or opposition from Congressional floor-debate transcripts. In EMNLP.
  • Xiao and Cho (2016) Yijun Xiao and Kyunghyun Cho. 2016. Efficient character-level document classification by combining convolution and recurrent layers. ArXiv:1602.00367.
  • Yamada and Matsumoto (2003) H. Yamada and Y. Matsumoto. 2003. Statistical dependency analysis with support vector machines. In IWPT.
  • Yang and Pedersen (1997) Yiming Yang and Jan O. Pedersen. 1997. A comparative study on feature selection in text categorization. In ICML.
  • Yang et al. (2016) Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. 2016. Hierarchical attention networks for document classification. In NAACL.
  • Yano et al. (2012) Tae Yano, Noah A. Smith, and John D. Wilkerson. 2012. Textual predictors of bill survival in congressional committees. In NAACL.
  • Yessenalina et al. (2010) Ainur Yessenalina, Yisong Yue, and Claire Cardie. 2010. Multi-level structured models for document sentiment classification. In EMNLP.
  • Yogatama and Smith (2014) Dani Yogatama and Noah A. Smith. 2014. Linguistic structured sparsity in text categorization. In ACL.
  • Yoshida et al. (2014) Yasuhisa Yoshida, Jun Suzuki, Tsutomu Hirao, and Masaaki Nagata. 2014. Dependency-based discourse parser for single-document summarization. In EMNLP.
  • Zhang et al. (2015) Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level convolutional networks for text classification. In NIPS.

Appendix A Supplementary Material: An example text from the Bill corpus

4449 IH
2d Session
H. R. 4449
To amend part A of title IV of the Social Security Act to enable States to construct, rehabilitate, purchase or rent permanent housing for homeless AFDC families, using funds that would otherwise be used to provide emergency assistance for such families.
MAY 18, 1994
Mr. PETERSON of Minnesota (for himself, Mr. FLAKE, Mr. FRANK of Massachusetts, Mr. VENTO, and Mr. RANGEL) introduced the following bill; which was referred jointly to the Committees on Ways and Means and Banking, Finance and Urban Affairs
To amend part A of title IV of the Social Security Act to enable States to construct, rehabilitate, purchase or rent permanent housing for homeless AFDC families, using funds that would otherwise be used to provide emergency assistance for such families.
Be it enacted by the Senate and House of Representatives of the United States of America in Congress assembled,
This Act may be cited as the ‘Permanent Housing for Homeless Families Act’.
(a) IN GENERAL- Section 406 of the Social Security Act (42 U.S.C. 606) is amended by inserting after subsection (c) the following:
‘(d)(1) The term ‘emergency assistance to needy families with children’ includes the qualified expenditures of an eligible State.
‘(2) As used in paragraph (1):
‘(A) The term ‘eligible State’ means, with respect to a fiscal year, a State that meets the following requirements:
‘(i) The State plan approved under this part for the fiscal year includes provision for emergency assistance as described in subsection (e) or this subsection.
‘(ii) The State has provided assurances to the Secretary that the average amount that the State intends to expend per family for such emergency assistance for the fiscal year would not exceed such average amount for the immediately preceding fiscal year. The Secretary shall prescribe in regulations standards for determining the period over which capital expenditures incurred in the provision of such emergency assistance are to be amortized.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description