Interpretable Structure-aware Document Encoders with Hierarchical Attention

Interpretable Structure-aware Document Encoders
with Hierarchical Attention

Khalil Mrini
Computer Science and Engineering
University of California, San Diego
La Jolla, CA 92093
\AndClaudiu Musat, Michael Baeriswyl
AI Group
Swisscom AG
Lausanne, Switzerland\AndMartin Jaggi
Machine Learning and
Optimization Lab, EPFL
Lausanne, Switzerland

We propose a method to create document representations that reflect their internal structure. We modify Tree-LSTMs to hierarchically merge basic elements such as words and sentences into blocks of increasing complexity. Our Structure Tree-LSTM implements a hierarchical attention mechanism over individual components and combinations thereof. We thus emphasize the usefulness of Tree-LSTMs for texts larger than a sentence111GitHub Repository:
. We show that structure-aware encoders can be used to improve the performance of document classification. We demonstrate that our method is resilient to changes to the basic building blocks, as it performs well with both sentence and word embeddings. The Structure Tree-LSTM outperforms all the baselines on two datasets by leveraging structural clues222We release our Wikipedia dataset here:
. We show our model’s interpretability by visualizing how our model distributes attention inside a document. On a third dataset from the medical domain, our model achieves competitive performance with the state of the art. This result shows the Structure Tree-LSTM can leverage dependency relations other than text structure, such as a set of reports on the same patient.

1 Introduction

Humans use structure to better represent information, and within that structure, elements vary in importance. For example, a table of contents helps in defining a document’s global structure and focus the reader’s attention to what matters.

Long, unstructured sequences are hard to process for humans and machines alike. Even though neural network techniques have recently shown significant improvement to text classification, Long Short-Term Memory (LSTM) networks Hochreiter and Schmidhuber (1997) perform poorly for long sequences Cheng et al. (2016). Proposed solutions to overcome the problems that stem from long, flattened sequences include LSTM variants such as the bidirectional variant Graves and Schmidhuber (2005); Huang et al. (2015); Chiu and Nichols (2015) and the attention-based one Wang et al. (2016). The latter focuses on relevant sections independent of their location. Flat attention cannot, however, cope with long sequences. Splitting a long text into smaller sections has the advantage of being able to flush the attention when the local context ends. A first step in this direction has been taken by Yang et al. (2016), who apply attention for words, and sentences as well, but documents are still viewed as a flat sequence of sentences.

Our first contribution is to adapt tools this far used only in the context of a single sentence, the tree-structured LSTM networks Tai et al. (2015); Le and Zuidema (2015); Zhu et al. (2015), to fit the document structure instead.

Our second contribution is to extend previous attention mechanisms to all the structural levels of a document, and make them applicable in a hierarchical structure. To do so, we apply different initialization mechanisms to our Tree-LSTM model, leading to a transformation of the LSTM forget gates into a de facto attention mechanism. We show our model is interpretable and makes semantically relevant choices using a visualisation of this hierarchical attention mechanism.

By including hierarchical structure in the document representation and leveraging several layers of visualizable attention, we create interpretable structure-aware attention-based document encoders.

Our third contribution is to show that the structure-aware encoders are useful. We choose the task of supervised multi-class document classification as a first target, where each document has to be assigned to exactly one category. We first show that hierarchical documents can be better classified using tree-structured LSTMs. We obtain improved document classification results over two datasets with varying structure depths.

Then we show that our model can leverage structure beyond a single document, in settings where a training sample is a set of related documents. On a benchmark dataset from the medical domain, we model patients using their health reports to predict mortality. We obtain competitive results with the state of the art, emphasizing that our model can efficiently model dependency relations other than simple textual structure.

Finally, we also release our code along with our new dataset of hierarchical documents.

The rest of the paper is organised as follows: we summarize the literature on document embedding and classification in Section 2. Section 3 describes the proposed method and Section 4 details the experiments and results. We draw conclusions and outline suggestions for future work in Section 5.

2 Related Work

Text Embeddings. Among the most popular context-based word embeddings models are word2vec Mikolov et al. (2013a, b), FastText Bojanowski et al. (2016) and GloVe Pennington et al. (2014).

More recently, embedding methods for sentences have emerged. Zhao et al. (2015) introduce AdaSent, a self-adaptive hierarchical sentence model. Kiros et al. (2015) propose Skip-Thought: an extension of the skip-gram model to sentences. Hill et al. (2016) try to address computationally expensive training with their FastSent model. Arora et al. (2016) compute a sentence’s embedding as the average of the embeddings of its words, minus the first principal component. Conneau et al. (2017) propose a supervised sentence embedding model trained on the SNLI dataset Bowman et al. (2015). The sent2vec embeddings Pagliardini et al. (2017) are trained unsupervised, and represent sentences by looking at unigrams, as well as n-grams that compose them, in a similar fashion to FastText. Devlin et al. (2018) introduce the Bidirectional Encoder Representations from Transformers (BERT).

Document Encoders. Yang et al. (2016) introduce a hierarchical document encoder, used as a document classification model. It decomposes a document into a sentence and a word level. At each level, it applies an encoder with a bi-GRU and an attention mechanism. As a higher level structure is not used, they tested the method on shorter documents including reviews and answers.

Tree-structured LSTM networks. Tai et al. (2015) adapt the standard LSTM to tree-like structures with two kinds of Tree-LSTM architectures. The first one is the Child-Sum variant: for a unit , the hidden variable , that would be carried on from the previous LSTM unit in a standard architecture, is replaced by the sum of the hidden variables of its children units , with being the children units of unit . In addition, there is one forget gate per child of unit . The parameter matrices enable the unit to determine the contributions of its children units in each gate. The second variant is the -ary Tree-LSTM, where each non-leaf unit should have a branching factor of at most and have ordered children. This variant allocates one parameter matrix per child, enabling it to learn conditioning based on the child’s position from to . However, it is not as modular as the Child-Sum Tree-LSTM as there is a constraint on the branching factor. Any Tree-LSTM unit still has to get an input , whether it is a leaf unit or not.

Le and Zuidema (2015) develop the LSTM-RNN, a binary tree-structured LSTM architecture, such that each non-leaf unit has exactly two children, with the corresponding pairs of input and forget gates. Zhu et al. (2015) introduce a similar binary tree-structured LSTM architecture, the S-LSTM, in which there is one input gate per non-leaf unit, but still two forget gates. A non-leaf unit in these two architectures does not have an input of its own, but it takes the inputs of its children units for the LSTM-RNN model, and their outputs for the S-LSTM model.

In addition to sentence-level sentiment classification, Tai et al. (2015) test their model on semantic relatedness between sentence pairs. Eriguchi et al. (2016) extend the Tree-LSTM model to introduce a tree-to-sequence attentional neural machine translation model. Chen et al. (2017) use it in conjunction with a bi-LSTM to form a hybrid model for natural language inference.

3 Structure-aware Attention-based Document Encoders

3.1 Structure Awareness

Figure 1: The Structure Tree corresponding to a document: the basis to form the document’s Tree-LSTM.
Figure 2: The Structure Tree-LSTM with Zero Vectors applied to the document in Figure 1, with the numbers in circles referring to the same input. The empty set symbols indicate a vector of zeros.
Figure 3: The Structure Tree-LSTM with Hierarchical Average applied to the document in Figure 1, with the numbers in circles referring to the same input. A given unit’s input is the average of the inputs of its children.

The starting assumption is that common documents have a hierarchical structure. Words are grouped in sentences, sentences in paragraphs, which in turn form subsections, sections and so on. From this observation, we derive the hypothesis that hierarchical attention over a document’s structure allows the resulting representation to highlight the document’s important aspects.

Our Structure Tree-LSTM captures a document’s hierarchical structure by mirroring the corresponding document tree. For example, the document tree in Figure 1 corresponds to a document with the following outline: (1) Introduction: 1 paragraph with 3 sentences; (2) History: 2 subsections; (2.1) 19th Century: 2 paragraphs, with 2 and 1 sentences respectively; (2.2) 20th Century: 1 paragraph with 2 sentences. The structure granularity can be adjusted according to the downstream task and size of the dataset. Large datasets can have coarse-grained structure for the model to be less computationally expensive, whereas smaller datasets can have Tree-LSTMs include components all the way down to words.

The first major difference with respect to the existing Tree-LSTM, is that, as seen in the example, there is no imposed order on the semantic components. The Dependency Tree-LSTM of Tai et al. (2015) uses sentence-level word dependencies, as in the example in Figure 4. This is generally not extensible at the document level. Our Structure Tree-LSTM relaxes this assumption, making it more generally applicable.

The second major difference is the distinction between leaf and non-leaf units in a Structure Tree-LSTM. A leaf unit is the smallest component of the document (i.e., a word or a sentence), and has as input the component’s embedding. A non-leaf (parent) unit represents a larger component of the document: a sentence (group of words), a paragraph (group of sentences), or a section (group of sections and/or paragraphs). This generalization of the node contents allows for the extension of the method to more general contexts. In the original models of Tai et al. (2015), all units of a Tree-LSTM represent an original input (i.e., a word).

This distinction between unit types leads to the creation of two variants of Structure Tree-LSTMs, that differ in the strategy for filling the non-leaf units. We investigate two main strategies:

Figure 4: Example of a sentence encoding using the Dependency Tree-LSTM of Tai et al. (2015). Each LSTM unit receives a word embedding as input. This sentence’s reordering is based on its dependency tree.

(1) Structure Tree-LSTM with Zero Vectors: non-leaf units get zero vectors as input (Figure 2).

(2) Structure Tree-LSTM with Hierarchical Average: non-leaf units have as input the average of the input vectors of its children (Figure 3).

The Structure Tree-LSTM is easily extensible with additional initialization methods. One such method could be replacing the hierarchical averaging by a sum of the children’s inputs. Another one could be using section titles as input for the non-leaf units representing sections. This underlines the power of the Structure Tree-LSTM to incorporate all the information available. However, for the sake of fairness in comparison with the baselines, we did not use section titles in any of our models.

3.2 Hierarchical Attention

We use the same transition equations as in the Child-Sum Tree-LSTM described in Tai et al. (2015). We analyse them to explain the attention mechanisms of our proposed models.

For a unit of a Child-Sum Tree-LSTM, the hidden state , that is carried on from the previous LSTM unit in a standard architecture, is replaced by the sum of the hidden states of its children units , with being the children units of unit . In addition, there is one forget gate per child of unit .

However, as the leaf units have no children units, we have that , and as such . Therefore, the only contribution comes from the input (the word or sentence embeddings), without influence from other inputs. This changes the equations in practice, as for example the formula for the input gate:


becomes for leaf units:


The model with Zero Vectors de facto changes the formulas for the non-leaf units as well. Because a non-leaf unit’s input is , the only contribution comes from the children units. This makes the Structure Tree-LSTM with zero vectors similar to a joint hierarchical attention mechanism. The formula for the forget gate for child unit :


becomes for non-leaf units:


making the forget gate an attention mechanism over individual child units. Likewise, the formulas of the memory cell , the input gate and output gate change in practice. For example, the formula for the output gate:


becomes for non-leaf units:


such that the output gate can now be assimilated to an attention over linear combinations of individual children.

We can thus view the model with Zero Vectors as a generalization of hierarchical attention mechanisms.

4 Experiments

4.1 Experimental Setup

We design two document classification experiments and one mortality prediction experiment for the Structure Tree-LSTM. To evaluate our model, we focus the analysis on datasets having a hierarchical structure. Therefore, we could not use the datasets of reviews on which the Hierarchical Attention Networks Yang et al. (2016) are evaluated.

In the datasets of reviews in Yang et al. (2016), a training sample is only one paragraph with about 5 to 14 sentences on average. In our three selected datasets, we keep all information on internal structure: where each document part (paragraphs, sections, subsections…) starts and ends. We therefore evaluate our models on datasets with at least three levels of hierarchy, with the highest hierarchy level being larger than a paragraph. More concretely, the highest hierarchy level in our datasets are an article, an email or a patient’s medical record.

4.2 Text Structure in Document Classification

4.2.1 Document Classification Datasets

The Enron Email Dataset. This UC Berkeley-labelled dataset333Collected from: contains 1,700 tagged emails with many overlapping categories. We select the two most common categories, and assign a third label for their intersection, and a fourth label for the emails that do not belong to any category. These real-world documents present meaningful, yet minimalist structure expressed as paragraphs.

The Wikipedia Dataset. We collect an English Wikipedia dataset of 494,657 articles. These are relatively long articles, split into 24 disjoint categories and released as an open resource. More information is available in Appendix A. Given that this is a large dataset, we set leaf nodes to represent sentences to mitigate computational complexity: leaf nodes take sent2vec embeddings as input.

4.2.2 Baselines

For the Enron Email and Wikipedia datasets, we compare the Structure Tree-LSTM with the following baselines:

(3) MLP with Unweighted Average: a Multi-Layer Perceptron (MLP) with one hidden layer having a rectifier activation function. In this model, a document is represented by an unweighted average of all input embeddings.

(4) MLP with Hierarchical Average: same architecture as model (3), but the document representation is a hierarchically-weighted average of input embeddings.

(5) Sequential LSTM: an LSTM layer that takes input embeddings in sequential order. The sets of input embeddings are not padded nor truncated. This model shares the same input as the Tree-LSTM, but processes them sequentially rather than hierarchically.

(6) Hierarchical Attention Networks: the model444Code: designed by Yang et al. (2016). To remain faithful to the original model, this model is only tested on experiments with word embeddings as inputs. This model uses bidirectional GRUs, making it a hierarchical bidirectional RNN model. The model’s hierarchy has two levels: one for words and one for sentences. Each level has separate encoders and attention weights, as well as a fixed number of elements. This means there is a fixed number of words (resp. sentences) per sentence (resp. document), and as such padding or truncating are applied where necessary.

We compare the number of parameters for each model in Table 1. We take into account the hidden layer, as well as the softmax output layer. Our Structure Tree-LSTM models have as many parameters to learn as a sequential LSTM. Ignoring the output layer and the HAN attention weights, an MLP model has about 4 times less parameters to learn than an LSTM model, and the HAN model has about 3 times more parameters.

4.2.3 Document Classification Results

We evaluate the Enron Email and Wikipedia datasets using the Macro-F1 score, computed by averaging the individual F1 scores of each class. These two datasets are about multi-class document classification, and the Macro-F1 score takes into account class imbalance. We show our results in Table 2.

For the Enron Emails, the Structure Tree-LSTM model with zero vectors obtains the highest F1 scores, outperforming all other models. The model’s performance is resilient regardless of the kind of building block the leaves represent – words or sentences. The extreme difference in macro-F1 scores with respect to the base LSTM underlines the importance of structure when fewer data points are available. We note the lacklustre macro-F1 of the Hierarchical Attention Networks, that suggests our structure-oriented attention model requires less data to train.

Likewise, for the Wikipedia dataset, the two Structure Tree-LSTM models score visibly higher in both macro-F1 and accuracy than the baselines, confirming the efficiency of a document embedding inclusive of structure. The absolute gain is higher than for the previous dataset, suggesting that our models successfully leverage the additional structure in Wikipedia articles. In all cases, the best Structure Tree-LSTM variant is the one with zero vectors, showing that attention over children units is sufficient.

We analyse the classification errors of the Structure Tree-LSTM with Zero Vectors. The predictions of the actors category are correct 83.12% of the time. The two most common incorrect predictions for actors are actresses (3.89%) and directors (2.82%). Another example is the airlines category, with 85.54% accuracy and most commonly confused with aircraft (12.90%). Although our model has high accuracy, its errors seem to stem from confusing semantically related categories.

Model Number of Parameters to Learn
Structure Tree-LSTM
Sequential LSTM
Hierarchical Attention Networks (HAN)
Table 1: Number of parameters to learn for each model in the document classification datasets. is the embedding dimension, is the hidden layer dimension, is the number of labels. For the HAN model, is the number of words per sentence and is the number of sentences per document.
Figure 5: Visualisation of the hierarchical attention mechanism of our Structure Tree-LSTM with Zero vectors. The example is the Wikipedia article of Hugo Bastidas, correctly predicted as an artist. All cells, except the top one, are colored in the same red, with the opacity toned to the corresponding attention weights.
Dataset Leaves Model Macro-F1 Accuracy
Enron Emails Word Embeddings (word2vec) Structure Tree-LSTM with Zero Vectors 0.4455 0.5235
Structure Tree-LSTM with Hierarchical Average 0.4099 0.4824
MLP with Unweighted Average 0.4063 0.4529
MLP with Hierarchical Average 0.3934 0.4941
Sequential LSTM 0.3429 0.4176
Hierarchical Attention Networks 0.3632 0.5078
Sentence Embeddings (sent2vec) Structure Tree-LSTM with Zero Vectors 0.4533 0.5118
Structure Tree-LSTM with Hierarchical Average 0.4278 0.4941
MLP with Unweighted Average 0.4164 0.4588
MLP with Hierarchical Average 0.3822 0.4706
Sequential LSTM 0.3002 0.4059
Wikipedia Sentence Embeddings (sent2vec) Structure Tree-LSTM with Zero Vectors 0.8538 0.8877
Structure Tree-LSTM with Hierarchical Average 0.8430 0.8814
MLP with Unweighted Average 0.7870 0.8534
MLP with Hierarchical Average 0.7790 0.8476
Sequential LSTM 0.6405 0.7802
Table 2: Results of the multi-class classification experiments. The best scores are in bold.
Dataset Leaves Model AUC score
MIMIC-III Sentence Embeddings (sent2vec) Structure Tree-LSTM with Zero Vectors 0.958
LDA 0.930
doc2vec 0.930
CNN 0.963
Table 3: Results of the binary classification experiment in comparison with the baselines of Grnarova et al. (2016).

4.2.4 Analysis of the Hierarchical Attention Mechanism

To visualize the attention weights given by a node to its children nodes, we compute the product of each child’s candidate vector with the forget gate. Then, we divide the value for each dimension of the resulting vector by the sum of the values for all children for the same dimension. We therefore obtain, for each child, a vector of contribution percentages for each dimension. We average this vector, and get the individual attention weights of the children nodes. More formally, for the -th child of node , with the child’s candidate vector of dimension and corresponding forget gate , we compute the corresponding attention weight as follows:


We analyse how our Structure Tree-LSTM with Zero vectors applies its hierarchical attention mechanism to documents through an example: the Wikipedia article of Hugo Bastidas555Available here:, correctly predicted by our model as an article about an artist. The attention weights are visualized in Figure 5.

Our Structure Tree-LSTM displays semantically relevant choices in its distribution of attention: to predict that the article is about an artist, it pays more attention to the Career section. The figure shows that a model needs more than just the introduction of a document for efficient classification. We illustrate how our model attributes attention weights to all levels of the hierarchy, including subsections (under Career), and all the way down to paragraphs and sentences (under Introduction). We did not display all levels of the hierarchy for lack of space. Using this attention visualisation, we can interpret our model, and see which parts of the document influences the prediction.

4.3 Extended Structure: Modelling Sets of Documents

We use a dataset from the medical domain, where the documents are medical reports of different categories. The reports can be grouped by patient to form sets of documents modelled as one Structure Tree-LSTM. By comparing to baseline models, we inquire whether our model can leverage additional structural knowledge, such as the types of reports and links between them.

4.3.1 The MIMIC-III Dataset

To compare our model to existing benchmark datasets, we use the MIMIC-III dataset Johnson et al. (2016). It is a freely accessible database on critical medical care.

It contains over 2 million unstructured textual medical reports corresponding to 46,520 hospitalised patients, and information about whether a patient has eventually recovered. In case the patient died, it is indicated whether the death occurred at the hospital, within one month after leaving hospital care, or within a year afterwards.

Grnarova et al. (2016) use this dataset to predict patient mortality. It contains 31,244 patients with 812,158 notes. Grnarova et al. (2016) approach this task as multiple binary classification problems: to predict mortality during the hospital stay, within one month later, or within a year later. We obtain the filtered dataset from Grnarova et al. (2016), and focus on in-hospital mortality prediction.

In this experiment, the root node of the Structure Tree represents a patient, and its children units are the patient’s reports. Each of the reports are divided into paragraphs and sentences. Similarly to the Wikipedia dataset, we use sentence embeddings for this experiment. Like Grnarova et al. (2016), we implement our Structure Tree-LSTM with categories encoded as one integer appended to sentence embeddings.

The training time for one epoch is over 24 hours. We choose to focus on the Zero-vector variant, as it performed best in the two first experiments. The performance is evaluated using the Area under the ROC Curve or AUC Hanley and McNeil (1982), the same metric used in the benchmark baselines of Grnarova et al. (2016).

4.3.2 Baselines

Grnarova et al. (2016) devise a model using a word-level CNN for words within a sentence, and a sentence-level CNN processing each sentence sequentially. They append the information about the corresponding medical report category as a vector to each sentence embedding.

They compare their CNN model to two baselines. The first one is the LDA-based Retrospective Topic Model Ghassemi et al. (2014), a linear kernel SVM trained on the per-report topic distributions. This model is the state-of-the-art model for mortality prediction in the MIMIC-II dataset Saeed et al. (2011). The second one is a linear SVM trained on doc2vec Le and Mikolov (2014) representations of the reports.

4.3.3 Target Replication

The authors also use target replication Lipton et al. (2015); Dai and Le (2015). The intuition behind target replication is that the model learns better by replicating the loss at intermediate steps.

More formally, we add to the learning objective a cross entropy term for every intermediate step of a training sample , where is the label associated to the corresponding training sample, is the predicted label at the intermediate step , and is the hidden state of step from which a softmax probability is computed. The loss function for the training sample becomes:


In Equation 8, is a regularization parameter, is the label predicted using the hidden state of the training sample .

In the CNN model, target replication means predicting at each sentence and computing the corresponding loss. However, in our Structure Tree-LSTM, it means that we replicate the loss at hierarchy levels: we can predict at all units one level below the root (the root’s children), two levels (the children of the root’s children), or more. Here, we use 1-level target replication. Intuitively, we are replicating the loss at the report nodes.

4.3.4 Mortality Prediction Results

Our results are reported in Table 3. Our model only came 0.005 short of the CNN baseline, and beat the other two baselines. This can be explained the difference between the CNN baseline and our Structure Tree-LSTM model: whereas the CNN baseline processes a patient’s reports in the temporal order in which they were issued by the hospital, our model does not incorporate this temporal information. This is important as there is a difference between a good health report coming after a bad health report (recovering patient) and the reverse situation (worsening health). More generally, the children of a Child-Sum Tree-LSTM unit are not processed in any sequential order, and as such sequential order is not preserved. Given that our model nonetheless gives competitive results without temporal information, our future work will focus on modelling sequential order in Child-Sum Tree-LSTMs. It also indicates that our model can efficiently model dependency relations other than structure, such as a group of reports on the same patient.

Additional examples of applications of this ability include author or user modelling, using an author’s writings as the children of the root Tree-LSTM unit. Here, the dependency relation would be authorship.

5 Conclusions

To the best of our knowledge, the Structure Tree-LSTM is the first attempt to use Tree-LSTMs for texts larger than sentences. We show that our proposed structure-aware document encoders – the zero-vector variant – applies attention to all document structural levels. We apply the method for document classification and obtain an average of 9.00% relative improvement in macro-F1 score with respect to the the best baseline score. We also show that our model’s hierarchical attention mechanism can be visualised, thereby making the predictions interpretable.

We also test this model on the MIMIC-III dataset, by modelling a patient as the root unit of the Tree-LSTM, and the corresponding medical reports as the children units. We obtain comparable results to the state of the art, coming only 0.005 short in AUC score. We hypothesize that the difference is because the Tree-LSTM cannot encode temporal order, but we note that it successfully modelled structure larger than a single document. This ability could have multiple practical applications, such as modelling people based on their writings.

Finally, we publish a novel document classification dataset of structured Wikipedia articles and release our code to encourage further research on long document encoders.


  • S. Arora, Y. Liang, and T. Ma (2016) A simple but tough-to-beat baseline for sentence embeddings. Cited by: §2.
  • P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov (2016) Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606. Cited by: §2.
  • S. R. Bowman, G. Angeli, C. Potts, and C. D. Manning (2015) A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 632–642. Cited by: §2.
  • Q. Chen, X. Zhu, Z. Ling, S. Wei, H. Jiang, and D. Inkpen (2017) Enhanced lstm for natural language inference. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1, pp. 1657–1668. Cited by: §2.
  • J. Cheng, L. Dong, and M. Lapata (2016) Long short-term memory-networks for machine reading. arXiv preprint arXiv:1601.06733. Cited by: §1.
  • J. P. Chiu and E. Nichols (2015) Named entity recognition with bidirectional lstm-cnns. arXiv preprint arXiv:1511.08308. Cited by: §1.
  • A. Conneau, D. Kiela, H. Schwenk, L. Barrault, and A. Bordes (2017) Supervised learning of universal sentence representations from natural language inference data. arXiv preprint arXiv:1705.02364. Cited by: §2.
  • A. M. Dai and Q. V. Le (2015) Semi-supervised sequence learning. In Advances in neural information processing systems, pp. 3079–3087. Cited by: §4.3.3.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §2.
  • A. Eriguchi, K. Hashimoto, and Y. Tsuruoka (2016) Tree-to-sequence attentional neural machine translation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1, pp. 823–833. Cited by: §2.
  • M. Ghassemi, T. Naumann, F. Doshi-Velez, N. Brimmer, R. Joshi, A. Rumshisky, and P. Szolovits (2014) Unfolding physiological state: mortality modelling in intensive care units. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 75–84. Cited by: §4.3.2.
  • A. Graves and J. Schmidhuber (2005) Framewise phoneme classification with bidirectional lstm and other neural network architectures. Neural Networks 18 (5-6), pp. 602–610. Cited by: §1.
  • P. Grnarova, F. Schmidt, S. L. Hyland, and C. Eickhoff (2016) Neural document embeddings for intensive care patient mortality prediction. arXiv preprint arXiv:1612.00467. Cited by: §4.3.1, §4.3.1, §4.3.1, §4.3.2, Table 3.
  • J. A. Hanley and B. J. McNeil (1982) The meaning and use of the area under a receiver operating characteristic (roc) curve.. Radiology 143 (1), pp. 29–36. Cited by: §4.3.1.
  • F. Hill, K. Cho, and A. Korhonen (2016) Learning distributed representations of sentences from unlabelled data. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1367–1377. Cited by: §2.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §1.
  • Z. Huang, W. Xu, and K. Yu (2015) Bidirectional lstm-crf models for sequence tagging. arXiv preprint arXiv:1508.01991. Cited by: §1.
  • A. E. Johnson, T. J. Pollard, L. Shen, H. L. Li-wei, M. Feng, M. Ghassemi, B. Moody, P. Szolovits, L. A. Celi, and R. G. Mark (2016) MIMIC-iii, a freely accessible critical care database. Scientific data 3, pp. 160035. Cited by: §4.3.1.
  • R. Kiros, Y. Zhu, R. R. Salakhutdinov, R. Zemel, R. Urtasun, A. Torralba, and S. Fidler (2015) Skip-thought vectors. In Advances in neural information processing systems, pp. 3294–3302. Cited by: §2.
  • P. Le and W. Zuidema (2015) Compositional distributional semantics with long short term memory. Lexical and Computational Semantics (* SEM 2015), pp. 10. Cited by: §1, §2.
  • Q. Le and T. Mikolov (2014) Distributed representations of sentences and documents. In International Conference on Machine Learning, pp. 1188–1196. Cited by: §4.3.2.
  • Z. C. Lipton, D. C. Kale, C. Elkan, and R. Wetzel (2015) Learning to diagnose with lstm recurrent neural networks. arXiv preprint arXiv:1511.03677. Cited by: §4.3.3.
  • T. Mikolov, K. Chen, G. Corrado, and J. Dean (2013a) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. Cited by: §2.
  • T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean (2013b) Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119. Cited by: §2.
  • M. Pagliardini, P. Gupta, and M. Jaggi (2017) Unsupervised learning of sentence embeddings using compositional n-gram features. arXiv preprint arXiv:1703.02507. Cited by: §2.
  • J. Pennington, R. Socher, and C. Manning (2014) Glove: global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543. Cited by: §2.
  • M. Saeed, M. Villarroel, A. T. Reisner, G. Clifford, L. Lehman, G. Moody, T. Heldt, T. H. Kyaw, B. Moody, and R. G. Mark (2011) Multiparameter intelligent monitoring in intensive care ii (mimic-ii): a public-access intensive care unit database. Critical care medicine 39 (5), pp. 952. Cited by: §4.3.2.
  • K. S. Tai, R. Socher, and C. D. Manning (2015) Improved semantic representations from tree-structured long short-term memory networks. arXiv preprint arXiv:1503.00075. Cited by: §1, §2, §2, Figure 4, §3.1, §3.1, §3.2.
  • Y. Wang, M. Huang, L. Zhao, et al. (2016) Attention-based lstm for aspect-level sentiment classification. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 606–615. Cited by: §1.
  • Z. Yang, D. Yang, C. Dyer, X. He, A. Smola, and E. Hovy (2016) Hierarchical attention networks for document classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1480–1489. Cited by: §1, §2, §4.1, §4.1, §4.2.2.
  • H. Zhao, Z. Lu, and P. Poupart (2015) Self-adaptive hierarchical sentence model.. In IJCAI, pp. 4069–4076. Cited by: §2.
  • X. Zhu, P. Sobihani, and H. Guo (2015) Long short-term memory over recursive structures. In International Conference on Machine Learning, pp. 1604–1612. Cited by: §1, §2.

Appendix A Wikipedia Dataset Details

The Wikipedia dataset is collected from the English Wikipedia dump of February 1st, 2018666All Wikipedia dumps are freely available at We use a slightly modified version of the WikiExtractor777The code used is available online at to extract article from the unzipped .xml file.

The articles are filtered to have a certain length. To do so, we first compute the number of sentences, paragraphs and sections for each article. This is to get an idea of the average length of Wikipedia articles, and then to set a limit on them to filter out stubs. We check percentiles and decide to filter at 25% to get stubs out of our dataset. Practically, this corresponds to filtering out articles with less than 2 sections, 3 paragraphs and 5 sentences.

Afterwards, we get articles such that they belong to exactly one of 24 categories, and the numbers are detailed in Table 4. We determine the categories in the table by looking at keywords from the Wikipedia-tagged categories, not the articles themselves, and these categories are excluded from the body of the articles. The resulting dataset has 494,657 articles, and is released as an open resource888Dataset link to be added in the camera-ready version..

Category Number of articles
Actors 28 007
Actresses 22 208
Aircraft 12 278
Airlines 2 496
Artists 39 618
Cities 27 090
Comedy 14 680
Directors 19 218
Documentaries 3 848
Drama 20 523
Footballers 69 151
Horror 4 875
Journalists 15 363
Languages 6 779
Military Personnel 17 910
Musicians 17 603
Novelists 14 964
Novels 25 247
Political Parties 4 233
Politicians 56 130
Singers 17 055
Television 33 434
Video Games 20 059
Wars 1 888
Total 494 657
Table 4: Number of articles per category in the Wikipedia dataset used in the first experiment.

Appendix B Training Details and Hyperparameters

For all experiments, we used an Adam optimizer with a weight decay of and a learning rate of . The batch size and hidden layer dimensions are respectively 64 and 128 for the sent2vec-based experiments, and 32 and 64 for the word2vec-based experiments. All models were trained using PyTorch.

We use 300-dimensional word2vec model pre-trained on the Google News Corpus999Available at:, and the 700-dimensional sent2vec101010Available at: model pre-trained on Wikipedia.

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description