Inter-sentence Relation Extraction with Document-levelGraph Convolutional Neural Network

Inter-sentence Relation Extraction with Document-level
Graph Convolutional Neural Network

Sunil Kumar Sahu National Centre for Text Mining,
School of Computer Science, The University of Manchester, United Kingdom
Fenia Christopoulou National Centre for Text Mining,
School of Computer Science, The University of Manchester, United Kingdom
Makoto Miwa Sophia Ananiadou Corresponding author.

Inter-sentence relation extraction deals with a number of complex semantic relationships in documents, which require local, non-local, syntactic and semantic dependencies. Existing methods do not fully exploit such dependencies. We present a novel inter-sentence relation extraction model that builds a labelled edge graph convolutional neural network model on a document-level graph. The graph is constructed using various inter- and intra-sentence dependencies to capture local and non-local dependency information. In order to predict the relation of an entity pair, we utilise multi-instance learning with bi-affine pairwise scoring. Experimental results show that our model achieves comparable performance to the state-of-the-art neural models on two biochemistry datasets. Our analysis shows that all the types in the graph are effective for inter-sentence relation extraction.

Inter-sentence Relation Extraction with Document-level
Graph Convolutional Neural Network

1 Introduction

Semantic relationships between named entities often span across multiple sentences. In order to extract inter-sentence relations, most approaches utilise distant supervision to automatically generate document-level corpora (peng2017; song2018n). Recently, Patrick2018 introduced multi-instance learning (MIL) (riedel2010; surdeanu2012) to treat multiple mentions of target entities in a document.

Inter-sentential relations depend not only on local but also on non-local dependencies. Dependency trees are often used to extract local dependencies of semantic relations (culotta2004dependency; liu2015dependency) in intra-sentence relation extraction (RE). However, such dependencies are not adequate for inter-sentence RE, since different sentences have different dependency trees. Figure 1 illustrates such a case between Oxytocin and hypotension. To capture their relation, it is essential to connect the co-referring entities Oxytocin and Oxt. RNNs and CNNs, which are often used for intra-sentence RE (Zeng14; dos2015; zhou2016; lin2016neural), are not effective on longer sequences (sunil2017) thus failing to capture such non-local dependencies.

Figure 1: Sentences with non-local dependencies between named entities. The red arrow represents a relation between co-referred entities and yellow arrows represent semantically dependent relations. Example adapted from the CDR dataset (biocreative2015overview).
Figure 2: Proposed model architecture. The input word sequence is mapped to a graph structure, where nodes are words and edges correspond to dependencies. We omit several edges, such as self-node edges of all words and syntactic dependency edges of different labels, for brevity. GCNN is employed to encode the graph and a bi-affine layer aggregates all mention pairs.

We propose a novel inter-sentence RE model that builds a labelled edge Graph CNN (GCNN) model (Marcheggiani2017) on a document-level graph. The graph nodes correspond to words and edges represent local and non-local dependencies among them. The document-level graph is formed by connecting words with local dependencies from syntactic parsing and sequential information, as well as non-local dependencies from coreference resolution and other semantic dependencies (peng2017). We infer relations between entities using MIL-based bi-affine pairwise scoring function (Patrick2018) on the entity node representations.

Our contribution is threefold. Firstly, we propose a novel model for inter-sentence RE using GCNN to capture local and non-local dependencies. Secondly, we apply the model on two biochemistry corpora and show its effectiveness. Finally, we developed a novel, distantly supervised dataset with chemical reactant-product relations from PubMed abstracts.111The dataset is publicly available at

2 Proposed Model

We formulate the inter-sentence, document-level RE task as a classification problem. Let be the words in a document and and be the entity pair of interest in . We name the multiple occurrences of these entities in the document entity mentions. A relation extraction model takes a triple (, , ) as input and returns a relation for the pair, including the ‘‘no relation’’ category, as output. We assume that the relationship of the target entities in can be inferred based on all their mentions. We thus apply multi-instance learning on to combine all mention-level pairs and predict the final relation category of a target pair.

We describe the architecture of our proposed model in Figure 2. The model takes as input an entire abstract of scientific articles and two target entities with all their mentions in the input layer. It then constructs a graph structure with words as nodes and labelled edges that correspond to local and non-local dependencies. Next, it encodes the graph structure using a stacked GCNN layer and classifies the relation between the target entities by applying MIL (Patrick2018) to aggregate all mention pair representations.

2.1 Input Layer

In the input layer, we map each word and its relative positions to the first and second target entities into real-valued vectors, , , , respectively. As entities can have more than one mention, we calculate the relative position of a word from the closest target entity mention. For each word , we concatenate the word and position representations into an input representation, .

2.2 Graph Construction

In order to build a document-level graph for an entire abstract, we use the following categories of inter- and intra-sentence dependency edges, as shown with different colours in Figure 2.

Syntactic dependency edge: The syntactic structure of a sentence reveals helpful clues for intra-sentential RE (miwa-bansal2016). We thus use labelled syntactic dependency edges between the words of each sentence, by treating each syntactic dependency label as a different edge type.

Coreference edge: As coreference is an important indicator of local and non-local dependencies (ma2016unsupervised), we connect co-referring phrases in a document using coreference type edges.

Adjacent sentence edge: We connect the syntactic root of a sentence with the roots of the previous and next sentences with adjacent sentence type edges (peng2017) for non-local dependencies between neighbouring sentences.

Adjacent word edge: In order to keep sequential information among the words of a sentence, we connect each word with its previous and next words with adjacent word type edges.

Self-node edge: GCNN learns a node representation based solely on its neighbour nodes and their edge types. Hence, to include the node information itself into the representation, we form self-node type edges on all the nodes of the graph.

2.3 GCNN Layer

We compute the representation of each input word by applying GCNN (kipf2016semi; defferrard2016) on the constructed document graph. GCNN is an advanced version of CNN for graph encoding that learns semantic representations for the graph nodes, while preserving its structural information. In order to learn edge type-specific representations, we use a labelled edge GCNN, which keeps separate parameters for each edge type (Shikhar2018). The GCNN iteratively updates the representation of each input word as follows:

where is the -th word representation resulted from the -th GCNN block, is a set of neighbouring nodes to , and are the parameters of the -th block for edge type between nodes and . We stack GCNN blocks to accumulate information from distant neighbouring nodes and use edge-wise gating to control information from neighbouring nodes.

Similar to Marcheggiani2017, we maintain separate parameters for each edge direction. We, however, tune the number of model parameters by keeping separate parameters only for the top- types and using the same parameters for all the remaining edge types, named ‘‘rare’’ type edges. This can avoid possible overfitting due to over-parameterisation for different edge types.

2.4 MIL-based Relation Classification

Since each target entity can have multiple mentions in a document, we employ a multi-instance learning (MIL)-based classification scheme to aggregate the predictions of all target mention pairs using bi-affine pairwise scoring (Patrick2018). As shown in Figure 2, each word is firstly projected into two separate latent spaces using two-layered feed-forward neural networks (FFNN), which correspond to the first (head) or second (tail) argument of the target pair.

where corresponds to the representation of the -th word after blocks of GCNN encoding, , are the parameters of two FFNNs for head and tail respectively and , are the resulted head/tail representations for the -th word.

Then, mention-level pairwise confidence scores are generated by a bi-affine layer and aggregated to obtain the entity-level pairwise score.

where, is a learned bi-affine tensor with the number of relation categories, and , denote a set of mentions for entities and respectively.

3 Experimental Settings

We first briefly describe the datasets where the proposed model is evaluated along with their pre-processing. We then introduce the baseline models we use for comparison. Finally, we show the training settings.

3.1 Data Sets

We evaluated our model on two biochemistry datasets.
Chemical-Disease Relations dataset (CDR): The CDR dataset is a document-level, inter-sentence relation extraction dataset developed for the BioCreative V challenge (biocreative2015overview).

CHemical Reactions dataset (CHR): We created a document-level dataset with relations between chemicals using distant supervision. Firstly, we used the back-end of the semantic faceted search engine Thalia222 (Thalia2018) to obtain abstracts annotated with several biomedical named entities from PubMed. We selected chemical compounds from the annotated entities and aligned them with the graph database Biochem4j (Biochem4j). Biochem4j is a freely available database that integrates several resources such as UniProt, KEGG and NCBI Taxonomy333 If two chemical entities have a relation in Biochem4j, we consider them as positive instances in the dataset, otherwise as negative.

3.2 Data Pre-processing

Table 1 shows the statistics for CDR and CHR datasets. For both datasets, the annotated entities can have more than one associated Knowledge Base (KB) ID. If there is at least one common KB ID between mentions then we considered all these mentions to belong to the same entity. This technique results in less negative pairs. We ignored entities that were not grounded to a known KB ID and removed relations between the same entity (self-relations). For the CDR dataset, we performed hypernym filtering similar to gu2017 and Patrick2018. In the CHR dataset, both directions were generated for each candidate chemical pair as chemicals can be either a reactant (first argument) or a product (second argument) in an interaction.

Data Count Train Dev. Test
CDR # Articles 500 500 500
# Positive pairs 1,038 1,012 1,066
# Negative pairs 4,198 4,069 4,119
CHR # Articles 7,298 1,182 3,614
# Positive pairs 19,643 3,185 9,578
# Negative pairs 69,843 11,466 33,339
Table 1: Statistics of the CDR and CHR datasets.

We processed the datasets using the GENIA Sentence Splitter444 and GENIA tagger (tsuruoka2005developing) for sentence splitting and word tokenisation, respectively. Syntactic dependencies were obtained using the Enju syntactic parser (miyao-tsujii-2008-feature) with predicate-argument structures. Coreference type edges were constructed using the Stanford CoreNLP software (corenlp:2014).

3.3 Baseline Models

For the CDR dataset, we compare with five state-of-the-art models: SVM (xu-EtAl:2016), ensemble of feature-based and neural-based models (zhou2016cdr), CNN and Maximum Entropy (gu2017), Piece-wise CNN (Li2018) and Transformer (Patrick2018). We additionally prepare and evaluate the following models: CNN-RE, a re-implementation from kim2014 and zhou2016cdr and RNN-RE, a re-implementation from sunil2017. In all models we use bi-affine pairwise scoring to detect relations.

3.4 Model Training

We used 100-dimentional word embeddings trained on PubMed with GloVe (pennington2014glove; th2015evaluating). Unlike Patrick2018, we used the pre-trained word embeddings in place of sub-word embeddings to align with our word graphs. Due to the size of the CDR dataset, we merged the training and development sets to train the models, similarly to Jun2016 and gu2017. We report the performance as the average of five runs with different parameter initialisation seeds in terms of precision (P), recall (R) and F1-score. We used the frequencies of the edge types in the training set to choose the top- edges in Section 2.3. We refer to the supplementary materials for the details of the training and hyper-parameter settings.

4 Results

We show the results of our model for the CDR and CHR datasets in Table 2. We report the performance of state-of-the-art models without any additional enhancements, such as joint training with NER, model ensembling and heuristic rules, to avoid any effects from the enhancements in the comparison. We observe that the GCNN outperforms the baseline models (CNN-RE/RNN-RE) in both datasets. However, in the CDR dataset, the performance of GCNN is percentage points lower than the best performing system of (gu2017). In fact, gu2017 incorporates two separate neural and feature-based models for intra- and inter-sentence pairs, respectively, whereas we utilize a single model for both pairs. Additionally, GCNN performs comparably to the second state-of-the-art neural model Li2018, which requires a two-step process for mention aggregation unlike our unified approach.

Data Model P (%) R (%) F1 (%)
CDR Jun2016 59.6 44.0 50.7
zhou2016cdr 64.8 49.2 56.0
gu2017 60.9 59.5 60.2
Li2018 55.1 63.6 59.1
Patrick2018 49.9 63.8 55.5
CNN-RE 51.5 65.7 57.7
RNN-RE 52.6 62.9 57.3
GCNN 52.8 66.0 58.6
CHR CNN-RE 81.2 87.3 84.1
RNN-RE 83.0 90.1 86.4
GCNN 84.7 90.5 87.5
Table 2: Performance on the CDR and CHR test sets in comparison with the state-of-the-art.

Figure 3 illustrates the performance of our model on the CDR development set when using a varying number of most frequent edge types . While tuning , we observed that the best performance was obtained for top- edge types, but it slightly deteriorated with more. We chose the top- edge types in other experiments.

Figure 3: Performance of GCNN model on the CDR development set when using the top- most frequent edge types and consider the rest as a single ‘‘rare’’ type.

We perform ablation analysis on the CDR dataset by separating the development set to intra- and inter-sentence pairs (approximately 70% and 30% of pairs, respectively). Table 3 shows the performance when removing an edge category at a time. In general, all dependency types have positive effects on inter-sentence RE and the overall performance, although self-node and adjacent sentence edges slightly harm the performance of intra-sentence relations. Additionally, coreference does not affect intra-sentence pairs.

Model Overall Intra Inter
GCNN (best) 57.19 63.43 36.90
   Adjacent word 55.75 62.53 35.61
   Syntactic dependency 56.12 62.89 34.75
   Coreference 56.44 63.27 35.65
   Self-node 56.85 63.84 33.20
   Adjacent sentence 57.00 63.99 35.20
Table 3: Ablation analysis on the CDR development set, in terms of F1-score (%), for intra- (Intra) and inter-sentence (Inter) pairs.

5 Related Work

Inter-sentence RE is a recently introduced task. peng2017 and song2018n used graph-based LSTM networks for -ary RE in multiple sentences for protein-drug-disease associations. They restricted the relation candidates in up to two-span sentences. Patrick2018 considered multi-instance learning for document-level RE. Our work is different from Patrick2018 in that we replace Transformer with a GCNN model for full-abstract encoding using non-local dependencies such as entity coreference.

GCNN was firstly proposed by kipf2016semi and applied on citation networks and knowledge graph datasets. It was later used for semantic role labelling (Marcheggiani2017), multi-document summarization (Yasunaga2017) and temporal relation extraction (Shikhar2018). zhang2018graph used a GCNN on a dependency tree for intra-sentence RE. Unlike previous work, we introduced a GCNN on a document-level graph, with both intra- and inter-sentence dependencies for inter-sentence RE.

6 Conclusion

We proposed a novel graph-based method for inter-sentence RE using a labelled edge GCNN model on a document-level graph. The graph is constructed with words as nodes and multiple intra- and inter-sentence dependencies between them as edges. A GCNN model is employed to encode the graph structure and MIL is incorporated to aggregate the multiple mention-level pairs . We show that our method achieves comparable performance to the state-of-the-art neural models on two biochemistry datasets. We tuned the number of labelled edges to maintain the number of parameters in the labelled edge GCNN. Analysis showed that all edge types are effective for inter-sentence RE.

Although the model is applied to biochemistry corpora for inter-sentence RE, our method is also applicable to other relation extraction tasks. As future work, we plan to incorporate joint named entity recognition training as well as sub-word embeddings in order to further improve the performance of the proposed model.


This research was supported with funding from BBSRC, Enriching Metabolic PATHwaY models with evidence from the literature (EMPATHY) [Grant ID: BB/M006891/1] and AIRC/AIST. Results were obtained from a project commissioned by the New Energy and Industrial Technology Development Organization (NEDO).


Appendix A Training and Hyper-parameter Settings

We implemented all models using Tensorflow555 The development set was used for hyper-parameter tuning. For all models, parameters were optimised using the Adam optimisation algorithm with exponential moving average (kingma2014adam), learning rate of , learning rate decay of and gradient clipping . We used early stopping with patience equal to epochs in order to determine the best training epoch. For other hyper-parameters, we performed a non-exhaustive hyper-parameter search based on the development set. We used the same hyper-parameters of both CDR and CHR datasets. The best hyper-parameter values are shown in Table 4.

Hyper-parameter Value
Batch size 32
Learning rate 5 10
Word dimension 100
Position dimension 20
GCNN dimension 140
Number of GCNN blocks () 2
MIL feed-forward layer dimension 140
Dropout rate (input layer) 0.1
Dropout rate (GCNN layer) 0.05
Dropout rate (MIL feed-forward layer) 0.05
Residual connection on GCNN layer yes
Table 4: Best performing hyper-parameters used in the proposed model.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description