Inter-sentence Relation Extraction with Document-level
Graph Convolutional Neural Network
Inter-sentence relation extraction deals with a number of complex semantic relationships in documents, which require local, non-local, syntactic and semantic dependencies. Existing methods do not fully exploit such dependencies. We present a novel inter-sentence relation extraction model that builds a labelled edge graph convolutional neural network model on a document-level graph. The graph is constructed using various inter- and intra-sentence dependencies to capture local and non-local dependency information. In order to predict the relation of an entity pair, we utilise multi-instance learning with bi-affine pairwise scoring. Experimental results show that our model achieves comparable performance to the state-of-the-art neural models on two biochemistry datasets. Our analysis shows that all the types in the graph are effective for inter-sentence relation extraction.
Semantic relationships between named entities often span across multiple sentences. In order to extract inter-sentence relations, most approaches utilise distant supervision to automatically generate document-level corpora (peng2017; song2018n). Recently, Patrick2018 introduced multi-instance learning (MIL) (riedel2010; surdeanu2012) to treat multiple mentions of target entities in a document.
Inter-sentential relations depend not only on local but also on non-local dependencies. Dependency trees are often used to extract local dependencies of semantic relations (culotta2004dependency; liu2015dependency) in intra-sentence relation extraction (RE). However, such dependencies are not adequate for inter-sentence RE, since different sentences have different dependency trees. Figure 1 illustrates such a case between Oxytocin and hypotension. To capture their relation, it is essential to connect the co-referring entities Oxytocin and Oxt. RNNs and CNNs, which are often used for intra-sentence RE (Zeng14; dos2015; zhou2016; lin2016neural), are not effective on longer sequences (sunil2017) thus failing to capture such non-local dependencies.
We propose a novel inter-sentence RE model that builds a labelled edge Graph CNN (GCNN) model (Marcheggiani2017) on a document-level graph. The graph nodes correspond to words and edges represent local and non-local dependencies among them. The document-level graph is formed by connecting words with local dependencies from syntactic parsing and sequential information, as well as non-local dependencies from coreference resolution and other semantic dependencies (peng2017). We infer relations between entities using MIL-based bi-affine pairwise scoring function (Patrick2018) on the entity node representations.
Our contribution is threefold. Firstly, we propose a novel model for inter-sentence RE using GCNN to capture local and non-local dependencies. Secondly, we apply the model on two biochemistry corpora and show its effectiveness. Finally, we developed a novel, distantly supervised dataset with chemical reactant-product relations from PubMed abstracts.111The dataset is publicly available at http://nactem.ac.uk/CHR/.
2 Proposed Model
We formulate the inter-sentence, document-level RE task as a classification problem. Let be the words in a document and and be the entity pair of interest in . We name the multiple occurrences of these entities in the document entity mentions. A relation extraction model takes a triple (, , ) as input and returns a relation for the pair, including the ‘‘no relation’’ category, as output. We assume that the relationship of the target entities in can be inferred based on all their mentions. We thus apply multi-instance learning on to combine all mention-level pairs and predict the final relation category of a target pair.
We describe the architecture of our proposed model in Figure 2. The model takes as input an entire abstract of scientific articles and two target entities with all their mentions in the input layer. It then constructs a graph structure with words as nodes and labelled edges that correspond to local and non-local dependencies. Next, it encodes the graph structure using a stacked GCNN layer and classifies the relation between the target entities by applying MIL (Patrick2018) to aggregate all mention pair representations.
2.1 Input Layer
In the input layer, we map each word and its relative positions to the first and second target entities into real-valued vectors, , , , respectively. As entities can have more than one mention, we calculate the relative position of a word from the closest target entity mention. For each word , we concatenate the word and position representations into an input representation, .
2.2 Graph Construction
In order to build a document-level graph for an entire abstract, we use the following categories of inter- and intra-sentence dependency edges, as shown with different colours in Figure 2.
Syntactic dependency edge: The syntactic structure of a sentence reveals helpful clues for intra-sentential RE (miwa-bansal2016). We thus use labelled syntactic dependency edges between the words of each sentence, by treating each syntactic dependency label as a different edge type.
Coreference edge: As coreference is an important indicator of local and non-local dependencies (ma2016unsupervised), we connect co-referring phrases in a document using coreference type edges.
Adjacent sentence edge: We connect the syntactic root of a sentence with the roots of the previous and next sentences with adjacent sentence type edges (peng2017) for non-local dependencies between neighbouring sentences.
Adjacent word edge: In order to keep sequential information among the words of a sentence, we connect each word with its previous and next words with adjacent word type edges.
Self-node edge: GCNN learns a node representation based solely on its neighbour nodes and their edge types. Hence, to include the node information itself into the representation, we form self-node type edges on all the nodes of the graph.
2.3 GCNN Layer
We compute the representation of each input word by applying GCNN (kipf2016semi; defferrard2016) on the constructed document graph. GCNN is an advanced version of CNN for graph encoding that learns semantic representations for the graph nodes, while preserving its structural information. In order to learn edge type-specific representations, we use a labelled edge GCNN, which keeps separate parameters for each edge type (Shikhar2018). The GCNN iteratively updates the representation of each input word as follows:
where is the -th word representation resulted from the -th GCNN block, is a set of neighbouring nodes to , and are the parameters of the -th block for edge type between nodes and . We stack GCNN blocks to accumulate information from distant neighbouring nodes and use edge-wise gating to control information from neighbouring nodes.
Similar to Marcheggiani2017, we maintain separate parameters for each edge direction. We, however, tune the number of model parameters by keeping separate parameters only for the top- types and using the same parameters for all the remaining edge types, named ‘‘rare’’ type edges. This can avoid possible overfitting due to over-parameterisation for different edge types.
2.4 MIL-based Relation Classification
Since each target entity can have multiple mentions in a document, we employ a multi-instance learning (MIL)-based classification scheme to aggregate the predictions of all target mention pairs using bi-affine pairwise scoring (Patrick2018). As shown in Figure 2, each word is firstly projected into two separate latent spaces using two-layered feed-forward neural networks (FFNN), which correspond to the first (head) or second (tail) argument of the target pair.
where corresponds to the representation of the -th word after blocks of GCNN encoding, , are the parameters of two FFNNs for head and tail respectively and , are the resulted head/tail representations for the -th word.
Then, mention-level pairwise confidence scores are generated by a bi-affine layer and aggregated to obtain the entity-level pairwise score.
where, is a learned bi-affine tensor with the number of relation categories, and , denote a set of mentions for entities and respectively.
3 Experimental Settings
We first briefly describe the datasets where the proposed model is evaluated along with their pre-processing. We then introduce the baseline models we use for comparison. Finally, we show the training settings.
3.1 Data Sets
We evaluated our model on two biochemistry datasets.
Chemical-Disease Relations dataset (CDR): The CDR dataset is a document-level, inter-sentence relation extraction dataset developed for the BioCreative V challenge (biocreative2015overview).
CHemical Reactions dataset (CHR): We created a document-level dataset with relations between chemicals using distant supervision. Firstly, we used the back-end of the semantic faceted search engine Thalia222http://www.nactem.ac.uk/Thalia/ (Thalia2018) to obtain abstracts annotated with several biomedical named entities from PubMed. We selected chemical compounds from the annotated entities and aligned them with the graph database Biochem4j (Biochem4j). Biochem4j is a freely available database that integrates several resources such as UniProt, KEGG and NCBI Taxonomy333http://biochem4j.org. If two chemical entities have a relation in Biochem4j, we consider them as positive instances in the dataset, otherwise as negative.
3.2 Data Pre-processing
Table 1 shows the statistics for CDR and CHR datasets. For both datasets, the annotated entities can have more than one associated Knowledge Base (KB) ID. If there is at least one common KB ID between mentions then we considered all these mentions to belong to the same entity. This technique results in less negative pairs. We ignored entities that were not grounded to a known KB ID and removed relations between the same entity (self-relations). For the CDR dataset, we performed hypernym filtering similar to gu2017 and Patrick2018. In the CHR dataset, both directions were generated for each candidate chemical pair as chemicals can be either a reactant (first argument) or a product (second argument) in an interaction.
|# Positive pairs||1,038||1,012||1,066|
|# Negative pairs||4,198||4,069||4,119|
|# Positive pairs||19,643||3,185||9,578|
|# Negative pairs||69,843||11,466||33,339|
We processed the datasets using the GENIA Sentence Splitter444http://www.nactem.ac.uk/y-matsu/geniass/ and GENIA tagger (tsuruoka2005developing) for sentence splitting and word tokenisation, respectively. Syntactic dependencies were obtained using the Enju syntactic parser (miyao-tsujii-2008-feature) with predicate-argument structures. Coreference type edges were constructed using the Stanford CoreNLP software (corenlp:2014).
3.3 Baseline Models
For the CDR dataset, we compare with five state-of-the-art models: SVM (xu-EtAl:2016), ensemble of feature-based and neural-based models (zhou2016cdr), CNN and Maximum Entropy (gu2017), Piece-wise CNN (Li2018) and Transformer (Patrick2018). We additionally prepare and evaluate the following models: CNN-RE, a re-implementation from kim2014 and zhou2016cdr and RNN-RE, a re-implementation from sunil2017. In all models we use bi-affine pairwise scoring to detect relations.
3.4 Model Training
We used 100-dimentional word embeddings trained on PubMed with GloVe (pennington2014glove; th2015evaluating). Unlike Patrick2018, we used the pre-trained word embeddings in place of sub-word embeddings to align with our word graphs. Due to the size of the CDR dataset, we merged the training and development sets to train the models, similarly to Jun2016 and gu2017. We report the performance as the average of five runs with different parameter initialisation seeds in terms of precision (P), recall (R) and F1-score. We used the frequencies of the edge types in the training set to choose the top- edges in Section 2.3. We refer to the supplementary materials for the details of the training and hyper-parameter settings.
We show the results of our model for the CDR and CHR datasets in Table 2. We report the performance of state-of-the-art models without any additional enhancements, such as joint training with NER, model ensembling and heuristic rules, to avoid any effects from the enhancements in the comparison. We observe that the GCNN outperforms the baseline models (CNN-RE/RNN-RE) in both datasets. However, in the CDR dataset, the performance of GCNN is percentage points lower than the best performing system of (gu2017). In fact, gu2017 incorporates two separate neural and feature-based models for intra- and inter-sentence pairs, respectively, whereas we utilize a single model for both pairs. Additionally, GCNN performs comparably to the second state-of-the-art neural model Li2018, which requires a two-step process for mention aggregation unlike our unified approach.
|Data||Model||P (%)||R (%)||F1 (%)|
Figure 3 illustrates the performance of our model on the CDR development set when using a varying number of most frequent edge types . While tuning , we observed that the best performance was obtained for top- edge types, but it slightly deteriorated with more. We chose the top- edge types in other experiments.
We perform ablation analysis on the CDR dataset by separating the development set to intra- and inter-sentence pairs (approximately 70% and 30% of pairs, respectively). Table 3 shows the performance when removing an edge category at a time. In general, all dependency types have positive effects on inter-sentence RE and the overall performance, although self-node and adjacent sentence edges slightly harm the performance of intra-sentence relations. Additionally, coreference does not affect intra-sentence pairs.
5 Related Work
Inter-sentence RE is a recently introduced task. peng2017 and song2018n used graph-based LSTM networks for -ary RE in multiple sentences for protein-drug-disease associations. They restricted the relation candidates in up to two-span sentences. Patrick2018 considered multi-instance learning for document-level RE. Our work is different from Patrick2018 in that we replace Transformer with a GCNN model for full-abstract encoding using non-local dependencies such as entity coreference.
GCNN was firstly proposed by kipf2016semi and applied on citation networks and knowledge graph datasets. It was later used for semantic role labelling (Marcheggiani2017), multi-document summarization (Yasunaga2017) and temporal relation extraction (Shikhar2018). zhang2018graph used a GCNN on a dependency tree for intra-sentence RE. Unlike previous work, we introduced a GCNN on a document-level graph, with both intra- and inter-sentence dependencies for inter-sentence RE.
We proposed a novel graph-based method for inter-sentence RE using a labelled edge GCNN model on a document-level graph. The graph is constructed with words as nodes and multiple intra- and inter-sentence dependencies between them as edges. A GCNN model is employed to encode the graph structure and MIL is incorporated to aggregate the multiple mention-level pairs . We show that our method achieves comparable performance to the state-of-the-art neural models on two biochemistry datasets. We tuned the number of labelled edges to maintain the number of parameters in the labelled edge GCNN. Analysis showed that all edge types are effective for inter-sentence RE.
Although the model is applied to biochemistry corpora for inter-sentence RE, our method is also applicable to other relation extraction tasks. As future work, we plan to incorporate joint named entity recognition training as well as sub-word embeddings in order to further improve the performance of the proposed model.
This research was supported with funding from BBSRC, Enriching Metabolic PATHwaY models with evidence from the literature (EMPATHY) [Grant ID: BB/M006891/1] and AIRC/AIST. Results were obtained from a project commissioned by the New Energy and Industrial Technology Development Organization (NEDO).
Appendix A Training and Hyper-parameter Settings
We implemented all models using Tensorflow555https://www.tensorflow.org. The development set was used for hyper-parameter tuning. For all models, parameters were optimised using the Adam optimisation algorithm with exponential moving average (kingma2014adam), learning rate of , learning rate decay of and gradient clipping . We used early stopping with patience equal to epochs in order to determine the best training epoch. For other hyper-parameters, we performed a non-exhaustive hyper-parameter search based on the development set. We used the same hyper-parameters of both CDR and CHR datasets. The best hyper-parameter values are shown in Table 4.
|Learning rate||5 10|
|Number of GCNN blocks ()||2|
|MIL feed-forward layer dimension||140|
|Dropout rate (input layer)||0.1|
|Dropout rate (GCNN layer)||0.05|
|Dropout rate (MIL feed-forward layer)||0.05|
|Residual connection on GCNN layer||yes|