Recurrent Event Network: Global Structure Inference over Temporal Knowledge Graph

Recurrent Event Network: Global Structure Inference over Temporal Knowledge Graph

Woojeong Jin, He Jiang, Meng Qu, Tong Chen, Changlin Zhang, Pedro Szekely, Xiang Ren
Department of Computer Science, University of Southern California
Mila-Quebec Institute for Learning Algorithms
School of Computer Science, Carnegie Mellon University
Information Sciences Institute, University of Southern California
{woojeong.jin, jian567, changlin.zhang, xiangren}@usc.edu
meng.qu@umontreal.ca, tongc2@andrew.cmu.edu, pszekely@isi.edu
Abstract

Modeling dynamically-evolving, multi-relational graph data has received a surge of interests with the rapid growth of heterogeneous event data. However, predicting future events on such data requires global structure inference over time and the ability to integrate temporal and structural information, which are not yet well understood. We present Recurrent Event Network (RE-Net), a novel autoregressive architecture for modeling temporal sequences of multi-relational graphs (e.g., temporal knowledge graph), which can perform sequential, global structure inference over future time stamps to predict new events. RE-Net employs a recurrent event encoder to model the temporally conditioned joint probability distribution for the event sequences, and equips the event encoder with a neighborhood aggregator for modeling the concurrent events within a time window associated with each entity. We apply teacher forcing for model training over historical data, and infer graph sequences over future time stamps by sampling from the learned joint distribution in a sequential manner. We evaluate the proposed method via temporal link prediction on five public datasets. Extensive experiments111Code and data are published at https://github.com/INK-USC/RE-Net. demonstrate the strength of RE-Net, especially on multi-step inference over future time stamps.

1 Introduction

Representation learning on dynamically-evolving, graph-structured data has emerged as an important problem in a wide range of applications, including social network analysis (Zhou et al., 2018; Trivedi et al., 2019), knowledge graph reasoning (Trivedi et al., 2017; Nguyen et al., 2018; Kazemi et al., 2019), event forecasting (Du et al., 2016), and recommender systems (Kumar et al., 2019; You et al., 2019). Previous methods over dynamic graphs mainly focus on learning time-sensitive structure representations for node classification and link prediction in single-relational graphs. However, the rapid growth of heterogeneous event data (Mahdisoltani et al., 2014; Boschee et al., 2015) has created new challenges on modeling temporal, complex interactions between entities (i.e., viewed as a temporal knowledge graph or a TKG), and calls for approaches that can predict new events in different future time stamps based on the history—i.e., structure inference of a TKG over time.

Recent attempts on learning over temporal knowledge graphs have focused on either predicting missing events (facts) for the observed time stamps (García-Durán et al., 2018; Dasgupta et al., 2018; Leblay & Chekol, 2018), or estimating the conditional probability of observing a future event using temporal point process (Trivedi et al., 2017, 2019). However, the former group of methods adopts an interpolation problem formulation over TKGs and thus cannot predict future events, as representations of unseen time stamps are unavailable. The latter group of methods, including Know-Evolve and its extension, DyRep, computes the probability of future events using ground-truths of the proceeding events during inference time, and cannot model concurrent events occurring within the same time window—which often happens when event time stamps are discrete. It is thus desirable to have a principled method that can infer graph structure sequentially over time and can incorporate local structural information (e.g., concurrent events) during temporal modeling.

To this end, we propose a sequential structure inference architecture, called Recurrent Event Network (RE-Net), for modeling heterogeneous event data in the form of temporal knowledge graphs. Key ideas of RE-Net are based on the following observations: (1) predicting future events can be viewed as a sequential (multi-step) inference of multi-relational interactions between entities over time; (2) temporally adjacent events may carry related semantics and informative patterns, which can further help inform future events (i.e., temporal information); and (3) multiple events may co-occur within the same time window and exhibit structural dependencies as they share entities (i.e., local structural information). To incorporate these ideas, RE-Net defines the joint probability distribution of all the events in a TKG in an autoregressive fashion, where it models the probability distribution of the concurrent events at the current time step conditioned on all the preceding events (see Fig. (b)b for an illustration). Specifically, a recurrent event encoder, parametrized by RNNs, is used to summarize information of the past event sequences, and a neighborhood aggregator is employed to aggregate the information of concurrent events for the related entity within each time stamp. With the summarized information of the past event sequences, our decoder defines the joint probability of a current event. Such an autoregressive model can be effectively trained by using teacher forcing. Global structure inference for predicting future events can be achieved by performing sampling in a sequential manner.

We evaluate our proposed method on temporal link prediction task, by testing the performance of multi-step inference over time on five public temporal knowledge graph datasets. Experimental results demonstrate that RE-Net outperforms state-of-the-art models of both static and temporal graph reasoning, showing its better capacity to model temporal, multi-relational graph data with concurrent events. We further show that RE-Net can perform effective multi-step inference to predict unseen entity relationships in a distant future.

(a) An example of temporal KG
(b) Overview of the RE-Net architecture
Figure 1: Illustration of (a) temporal knowledge graph and (b) the Recurrent Event Network architecture. RE-Net employs an RNN to capture -related interactions (modeled by a neighborhood aggregator) at different time . Also the global information from is used to capture the global graph structures. Recurrent event encoder updates its state with graph sequences in an autoregressive manner. The decoder defines the probability at current time step conditioned on the preceding events.

2 Related Work

Our work is related to representation learning methods for static, multi-relational graphs, previous studies on temporal knowledge graph reasoning, and recent advancements in recurrent graph models.

Temporal KG Reasoning and Link Prediction. There are some recent attempts on incorporating temporal information in modeling dynamic knowledge graphs. (Trivedi et al., 2017) presented Know-Evolve which models the occurrence of a fact as a temporal point process. However, this method is built on a problematic formulation when dealing with concurrent events, as shown in Section E. Several embedding-based methods have been proposed (García-Durán et al., 2018; Leblay & Chekol, 2018; Dasgupta et al., 2018) to model time information. They embed the associate into a low dimensional space such as relation embeddings with RNN on the text of time (García-Durán et al., 2018), time embeddings (Leblay & Chekol, 2018), and temporal hyperplanes (Leblay & Chekol, 2018). However, these models do not capture temporal dependency and cannot generalize to unobserved time stamps.

Static KG Completion and Embedding. Extensive studies have been done on modeling static, multi-relation graph data for link prediction. There are methods which embed entities and relations into low-dimensional spaces (Bordes et al., 2013; Yang et al., 2015; Trouillon et al., 2016; Dettmers et al., 2018). Among these methods, Relational Graph Convolutional Networks (RGCN) (Schlichtkrull et al., 2018) generalized the previous GCN works (Kipf & Welling, 2016) by dealing with directed, multi-relational graphs such as knowledge graphs. These methods achieve high accuracy on reasoning on static knowledge graphs. However, they cannot deal with temporal evolution on knowledge graphs.

Recurrent Graph Neural Models. There have been some studies on recurrent graph neural models for sequential or temporal graph-structured data (Sanchez-Gonzalez et al., 2018; Battaglia et al., 2018; Palm et al., 2018; Seo et al., 2017; Pareja et al., 2019). These methods adopt a message-passing framework for aggregating nodes’ neighborhood information (e.g., via graph convolutional operations). GN (Sanchez-Gonzalez et al., 2018; Battaglia et al., 2018) and RRN (Palm et al., 2018) update node representations by a message-passing scheme between time stamps. EvolveGCN (Pareja et al., 2019) and GCRN (Seo et al., 2017) introduce an RNN to update node representations across different time stamps. In contrast, our proposed method, RE-Net, augments a RNN with message passing procedure between entity neighborhood to encode temporal dependency between (concurrent) events (i.e., entity interactions), instead of using the RNN to memorize historical information about the node representations.

3 Proposed Method: RE-Net

We consider a temporal knowledge graph (TKG) as a multi-relational, directed graph with time-stamped edges (relationships) between nodes (entities). An event is defined as a time-stamped edge, i.e., (subject entity, relation, object entity, time) and is denoted by a quadruple or . We denote a set of events at time as . A TKG is built upon a sequence of event quadruples ordered ascending based on their time stamps, i.e., (with ), where each time-stamped edge has a direction pointing from the subject entity to the object entity.222The same triple may occur multiple times in different time stamps, yielding different event quadruples. The goal of learning generative models of events is to learn a distribution over temporal knowledge graphs, based on a set of observed event sets . To model lasting events which span over a time range, i.e., , we simply partition such event into a sequence of time-stamp events . We leave more sophisticated modeling of lasting events as future work.

3.1 Recurrent Event Network

Sequential Structure Inference in TKG. The key idea in RE-Net is to define the joint distribution of all the events in an autoregressive manner, i.e., . Basically, we decompose the joint distribution into a sequence of conditional distributions (e.g., ), where we assume the probability of the events at a time step, e.g. , only depends on the events at the previous steps, e.g., . For each conditional distribution , we further assume that the events in are mutually independent given the previous events . In this way, the joint distribution can be rewritten as follows.

(1)

Intuitively, the generation process of each triplet is defined as below. Given all the past events , we fist generate a subject entity through the distribution . Then we further generate a relation with , and finally the object entity is generated by defining .

In this work, we assume that and depend only on events that are related to , and focus on modeling the following joint probability:

(2)

where becomes which is a set of neighboring entities interacted with subject entity under all relations at time stamp . For the third probability, the event sets should be considered since subject is not given. Next, we introduce how we parameterize these distributions.

Recurrent Event Encoder. RE-Net parameterizes in the following way:

(3)

where are learnable embedding vectors specified for subject entity and relation . is a history vector which encodes the information from the neighbor sets interacted with in the past, as well as the global information from graph structures of . Basically, is an encoding to summarize all the past information. Based on that, we further compute the probability of different object entities by passing the encoding into a linear softmax classifier parameterized by .

Similarly, we define the probabilities for relations and subjects as follows:

(4)
(5)

where captures all the local information about in the past, and is a vector representation to encode the global graph structures .

For each time step , since the hidden vectors , and preserve the information from the past events, and we update them in the following recurrent way:

(6)
(7)
(8)

where is an aggregation function, and stands for all the events related to at the current time step . Intuitively, we obtain the current information related to by aggregating all the related events at time , i.e., . Then we update the hidden vector by using the aggregated information at the current step, the past value and also the global hidden vector . The hidden vector is updated in a similar way. For the global hidden vector , we aggregate the information from all the events at time for update.

For each subject entity , it can interact with multiple relations and object entities at each time step . In other words, the set can contain multiple events. Designing effective aggregation functions to aggregate information from for is therefore a nontrivial problem. Next, we introduce how we design in RE-Net.

Figure 2: Illustration of the multi-relational graph (RGCN) aggregator. The blue node corresponds to node , red nodes are 1-hop neighbors, and green nodes are 2-hop neighbors. Different colored edges are different relations. In this figure, we get , and for each graph from a two-layered RGCN aggregator.

3.2 Multi-relational Graph (RGCN) Aggregator

Here we discuss the aggregate function , which capture different kinds of neighborhood information for each subject entity and relation, i.e., (, ). We first introduce two simple aggregation functions, i.e., mean pooling aggregator and attentive pooling aggregator. Then we introduce a more powerful aggregation function, i.e., multi-relational aggregator.

Mean Pooling Aggregator. The baseline aggregator simply takes the element-wise mean of the vectors in , where is a set of objects interacted with under at . But the mean aggregator treats all neighboring objects equally, and thus ignores the different importance of each neighbor entity.

Attentive Pooling Aggregator. We define an attentive aggregator based on the additive attention introduced in (Bahdanau et al., 2015) to distinguish the important entities for . The aggregate function is defined as , where . and are trainable weight matrices. By adding the attention function of the subject and the relation, the weight can determine how relevant each object entity is to the subject and relation.

Multi-Relational Aggregator. Here, we introduce a multi-relational graph aggregator based on (Schlichtkrull et al., 2018). This is a general aggregator that can incorporate information from multi-relational neighbors and multi-hop neighbors. Formally, the aggregator is defined as follows:

(9)

Basically, each relation can derive a local graph structure between entities, which further yield a message on each entity by aggregating the information from the neighbors of that entity, i.e., . The overall message on each entity is further computed by aggregating all the relation-specific messages, i.e., . Finally, the aggregator is defined by combining both the overall message and the information from past steps, i.e., .

To distinguish between different relations, we introduce independent weight matrices for each relation . Furthermore, the aggregator collects representations of multi-hop neighbors by introducing multiple layers of the neural network, with each layer indexed by . The number of layers determines the depth to which the node reaches to aggregate information from its local neighborhood. We depict this aggregator in Fig. 2.

The major issue of this aggregator is that the number of parameters grows rapidly with the number of relations. In practice, this can easily lead to overfitting on rare relations and models of very large size. Thus, we adopt the block-diagonal decomposition (Schlichtkrull et al., 2018), where each relation-specific weight matrix is decomposed into a block-diagonal by decomposing into low-dimensional matrices. in equation 9 is defined as a block diagonal matrix, where and is the number of basis matrices. The block decomposition reduces the number of parameters and helps to prevent overfitting.

3.3 Parameter Learning and Inference of RE-Net

Parameter Learning via Event Prediction. The (object) entity prediction given can be viewed as a multi-class classification task, where each class corresponds to one object entity. Similarly, relation prediction given and subject entity prediction can be considered as a multi-class classification task. Here we omit the notation for previous events. To learn weights and representations for entities and relations, we adopt a multi-class cross-entropy loss to the model’s output.The loss function is comprised of three losses and is defined as:

(10)

where is set of events, and and are importance parameters that control the importance of each loss term. and can chosen depending on a task. If the task aims to predict given , then we can give small values to and . Each probability is defined in equations 3, 4, and 5, respectively. We apply teacher forcing for model training over historical data.

Multi-step Inference over Time. At inference time, RE-Net seeks to predict the forthcoming events based on the previous observations. Suppose that the current time is and we aim at predicting events at time , then the problem of multi-step inference can be formalized as an inference problem, i.e., inferring the conditional probability . The problem is nontrivial as we need to integrate over all . To achieve efficient inference, we draw a sample of , and estimate the conditional probability in the following way:

(11)

Such an inference procedure is intuitive. Basically, one starts with computing , and drawing a sample from the conditional distribution. With this sample, one can further compute . By iteratively computing the conditional distribution for and drawing a sample from it, one can eventually estimate as . In practice, we can improve the estimation by drawing multiple samples at each step, but RE-Net already performs very well with a single sample, and thus we only draw one sample at each step for better efficiency. Based on the estimation of the conditional distribution, we can further predict events which are likely to form in the future. We summarize the detailed inference algorithm in Algorithm 1.

Input: Observed graph sequence , Number of events to sample at each step .
Output: An estimation of the conditional distribution .
1
2 while  do
3       Sample number of by Equation 4.
4       Pick top- triples ranked by Equation 2.
5      
6      
7Estimate the probability of each event by Equation 2.
8 Estimate the joint distribution of all events by Equation 1.
return as the estimation.
Algorithm 1 Inference algorithm of RE-Net

Computational Complexity Analysis. Here we analyze the time complexity of the graph generation algorithm 1. Assume that the maximum degree of entities is , and we have layers of aggregation, the time complexity of each aggregation operation is . As we unroll for time time has linear time complexity , the time complexity for generating one example from is .

4 Experiments

Evaluating the quality of generated graphs is challenging, especially in knowledge graphs (Theis et al., 2015). Instead, we evaluate our proposed method on a link prediction task on temporal knowledge graphs. The task of predicting future links aims to predict unseen relationships with object entities given (or subject entities given ), based on the observed events in the TKG. Essentially, the task is a ranking problem over all the events (or ). RE-Net can approach this problem by computing the probability of each event in a distant future with the inference algorithm in Algorithm 1, and further rank all the events according to their probabilities.

We evaluate our proposed method on three benchmark tasks: (1) predicting future events on three event-based datasets; (2) predicting future facts on two knowledge graphs which include facts with time spans, and (3) studying parameter sensitivity and ablation of our proposed method. Section 4.1 summarizes the datasets, and the supplementary material contains additional information. In all these experiments, we perform predictions on time stamps that are not observed during training.

Method ICEWS18 - filtered GDELT - filtered ICEWS14 - filtered
MRR H@1 H@3 H@10 MRR H@1 H@3 H@10 MRR H@1 H@3 H@10

Static

TransE 17.56 2.48 26.95 43.87 16.05 0.00 26.10 42.29 18.65 1.21 31.34 47.07
DistMult 22.16 12.13 26.00 42.18 18.71 11.59 20.05 32.55 19.06 10.09 22.00 36.41
ComplEx 30.09 21.88 34.15 45.96 22.77 15.77 24.05 36.33 24.47 16.13 27.49 41.09
R-GCN 23.19 16.36 25.34 36.48 23.31 17.24 24.94 34.36 26.31 18.23 30.43 45.34
ConvE 36.67 28.51 39.80 50.69 35.99 27.05 39.32 49.44 40.73 33.20 43.92 54.35
RotatE 23.10 14.33 27.61 38.72 22.33 16.68 23.89 32.29 29.56 22.14 32.92 42.68

Temporal

HyTE 7.31 3.10 7.50 14.95 6.37 0.00 6.72 18.63 11.48 5.64 13.04 22.51
TTransE 8.36 1.94 8.71 21.93 5.52 0.47 5.01 15.27 6.35 1.23 5.80 16.65
TA-DistMult 28.53 20.30 31.57 44.96 29.35 22.11 31.56 41.39 20.78 13.43 22.80 35.26
Know-Evolve* 3.27 3.23 3.23 3.26 2.43 2.33 2.35 2.41 1.42 1.35 1.37 1.43
Know-Evolve+MLP 9.29 5.11 9.62 17.18 22.78 15.40 25.49 35.41 22.89 14.31 26.68 38.57
DyRep+MLP 9.86 5.14 10.66 18.66 23.94 15.57 27.88 36.58 24.61 15.88 28.87 39.34
R-GCRN+MLP 35.12 27.19 38.26 50.49 37.29 29.00 41.08 51.88 36.77 28.63 40.15 52.33
RE-Net w/o multi-step 40.05 33.32 42.60 52.92 38.10 29.34 41.26 51.61 42.72 35.42 46.06 56.15
RE-Net w/o agg. 33.46 26.64 35.98 46.62 38.72 30.57 42.52 52.78 42.23 34.73 45.61 56.07
RE-Net w. mean agg. 40.70 34.24 43.27 53.65 38.35 29.92 42.13 52.52 43.79 36.21 47.34 57.47
RE-Net 42.93 36.19 45.47 55.80 40.12 32.43 43.40 53.80 45.71 38.42 49.06 59.12
RE-Net w. GT 44.33 37.61 46.83 57.27 41.80 33.54 45.71 56.03 46.74 39.41 50.10 60.19
Table 1: Performance comparison on temporal link prediction (average metrics in % over 5 runs) on three event-based TKG datasets with filtered setting. RE-Net achieves the best results. Results with raw setting are in the supplementary material.

4.1 Experimental Set-up

Datasets. We use five datasets: 1) three event-based temporal knowledge graphs and 2) two knowledge graphs where temporally associated facts have meta-facts as where is the starting time point and is the ending time point. The first group of graphs includes Integrated Crisis Early Warning System (ICEWS18  (Boschee et al., 2015) and ICEWS14 (Trivedi et al., 2017)), and Global Database of Events, Language, and Tone (GDELT) (Leetaru & Schrodt, 2013). The second group of graphs includes WIKI (Leblay & Chekol, 2018) and YAGO (Mahdisoltani et al., 2014). We preprocess the second group of datasets such that each fact is converted to where is a unit time to ensure each fact has a sequence of events. The details of the dataset are described in Section B.

Evaluation Setting and Metrics. For each dataset except ICEWS14, we split it into three subsets, i.e., train(80%)/valid(10%)/test(10%), by time stamps. Thus, (times of train) < (times of valid) < (times of test). We report Mean Reciprocal Ranks (MRR) and Hits/, using the filtered version and the raw version of the datasets. Similar to the definition of filtered setting in (Bordes et al., 2013), during evaluation, we remove from the list of corrupted triplets all the triplets that appear either in the train, dev, or test set.

Competitors. We compare our approach to baselines for static graphs and temporal graphs:

(1) Static Methods. By ignoring the edge time stamps, we construct a static, cumulative graph for all the training events, and apply multi-relational graph representation learning methods including TransE (Bordes et al., 2013), DistMult (Yang et al., 2015), ComplEx (Trouillon et al., 2016), R-GCN (Schlichtkrull et al., 2018), ConvE (Dettmers et al., 2018), and RotatE (Sun et al., 2019).

(2) Temporal Reasoning Methods. We also compare state-of-the-art temporal reasoning methods for knowledge graphs, including Know-Evolve333*: We found a problematic formulation in Know-Evolve when dealing with concurrent events (Eq. (3) in its paper) and a flaw in its evaluation code. The performance dramatically drops after fixing the evaluation code. Details of this issues are discussed in Section E. (Trivedi et al., 2017), TA-DistMult (García-Durán et al., 2018), HyTE (Dasgupta et al., 2018), and TTransE (Leblay & Chekol, 2018). TA-DistMult, HyTE, and TTransE are for a interpolation task which is to make predictions at time such that , which is different from our setting. We give random values or embeddings that are not observed during training. To see the effectiveness of our recurrent event encoder, we use encoders of previous work and our MLP decoder as baselines; we compare Know-Evolve, Dyrep (Trivedi et al., 2019), and GCRN (Seo et al., 2017) combined with our MLP decoder, which are called Know-Evolve+MLP, DyRep+MLP, and R-GCRN+MLP. The GCRN utilizes Graph Gonvolutional Network (Kipf & Welling, 2016). Instead, we use RGCN (Schlichtkrull et al., 2018) to deal with relational graphs.

(3) Variants of RE-Net. To evaluate the importance of different components of RE-Net, we varied our base model in different ways: RE-Net w/o multi-step which does not update history during inference, RE-Net without the aggregator (RE-Net w/o agg.), and RE-Net with a mean aggregator. RE-Net w/o agg. takes a zero vector instead of a representation of the aggregator. RE-Net w. GT denotes RE-Net with ground truth history or interactions during multi-step inference, and thus the model knows all the interactions before the time for testing. Experiment settings and implementation details of RE-Net and baselines are described in Section C.

Method WIKI - filtered WIKI - raw YAGO - filtered YAGO - raw
MRR H@3 H@10 MRR H@3 H@10 MRR H@3 H@10 MRR H@3 H@10

Static

TransE 46.68 49.71 51.71 26.21 31.25 39.06 48.97 62.45 66.05 33.85 48.19 59.50
DistMult 46.12 49.81 51.38 27.96 32.45 39.51 59.47 60.91 65.26 44.05 49.70 59.94
ComplEx 47.84 50.08 51.39 27.69 31.99 38.61 61.29 62.28 66.82 44.09 49.57 59.64
R-GCN 37.57 39.66 41.90 13.96 15.75 22.05 41.30 44.44 52.68 20.25 24.01 37.30
ConvE 47.57 50.10 50.53 26.03 30.51 39.18 62.32 63.97 65.60 41.22 47.03 59.90
RotatE 50.67 50.74 50.88 26.08 31.63 38.51 65.09 65.67 66.16 42.08 46.77 59.39

Temporal

HyTE 43.02 45.12 49.49 25.40 29.16 37.54 23.16 45.74 51.94 14.42 39.73 46.98
TTransE 31.74 36.25 43.45 20.66 23.88 33.04 32.57 43.39 53.37 26.10 36.28 47.73
TA-DistMult 48.09 49.51 51.70 26.44 31.36 38.97 61.72 65.32 67.19 44.98 50.64 61.11
Know-Evolve* 0.09 00.03 0.10 0.03 0 0.04 00.07 0 0.04 0.02 0 0.01
Know-Evolve+MLP 12.64 14.33 21.57 10.54 13.08 20.21 6.19 6.59 11.48 5.23 5.63 10.23
DyRep+MLP 11.60 12.74 21.65 10.41 12.06 20.93 5.87 6.54 11.98 4.98 5.54 10.19
R-GCRN+MLP 47.71 48.14 49.66 28.68 31.44 38.58 53.89 56.06 61.19 43.71 48.53 56.98
RE-Net w/o multi-step 51.01 51.14 52.91 29.91 32.60 40.29 64.21 64.70 67.11 45.88 51.78 60.97
RE-Net w/o agg. 31.08 33.98 45.53 17.55 20.65 33.51 33.86 36.89 50.72 27.37 30.20 46.35
RE-Net w. mean agg. 51.13 51.37 53.01 30.19 32.94 40.57 65.10 65.24 67.34 46.33 52.49 61.21
RE-Net 51.97 52.07 53.91 30.87 33.55 41.27 65.16 65.63 68.08 46.81 52.71 61.93
RE-Net w. GT 53.57 54.10 55.72 32.44 35.42 43.16 66.80 67.23 69.77 48.60 54.20 63.59
Table 2: Performance comparison on temporal link prediction (average metrics in % over 5 runs) on two public temporal knowledge graphs, i.e., WIKI and YAGO.
ICEWS18 (H
(a) ICEWS18 (H)
(b) GDELT (H)
(c) WIKI (H)
(d) YAGO (H)
Figure 3: Performance of temporal link prediction over future time stamps with filtered Hits@3. RE-Net consistently outperforms the baselines.

4.2 Performance Comparison on Temporal Knowledge Graphs.

In this section we compare our proposed method with other baselines. The test results are obtained by averaged metrics over the entire test sets on datasets.

Performances on Event-based TKGs. Table 1 summarizes results on three event-based datasets: ICEWS18, GDELT, and ICEWS14. Our proposed RE-Net outperforms all other baselines on these datasets. Static methods show good results but they underperform our method since they do not consider temporal factors. Also, RE-Net outperforms all other temporal methods, which demonstrates effectiveness of the proposed method. The modified Know-Evolve with our MLP decoder (Know-Evovle+MLP) achieves the better performances than Know-Evolve, which shows effectiveness of our MLP decoder, but there is still a large gap from our model. We notice that Know-Evolve and DyRep has a gradient exploding issue on their encoder since their RNN-like structures keep accumulating embedding over time. This issue degrades their performances. Graph Convolutional Recurrent Network (GCRN) is not for dynamic and multi-relational graphs, and is not capable of link prediction. We modified the model to work on dynamic graphs and our problem setting by using RGCN instead of GCN, and our MLP decoder. The modified model (R-GCRN+MLP) shows good performances but it does not outperform our method. R-GCRN+MLP has a similar structure to ours in that it has a recurrent encoder and an RGCN aggregator but it lacks multi-step inference, global information, and sophisticated modeling for the recurrent encoder. These results of the combined models suggest the our recurrent event encoder shows better performances in link prediction. Importantly, all these temporal methods are not capable of multi-step inference, while RE-Net sequentially infers multi-step events.

Performances on Public KGs. Previous results have proved the effectiveness of RE-Net, and here we will compare the method on the Public KGs: WIKI and YAGO. In Table 2, our proposed RE-Net outperforms all other baselines. In these datasets, baselines show better results than in the Event-based TKGs. This is due to the characteristics of the datasets; they have facts that are valid within a time span. However, our proposed method consistently outperforms the static and temporal methods. which implies that RE-Net effectively infers new events using a powerful event encoder and an aggregator, and provides accurate prediction results.

Performances of Prediction over Time. Next, we further study performances of RE-Net over time. Figs. 3 shows the performance comparisons over different time stamps on the ICEWS18, GDELT, WIKI, and YAGO datasets with filtered Hits@3 metrics. RE-Net consistently outperforms baseline methods for all different time stamps. We notice that with the increase of time step, the difference between RE-Net and ConvE is getting smaller as shown in Fig. 3. This is expected since further future events are harder to predict. Furthermore, we can think that the decline of the performances is due to the generation of a long graph sequence. To estimate the joint probability distribution of all events in a distant future, RE-Net should generates a long sequence of graphs. The quality of generated graphs deteriorates when RE-Net generates a long graph sequence.

(a) RE-Net with different aggregators
(b) Effect of global representations
(c) Study of empirical and
Figure 4: Performance study on model variations. We study the effects of (a) RE-Net with different aggregators, (b) effect of the global representation from a global graph structure, and (c) empirical and

4.3 Ablation Study

In this section, we study the effect of variations in RE-Net. To evaluate the importance of different components of RE-Net, we varied our base model in different ways, measuring the change in performance on the link prediction task on the ICEWS18 dataset. We present the results in Tables 1, 2, and Figs. 4.

Different Aggregators. We first analyze the effect of the aggregator. In Tables  1, 2, we observe that RE-Net w/o agg. hurts model quality. This suggests that introducing aggregators make the model capable of dealing with concurrent events and aggregators improve the prediction performances. Fig. (a)a shows the performances of RE-Net with different aggregators. Among them, RGCN aggregator outperforms other aggregators. This aggregator has the advantage of exploring multi-relational neighbors not limited to neighbors under the same relation.

Global Information. We further observe that representations from global graph structures help the predictions. Fig. (b)b shows effectiveness of a representation of global graph structures. We consider that global representations give information beyond local graph structures.

Empirical Probabilities. Here, we study the role of and . We simply denote them as and for brevity. Also, or simply is equivalent to . In Fig (c)c, emp. denotes a model with empirical (or ) which is defined as (# of -related triples) / (total # of triples). Also, emp. denotes a model with and which is defined as (# of -related triples) / (total # of triples). Thus, . RE-Net use a trained and . The results shows that the trained and help RE-Net for multi-step predictions. Using underperforms RE-Net, and using shows the worst performances, which suggests that training each part of the probability in equation 2 gives better prediction performances.

4.4 Sensitivity Analysis

In this section, we study the parameter sensitivity of RE-Net including the length of history for the event encoder, cutoff position k for events to generate a graph. Furthermore, we study the layers of RGCN aggregator. We report the performance change of RE-Net on the ICEWS18 dataset by varying the hyper-parameters in Table 5.

Length of Past History in Recurrent Event Encoder. The recurrent event encoder takes the sequence of past interactions up to graph sequences or previous histories. Figure (a)a shows the performances with varying length of past histories. When RE-Net uses longer histories, MRR is getting higher. However, the MRR is not likely to go higher when the length of history is 5 and over. This implies that long history does not make big differences.

Cut-off Position at Inference Time. To do multi-step prediction, RE-Net should generate graphs with triples. To generate graphs, we cut off top- triples on ranking results. If is 0, RE-Net does not generate graphs for estimating , which means RE-Net does not perform multi-step predictions Fig. (b)b shows the performances with choosing different cutoff position . When , which means RE-Net performs single-step predictions, it shows the lowest result. When is larger, the performance is getting higher and the performances are saturated after 500. We also notice that the conditional distribution can be approximated by by using a larger cutoff position.

Layers of RGCN Aggregator. We examine the number of layers in the RGCN aggregator. The number of layers in the aggregator means the depth to which the node reaches. Fig. (c)c shows the performances according to different numbers of layers of RGCN. We notice that 2-layered RGCN improves the performances considerably compared to 1-layered RGCN since 2-layered RGCN aggregates more information.

(a) Length of past history
(b) Cutoff position
(c) # layers of RGCN
Figure 5: Parameter sensitivity on RE-Net. We study the effects of (a) length of RNN history in event sequence encoder, (b) cutoff position at inference time, and (c) number of RGCN layers in neigborhood aggregation.

5 Conclusion

In this work, we studied the sequential graph generation on temporal knowledge graphs. To tackle this problem, we proposed Recurrent Event Network (RE-Net) which models temporal, multi-relational, and concurrent interactions between entities. A recurrent event encoder in RE-Net summarizes information of the past event sequences, and a neighborhood aggregator collects the information of concurrent. RE-Net defines the joint probability of all events, and thus is capable of inferring global structures in a sequential manner. We tested the proposed model on a link prediction task on temporal knowledge graphs. The experiment revealed that the proposed RE-Net outperforms all the static and temporal methods and our extensive experiments shows its strength. Interesting future work includes modeling lasting events and performing inference on the long-lasting graph structures.

References

Appendix A Recurrent Event Encoder

We define a recurrent event encoder based on RNN as follows:

We use Gated Recurrent Units (Cho et al., 2014) as RNN:

where is concatenation, is an activation function, and is a Hadamard operator. The input is a concatenation of three vectors: subject embedding, object embedding, and aggregation of neighborhood representations. and are similarly defined.

Appendix B Dataset

We use five datasets: 1) three event-based temporal knowledge graphs (ICEWS18, ICEWS14, and GDELT), and 2) two knowledge graphs (WIKI and YAGO). ICEWS18 is collected from 1/1/2018 to 10/31/2018, ICEWS14 is from 1/1/2014 to 12/31/2014, and GDELT is from 1/1/2018 to 1/31/2018. The ICEWS14 is from (Trivedi et al., 2017). We didn’t use their version of the GDELT dataset since they didn’t release the dataset.

WIKI and YAGO datasets have temporally associated facts . We preprocess the datasets such that each fact is converted to where is a unit time to ensure each fact has a sequence of events. Noisy events of early years are removed (before 1786 for WIKI and 1830 for YAGO).

The difference between the first group and the second group is that facts happen multiple times (even periodically) on the first group (event-based knowledge graphs) while facts last long time but are not likely to occur multiple times in the second group.

Dataset statistics are described on table 3.

Data Time granularity
GDELT 1,734,399 238,765 305,241 7,691 240 15 mins
ICEWS18 373,018 45,995 49,545 23,033 256 24 hours
ICEWS14 323,895 - 341,409 12,498 260 24 hours
WIKI 539,286 67,538 63,110 12,554 24 1 year
YAGO 161,540 19,523 20,026 10,623 10 1 year
Table 3: Dataset Statistics.

Appendix C Detailed Experimental Settings

Model details for RE-Net. We use Gated Recurrent Units (Cho et al., 2014) as our recurrent event encoder, where the length of history is set as which means saving past 10 event sequences. If the events related to are sparse, we check the previous time steps until we get previous time steps related to the entity . We pretrain the parameters related to equations 5 and 8 due to large size of training graphs. We freeze the parameters during learning parameters for equations 3 and 4. At inference time, RE-Net performs multi-step prediction across the time stamps in dev and test sets. In each time step, we save top-1000 triples to use them as a generated graph. We set the size of entity/relation embeddings to be 200 and embedding of unobserved embeddings are randomly initialized. We use two-layer RGCN in the RGCN aggregator with block dimension . The model is trained by the Adam optimizer (Kingma & Ba, 2014). We set to 0.1, the learning rate to and the weight decay rate to 0.00001. All experiments were done on GeForce GTX 1080 Ti.

Experimental Settings for Baseline Methods. In this section, we provide detailed settings for baselines. We use implementations of TransE and DistMult444https://github.com/jimmywangheng/knowledge_representation_pytorch. We implemented TTransE and TA-DistMult based on the implementation of TransE and Distmult, respectively. For TA-DistMult, We use temporal tokens with the vocabulary of year, month and day on the ICEWS dataset and the vocabulary of year, month, day, hour and minute on the GDELT dataset. We use a margin-based ranking loss with L1 norm for TransE and use a binary cross-entropy loss for DistMult and TA-DistMult. We validate the embedding size among 100 and 200. We set the batch size to 1024, margin to 1.0, negative sampling ratio to 1, and use the Adam optimizer.

We use the implementation of ComplEx555https://github.com/thunlp/OpenKE Han et al. (2018). We validate the embedding size among 50, 100 and 200. The batch size is 100, the margin is 1.0, and the negative sampling ratio is 1. We use the Adagrad optimizer.

We use the implementation of HyTE666https://github.com/malllabiisc/HyTE. We use every timestamp as a hyperplane. The embedding size is set to 128, the negative sampling ratio to 5, and margin to 1.0. We use time agnostic negative sampling (TANS) for entity prediction, and the Adam optimizer.

We use the codes for ConvE777https://github.com/TimDettmers/ConvE and use implementation by Deep Graph Library888https://github.com/dmlc/dgl/tree/master/examples/pytorch/rgcn. Embedding sizes are 200 for both methods. We use 1 to all negative sampling for ConvE and use 10 negative sampling ratio for RGCN, and use the Adam optimizer for both methods. We use the codes for Know-Evolve999https://github.com/rstriv/Know-Evolve. For Know-Evolve, we fix the issue in their codes. Issues are described in Section E. We follow their default settings.

We use the code for RotatE101010https://github.com/DeepGraphLearning/KnowledgeGraphEmbedding. The hidden layer/embedding size is set to 100, and batch size 256; other values follow the best values for the larger FB15K dataset configurations supplied by the author. The author reports filtered metrics only, so we added the implementation of the raw setting.

Method ICEWS18 - raw GDELT - raw ICEWS14 - raw
MRR H@1 H@3 H@10 MRR H@1 H@3 H@10 MRR H@1 H@3 H@10

Static

TransE 12.37 1.51 15.99 34.65 7.84 0.00 8.92 23.30 11.17 0.73 14.45 32.29
DisMult 13.86 5.61 15.22 31.26 8.61 3.91 8.27 17.04 9.72 3.23 10.09 22.53
ComplEx 15.45 8.04 17.19 30.73 9.84 5.17 9.58 18.23 11.20 5.68 12.11 24.17
R-GCN 15.05 8.13 16.49 29.00 12.17 7.40 12.37 20.63 15.03 7.17 16.12 31.47
ConvE 22.81 13.63 25.83 41.43 18.37 11.29 19.36 32.13 21.32 12.83 23.45 38.44
RotatE 11.63 4.21 12.31 28.03 3.62 0.52 2.26 8.37 9.79 3.77 9.37 22.24

Temporal

HyTE 7.41 3.10 7.33 16.01 6.69 0.01 7.57 19.06 7.72 1.65 7.94 20.16
TTransE 8.44 1.85 8.95 22.38 5.53 0.46 4.97 15.37 4.34 0.81 3.27 10.47
TA-DistMult 15.62 7.63 17.09 32.21 10.34 4.44 10.44 21.63 11.29 5.11 11.60 23.71
Know-Evolve* 0.11 0.00 0.00 0.47 0.11 0.00 0.02 0.10 0.05 0.00 0.00 0.10
Know-Evolve+MLP 7.41 3.31 7.87 14.76 15.88 11.66 15.69 22.28 16.81 9.95 18.63 29.20
DyRep+MLP 7.82 3.57 7.73 16.33 16.25 11.78 16.45 23.86 17.54 10.39 19.87 30.34
R-GCRN+MLP 23.46 14.24 26.62 41.96 18.63 11.53 19.80 32.42 21.39 12.74 23.60 38.96
RE-Net w/o agg. 23.11 14.46 26.45 39.96 18.90 11.69 20.07 32.93 21.43 12.25 24.12 40.09
RE-Net w/o multi-step 25.67 15.98 29.33 44.65 19.15 11.87 20.34 33.39 23.86 14.63 26.53 42.59
RE-Net (mean pool) 25.45 15.76 29.27 44.31 19.03 11.78 20.20 33.32 22.73 13.52 25.47 41.48
RE-Net 26.62 16.96 30.27 45.57 19.60 12.03 20.56 33.89 23.85 14.63 26.52 42.58
RE-Net w. GT 27.87 18.12 31.60 46.94 21.29 13.99 22.53 35.59 24.88 15.63 27.55 43.63
Table 4: Performance comparison on ICEWS and GDELT datasets with raw metrics. We observe our method outperforms all other methods.
ICEWS18 (MRR)
(a) ICEWS18 (MRR)
(b) GDELT (MRR)
(c) WIKI (MRR)
(d) YAGO (MRR)
Figure 6: Performance of temporal link prediction over future time stamps. We report filtered MRR (average metrics in %) on the test sets of ICEWS18, GDELT, WIKI, and YAGO datasets.

Appendix D Additional Experiments

Table 4 shows the performance comparison on ICEWS18, GDELT, ICEWS14 with raw settings. Our proposed RE-Net outperforms all other baselines. Figs. 6 shows the performance comparisons over different time stamps on the ICEWS18, GDELT, WIKI and YAGO datasets with filtered MRR. Our proposed RE-Net consistently outperform baselines over time.

Appendix E Implementation Issues of Know-Evolve

We found a problematic formulation in the Know-Evolve model and codes. The intensity function (equation 3 in (Trivedi et al., 2017)) is defined as , where is a score function, is current time, and is the most recent time point when either subject or object entity was involved in an event. This intensity function is used in inference to rank entity candidates. However, they don’t consider concurrent event at the same time stamps, and thus will become after one event. For example, we have events . After , will become (subject ’s most recent time point), and thus the value of intensity function for will be 0. This is problematic in inference since if , then the intensity function will always be 0 regardless of entity candidates. In inference, all object candidates are ranked by the intensity function. But all intensity scores for all candidates will be 0 since , which means all candidates have the same 0 score. In their code, they give the highest ranks (first rank) for all entities including the ground truth object in this case. Thus, we fixed their code for a fair comparison; we give an average rank to entities who have the same scores.

Appendix F Theoretical Analysis

Here we analyze the model capacity of RE-Net of capturing complex time-invariant local structure like (Hamilton et al., 2017), as well as the emerging global community structure as (You et al., 2018).

Theorem 1

Let be the snapshot of temporal knowledge graph after time-steps. Let to be the input feature representation for Algorithm 1 of each entity node . Suppose that there exists a fixed positive constant such that for all pair of all pair of entities . Then we have that , there exist a parameter setting for RE-Net s.t. after layers of aggregation,

where are output values generated by RE-Net and are clustering coefficients of .

Observation 1

Consider a temporal graph under stochastic block model described in Section F.2. Let to be the input feature representation for Algorithm 1 of each node. Suppose that a constant portion of input representations can be linearly separated by a hyperplane, while the representation of other nodes lies on the hyperplane. There exists a parameter setting of RE-Net that can output the probability that new node connected to node .

f.1 Proof for Theorem 1

Using pooling aggregator of GraphSAGE, we can actually copy its behavior by setting recurrent weight matrix of the RNN model to be . In this case, we lose all time-dependency our RE-Net and the representation model becomes time-invariant. However, RE-Net have exactly the same model capacity as GraphSAGE.

f.2 Analysis for Observation 1

Here we define the generation process of our temporal graph. Assume that the generation process of the graph follows a stochastic block model, and there are two communities in the graph. Half of the nodes belong to community A and the other half belong to community B. Nodes within one community have probability to be connected while other pairs have probability to be connected. The edges in the graph are introduced into the graph in a time. Suppose a sequence of time-steps, a new node is introduced to the community and each edge is added to the graph.

This observation follows from three facts: (1) For each node in the neighborhood , using pooling aggregator, we can detect their community assignment . We assign the output of community A to be and the output of community B to be . (2) The error of incorrectly discerning the community of a node decrease exponentially with the number of links. For example let the node be in community A. Let the total number of nodes at time to be , by Hoeffding’s inequality we have

(3) Given the correct community classification, the relation classifier is able to predict the probability of linking nodes.

Combining these three facts, RE-Net is able to infer the community structure of the node.

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
393424
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description