1 Introduction


We study the problem of learning Granger causality between event types from asynchronous, interdependent, multi-type event sequences. Existing work suffers from either limited model flexibility or poor model explainability and thus fails to uncover Granger causality across a wide variety of event sequences with diverse event interdependency. To address these weaknesses, we propose CAUSE (Causality from AttribUtions on Sequence of Events), a novel framework for the studied task. The key idea of CAUSE is to first implicitly capture the underlying event interdependency by fitting a neural point process, and then extract from the process a Granger causality statistic using an axiomatic attribution method. Across multiple datasets riddled with diverse event interdependency, we demonstrate that CAUSE achieves superior performance on correctly inferring the inter-type Granger causality over a range of state-of-the-art methods.


1 Introduction

Many real-world processes generate a massive number of asynchronous and interdependent events in real time. Examples include the diagnosis and drug prescription histories of patients in electronic health records, the posting and responding behaviors of users on social media, and the purchase and selling orders executed by traders in stock markets, among others. Such data can be generally viewed as multi-type event sequences, in which each event consists of both a timestamp and a type label, indicating when and what the event is, respectively.

In this paper, we focus on the fundamental problem of uncovering causal structure among event types from multi-type event sequence data. Since the question of “true causality” is deeply philosophical (schaffer2003metaphysics), and there are still massive debates on its definition (pearl2009causality; imbens2015causal), we consider a weaker notion of causality based on predictability—Granger causality. While Granger causality was initially used for studying the dependence structure for multivariate time series (Granger1969; Dahlhaus2003), it has also been extended to multi-type event sequences (Didelez2008). Intuitively, for event sequence data, an event type is said to be (strongly) Granger causal for another event type if the inclusion of historical events of the former type leads to better predictions of future events of the latter type.

Due to their asynchronous nature, in the literature, multi-type event sequences are often modeled by multivariate point process (MPP), a class of stochastic processes that characterize the random generation of points on the real line. Existing point process models for inferring inter-type Granger causality from multi-type event sequences appear to be limited to a particular case of MPPs—Hawkes process (Eichler2017; xu2016learning; achab2017uncovering), which assumes past events can only independently and additively excite the occurrence of future events according to a collection of pairwise kernel functions. While these Hawkes process-based models are very interpretable and many include favorable statistical properties, the strong parametric assumptions inherent in Hawkes processes render these models unsuitable for many real-world event sequences with potentially abundant inhibitive effects or event interactions. For example, maintenance events should reduce the chances of a system breaking down, and a patient who takes multiple medicines at the same time is more likely to experience unexpected adverse events.

Regarding event sequence modeling in general, a new class of MPPs, loosely referred to as neural point processes (NPPs), has recently emerged in the literature (du2016recurrent; xiao2017modeling; mei2017neuralhawkes; Xiao2019). NPPs use deep (mostly recurrent) neural networks to capture complex event dependencies, and thus excel at predicting future events due to their model flexibility. However, NPPs are uninterpretable and unable to provide insight into the Granger causality between event types.

To address this tension between model explainability and model flexibility in existing point process models, we propose CAUSE (Causality from AttribUtions on Sequence of Events), a framework for obtaining Granger causality from multi-type event sequences using information captured by a highly predictive NPP model. At the core of CAUSE are two steps: first, it trains a flexible NPP model to capture the complex event interdependency, then it computes a novel Granger causality statistic by inspecting the trained NPP with an axiomatic attribution method. In this way, CAUSE is the first technique that brings model-agnostic explainability to NPPs and endows NPPs with the ability to discover Granger causality from multi-type event sequences exhibiting various types of event interdependencies.

Contributions. Our contributions are:

  • We bring model-agnostic explainability to NPPs and propose CAUSE, a novel framework for learning Granger causality from multi-type event sequences exhibiting various types of event interdependency.

  • We design a two-level batching algorithm that enables efficient computation of Granger causality scalable to millions of events.

  • We evaluate CAUSE on both synthetic and real-world datasets riddled with diverse event interdependency. Our experiments demonstrate that CAUSE outperforms other state-of-the-art methods.

Reproducibility. We publish our data and our code at https://github.com/razhangwei/CAUSE.

2 Background

In this section, we first establish some notation and then briefly introduce the background for several highly relevant topics.

2.1 Notations

Suppose there are subjects and each subject is associated with a multi-type event sequence , where is the timestamp of the -th event of the -th sequence, is the corresponding event type, and is the sequence length. We denote by the one-hot representation of each event type , and use to represent the set for any positive integer . To avoid clutter, we omit the subscript/superscript of index when we are discussing a single event sequence and no confusion arises.

2.2 Multivariate Point Process

Multivariate point processes (MPPs) (Daley2003) are a particular class of stochastic processes that characterize the dynamics of discrete events of multiple types in continuous time. The most common way to define an MPP is through a set of conditional intensity functions (CIFs), one for each event type. Specifically, let be the number of events of type that have occurred up to , and let be the history of all types of events before . The CIF for event type is defined as the expected instantaneous event occurrence rate conditioned on history, i.e.,

where the use of the asterisk is a notational convention to emphasize that intensity is conditioned upon .

Different parameterizations of CIFs lead to different MPPs. One classic example of MPP is the multivariate Hawkes process (hawkes1971point; hawkes1971spectra), which assumes each to be of the following form:


where is the baseline rate for event type , and for any is a non-negative-valued function (usually referred to as kernel function) that characterizes the excitation effect of event type on type .

More recently, a class of MPPs loosely referred to as neural point processes have emerged in the literature (du2016recurrent; xiao2017modeling; mei2017neuralhawkes; Xiao2019). These models parameterize CIFs with deep neural networks and generally follow an encoder-decoder design: an encoder successively embeds the event history into a vector for each , and a decoder then predicts with the future CIFs after time (until the next event is generated).

Most MPPs are trained by minimizing the negative log-likelihood (NLL):


where is the CIF of the -th sequence for the type . In (2), the first term corresponds to the NLL of an event of type being observed at for the -th sequence, whereas the second term is the NLL of the observation that no events of any type occurred during the window . When there are no closed-form expressions for the integrals , Monte-Carlo approximation needs to be used to approximate either the integrals themselves or their gradients with respect to the parameters. However, these approximation techniques are inefficient and generally suffer from large variances, resulting in low convergence rate.

2.3 Granger Causality for Multi-Type Event Sequences

The Granger causality definition for multi-type event sequences is established based on point process theory (Daley2003). To proceed formally, for any , we denote by the natural filtration expanded by the sub-process ; that is, the sequence of smallest -algebra expanded by the past event history of any type and , i.e., .1 We further write for any . \deletedSomesh: Describe measure theory in Appendix

Definition 1.

(Eichler2017) For a -dimensional MPP, event type is Granger non-causal for event type if is -measurable for all .

The above definition amounts to saying that a type is Granger non-causal for another type if, given the history of events other than type , historical events of type do not further contribute to future at any time. Otherwise, type is said to be Granger causal for type .

Uncovering Granger causality from event sequences generally is a very challenging task, as the underlying MPP may have rather complex CIFs with abundant event interaction and non-excitative effect. As a result, existing work tends to restrict consideration to certain classes of parametric MPPs, such as Hawkes processes (Eichler2017; xu2016learning; achab2017uncovering). Specifically, for multivariate Hawkes process, it is straightforward from (1) that a type is Granger non-causal for another type if and only if the corresponding kernel function . \deletedHowever, Hawkes process assumes that past events can only independently and additively excite (and not inhibit) the occurrence of future events; this strong parametric assumption can be easily violated on real-world data. For example, maintenance events should reduce the chances of a system breaking down, and a patient who takes multiple medicines at the same time is more likely to experience unexpected adverse events.

2.4 Attribution Methods

We view an attribution method for black-box functions as another “black box”, which takes in a function, an input, and a baseline, and outputs a set of meaningful attribution scores, one per feature. The following is a formal definition of attribution method.

Definition 2 (Attribution Method).

Suppose is a -dimensional input and a suitable baseline. Let be a class of functions from to . A functional is called an attribution method for if measures the contribution of to the prediction relative to for any , , and .

Since it is very challenging (and often subjective) to compare different attribution methods, Sundararajan2017 argue that attribution methods should ideally satisfy a number of axioms (i.e., desirable properties), which we re-state in Definition 3.

Definition 3.

An attribution method is said to satisfy the axiom of:

  1. Linearity, if for any , , and ,

  2. Completeness/Efficiency, if for any and ,

  3. Null player, if for any such that does not mathematically depend on a dimension ,


    for all .

  4. Implementation invariance, if for any , and any such that for all ,


Besides these four axioms, we also identify two other useful properties of attribution methods, which are less explicitly mentioned in the literature. We state these two properties in Definition 4.

Definition 4.

An attribution method is said to satisfy

  1. Fidelity-to-control, if for any , , and ,

  2. Batchability, if for any and any input/baseline pairs , there exists a function such that


    where and .

Many popular attribution methods satisfy most of these six properties, as we show in the Proposition 1 and 2.

Proposition 1.

Integrated Gradients (Sundararajan2017) satisfies all four axioms (A1A4) and two properties (P1P2), and DeepLIFT (shrikumar2017learning) satisfies all but the implementation invariance (A4). In particular, a choice of for both methods satisfying batchability (P2) is .

Proposition 2.

For any , let and define to be the spliced data point between and such that for any


Then Shapley values (shapley1953value) with a value function satisfies all four axioms (A1A4) and the fidelity-to-control (P1).


We include proofs for both propositions in Appendix B. ∎

3 Proposed: CAUSE

In this section, we formally present CAUSE, a novel framework for learning Granger causality from multi-type event sequences. Our framework consists of two steps: first, it trains a neural point process (NPP) to fit the underlying event sequence data; then it inspects the predictions of the trained NPP to compute a Granger causality statistic with some “well-behaved” attribution method , which we assume satisfies the following properties: linearity (A1), completeness (A2), null player (A3), fidelity-to-control (P1), and batchability (P2).

In what follows, we first describe the architecture of the used NPP in Section 3.1. Then we elaborate the intuition and the definition of our Granger causality statistic in Section 3.2. Section 3.3 explains the computational challenges and presents a highly efficient algorithm for computing such statistic. We conclude this section by discussing the choice of attribution methods for CAUSE in Section 3.4.


Somesh: Need a picture discribing the overall architecture of CAUSE.

3.1 A Semi-Parametric Neural Point Process

The design of our NPP follows the general encoder-decoder architecture of existing NPPs (Section 2.2), but we innovate the decoder part to enable both modeling flexibility and computational feasibility.

Encoder. First, we convert each event into an embedding vector that summarizes both the temporal and the type information for that event, as follows:


where is a pre-specified function that transforms the elapsed time into one or more temporal features (simply chosen to be identity function in our experiments), is the embedding matrix for event types, and recall that is the one-hot encoding of the even type .

We then obtain the embedding of a history from event embedding sequences by


where is a sequence encoder\replaced and chosen to be a Gated Recurrent Unit (GRU) (Cho2014) in our experiments.. It is chosen to be a Gated Recurrent Unit (GRU) (Cho2014) in our experiments, although alternative architectures, such as LSTMs, RNNs with attention, and transformers (Vaswani2017), are also applicable.

Decoder. An ideal decoder should fullfill the following two desiderata: (a) it should be flexible enough to produce from a wide variety of with complex time-varying patterns; and (b) it should also be computationally manageable, particularly in terms of computing the cumulative intensity , a key term in the log-likelihood-based training given in (2) and the definition of our Granger causality statistic in the subsequent subsections.

We propose a novel semi-parametric decoder that enjoys both the flexible modeling of CIF and computational feasibility. Specifically, for each , we define the CIF on to be a weighted sum of a set of basis functions, as follows:


where is a set of pre-specified positive-valued basis functions, and is a feedforward neural network that computes positive weights for each of the event types. In this way, by choosing to be a rich-enough function family, the CIFs defined in (6) are able to express a wide variety of time-varying patterns. More importantly, since the parts relevant to neural networks— and —are separated from the basis functions, we can evaluate the integral analytically, as follows:


where is generally available for many parametric basis functions.


We choose the basis functions to be the densities of a Gaussian family with increasing means and variances. This design of basis functions reflects a reasonable inductive bias that the CIFs should vary more smoothly as the time increases. The details are given in Appendix B.3. Inspired by the dyadic interval bases used by Bao2017, we choose the basis functions to be the densities for a Gaussian family , whose means are given by


and the standard deviations by for . This design of basis functions reflects a reasonable inductive bias that the CIFs should vary more smoothly as the time increases. As shown in Figure 4 for an example of and , the first a few bases, due to their small means and variances, capture the short-term effects, whereas the last several characterize the mid/long-term effects.

3.2 From Event Contributions to a Granger Causality Statistic

Now that we have trained a flexible NPP that can successively update the history embedding after each event occurrence and then predict the future CIFs after until ; we would like to ask: can we quantify the contribution of each past event to each prediction? Since in our case ’s are instantiated by two potentially highly nonlinear neural networks (i.e., and ), it is not as straightforward to obtain the past event’s contribution to current event occurrence as in the case of some parametric MPPs (e.g., Hawkes processes).

A natural idea for the aforementioned question would be applying some attribution method to ’s. To do so, however, there are two challenges. First, the predictions in our case are time-varying functions rather than static quantities (e.g., the probability of a class, as commonly seen in existing applications of attribution methods); thus it is unclear which target should be attributed. Second, as the input to ’s are multi-type event sequences with asynchronous timestamps, it is also unclear which baseline should be used.

We tackle the first challenge by setting the cumulative intensity to be the attribution target. This is not only because the cumulative intensity reflects the overall effect of on , but also because it has a clear meaning in the context of point processes: it is the rate of the Poisson distribution that the number of events of type on satisfies. More importantly, since the cumulative intensity has a closed form as in (7), its gradients with respect to its input can be computed both precisely and efficiently. Note that by adopting this target, the input now includes not only but also ; thus we define and write the target as .

As for the second challenge, we choose the baseline of an input to be ; that is, the one-hot encodings of all observed event types are replaced with zero vectors. Since and only differ in the dimensions corresponding to the event types, i.e., , by the fidelity-to-control (P1), then only these dimensions will have non-zero attributions. With completeness (A2), it further implies that for every type


where is the attribution to . Thus, the term can be viewed as the event contribution of the -th event to the cumulative intensity prediction relative to the baseline . Besides, event timestamps are identical in and , thus this contribution comes only from the event type and denotes how type contributes to the prediction of type for a specific event history .

A Granger Causality Statistic. We have established ’s as the past events’ contribution to the cumulative intensity on interval . A further question is: can we infer from these event contributions for individual predictions the population-level Granger causality among event types?

To answer this question, we define a novel statistic indicating the Granger causality for type to type as follows:


Here the numerator aggregates the event contributions for all event occurrences over the whole dataset, and denominator accounts for the fact that some event types may occur far more frequently than other types, which can lead to unreasonally large scores if used without such normalization. Note that an event contribution may be negative when the event exerts an inhibitive effect; thus can also be negative and characterize the Granger causality from type to type even when the influence is inhibitive.


Prasad: What’s the connection between this statistics and the definition of 1 of Granger causality?

Attribution Regularization. One caveat in (9) and (10) is that our chosen baselines have never appeared in the training procedure, thus the value of may be meaningless or even misleading. Ideally, we would like to be the cumulative intensity of type given that history prior to consists of “null” events at . Thus a natural prior reflecting this idea is to make nearly zero for any handcrafted baseline . Such an “invariance” property on can be achieved by adding an auxiliary regularization for each in the NLL given in (2), leading to a training objective


where is a hyperparameter.

3.3 Computing the Granger Causality Statistic

While (10) defines ’s analytically, it is rather challenging to compute them. This is because a naive implementation would require applying at each event occurrence, which is computationally prohibitive for a dataset of millions of events. Note that the normalization in (10) can be easily calculated; so if we write , the problem is reduced to how to efficiently compute .

We propose an efficient algorithm to compute , which is summarized in Algorithm LABEL:alg:batching-scheme. At the core of our algorithm are two levels of batching: (a) intra-sequence batching, which allow the computation of with only one call of ; and (b) inter-sequence batching, which enables batch computation of for a mini-batch of event sequences indexed by . We explain the details of these two levels of batching as follows.


algocf[!tbp]     \end@float

Intra-Sequence Batching. As this part only deals with a particular event sequence, to simplify the notation, we omit the sequence index for now. Note that and due to the recurrent nature of , all for can be computed in a single forward pass with the shared input . Denote by a matrix-valued function such that for any .

The equivalence between and means that,

which further implies that we can rewrite as a weighted sum of attribution scores for the same input and baseline . Since we are not interested in computing the individual attribution scores but their sum, we can leverage the linearity property (A1) to compute the attribution scores directly for the sum, as shown in the following proposition.

Proposition 3.

For an attribution method with the linearity (A1) and the null player (A3), it holds that


The proof is in Appendix B.4. ∎

Inter-Sequence Batching. We now discuss how to efficiently compute for a mini-batch of event sequences indexed by . The key idea for a significant computational speed-up here is that if satisfies batchability (P2), we can then batch the computation of different sequences with a single call of .

To simplify the discussion, we assume without loss of generality that and for all . Let and analogously the corresponding baselines . We further override our previous notation and denote by a new tensor-valued function such as that . Then with Proposition 1, we have that

Time Complexity Analysis. With our two-level batching scheme, Algorithm LABEL:alg:batching-scheme only requires invocations of , a significant reduction from the invocations required by a naive implementation that directly calculates ’s, where is the average sequence length. Since modern computation hardware (such as GPUs) enables calling with a batch of inputs being almost as fast as calling it with a single input, our algorithm can achieve up to three orders-of-magnitude speedup over a naive implementation on datasets with relatively large and . (See Section 4.2.3 for empirical evaluations.)

3.4 Choice of Attribution Methods

In our experiments, we choose the attribution method to be the Integrated Gradients, which is defined as follow:


where is the Hadamard product. Nevertheless, CAUSE does not depend on a specific attribution method but a set of properties that we have stated upfront; this means that any other attribution methods that satisfy these properties (e.g., DeepLIFT) should be applicable to CAUSE. Also note that batchability (P2) is only used in the inter-sequence batching for speeding up the computation; thus, if efficiency is less of a concern, or the computation of attributions for different inputs can be accelerated in alternative ways,2 attribution methods that only violate batchability, such as Shapley values, should also be applicable.

4 Experiments

In this section, we present the experiments that are designed to evaluate CAUSE and answer the following three questions:

  • Goodness-of-Fit: How good is CAUSE at fitting multi-type event sequences?

  • Causality Discovery: How accurate is CAUSE at discovering Granger causality between event types?

  • Scalability: How scalable is CAUSE?

The experimental results on five datasets show that CAUSE (a) outperforms state-of-the-art methods in both fitting and discovering Granger causality from event sequences of diverse event interdependency, (b) can identify Granger causality on real-world datasets that agrees with human intuition, and (c) can compute the Granger causality statistic three orders-of-magnitude faster due to our optimization.

4.1 Experimental Setup

Datasets. We designed three synthetic datasets to reflect various types of event interactions and temporal effects.

  • Excitation: This dataset was generated by a multivariate Hawkes process, whose CIFs are defined in (1). The exponential decay kernels were used, and a weighted ground-truth causality matrix was constructed with the norms of the kernel functions .

  • Inhibition: This dataset was generated by a multivariate self-correcting process (isham1979self), whose CIFs are of the form: , where and . A weighted ground-truth causality matrix was formed with the pairwise weights .

  • Synergy: Generated by a proximal graphical event model (PGEM) (Bhattacharjya2018), this dataset contains synergistic effects between a pair of event types to a third event type. A binary ground-truth causality matrix was constructed from the dependency graph of the PGEM.

We also included two real-world datasets used in existing literature.

  • IPTV (Luo2015): Each sequence records the history of TV watching behavior of a user, and the event types are the TV program categories. This dataset, however, does not contain ground-truth causality between TV program categories.

  • MemeTracker (MT):3 Each sequence represents how a phrase or quote appeared on various online websites over time during the period of August 2008 to April 2009, and the event types are the domains of the top websites. Like previous studies (achab2017uncovering; Xiao2019), a weighted ground-truth causality matrix was approximated by whether one site contains any URLs linking to another site.

The parameter settings for synthetic datasets, the preprocessing steps for real-world datasets, and the dataset statistics are detailed in Appendix C.1.

Methods for Comparsison. We compared our method to the following baselines:

  • HExp: Hawkes process with fixed exponential kernels.

  • HSG and NHPC: Hawkes process with sum of Gaussian kernels (xu2016learning) and nonparametric Hawkes process cumulant matching (achab2017uncovering). These two methods represent the state-of-the-art parametric and nonparametric methods for learning Granger causality for Hawkes process, respectively.

  • RPPN: Recurrent point process network (Xiao2019), to the best of our knowledge, the only NPP that can provide summary statistics for Granger causality, which is enabled by its use of an attention mechanism.

The implementation details and hyperparameter configurations for CAUSE and various baselines are provided in Appendix C.2

Evaluation Metrics. The hold-out negative log-likelihood (NLL) was used for evaluating the goodness-of-fit of each method on various datasets, and the Kendall’s coefficient and the area under the ROC curve (AUC) were used for evaluating the estimated Granger causality matrix against the ground truth. Non-binary ground-truth causality matrices were binarized at zero in the evaluation of AUCs. We performed five-fold cross-validation and report the average results.

4.2 Detailed Results


We start by examining the goodness-of-fit of various methods on various datasets, since if a method fails to fit the data, it is unlikely to detect the true Granger causality between event types. As shown in Figure 1, CAUSE attains smaller NLLs than all baselines on all datasets, suggesting that CAUSE consistently has a better fit than all baselines. Notably, on all three synthetic datasets, the NLLs of CAUSE nearly match those computed by the ground-truth models. These results confirm the flexibility of CAUSE in learning the various types of event interactions and temporal effects.

(a) Excitation
(b) Inhibition
(c) Synergy
(d) IPTV
(e) MT
Figure 1: Hold-out NLLs of various methods, where horizontal lines denote the ground-truth NLLs. CAUSE attains the best NLLs on all datasets.

Causality Discovery

We now examined the performance of CAUSE on Granger causality discovery, both quantitatively and qualitatively.

Quantitative Analsyis.
Excitation Inhibition Synergy MT
Method AUC Kendall’s AUC Kendall’s AUC Kendall’s AUC Kendall’s
HExp 0.8580.004 0.4530.005 0.5460.002 0.1020.002 0.8720.058 0.2510.039 0.4040.009 -0.0610.005
HSG 0.9970.001 0.6350.002 0.4900.002 -0.0130.002 0.8760.007 0.2540.039 0.5390.008 0.0240.005
NPHC 0.7820.007 0.3370.010 0.4000.054 -0.1380.067 0.7410.129 0.1630.087 N/A N/A
RPPN 0.5950.010 0.1360.012 0.4480.003 -0.0660.002 0.8910.043 0.2640.029 0.4920.004 -0.0050.002
CAUSE 0.9200.012 0.5330.013 0.9210.021 0.5320.021 0.9910.004 0.3310.003 0.6230.012 0.0750.007
Table 1: Results for Granger causality discovery on the four datasets with ground-truth causality available. The best and the second best results on each dataset are emboldened and italicized, respectively.

Table 1 shows values of AUC and Kendall’s of various methods on the four datasets that have ground-truth causality. The results in the table support the following conclusions.

First, CAUSE performs the best overall and is most robust to various types of event interactions. It not only significantly outperforms all baselines on three of the four datasets (i.e., Inhibition, Synergy, and MT), but also achieves a close-second on Excitation, in which events were generated by a Hawkes process, and CAUSE is supposed to have a disadvantage relative to Hawkes process-based baselines.

Second, once the underlying data generation process violates the assumptions of Hawkes process and exhibits complex event interactions other than excitation, Hawkes process-based methods tend to perform poorly, as seen from Inhibition and Synergy.

Finally, despite both being NPP-based methods, RPPN performs significantly worse than CAUSE on all datasets. We suspect that this underperformance is caused by two issues in RPPN’s construction of the Granger causality statistics with the attention weights. First, RPPN restricts all attention weights to be positive, thus cannot distinguish between excitative and inhibitive effects. Second, attention mechanism may not correctly attribute the model’s prediction to its inputs, as shown in several recent studies (jain2019attention; Serrano2019).

Qualitative Analysis. Figure 2 shows the heat map for the Granger causality matrix of IPTV dataset estimated by CAUSE. Almost all diagonal entries have large positive values, indicating that users, on average, exhibit strong tendencies to watch the TV programs of the same category. Several positive associations between different TV program categories are also observed, such as from military, laws, finance, and education to news, and from kids and music to drama. These results agree with common sense and are consistent with the findings of an existing study with HSG (xu2016learning). Our method also suggests several meaningful negative associations, including ads to drama and education to entertainment; such negative associations, however, can never be detected by models that only consider the excitations between events, such as HSG.


Appendix C.4 provides a detailed analysis of the estimated Granger causality matrix for MT dataset.

Figure 2: Visualization of the estimated Granger causality statistic matrices on IPTV \replaced. and MT (Figure (a)a(b)b). Better viewed on screen.


Finally, we investigate the scalability of CAUSE in computing the Granger causality statistic by Algorithm LABEL:alg:batching-scheme. Figure 3 shows how much speedup Algorithm LABEL:alg:batching-scheme achieves over a naive implementation with different sequence lengths and batch sizes. The results demonstrate that with batch size and average sequence length both being relatively large (i.e., greater or equal to and , respectively), our algorithm can achieve over three orders-of-magnitude speedup relative to a native implementation. Furthermore, the speedup scales almost linearly with sequence length and batch size when they do not exceed and , respectively, which is consistent with our analysis in Section 3.3. Beyond this regime, only a sublinear relationship between the speedup and batch size or sequence length is observed, which is because the GPU we tested on was reaching its maximum utilization.

Figure 3: The speedup achieved by Algorithm LABEL:alg:batching-scheme relative to a naive implementation with different average sequence lengths and batch sizes.

5 Conclusion

We have presented CAUSE, a novel framework for learning Granger causality between event types from multi-type event sequences. At the core of CAUSE are two steps: first, it trains a flexible NPP model to capture the complex event interdependency, then it computes a novel Granger causality statistic by inspecting the trained model with an axiomatic attribution method. A two-level batching algorithm is derived to compute the statistic efficiently. We evaluate CAUSE on both synthetic and real-world datasets abundant with diverse event interactions and show the effectiveness of CAUSE on identifying the Granger causality between event types.


Appendix A Additional Related Work

Event Sequence Modeling With the increasing availability of multi-type event sequences, there has been considerable interest in modeling such data for both prediction and inference. The majority of prior research in this direction has been based on the theory of point processes (Daley2003), a particular class of stochastic processes that characterize the distribution of random points on the real line. Notably, Hawkes process (hawkes1971point; hawkes1971spectra), a special class of point process, has been widely investigated, partly due to its ability to capture mutual excitation among events and its mathematical tractability. However, Hawkes process assumes that past events can only independently and additively influence the occurrence of future events, and that influence can only be excitative; these inherent limitations have restricted its modeling flexibility and render it unable to capture complex event interaction in real-world data.

As such, other more flexible models have been proposed, including the piecewise-constant conditional intensity model (PCIM) (Gunawardana2011) and its variants (Weiss2013; Bhattacharjya2018), and more recently a class of models loosely referred to as neural point processes (NPPs) (du2016recurrent; xiao2017modeling; mei2017neuralhawkes; Xiao2019). These models, particularly NPPs, generally enjoy better predictive performance than parametric point processes, since they use more expressive machine learning models (e.g., decision trees, random forests, or recurrent neural networks) to sequentially compute the conditional intensity until next event is generated. A significant weakness of these models, however, is that they are generally uninterpretable and thus unable to provide summary statistics for determining the Granger causality among event types.

Granger Causality Discovery In his seminal paper, Granger1969 first proposed the concept of Granger causality for time series data. Many approaches have been proposed for uncovering Granger causality for multivariate time series, including the Hiemstra-Jones test (Hiemstra1994) and its improved variant (Diks2006), Lasso-Granger method (Arnold2007), and approaches based on information-theoretic measures (Hlavackova-Schindler2007). However, as these methods are designed for the synchronous multivariate time series, they are not directly applicable to asynchronous multi-type event sequence data, since otherwise one has to discretize the continuous observation window.

Didelez2008 first established the Granger causality for event types in event sequences under the framework of marked point processes. Later, Eichler2017 shows that Granger causality for Hawkes processes is entirely encoded in the excitation kernel functions (also called impact function). To our best knowledge, existing research for Granger causality discovery from event sequences appears to be limited to the case of Hawkes process (Eichler2017; xu2016learning; achab2017uncovering), possibly because of this direct link between the process parameterization and Granger causality.

Prediction Attribution for Black-Box Models Prediction attribution, the task of assigning to each input feature a score for representing the feature’s contribution to model prediction, has been attracting considerable interest in the field due to its ability to provide insight into predictions made by black-box models such as neural networks. While various approaches have been proposed, there are two prominent groups of approaches: perturbation-based and gradient-based approaches. Perturbation approaches (Zeiler2014) typically comprise, first, removing, masking, or altering a feature, and then measuring the attribution score of that feature by the change of the model’s output. While perturbation-based methods are simple, intuitive, and applicable to almost all black-box models, the quality of the resultant scores is often sensitive to how the perturbation is performed. Moreover, as these methods scale linearly with the number of input features, they become computationally unaffordable for high-dimensional inputs.

In contrast, backpropagation-based methods construct the attribution scores based on the estimation of local gradients of the model around the input instance with backpropagation. The ordinary gradients, however, could suffer from a “saturation” problem for neural networks with activation functions that contain constant-valued regions (e.g., rectifier linear unit (ReLUs)); that is, the gradient coming into a ReLU during the backward pass is zero’d out if the input to the ReLU during the forward pass is in a constant region. One valid solution to this issue is to replace gradients with discrete gradients and use a modified form of backpropagation to compose discrete gradients into attributions, such as layer-wise relevance propagation (LRP) (Bach2015) and DeepLIFT (shrikumar2017learning). Another solution, proposed by Integrated Gradient (IG) (Sundararajan2017), is to use the line integral of the gradients along the path from the input to a chosen baseline. Sundararajan2017 show that IG satisfies many desirable properties, as detailed in Proposition 1.

It is worth mentioning that much existing work often uses the intermediate results, produced by certain intelligible neural network architecture, as the attribution scores for an input. A most notable example of such an idea is the use of attention weights induced by some attention mechanism as the importance of the input (Bahdanau2015; Xu2015). Recently, however, there are growing concerns on the validity of attention weights being used as the explanation of neural networks (jain2019attention; Serrano2019). In particular, jain2019attention show that across a variety of NLP tasks, the learned attention weights are frequently uncorrelated with feature importance produced by gradient-based prediction attribution methods, and random permutation of attention weights can nonetheless yield equivalent predictions.

Appendix B Additional Technical Details

b.1 Proof of Proposition 1


That both IG and DeepLIFT satisfy A1A4 has been established in (Sundararajan2017). P1 is straightforward from the definion of either method. Thus, we only prove that both methods satisfy batchability (P2) with .

To prove that IG satisfies batchability, we first rewrite the as follows:

where the second step is due to that summation and gradients are swapable, and the third step is because the gradients of different terms are separable. Thus, we have


which establishes the formula.

The proof of DeepLIFT satisfying batchability can be established in a similar way as IG. The key part, shown in the Proposition 2 of (Ancona2017), is that the attribution scores produced by DeepLIFT for a neural-network-like function , an input , and a baseline , i.e., can be viewed as the Hadamard product between and a modified gradient of at all its internal nonlinear layers. Since the last layer of is a simple linear addition of all ’s, the modified gradient of for input is the same as the one of for . Thus, we have


b.2 Proof of Proposition 2

We first briefly review Shapley values. Suppose there is a team of players working together to earn a certain amount of value. The value that every coalition achieves is , where is a value function. Shapley values, proposed by shapley1953value, provide a well-motivated way to decide how the total earning should be distributed among such players. Specifically, the Shapley value for each player is defined as


For any target function , input , and baseline , we define a value function for any , where is the spliced data point between and , defined in (3). Then the Shapley values can be viewed as an attribution method that provides the attribute scores for any , , and .

Now we prove that this attribution method based on Shapley valeus satisfies all four axioms (A1A4) and the fidelity-to-control (P1), as stated in Proposition 2.


First, it is clear from the definition of Shapley values in (16) that satisfies linearity (A1) and implementation variance (A4). Since shapley1953value shows that for any value function , the Shapley values satisfies that

substituting our definition of the value function into the above equation yields

which establishes the completeness (A2). For any and , we have

Note that and only potentially differ on the -th dimension. If does not depend on the -th dimension of its input or (which implies ), then and thereby . Thus, satisfies null player (A3) and fidelity-to-control (P1).

b.3 Dyadic Gaussian Basis

Inspired by the dyadic interval bases used by Bao2017, we choose the basis functions to be the densities for a Gaussian family , which we term dyadic Gaussian basis. The means of dyadic Gaussian basis are given by


and the standard deviations by for . This design of basis functions reflects a reasonable inductive bias that the CIFs should vary more smoothly as the time increases. As shown in Figure 4 for an example of and , the first a few bases, due to their small means and variances, capture the short-term effects, whereas the last several characterize the mid/long-term effects.

Figure 4: An example of dyadic Gaussian bases for and . The first a few bases, due to their small means and variances, can capture the short-term effects, whereas the last several characterize the mid/long-term effects.

b.4 Proof of Proposition 3


We omit the index in this proof for brevity. First, we rewrite as

where in the step step, we replace with . Since , i.e., , does not depend on the events before the -th event, with null player (A3), we have for any , which further implies that

With linearity (A1), we have

which establishes the formula.

Appendix C Additional Experimental Details

c.1 The settings for synthetic and real-world datasets.

We describe below the setup and preprocessing details for the five datasets that we consider in this paper. The statistics of these datasets are summarized in Table 2.

  • Excitation. This dataset was generated by a multivariate Hawkes process, whose CIFs are of the form:

    We set , , , , and . To generate a sparse excitation weight matrix , we first selected all its diagonal entries and random off-diagonal entries, then generated the values for these entries from , and finally scaled the matrix to have a spectral radius of .

  • Inhibition. This dataset was generated by a multivariate self-correcting point process, whose CIFs take the form:

    We chose , , , and . To generated a sparse weight matrix , we first selected all its diagonal entries and random off-diagonal entries and further generated the values for these entries from .

  • Synergy. This dataset was generated by a proximal graphical event model (PGEM) (Bhattacharjya2018). PGEM assumes that the CIF of an event type depends only on whether or not its parent event types (specified by a dependency graph) have occurred in the most recent history. We designed a local dependency structure that consists of five event types labeled as A–E. Among these event types, type E is the outcome and can be excited by the occurrence of type A, B, or C; type A and B, only when both occurred in the most recent history, would incur a large synergistic effect on type E; type C has an isolated excitative effect on type E and does not interacts with other event types; and finally, type D does not have any excitative effect and is introduced to complicate the learning task. The dependency graph, together with the corresponding time windows and intensity tables, illustrated in Figure 5. To add more complexity to this dataset, we further replicated this local structure for another copy, leading to a total of event types. We generated event sequences with a maximum time span of .

  • IPTV. We obtained the dataset from4. We further normalized the timestamps into the days and splitted long event sequences so that the length of each sequence is smaller or equal to .

  • MT. We downloaded the raw MemeTracker phrase data from5. We filtered the phrase data that occurred from 2008-08-01 to 2008-09-30 and from the top-100 website domains. We further normalized the timestamps into hours and filtered out those event sequences (i.e., phrase cascades) whose lengths are not in between and .

Dataset # of events Ground truth
Excitation 1,000 10 250,447 Weighted
Inhibition 1,000 10 250,442 Weighted
Synergy 1,000 10 178,338 Binary
IPTV 1,869 16 966,338 N/A
MT 382,014 100 3,419,399 Weighted
Table 2: Statistics for various datasets.
Figure 5: The dependency graph, time windows, and intensity tables for the PGEM used in generating the Synergy dataset.

c.2 Implementation Details and Hyperparameter Configurations for Various Methods

For CAUSE, the and the were implemented by a single-layer GRU and a two-layer fully connected network with skip connections, respectively. The dimension of event type embeddings was fixed to , and the number of hidden units for GRU was set to be for synthetic datasets and for real-world datasets. The number of basis functions and the maximum mean were chosen by a rule of thumb such that and are of the same scale as the th and the th percentiles of the inter-event elapsed times, respectively. The optimization was conducted by Adam with an initial learning rate of . A hold-out validation set consisting of of the sequences of the training set was used for model selection; the model snapshot that attains the smallest validation loss was chosen. As events sequence lengths vary greatly on two real-world datasets, in constructing mini-batches for both training and inference, we adopted the bucketing technique to reduce the amount of wasted computation caused by padding. Finally, the line integral of IG, defined in (13), was approximated by steps for MT and steps for other datasets; a smaller number of steps, although may result in certain lose of accuracy, allows for a larger batch size and thus shorter execution time for attribution.

For the Hawkes process-based baselines—HExp, HSG, and NHPC—their implementation was provided by the package tick (bacry2017tick). The most relevant hyperparameters for each method were tuned by cross-validation.

As there is no publicly available codes for RPPN, we implemented it with our best effort. Its overall settings for architecture and optimization is similar to the ones for CAUSE.

c.3 Platform and Runtime

All experiments were conducted on a server with a 16-core CPU, 512G memory, and two Quadro P5000 GPUs. On the largest dataset, MT, the total runtime for CAUSE was less than 3 hours, including both training and computing the Granger causality statistic.

c.4 Qualitative Analysis on MT

Since there are too many event types in MT, instead of a heat map, we visualize the causality matrix as a graph and show in Figure (a)a and Figure (b)b the top-two communities of that graph, where the directed edges denote the estimated Granger causality between pairs of domains.6 In Figure (a)a, the domain news.google.com centers in the middle and is pointed by many sites, which is unsurprising because Google News aggregates articles from other publishers and websites. Our method also correctly identifies other major “information-consuming” domains such as bogleheads.org, an active forum for investment-related Q&A. In Figure (b)b, the then very popular social networking website blog.myspace.com sits in the center of the community. Our method also identifies credible excitative relationships between the subdomains of craigslist.org, a mega-website that hosts classified, local advertisements.

(a) MT (Community 1)
(b) MT (Community 2)
Figure 6: The top-two communities of the estimated Granger causality statistic matrices on MT (Figure (a)a(b)b). Better viewed on screen.

Appendix D A Primer on Measure and Probability Theory

In this section, we review some of the basic definitions of in measure theory, which may help the understanding of the definition of Granger causality for multivariate point processes. Most of the content in this section were adapted based on primarily based on the Chapter 1 of (durrett2019probability) and the Appendix 3 of (Daley2003),

Let be a set of “outcomes” and a nonempty collection of subsets of . The set is -algebra of , if it is closed under complement and countable unions; that is,

  1. if , then , and

  2. if is a countable sequence of sets, then .

With these two conditions, it’s easy to see that -algebra is also closed under arbitrary (possibly uncountable) intersections. From this, it follows that given a nonempty set and a collection of of subsets of , there is a smallest -algebra containing ; we denote such smallest -algebra by . One particular -algebra is of particular interst—Borel -algebra; that is, the smallest -algebra containing all open sets in , denoted by . Specifically, let be the empty set plus all sets of the form