Abstract
We study the problem of learning Granger causality between event types from asynchronous, interdependent, multitype event sequences. Existing work suffers from either limited model flexibility or poor model explainability and thus fails to uncover Granger causality across a wide variety of event sequences with diverse event interdependency. To address these weaknesses, we propose CAUSE (Causality from AttribUtions on Sequence of Events), a novel framework for the studied task. The key idea of CAUSE is to first implicitly capture the underlying event interdependency by fitting a neural point process, and then extract from the process a Granger causality statistic using an axiomatic attribution method. Across multiple datasets riddled with diverse event interdependency, we demonstrate that CAUSE achieves superior performance on correctly inferring the intertype Granger causality over a range of stateoftheart methods.
1 Introduction
Many realworld processes generate a massive number of asynchronous and interdependent events in real time. Examples include the diagnosis and drug prescription histories of patients in electronic health records, the posting and responding behaviors of users on social media, and the purchase and selling orders executed by traders in stock markets, among others. Such data can be generally viewed as multitype event sequences, in which each event consists of both a timestamp and a type label, indicating when and what the event is, respectively.
In this paper, we focus on the fundamental problem of uncovering causal structure among event types from multitype event sequence data. Since the question of “true causality” is deeply philosophical (schaffer2003metaphysics), and there are still massive debates on its definition (pearl2009causality; imbens2015causal), we consider a weaker notion of causality based on predictability—Granger causality. While Granger causality was initially used for studying the dependence structure for multivariate time series (Granger1969; Dahlhaus2003), it has also been extended to multitype event sequences (Didelez2008). Intuitively, for event sequence data, an event type is said to be (strongly) Granger causal for another event type if the inclusion of historical events of the former type leads to better predictions of future events of the latter type.
Due to their asynchronous nature, in the literature, multitype event sequences are often modeled by multivariate point process (MPP), a class of stochastic processes that characterize the random generation of points on the real line. Existing point process models for inferring intertype Granger causality from multitype event sequences appear to be limited to a particular case of MPPs—Hawkes process (Eichler2017; xu2016learning; achab2017uncovering), which assumes past events can only independently and additively excite the occurrence of future events according to a collection of pairwise kernel functions. While these Hawkes processbased models are very interpretable and many include favorable statistical properties, the strong parametric assumptions inherent in Hawkes processes render these models unsuitable for many realworld event sequences with potentially abundant inhibitive effects or event interactions. For example, maintenance events should reduce the chances of a system breaking down, and a patient who takes multiple medicines at the same time is more likely to experience unexpected adverse events.
Regarding event sequence modeling in general, a new class of MPPs, loosely referred to as neural point processes (NPPs), has recently emerged in the literature (du2016recurrent; xiao2017modeling; mei2017neuralhawkes; Xiao2019). NPPs use deep (mostly recurrent) neural networks to capture complex event dependencies, and thus excel at predicting future events due to their model flexibility. However, NPPs are uninterpretable and unable to provide insight into the Granger causality between event types.
To address this tension between model explainability and model flexibility in existing point process models, we propose CAUSE (Causality from AttribUtions on Sequence of Events), a framework for obtaining Granger causality from multitype event sequences using information captured by a highly predictive NPP model. At the core of CAUSE are two steps: first, it trains a flexible NPP model to capture the complex event interdependency, then it computes a novel Granger causality statistic by inspecting the trained NPP with an axiomatic attribution method. In this way, CAUSE is the first technique that brings modelagnostic explainability to NPPs and endows NPPs with the ability to discover Granger causality from multitype event sequences exhibiting various types of event interdependencies.
Contributions. Our contributions are:

We bring modelagnostic explainability to NPPs and propose CAUSE, a novel framework for learning Granger causality from multitype event sequences exhibiting various types of event interdependency.

We design a twolevel batching algorithm that enables efficient computation of Granger causality scalable to millions of events.

We evaluate CAUSE on both synthetic and realworld datasets riddled with diverse event interdependency. Our experiments demonstrate that CAUSE outperforms other stateoftheart methods.
Reproducibility. We publish our data and our code at https://github.com/razhangwei/CAUSE.
2 Background
In this section, we first establish some notation and then briefly introduce the background for several highly relevant topics.
2.1 Notations
Suppose there are subjects and each subject is associated with a multitype event sequence , where is the timestamp of the th event of the th sequence, is the corresponding event type, and is the sequence length. We denote by the onehot representation of each event type , and use to represent the set for any positive integer . To avoid clutter, we omit the subscript/superscript of index when we are discussing a single event sequence and no confusion arises.
2.2 Multivariate Point Process
Multivariate point processes (MPPs) (Daley2003) are a particular class of stochastic processes that characterize the dynamics of discrete events of multiple types in continuous time. The most common way to define an MPP is through a set of conditional intensity functions (CIFs), one for each event type. Specifically, let be the number of events of type that have occurred up to , and let be the history of all types of events before . The CIF for event type is defined as the expected instantaneous event occurrence rate conditioned on history, i.e.,
where the use of the asterisk is a notational convention to emphasize that intensity is conditioned upon .
Different parameterizations of CIFs lead to different MPPs. One classic example of MPP is the multivariate Hawkes process (hawkes1971point; hawkes1971spectra), which assumes each to be of the following form:
(1) 
where is the baseline rate for event type , and for any is a nonnegativevalued function (usually referred to as kernel function) that characterizes the excitation effect of event type on type .
More recently, a class of MPPs loosely referred to as neural point processes have emerged in the literature (du2016recurrent; xiao2017modeling; mei2017neuralhawkes; Xiao2019). These models parameterize CIFs with deep neural networks and generally follow an encoderdecoder design: an encoder successively embeds the event history into a vector for each , and a decoder then predicts with the future CIFs after time (until the next event is generated).
Most MPPs are trained by minimizing the negative loglikelihood (NLL):
(2) 
where is the CIF of the th sequence for the type . In (2), the first term corresponds to the NLL of an event of type being observed at for the th sequence, whereas the second term is the NLL of the observation that no events of any type occurred during the window . When there are no closedform expressions for the integrals , MonteCarlo approximation needs to be used to approximate either the integrals themselves or their gradients with respect to the parameters. However, these approximation techniques are inefficient and generally suffer from large variances, resulting in low convergence rate.
2.3 Granger Causality for MultiType Event Sequences
The Granger causality definition for multitype event sequences is established based on point process theory (Daley2003).
To proceed formally, for any , we denote by the natural filtration expanded by the subprocess ; that is, the sequence of smallest algebra expanded by the past event history of any type and , i.e., .
Definition 1.
(Eichler2017) For a dimensional MPP, event type is Granger noncausal for event type if is measurable for all .
The above definition amounts to saying that a type is Granger noncausal for another type if, given the history of events other than type , historical events of type do not further contribute to future at any time. Otherwise, type is said to be Granger causal for type .
Uncovering Granger causality from event sequences generally is a very challenging task, as the underlying MPP may have rather complex CIFs with abundant event interaction and nonexcitative effect. As a result, existing work tends to restrict consideration to certain classes of parametric MPPs, such as Hawkes processes (Eichler2017; xu2016learning; achab2017uncovering). Specifically, for multivariate Hawkes process, it is straightforward from (1) that a type is Granger noncausal for another type if and only if the corresponding kernel function . \deletedHowever, Hawkes process assumes that past events can only independently and additively excite (and not inhibit) the occurrence of future events; this strong parametric assumption can be easily violated on realworld data. For example, maintenance events should reduce the chances of a system breaking down, and a patient who takes multiple medicines at the same time is more likely to experience unexpected adverse events.
2.4 Attribution Methods
We view an attribution method for blackbox functions as another “black box”, which takes in a function, an input, and a baseline, and outputs a set of meaningful attribution scores, one per feature. The following is a formal definition of attribution method.
Definition 2 (Attribution Method).
Suppose is a dimensional input and a suitable baseline. Let be a class of functions from to . A functional is called an attribution method for if measures the contribution of to the prediction relative to for any , , and .
Since it is very challenging (and often subjective) to compare different attribution methods, Sundararajan2017 argue that attribution methods should ideally satisfy a number of axioms (i.e., desirable properties), which we restate in Definition 3.
Definition 3.
An attribution method is said to satisfy the axiom of:

Linearity, if for any , , and ,
(A1) 
Completeness/Efficiency, if for any and ,
(A2) 
Null player, if for any such that does not mathematically depend on a dimension ,
(A3) for all .

Implementation invariance, if for any , and any such that for all ,
(A4)
Besides these four axioms, we also identify two other useful properties of attribution methods, which are less explicitly mentioned in the literature. We state these two properties in Definition 4.
Definition 4.
An attribution method is said to satisfy

Fidelitytocontrol, if for any , , and ,
(P1) 
Batchability, if for any and any input/baseline pairs , there exists a function such that
(P2) where and .
Many popular attribution methods satisfy most of these six properties, as we show in the Proposition 1 and 2.
Proposition 1.
Proposition 2.
Proof.
We include proofs for both propositions in Appendix B. ∎
3 Proposed: CAUSE
In this section, we formally present CAUSE, a novel framework for learning Granger causality from multitype event sequences. Our framework consists of two steps: first, it trains a neural point process (NPP) to fit the underlying event sequence data; then it inspects the predictions of the trained NPP to compute a Granger causality statistic with some “wellbehaved” attribution method , which we assume satisfies the following properties: linearity (A1), completeness (A2), null player (A3), fidelitytocontrol (P1), and batchability (P2).
In what follows, we first describe the architecture of the used NPP in Section 3.1. Then we elaborate the intuition and the definition of our Granger causality statistic in Section 3.2. Section 3.3 explains the computational challenges and presents a highly efficient algorithm for computing such statistic. We conclude this section by discussing the choice of attribution methods for CAUSE in Section 3.4.
Somesh: Need a picture discribing the overall architecture of CAUSE.
3.1 A SemiParametric Neural Point Process
The design of our NPP follows the general encoderdecoder architecture of existing NPPs (Section 2.2), but we innovate the decoder part to enable both modeling flexibility and computational feasibility.
Encoder. First, we convert each event into an embedding vector that summarizes both the temporal and the type information for that event, as follows:
(4) 
where is a prespecified function that transforms the elapsed time into one or more temporal features (simply chosen to be identity function in our experiments), is the embedding matrix for event types, and recall that is the onehot encoding of the even type .
We then obtain the embedding of a history from event embedding sequences by
(5) 
where is a sequence encoder\replaced and chosen to be a Gated Recurrent Unit (GRU) (Cho2014) in our experiments.. It is chosen to be a Gated Recurrent Unit (GRU) (Cho2014) in our experiments, although alternative architectures, such as LSTMs, RNNs with attention, and transformers (Vaswani2017), are also applicable.
Decoder. An ideal decoder should fullfill the following two desiderata: (a) it should be flexible enough to produce from a wide variety of with complex timevarying patterns; and (b) it should also be computationally manageable, particularly in terms of computing the cumulative intensity , a key term in the loglikelihoodbased training given in (2) and the definition of our Granger causality statistic in the subsequent subsections.
We propose a novel semiparametric decoder that enjoys both the flexible modeling of CIF and computational feasibility. Specifically, for each , we define the CIF on to be a weighted sum of a set of basis functions, as follows:
(6) 
where is a set of prespecified positivevalued basis functions, and is a feedforward neural network that computes positive weights for each of the event types. In this way, by choosing to be a richenough function family, the CIFs defined in (6) are able to express a wide variety of timevarying patterns. More importantly, since the parts relevant to neural networks— and —are separated from the basis functions, we can evaluate the integral analytically, as follows:
(7) 
where is generally available for many parametric basis functions.
We choose the basis functions to be the densities of a Gaussian family with increasing means and variances. This design of basis functions reflects a reasonable inductive bias that the CIFs should vary more smoothly as the time increases. The details are given in Appendix B.3. Inspired by the dyadic interval bases used by Bao2017, we choose the basis functions to be the densities for a Gaussian family , whose means are given by
(8) 
and the standard deviations by for . This design of basis functions reflects a reasonable inductive bias that the CIFs should vary more smoothly as the time increases. As shown in Figure 4 for an example of and , the first a few bases, due to their small means and variances, capture the shortterm effects, whereas the last several characterize the mid/longterm effects.
3.2 From Event Contributions to a Granger Causality Statistic
Now that we have trained a flexible NPP that can successively update the history embedding after each event occurrence and then predict the future CIFs after until ; we would like to ask: can we quantify the contribution of each past event to each prediction? Since in our case ’s are instantiated by two potentially highly nonlinear neural networks (i.e., and ), it is not as straightforward to obtain the past event’s contribution to current event occurrence as in the case of some parametric MPPs (e.g., Hawkes processes).
A natural idea for the aforementioned question would be applying some attribution method to ’s. To do so, however, there are two challenges. First, the predictions in our case are timevarying functions rather than static quantities (e.g., the probability of a class, as commonly seen in existing applications of attribution methods); thus it is unclear which target should be attributed. Second, as the input to ’s are multitype event sequences with asynchronous timestamps, it is also unclear which baseline should be used.
We tackle the first challenge by setting the cumulative intensity to be the attribution target. This is not only because the cumulative intensity reflects the overall effect of on , but also because it has a clear meaning in the context of point processes: it is the rate of the Poisson distribution that the number of events of type on satisfies. More importantly, since the cumulative intensity has a closed form as in (7), its gradients with respect to its input can be computed both precisely and efficiently. Note that by adopting this target, the input now includes not only but also ; thus we define and write the target as .
As for the second challenge, we choose the baseline of an input to be ; that is, the onehot encodings of all observed event types are replaced with zero vectors. Since and only differ in the dimensions corresponding to the event types, i.e., , by the fidelitytocontrol (P1), then only these dimensions will have nonzero attributions. With completeness (A2), it further implies that for every type
(9) 
where is the attribution to . Thus, the term can be viewed as the event contribution of the th event to the cumulative intensity prediction relative to the baseline . Besides, event timestamps are identical in and , thus this contribution comes only from the event type and denotes how type contributes to the prediction of type for a specific event history .
A Granger Causality Statistic. We have established ’s as the past events’ contribution to the cumulative intensity on interval . A further question is: can we infer from these event contributions for individual predictions the populationlevel Granger causality among event types?
To answer this question, we define a novel statistic indicating the Granger causality for type to type as follows:
(10) 
Here the numerator aggregates the event contributions for all event occurrences over the whole dataset, and denominator accounts for the fact that some event types may occur far more frequently than other types, which can lead to unreasonally large scores if used without such normalization. Note that an event contribution may be negative when the event exerts an inhibitive effect; thus can also be negative and characterize the Granger causality from type to type even when the influence is inhibitive.
Prasad: What’s the connection between this statistics and the definition of 1 of Granger causality?
Attribution Regularization. One caveat in (9) and (10) is that our chosen baselines have never appeared in the training procedure, thus the value of may be meaningless or even misleading. Ideally, we would like to be the cumulative intensity of type given that history prior to consists of “null” events at . Thus a natural prior reflecting this idea is to make nearly zero for any handcrafted baseline . Such an “invariance” property on can be achieved by adding an auxiliary regularization for each in the NLL given in (2), leading to a training objective
(11) 
where is a hyperparameter.
3.3 Computing the Granger Causality Statistic
While (10) defines ’s analytically, it is rather challenging to compute them. This is because a naive implementation would require applying at each event occurrence, which is computationally prohibitive for a dataset of millions of events. Note that the normalization in (10) can be easily calculated; so if we write , the problem is reduced to how to efficiently compute .
We propose an efficient algorithm to compute , which is summarized in Algorithm LABEL:alg:batchingscheme. At the core of our algorithm are two levels of batching: (a) intrasequence batching, which allow the computation of with only one call of ; and (b) intersequence batching, which enables batch computation of for a minibatch of event sequences indexed by . We explain the details of these two levels of batching as follows.
algocf[!tbp] \end@float
IntraSequence Batching. As this part only deals with a particular event sequence, to simplify the notation, we omit the sequence index for now. Note that and due to the recurrent nature of , all for can be computed in a single forward pass with the shared input . Denote by a matrixvalued function such that for any .
The equivalence between and means that,
which further implies that we can rewrite as a weighted sum of attribution scores for the same input and baseline . Since we are not interested in computing the individual attribution scores but their sum, we can leverage the linearity property (A1) to compute the attribution scores directly for the sum, as shown in the following proposition.
Proposition 3.
Proof.
The proof is in Appendix B.4. ∎
InterSequence Batching. We now discuss how to efficiently compute for a minibatch of event sequences indexed by . The key idea for a significant computational speedup here is that if satisfies batchability (P2), we can then batch the computation of different sequences with a single call of .
To simplify the discussion, we assume without loss of generality that and for all . Let and analogously the corresponding baselines . We further override our previous notation and denote by a new tensorvalued function such as that . Then with Proposition 1, we have that
Time Complexity Analysis. With our twolevel batching scheme, Algorithm LABEL:alg:batchingscheme only requires invocations of , a significant reduction from the invocations required by a naive implementation that directly calculates ’s, where is the average sequence length. Since modern computation hardware (such as GPUs) enables calling with a batch of inputs being almost as fast as calling it with a single input, our algorithm can achieve up to three ordersofmagnitude speedup over a naive implementation on datasets with relatively large and . (See Section 4.2.3 for empirical evaluations.)
3.4 Choice of Attribution Methods
In our experiments, we choose the attribution method to be the Integrated Gradients, which is defined as follow:
(13) 
where is the Hadamard product.
Nevertheless, CAUSE does not depend on a specific attribution method but a set of properties that we have stated upfront;
this means that any other attribution methods that satisfy these properties (e.g., DeepLIFT) should be applicable to CAUSE.
Also note that batchability (P2) is only used in the intersequence batching for speeding up the computation; thus, if efficiency is less of a concern, or the computation of attributions for different inputs can be accelerated in alternative ways,
4 Experiments
In this section, we present the experiments that are designed to evaluate CAUSE and answer the following three questions:

GoodnessofFit: How good is CAUSE at fitting multitype event sequences?

Causality Discovery: How accurate is CAUSE at discovering Granger causality between event types?

Scalability: How scalable is CAUSE?
The experimental results on five datasets show that CAUSE (a) outperforms stateoftheart methods in both fitting and discovering Granger causality from event sequences of diverse event interdependency, (b) can identify Granger causality on realworld datasets that agrees with human intuition, and (c) can compute the Granger causality statistic three ordersofmagnitude faster due to our optimization.
4.1 Experimental Setup
Datasets. We designed three synthetic datasets to reflect various types of event interactions and temporal effects.

Excitation: This dataset was generated by a multivariate Hawkes process, whose CIFs are defined in (1). The exponential decay kernels were used, and a weighted groundtruth causality matrix was constructed with the norms of the kernel functions .

Inhibition: This dataset was generated by a multivariate selfcorrecting process (isham1979self), whose CIFs are of the form: , where and . A weighted groundtruth causality matrix was formed with the pairwise weights .

Synergy: Generated by a proximal graphical event model (PGEM) (Bhattacharjya2018), this dataset contains synergistic effects between a pair of event types to a third event type. A binary groundtruth causality matrix was constructed from the dependency graph of the PGEM.
We also included two realworld datasets used in existing literature.

IPTV (Luo2015): Each sequence records the history of TV watching behavior of a user, and the event types are the TV program categories. This dataset, however, does not contain groundtruth causality between TV program categories.

MemeTracker (MT):
^{3} Each sequence represents how a phrase or quote appeared on various online websites over time during the period of August 2008 to April 2009, and the event types are the domains of the top websites. Like previous studies (achab2017uncovering; Xiao2019), a weighted groundtruth causality matrix was approximated by whether one site contains any URLs linking to another site.
The parameter settings for synthetic datasets, the preprocessing steps for realworld datasets, and the dataset statistics are detailed in Appendix C.1.
Methods for Comparsison. We compared our method to the following baselines:

HExp: Hawkes process with fixed exponential kernels.

HSG and NHPC: Hawkes process with sum of Gaussian kernels (xu2016learning) and nonparametric Hawkes process cumulant matching (achab2017uncovering). These two methods represent the stateoftheart parametric and nonparametric methods for learning Granger causality for Hawkes process, respectively.

RPPN: Recurrent point process network (Xiao2019), to the best of our knowledge, the only NPP that can provide summary statistics for Granger causality, which is enabled by its use of an attention mechanism.
The implementation details and hyperparameter configurations for CAUSE and various baselines are provided in Appendix C.2
Evaluation Metrics. The holdout negative loglikelihood (NLL) was used for evaluating the goodnessoffit of each method on various datasets, and the Kendall’s coefficient and the area under the ROC curve (AUC) were used for evaluating the estimated Granger causality matrix against the ground truth. Nonbinary groundtruth causality matrices were binarized at zero in the evaluation of AUCs. We performed fivefold crossvalidation and report the average results.
4.2 Detailed Results
Goodnessoffit
We start by examining the goodnessoffit of various methods on various datasets, since if a method fails to fit the data, it is unlikely to detect the true Granger causality between event types. As shown in Figure 1, CAUSE attains smaller NLLs than all baselines on all datasets, suggesting that CAUSE consistently has a better fit than all baselines. Notably, on all three synthetic datasets, the NLLs of CAUSE nearly match those computed by the groundtruth models. These results confirm the flexibility of CAUSE in learning the various types of event interactions and temporal effects.
Causality Discovery
We now examined the performance of CAUSE on Granger causality discovery, both quantitatively and qualitatively.
Quantitative Analsyis.
Excitation  Inhibition  Synergy  MT  

Method  AUC  Kendall’s  AUC  Kendall’s  AUC  Kendall’s  AUC  Kendall’s 
HExp  0.8580.004  0.4530.005  0.5460.002  0.1020.002  0.8720.058  0.2510.039  0.4040.009  0.0610.005 
HSG  0.9970.001  0.6350.002  0.4900.002  0.0130.002  0.8760.007  0.2540.039  0.5390.008  0.0240.005 
NPHC  0.7820.007  0.3370.010  0.4000.054  0.1380.067  0.7410.129  0.1630.087  N/A  N/A 
RPPN  0.5950.010  0.1360.012  0.4480.003  0.0660.002  0.8910.043  0.2640.029  0.4920.004  0.0050.002 
CAUSE  0.9200.012  0.5330.013  0.9210.021  0.5320.021  0.9910.004  0.3310.003  0.6230.012  0.0750.007 
Table 1 shows values of AUC and Kendall’s of various methods on the four datasets that have groundtruth causality. The results in the table support the following conclusions.
First, CAUSE performs the best overall and is most robust to various types of event interactions. It not only significantly outperforms all baselines on three of the four datasets (i.e., Inhibition, Synergy, and MT), but also achieves a closesecond on Excitation, in which events were generated by a Hawkes process, and CAUSE is supposed to have a disadvantage relative to Hawkes processbased baselines.
Second, once the underlying data generation process violates the assumptions of Hawkes process and exhibits complex event interactions other than excitation, Hawkes processbased methods tend to perform poorly, as seen from Inhibition and Synergy.
Finally, despite both being NPPbased methods, RPPN performs significantly worse than CAUSE on all datasets. We suspect that this underperformance is caused by two issues in RPPN’s construction of the Granger causality statistics with the attention weights. First, RPPN restricts all attention weights to be positive, thus cannot distinguish between excitative and inhibitive effects. Second, attention mechanism may not correctly attribute the model’s prediction to its inputs, as shown in several recent studies (jain2019attention; Serrano2019).
Qualitative Analysis. Figure 2 shows the heat map for the Granger causality matrix of IPTV dataset estimated by CAUSE. Almost all diagonal entries have large positive values, indicating that users, on average, exhibit strong tendencies to watch the TV programs of the same category. Several positive associations between different TV program categories are also observed, such as from military, laws, finance, and education to news, and from kids and music to drama. These results agree with common sense and are consistent with the findings of an existing study with HSG (xu2016learning). Our method also suggests several meaningful negative associations, including ads to drama and education to entertainment; such negative associations, however, can never be detected by models that only consider the excitations between events, such as HSG.
Appendix C.4 provides a detailed analysis of the estimated Granger causality matrix for MT dataset.
Scalability
Finally, we investigate the scalability of CAUSE in computing the Granger causality statistic by Algorithm LABEL:alg:batchingscheme. Figure 3 shows how much speedup Algorithm LABEL:alg:batchingscheme achieves over a naive implementation with different sequence lengths and batch sizes. The results demonstrate that with batch size and average sequence length both being relatively large (i.e., greater or equal to and , respectively), our algorithm can achieve over three ordersofmagnitude speedup relative to a native implementation. Furthermore, the speedup scales almost linearly with sequence length and batch size when they do not exceed and , respectively, which is consistent with our analysis in Section 3.3. Beyond this regime, only a sublinear relationship between the speedup and batch size or sequence length is observed, which is because the GPU we tested on was reaching its maximum utilization.
5 Conclusion
We have presented CAUSE, a novel framework for learning Granger causality between event types from multitype event sequences. At the core of CAUSE are two steps: first, it trains a flexible NPP model to capture the complex event interdependency, then it computes a novel Granger causality statistic by inspecting the trained model with an axiomatic attribution method. A twolevel batching algorithm is derived to compute the statistic efficiently. We evaluate CAUSE on both synthetic and realworld datasets abundant with diverse event interactions and show the effectiveness of CAUSE on identifying the Granger causality between event types.
References
Appendix A Additional Related Work
Event Sequence Modeling With the increasing availability of multitype event sequences, there has been considerable interest in modeling such data for both prediction and inference. The majority of prior research in this direction has been based on the theory of point processes (Daley2003), a particular class of stochastic processes that characterize the distribution of random points on the real line. Notably, Hawkes process (hawkes1971point; hawkes1971spectra), a special class of point process, has been widely investigated, partly due to its ability to capture mutual excitation among events and its mathematical tractability. However, Hawkes process assumes that past events can only independently and additively influence the occurrence of future events, and that influence can only be excitative; these inherent limitations have restricted its modeling flexibility and render it unable to capture complex event interaction in realworld data.
As such, other more flexible models have been proposed, including the piecewiseconstant conditional intensity model (PCIM) (Gunawardana2011) and its variants (Weiss2013; Bhattacharjya2018), and more recently a class of models loosely referred to as neural point processes (NPPs) (du2016recurrent; xiao2017modeling; mei2017neuralhawkes; Xiao2019). These models, particularly NPPs, generally enjoy better predictive performance than parametric point processes, since they use more expressive machine learning models (e.g., decision trees, random forests, or recurrent neural networks) to sequentially compute the conditional intensity until next event is generated. A significant weakness of these models, however, is that they are generally uninterpretable and thus unable to provide summary statistics for determining the Granger causality among event types.
Granger Causality Discovery In his seminal paper, Granger1969 first proposed the concept of Granger causality for time series data. Many approaches have been proposed for uncovering Granger causality for multivariate time series, including the HiemstraJones test (Hiemstra1994) and its improved variant (Diks2006), LassoGranger method (Arnold2007), and approaches based on informationtheoretic measures (HlavackovaSchindler2007). However, as these methods are designed for the synchronous multivariate time series, they are not directly applicable to asynchronous multitype event sequence data, since otherwise one has to discretize the continuous observation window.
Didelez2008 first established the Granger causality for event types in event sequences under the framework of marked point processes. Later, Eichler2017 shows that Granger causality for Hawkes processes is entirely encoded in the excitation kernel functions (also called impact function). To our best knowledge, existing research for Granger causality discovery from event sequences appears to be limited to the case of Hawkes process (Eichler2017; xu2016learning; achab2017uncovering), possibly because of this direct link between the process parameterization and Granger causality.
Prediction Attribution for BlackBox Models Prediction attribution, the task of assigning to each input feature a score for representing the feature’s contribution to model prediction, has been attracting considerable interest in the field due to its ability to provide insight into predictions made by blackbox models such as neural networks. While various approaches have been proposed, there are two prominent groups of approaches: perturbationbased and gradientbased approaches. Perturbation approaches (Zeiler2014) typically comprise, first, removing, masking, or altering a feature, and then measuring the attribution score of that feature by the change of the modelâs output. While perturbationbased methods are simple, intuitive, and applicable to almost all blackbox models, the quality of the resultant scores is often sensitive to how the perturbation is performed. Moreover, as these methods scale linearly with the number of input features, they become computationally unaffordable for highdimensional inputs.
In contrast, backpropagationbased methods construct the attribution scores based on the estimation of local gradients of the model around the input instance with backpropagation. The ordinary gradients, however, could suffer from a “saturation” problem for neural networks with activation functions that contain constantvalued regions (e.g., rectifier linear unit (ReLUs)); that is, the gradient coming into a ReLU during the backward pass is zeroâd out if the input to the ReLU during the forward pass is in a constant region. One valid solution to this issue is to replace gradients with discrete gradients and use a modified form of backpropagation to compose discrete gradients into attributions, such as layerwise relevance propagation (LRP) (Bach2015) and DeepLIFT (shrikumar2017learning). Another solution, proposed by Integrated Gradient (IG) (Sundararajan2017), is to use the line integral of the gradients along the path from the input to a chosen baseline. Sundararajan2017 show that IG satisfies many desirable properties, as detailed in Proposition 1.
It is worth mentioning that much existing work often uses the intermediate results, produced by certain intelligible neural network architecture, as the attribution scores for an input. A most notable example of such an idea is the use of attention weights induced by some attention mechanism as the importance of the input (Bahdanau2015; Xu2015). Recently, however, there are growing concerns on the validity of attention weights being used as the explanation of neural networks (jain2019attention; Serrano2019). In particular, jain2019attention show that across a variety of NLP tasks, the learned attention weights are frequently uncorrelated with feature importance produced by gradientbased prediction attribution methods, and random permutation of attention weights can nonetheless yield equivalent predictions.
Appendix B Additional Technical Details
b.1 Proof of Proposition 1
Proof.
That both IG and DeepLIFT satisfy A1–A4 has been established in (Sundararajan2017). P1 is straightforward from the definion of either method. Thus, we only prove that both methods satisfy batchability (P2) with .
To prove that IG satisfies batchability, we first rewrite the as follows:
where the second step is due to that summation and gradients are swapable, and the third step is because the gradients of different terms are separable. Thus, we have
(14) 
which establishes the formula.
The proof of DeepLIFT satisfying batchability can be established in a similar way as IG. The key part, shown in the Proposition 2 of (Ancona2017), is that the attribution scores produced by DeepLIFT for a neuralnetworklike function , an input , and a baseline , i.e., can be viewed as the Hadamard product between and a modified gradient of at all its internal nonlinear layers. Since the last layer of is a simple linear addition of all ’s, the modified gradient of for input is the same as the one of for . Thus, we have
(15) 
∎
b.2 Proof of Proposition 2
We first briefly review Shapley values. Suppose there is a team of players working together to earn a certain amount of value. The value that every coalition achieves is , where is a value function. Shapley values, proposed by shapley1953value, provide a wellmotivated way to decide how the total earning should be distributed among such players. Specifically, the Shapley value for each player is defined as
(16) 
For any target function , input , and baseline , we define a value function for any , where is the spliced data point between and , defined in (3). Then the Shapley values can be viewed as an attribution method that provides the attribute scores for any , , and .
Now we prove that this attribution method based on Shapley valeus satisfies all four axioms (A1–A4) and the fidelitytocontrol (P1), as stated in Proposition 2.
Proof.
First, it is clear from the definition of Shapley values in (16) that satisfies linearity (A1) and implementation variance (A4). Since shapley1953value shows that for any value function , the Shapley values satisfies that
substituting our definition of the value function into the above equation yields
which establishes the completeness (A2). For any and , we have
Note that and only potentially differ on the th dimension. If does not depend on the th dimension of its input or (which implies ), then and thereby . Thus, satisfies null player (A3) and fidelitytocontrol (P1).
∎
b.3 Dyadic Gaussian Basis
Inspired by the dyadic interval bases used by Bao2017, we choose the basis functions to be the densities for a Gaussian family , which we term dyadic Gaussian basis. The means of dyadic Gaussian basis are given by
(17) 
and the standard deviations by for . This design of basis functions reflects a reasonable inductive bias that the CIFs should vary more smoothly as the time increases. As shown in Figure 4 for an example of and , the first a few bases, due to their small means and variances, capture the shortterm effects, whereas the last several characterize the mid/longterm effects.
b.4 Proof of Proposition 3
Proof.
We omit the index in this proof for brevity. First, we rewrite as
where in the step step, we replace with . Since , i.e., , does not depend on the events before the th event, with null player (A3), we have for any , which further implies that
With linearity (A1), we have
which establishes the formula.
∎
Appendix C Additional Experimental Details
c.1 The settings for synthetic and realworld datasets.
We describe below the setup and preprocessing details for the five datasets that we consider in this paper. The statistics of these datasets are summarized in Table 2.

Excitation. This dataset was generated by a multivariate Hawkes process, whose CIFs are of the form:
We set , , , , and . To generate a sparse excitation weight matrix , we first selected all its diagonal entries and random offdiagonal entries, then generated the values for these entries from , and finally scaled the matrix to have a spectral radius of .

Inhibition. This dataset was generated by a multivariate selfcorrecting point process, whose CIFs take the form:
We chose , , , and . To generated a sparse weight matrix , we first selected all its diagonal entries and random offdiagonal entries and further generated the values for these entries from .

Synergy. This dataset was generated by a proximal graphical event model (PGEM) (Bhattacharjya2018). PGEM assumes that the CIF of an event type depends only on whether or not its parent event types (specified by a dependency graph) have occurred in the most recent history. We designed a local dependency structure that consists of five event types labeled as A–E. Among these event types, type E is the outcome and can be excited by the occurrence of type A, B, or C; type A and B, only when both occurred in the most recent history, would incur a large synergistic effect on type E; type C has an isolated excitative effect on type E and does not interacts with other event types; and finally, type D does not have any excitative effect and is introduced to complicate the learning task. The dependency graph, together with the corresponding time windows and intensity tables, illustrated in Figure 5. To add more complexity to this dataset, we further replicated this local structure for another copy, leading to a total of event types. We generated event sequences with a maximum time span of .

IPTV. We obtained the dataset from
^{4} . We further normalized the timestamps into the days and splitted long event sequences so that the length of each sequence is smaller or equal to . 
MT. We downloaded the raw MemeTracker phrase data from
^{5} . We filtered the phrase data that occurred from 20080801 to 20080930 and from the top100 website domains. We further normalized the timestamps into hours and filtered out those event sequences (i.e., phrase cascades) whose lengths are not in between and .
Dataset  # of events  Ground truth  

Excitation  1,000  10  250,447  Weighted 
Inhibition  1,000  10  250,442  Weighted 
Synergy  1,000  10  178,338  Binary 
IPTV  1,869  16  966,338  N/A 
MT  382,014  100  3,419,399  Weighted 
c.2 Implementation Details and Hyperparameter Configurations for Various Methods
For CAUSE, the and the were implemented by a singlelayer GRU and a twolayer fully connected network with skip connections, respectively. The dimension of event type embeddings was fixed to , and the number of hidden units for GRU was set to be for synthetic datasets and for realworld datasets. The number of basis functions and the maximum mean were chosen by a rule of thumb such that and are of the same scale as the th and the th percentiles of the interevent elapsed times, respectively. The optimization was conducted by Adam with an initial learning rate of . A holdout validation set consisting of of the sequences of the training set was used for model selection; the model snapshot that attains the smallest validation loss was chosen. As events sequence lengths vary greatly on two realworld datasets, in constructing minibatches for both training and inference, we adopted the bucketing technique to reduce the amount of wasted computation caused by padding. Finally, the line integral of IG, defined in (13), was approximated by steps for MT and steps for other datasets; a smaller number of steps, although may result in certain lose of accuracy, allows for a larger batch size and thus shorter execution time for attribution.
For the Hawkes processbased baselines—HExp, HSG, and NHPC—their implementation was provided by the package tick (bacry2017tick). The most relevant hyperparameters for each method were tuned by crossvalidation.
As there is no publicly available codes for RPPN, we implemented it with our best effort. Its overall settings for architecture and optimization is similar to the ones for CAUSE.
c.3 Platform and Runtime
All experiments were conducted on a server with a 16core CPU, 512G memory, and two Quadro P5000 GPUs. On the largest dataset, MT, the total runtime for CAUSE was less than 3 hours, including both training and computing the Granger causality statistic.
c.4 Qualitative Analysis on MT
Since there are too many event types in MT, instead of a heat map, we visualize the causality matrix as a graph and show in Figure (a)a and Figure (b)b the toptwo communities of that graph, where the directed edges denote the estimated Granger causality between pairs of domains.
Appendix D A Primer on Measure and Probability Theory
In this section, we review some of the basic definitions of in measure theory, which may help the understanding of the definition of Granger causality for multivariate point processes. Most of the content in this section were adapted based on primarily based on the Chapter 1 of (durrett2019probability) and the Appendix 3 of (Daley2003),
Let be a set of “outcomes” and a nonempty collection of subsets of . The set is algebra of , if it is closed under complement and countable unions; that is,

if , then , and

if is a countable sequence of sets, then .
With these two conditions, it’s easy to see that algebra is also closed under arbitrary (possibly uncountable) intersections. From this, it follows that given a nonempty set and a collection of of subsets of , there is a smallest algebra containing ; we denote such smallest algebra by . One particular algebra is of particular interst—Borel algebra; that is, the smallest algebra containing all open sets in , denoted by . Specifically, let be the empty set plus all sets of the form