Deep Attention Spatio-Temporal Point Processes
We present a novel attention-based sequential model for mutually dependent spatio-temporal discrete event data, which is a versatile framework for capturing the non-homogeneous influence of events. We go beyond the assumption that the influence of the historical event (causing an upper-ward or downward jump in the intensity function) will fade monotonically over time, which is a key assumption made by many widely-used point process models, including those based on Recurrent Neural Networks (RNNs). We borrow the idea from the attention model based on a probabilistic score function, which leads to a flexible representation of the intensity function and is highly interpretable. We demonstrate the superior performance of our approach compared to the state-of-the-art for both synthetic and real data.
Spatio-temporal event data are ubiquitous in modern applications, ranging from traffic incidents recorded by the police, user behaviors in social-media applications, to the earthquake catalog data. Such data consist of a sequence of events that indicate when and where each event occurred, as well as any additional information about that event. A particular interest is to model the triggering effect: when an event occurs, subsequent events are more likely or less likely to happen. A popular model for such a triggering effect is point processes, whose distributions are completely specified by the conditional intensity function. The task of modeling the triggering effect thus becomes how to incorporate the“influence” of historical events into the conditional intensity function.
The most commonly used modeling approach is to assume the influence function decays monotonically over time and introduce parametric models for the influence function. For instance, this approach is used in the popular Recurrent Neural Networks (RNNs) based methods [3, 13, 11, 19, 23, 22, 28], which have achieved various successes in modeling complex temporal dependence: e.g.,  assumes that the influence of an event decreases or increases exponentially over time.
However, in various scenarios, the influence of past events may not decay monotonically over time. For example, most earthquakes cause aftershocks for no more than a decade afterward. However, sometimes earthquakes hundreds of years ago can still contribute to recent seismic activities : it has been shown that recent seismic events occurred in New Madrid, MO, are aftershocks of four earthquakes of magnitude 7.5 in 1811 (shown in Figure 1 (a)).
In this paper, we present a novel Deep Attention Point Processes (DAPP) model to capture the complex non-homogeneous influence of historical events on the future. We go beyond the assumption that the influence of the historical event fades over time, and leverage the attention mechanism to develop a flexible framework that “focus” on past events with high importance score on the current event. We introduce probabilistic score function between past events and the current event represented using a deep neural network (see an example in Figure 2, the score learned from traffic data is not monotonically decaying over time), which capture the joint distribution of features in the embedding space. Our DAPP model also extends the conventional dot-product score  used in other attention models, and our model is highly interpretable.
The main contributions of our paper include (1) To the best of our knowledge, it is the first attempt to use attention-based models to capture non-homogeneous influence in point processes and offer a more flexible framework for event data modeling; (2) Our probabilistic score function is highly interpretable, and it generalizes the dot-product score commonly used in the existing attention model; (3) To achieve constant memory in the face of streaming data, we introduce an online algorithm to perform attention efficiently for our DAPP model, where only the most informative events in the past are retained for computation.
Related work. Existing works for statistical point processes modeling, such as [7, 6, 24, 27], often make strong assumptions and specify a parametric form of the intensity functions. Such methods enjoy good interpretability and are efficient to perform. However, parametric models are not expressive enough to capture the events’ dynamics in some applications.
Recent interest has focused on improving the expressive power of point process models. There are have been attempts on RNNs based point process models [3, 13, 23, 28], which use the RNNs to memorize the influence of historical events. However, the conditional intensity is assumed to be some specific functional forms.
There are also other works [11, 22] using RNNs to model the events’ dependencies without specifying the conditional intensity function explicitly. These works only use RNN as a generative model where the conditional function is not available. These works focus on studying different learning strategies since maximum likelihood estimation is not applicable here. Another recent work  has aimed at looking for a more general way to model point processes, where no parametric form is assumed. It uses a neural network to parameterize the hazard function, where the conditional intensity can be further derived by taking the derivative of the hazard function. This approach is highly flexible and easy to compute since no numerical integral calculation is involved. However, the model is only specified using the neural network, which lacks interpretability. In addition, this model only works for temporal events.
There also have been some work that model stochastic processes using the attention mechanism [26, 9].  uses self-attention to model a class of neural latent variable models, called Neural Processes , which is not for sequential data specifically. A recent work  also uses the attention to model the historical information in point processes. However, an important distinction of their approach from ours is that it is still a parametric form and assumes a decaying exponential assumption on the conditional intensity function, which may not capture distant events although they are important; we do not make such assumption and can capture important events as long as their “importance score” is high. Moreover,  focuses on temporal point processes while we also consider spatio-temporal point processes; they use the conventional dot-product score function is employed to measure the similarity of two events while we introduce the more flexible score function based on neural networks which are learned from data.
In this section, we revisit the basic definitions of marked spatio-temporal point processes and attention mechanism.
2.1 Spatio-temporal point processes
Marked spatio-temporal point processes (MSTPPs)  consist of an ordered sequence of events localized in time, location, and mark spaces. Let represent a sequence of points sampled from a MSTPP. We denote as the number of the points generated in the time horizon . Each point is a marked spatio-temporal tuple , where is the time of occurrence of the -th event, is its location, and is its mark.
The events’ distribution in MSTPPs are characterized via a conditional intensity function , which is the probability of observing an event in the marked spatio-temporal space given the events’ history , i.e.,
where is the counting measure of events with mark over the set and is the Lebesgue measure of the ball with radius (see Appendix A for the derivation). For the notational simplicity, we denote the conditional intensity function as and the conditional probability density function of the event as . Then the conditional probability density function of the -th event can be obtained as follows:
The log-likelihood of observing a sequence with events denoted as can be obtained by (see Appendix B for the derivation):
The attention mechanism [12, 1, 20] has been widely used in the recent deep learning studies [16, 2, 25, 21]. Such mechanisms make it possible to focus more on a subset of input, which reveals significant patterns of data.
The basic dot-product single attention  can be described as mapping a query and pairs of key-value to an output (which is also referred to as the score):
where is the dimension of query and key embeddings, is the dimension of value embeddings. The function is the inner product between the query and the key , which measures their “similarity”. The function and are linear transformations for the value and the key , where are weight matrices that projects key and value vectors to corresponding embedding spaces, the and are the dimensions of embedding spaces, respectively. The attention output for a query is computed as a weighted sum of value embeddings, where the weight assigned to each value embedding is computed by the score between the query and the corresponding key.
The single attention model can also be extended to multi-head attention by concatenating multiple independent attention outputs, which allows the model to jointly attend to information from different representation subspaces at different positions of the sequence . A multi-head attention with heads is formally defined as follows:
where is the concatenation of a sequence of vectors. Each attention function associates with linear mapping functions and a similarity function .
3 Proposed Method
In this section, we present a novel attention-based point process model with a probabilistic score function, which is able to capture complex marked spatio-temporal dependencies.
3.1 Attention in point processes
The idea of the Deep Attention Point Processes (DAPP) is to model the nonlinear dependencies of the current event from past events using the attention mechanism. Specifically, we model the conditional intensity function of MSTPPs using the attention output. We also consider multi-heads, which offers multiple “representation subspace” for events in the sequence. The exact calculation of the conditional intensity is carried out as follows.
Let represent the data tuple of the current event and represent the data tuple of the past -th event. As shown in Figure 3, for the -th attention head, we first map past events to the value embedding space and the key embedding space . Then we score the current event against its past event using their key embeddings and ( is the dimension of the key embedding space), denoted as . For the event , the score determines how much attention to place on the past event as we encode the history information, which will be discussed in Section 3.2. The normalized score for the event and is obtained by employing the softmax function over the score, which is defined as:
Then we are able to obtain the -th attention head ( is the dimension of value embedding) for the event via multiplying each value embedding by the score and adding them up, which is formally defined as
where the value embedding and is a weight matrix, the is the data dimension. Here, the key embedding for the current event is analogous to the query, the key and value embedding for the past event are analogous to the key and the value in the attention mechanism, respectively. The multi-head attention is the concatenation of single attention heads:
Consider a non-linear transformation of the multi-head attention as the historical information before event , the conditional intensity function can be specified as:
where are the weight matrix and the bias term. The function is a smooth approximation of the ReLU function, which ensures the intensity strictly positive at all times when an event could possibly occur and avoid infinitely bad log-likelihood. The is the base intensity, which can be estimated from the data.
3.2 Probabilistic score function
The score function directly quantifies how likely one event is triggered by the other in a sequence. The dot-product score function has been widely used in most of the attention models. Typically, two events and are first projected onto another space as and , respectively. Then the score is obtained by computing their inner product , which is their Euclidean distance in the embedding space, as shown in Figure 4 (a). However, for some real applications, the correlation of two events may not depend on their Euclidean distance explicitly. Take earthquake catalog data as an example. The correlation between seismic events is related to the geologic structure of faults and usually exhibits a heterogeneous correlation regarding the distance. For instance, most aftershocks either occur along the fault plane or other faults within the volume affected by the strain associated with the mainshock.
To capture such complex dependencies between events, we model the score in the -th attention head as the conditional probability of features in the concatenated embedding , where is the key embedding of and . As shown in Figure 4 (b), we leverage the expressive power of deep neural networks to capture the co-occurrence of these features, which takes the concatenated embedding as input and yields the conditional probability of these features. For notational simplicity, we denote feature variables in the embedding as , respectively. The above conditional probability can be written as:
where is a multi-layer neural network parameterized by a set of weights and biases . The is the normalization term that ensures
3.3 Online attention for streaming data
For streaming data, the attention calculation may be computationally intractable since past events would explode in number as time goes on. Here, we propose an adaptive online attention algorithm to address this issue, where only a fixed number of “important” historical events with high average scores will be remembered for the attention calculation in each attention head. The procedure for collecting “important” events in each attention head is demonstrated as follows.
First, when the -th event occurs, for a past event in -th attention, we denote the set of its score against the events as . Then the average score of the event can be computed by , where denotes the number of elements in set . Hence, a recursive definition of the set of active events in the -th attention head up until the occurrence of the event is written as:
The exact event selection is carried out by Algorithm 1.
3.4 Learning of model
The log-likelihood function of our proposed model can be obtained by substituting (5) into (2) defined in Section 2.1. The model is jointly parameterized by a set of parameters , which can be learned via maximizing log-likelihood estimation using the stochastic gradient descent.
To perform the event prediction given a sequence of events with length of , we first substitute the conditional intensity (5) into the conditional probability (1) defined in Section 2.1, then the next event can be estimated by calculating the expectation of the conditional probability:
In general, the integration above cannot be obtained analytically. Therefore, we use common numerical integration techniques here to compute the expectation.
In this section, we conduct experiments on four synthetic data sets and four large-scale real-world data sets. We evaluate our deep attention point process with / without the online attention (DAPP / ODAPP) and other baseline methods by comparing their log-likelihood and visually inspecting their conditional intensity function in both temporal and spatial scenarios. There are four baseline methods that we are considering in the following experiments:
Recurrent Marked Temporal Point Process (RMTPP):  assumes the following form for the conditional intensity function in point processes, denoted as where the -th hidden state in the RNN is used to represent the history influence up to the nearest happened event , and represents the current influence. The are trainable parameters.
Neural Hawkes Process (NHP):  specifies the conditional intensity function in point processes using a continuous-time LSTM, denoted as where the hidden state of the LSTM up to time represents the history influence, the is a softplus function which ensure the positive output given any input.
Self-Attentive Hawkes Process (SAHP):  adopts self-attention mechanism to model the historical information in the conditional intensity function, which is specified as where are computed via three non-linear mappings: , , . The are trainable parameters.
Hawkes Process (HP):  As a sanity check, the conditional intensity function of Hawkes process is given by where parameters can be estimated via maximizing likelihood.
The experiment configurations are as follows: each data set sampling from a specific point process model is divided into 80% training and 20% testing data. In the training phase, the model parameters are estimated using the training data. For optimization, we employ Adam optimizer with a learning rate while the batch size is 64. The objective is to minimize the negative log-likelihood function derived in (2). For the testing phase, both estimated conditional intensity and log-likelihood are evaluated. Moreover, for ODAPP, only 50% number of events are retained for training, i.e., , where is the maximum length of sequences in the data set.
4.1 Synthetic data
In the following experiments with synthetic data, we confirmed that our deep attention point process model is capable of capturing the temporal pattern of synthetic data, which are generated from some conventional generative processes. To evaluate the performance of each method, we first measure the average maximum likelihood of the models on each synthetic data set. Since we know the true latent intensities of these generating processes, we also plot the conditional intensity function over time given one particular event series, and visually inspect whether the trained models predict these intensities accurately.
The synthetic data are obtained by the following four generative processes: (1) Hawkes process: the conditional intensity function is given by , where , , and ; (2) self-correction point process: the conditional intensity function is given by , where , ; (3) non-homogeneous Poisson : The intensity function is given by where is the sample size, the is the PDF of standard normal distribution, and is uniform distribution between and ; (4) non-homogeneous Poisson : The intensity function is a composition of two normal functions, where , where , . Each synthetic data set contains 5,000 sequences with an average length of 30, where each data point in the sequence only contains the occurrence time of the event.
Figure 5 summarizes the log-likelihood value of each model versus the training epochs, where each epoch includes 125 batches, and each batch randomly takes 40 sequences as training data. The higher log-likelihood value indicates better performance the model achieves. As we can see from Figure 5 and Table 1, our DAPP outperforms the other four baseline methods on all four synthetic data sets with the largest average maximum log-likelihood value. Besides, our ODAPP also shows competitive performances, with only 50% of events are used.
Figure 6 shows the estimated intensities using different methods in contrast to the true latent intensities indicated by the grey lines. We compare the predictive performance of the proposed model fitted to three types of time series models. It is clear to see that our DAPP can better capture the true conditional intensity function for all four synthetic data sets, comparing to the other four baseline methods.
4.2 Real data
We evaluate the performance on real-world data sets from a diverse range of domains, including a spatio-temporal data set and three other temporal data sets:
Traffic Congestions: We collect the data of traffic congestions from the Georgia Department of Transportation (GDOT)  over 178 days from 2017 to 2018, including 15,663 congestion events recorded by 86 different observation sites. Each event consists of time, location, and congestion level. We partition the data into 178 sequences by day, and each sequence has an average length of 88.
Electrical Medical Records: Medical Information Mart for Intensive Care III (MIMIC-III)  contains de-identified clinical visit time records from 2001 to 2012 for more than 40,000 patients. We select 2,246 patients with at least three visits. The visit history of each patient will be considered as an event series, and each clinical visit will be considered as an event.
Financial Transactions We collected data from NYSE of the high-frequency transactions for a stock. It contains 0.7 million transaction records, each of which records the time (in millisecond) and the possible action (sell or buy). We partition the raw data into 5,756 sequences with an average length of 48 by days.
Memes: MemeTracker  tracks the meme diffusion over public media, which contains more than 172 million news articles or blog posts. The memes are sentences, such as ideas, proverbs, and the time is recorded when it spreads to specific websites. We randomly sample 22,003 sequences of memes with an average length of 24.
We compare and report the average log-likelihood of different models over training epochs on the testing data of each data set in Figure 7. As we can see from Figure 7 and Table 1, our DAPP outperforms the other alternatives with the largest average maximum log-likelihood value when the model achieves the convergence.
We also evaluate the conditional intensity of our DAPP for the traffic data set, where spatial information is included. For better visualization of the conditional intensity over space, we select 14 representative observation sites on two major highways (I-75 and I-85) in Atlanta, as shown in the left of Figure 8, and visualize their conditional intensity on May 8th, 2018. We first visualize the conditional intensity of 14 sites as a heatmap in the upper right of Figure 8, where each row represents an observation site, and each column represents a specific time frame, the color depth of each entry indicates the level of intensity. There is an apparent temporal pattern that the conditional intensity of each site reaches its pinnacle in both morning (around 7:00) and evening (around 16:00) rush hours. We also categorize the observation sites into three groups based on their locations and plot their conditional intensities in a temporal view shown in the bottom right of Figure 8. We can find that there are similar temporal patterns among the observation sites in the same subplots since these sites are sharing the same traffic flow successively. Moreover, we also found the “phantom traffic jam” phenomenon from the above result. This kind of situation usually begins when a part of traffic flow slows down even slightly, then causes the flow behind that part to slow even more, and the slowing action spreads backward through the lane of traffic like a wave, getting worse the farther it spreads. For example, as the site L1S, L2S, LR1S are distributed along the southbound of I-75, the peak of the conditional intensity of one site drift towards the right and appear later about half an hour against its adjacent site in the south. A similar phenomenon can also be found among the site L1N, L2N, LR1N. There are more results from other days and corresponding interpretations shown in Appendix C.
4.3 Score function interpretation
To interpret the result of our score function, we visualize the scores of current event against its historical events in Figure 2, where 15 events in a row in a sequence are considered. Specifically, the entry at the -th row and the -th column of the lower triangular matrix represents scores of the event against its past event , i.e., .
Figure 2 confirms that our DAPP can capture the complex dependence between events accurately, and the probabilistic score function is interpretable. Specifically, Figure 2 (a) shows the scores of events generated from a Hawkes process. It is clear to see that the pattern of scores for each event against its past resembles the exponential decay in the kernel function of Hawkes processes. We also conduct a similar experiment on the traffic data set, as shown in Figure 2 (b), where spatial information is considered as well. It has been shown that the first three events pose more impact on their subsequent events in contrast to others, i.e., the first three scores in the last row are remarkably higher than the other scores. By investigating the data, we find out that the first three events are observed by the same site on the highway, while others are observed by other sites located somewhere else. It provides us some insights about the importance of the first three events and the location of the observation site.
In conclusion, we propose a novel attention-based mechanism for learning the general spatio-temporal process. As demonstrated by our experiments, our method achieves the best performance in maximizing the likelihood function of a point process compared with previous approaches. Besides, by implementing various kinds of point process models, we show that our model exceeds the others in terms of robustness and flexibility. Furthermore, based on the structural information of dynamic networks, our model can be generalized in such a way that the prediction of the current event of a particular type might depend more on some specific types of events by exploring the structure of the score matrices. Thus gives us a new method for implementing causality inference in networks.
Appendix A Deriving Conditional Intensity of MSTPPs
The conditional intensity function is defined by:
The last equality is achieved by consider up to time , the filtration contains history events , thus is equivalent to be written as .
Then can be interpreted heuristically as the following: consider a small enough interval around addressed as , and a ball with radius , then
Denote where , the above equation can be written as
This shows the definition we present in main section 2.1 is equivalent.
Appendix B Deriving Log-Likelihood of MSTPPs
The likelihood function is defined as:
Note from the definition of conditional intensity function, we get:
Integrating both side from to (since depends on history events, so its support is ) where is the last event before , integrate over all , sum over all marks we can get:
obviously, using basic calculus we could find:
Plugging in the above formula into the definition of likelihood function, we have:
and the log-likelihood function of marked spatio-temporal point process can be the written as :
Appendix C Additional Traffic Results
As shown in Figure 9, we present two extra concrete examples using the traffic data set for two different days: Figure 9 (a) shows the conditional intensities on Saturday, April 21th, 2018, which has a different traffic congestion pattern in contrast to what we have shown in Figure 8. The overall congestion intensities are lower than the normal week days and the time for the morning rush hour has been delayed around two hours; Figure 9 (b) shows the conditional intensities on Tuesday, April 24th, 2018. In this day, Atlanta broke a 135-year-old rainfall record when it got 4.16 inches of rain . The previous record, set in 1883, was 2.4 inches. As we can see from the figure, the heavy rain and subsequent flood in the city led to the unusual level of traffic congestion. Differing from the results shown in Figure 8, the traffic congestion level remains at a relatively high level throughout the entire day.
- (2014) Neural machine translation by jointly learning to align and translate. External Links: Cited by: §2.2.
- (2016-11) Long short-term memory-networks for machine reading. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas, pp. 551–561. External Links: Cited by: §2.2.
- (2016) Recurrent marked temporal point processes: embedding event history to vector. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, New York, NY, USA, pp. 1555–1564. External Links: Cited by: §1, §1, §4.
- (2018) Neural processes. External Links: Cited by: §1.
- Traffic analysis and data application (tada). Note: http://www.dot.ga.gov/DS/Data Cited by: §4.2.
- (2010) Inferring networks of diffusion and influence. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’10, New York, NY, USA, pp. 1019–1028. External Links: Cited by: §1.
- (1971) Spectra of some self-exciting and mutually exciting point processes. Biometrika 58 (1), pp. 83–90. External Links: Cited by: §1, §4.
- (2016) MIMIC-iii, a freely accessible critical care database. Scientific Data 3 (1), pp. 160035. External Links: Cited by: §4.2.
- (2019) Attentive neural processes. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, Cited by: §1.
- MemeTracker. Note: http://www.memetracker.org/data.html Cited by: §4.2.
- (2018) Learning temporal point processes via reinforcement learning. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS ’18, Red Hook, NY, USA, pp. 10804–10814. Cited by: §1, §1.
- (2015-09) Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, pp. 1412–1421. External Links: Cited by: §2.2.
- (2017) The neural hawkes process: a neurally self-modulating multivariate point process. In Advances in Neural Information Processing Systems 30, pp. 6754–6764. Cited by: §1, §1, §4.
- (2018) Atlanta breaks 135-year-old rainfall record – and more is on the way. Note: \urlhttps://www.ajc.com/news/local/atlanta-breaks-135-year-old-rainfall-record-and-more-the-way/ToXxI0475c7evyvMp2FrMP/ Cited by: Appendix C.
- (2019) Fully neural network based model for general temporal point processes. In Advances in Neural Information Processing Systems 32, pp. 2120–2129. Cited by: §1.
- (2016-11) A decomposable attention model for natural language inference. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas, pp. 2249–2255. External Links: Cited by: §2.2.
- (2017) A review of self-exciting spatio-temporal point processes and their applications. Note: Statistical Science 33 (2018), no. 3, pp. 299-318 External Links: Cited by: §2.1.
- (2009) Long aftershock sequences within continents and implications for earthquake hazard assessment. Nature 462 (7269), pp. 87–89. External Links: Cited by: §1.
- (2018) Deep reinforcement learning of marked temporal point processes. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi and R. Garnett (Eds.), pp. 3168–3178. Cited by: §1.
- (2017) Attention is all you need. In Advances in Neural Information Processing Systems 30, pp. 5998–6008. Cited by: §1, §2.2, §2.2, §2.2.
- (2018) Graph attention networks. In International Conference on Learning Representations, Cited by: §2.2.
- (2017) Wasserstein learning of deep generative point process models. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS ’17, Red Hook, NY, USA, pp. 3250–3259. External Links: Cited by: §1, §1.
- (2017) Modeling the intensity function of point process via recurrent neural networks. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, AAAI ’17, pp. 1597–1603. Cited by: §1, §1.
- (2019) Multivariate spatiotemporal hawkes processes and network reconstruction. SIAM Journal on Mathematics of Data Science 1 (2), pp. 356–382. External Links: Cited by: §1.
- (2018) Self-attention generative adversarial networks. External Links: Cited by: §2.2.
- (2019) Self-attentive hawkes processes. External Links: Cited by: §1, §4.
- (2019) Spatial-temporal-textual point processes with applications in crime linkage detection. External Links: Cited by: §1.
- (2019) Adversarial anomaly detection for marked spatio-temporal streaming data. External Links: Cited by: §1, §1.