Detecting changes in dynamic events over networks
Abstract
Large volume of networked streaming event data are becoming increasingly available in a wide variety of applications, such as social network analysis, Internet traffic monitoring and healthcare analytics. Streaming event data are discrete observation occurred in continuous time, and the precise time interval between two events carries a great deal of information about the dynamics of the underlying systems. How to promptly detect changes in these dynamic systems using these streaming event data? In this paper, we propose a novel changepoint detection framework for multidimensional event data over networks. We cast the problem into sequential hypothesis test, and derive the likelihood ratios for point processes, which are computed efficiently via an EMlike algorithm that is parameterfree and can be computed in a distributed fashion. We derive a highly accurate theoretical characterization of the falsealarmrate, and show that it can achieve weak signal detection by aggregating local statistics over time and networks. Finally, we demonstrate the good performance of our algorithm on numerical examples and realworld datasets from twitter and Memetracker.
Keywords: Changepoint Detection for Event Data, Hawkes Process, Online Detection Algorithm, False Alarm Control
I Introduction
Networks have become a convenient tool for people to efficiently disseminate, exchange and search for information. Recent attacks on very popular web sites such as Yahoo and eBay [1], leading to a disruption of services to users, have triggered an increasing interest in network anomaly detection. In the positive side, surge of hot topics and breaking news can provide business opportunities. Therefore, early detection of changes, such as anomalies, epidemic outbreaks, hot topics, or new trends among streams of data from networked entities is a very important task and has been attracting significant interests [1, 2, 3].
All types of the abovementioned changes can be more concretely formulated as the changes of time interval distributions between events, combined with the alteration of interaction structures across components in networks. However, changepoint detection based on event data occurring over the network topology is nontrivial. Apart from the possible temporal dependency of the event data as well as the complex crossdimensional dependence among components in network, event data from networked entities are usually not synchronized in time. Dynamic in nature, many of the collected data are discrete events observed irregularly in continuous time [4, 5]. The precise time interval between two events is random and carries a great deal of information about the dynamics of the underlying systems. These characteristics make such event data fundamentally different from independently and identically distributed (i.i.d.) data, and timeseries data where time and space is treated as an index rather than random variables (see Figure 1 for further illustrations of the distinctive nature of event data vs. i.i.d. and time series data). Clearly, i.i.d. assumption can not capture temporal dependency between data points, while timeseries models require us to discretize the time axis and aggregate the observed events into bins (such as the approach in [6] for neural spike train change detection). If this approach is taken, it is not clear how one can choose the size of the bin and how to best deal with the case when there is no event within a bin.
Besides the distinctive temporal and spatial aspect, there are three additional challenges using event data over network: (i) how to detect weak changes; (ii) how to update the statistics efficiently online; and (iii) how to provide theoretical characterization of the falsealarmrate for the statistics. For the first challenge, many existing approaches usually use random or adhoc aggregations which may not pool data efficiently or lose statistical power to detect weak signals. Occurrence of changepoints (e.g., epidemic outbreaks, hot topics, etc.) over networks usually evince a certain clustering behavior over dimensions and tend to synchronize in time. Smart aggregation over dimensions and time horizon would manifest the strength of signals and detect the change quicker [7]. For the second challenge, many existing changepoint detection methods based on likelihood ratio statistics do not take into account computational complexity nor can be computed in a distributed fashion and, hence, are not scalable to large networks. Temporal events can arrive at social platforms in very high volume and velocity. For instance, every day, on average, around 500 million tweets are tweeted on Twitter [8]. There is a great need for developing efficient algorithms for updating the detection statistics online. For the third challenge, it is usually very hard to control falsealarms for changepoint detection statistics over a large network. When applied to real network data, traditional detection approaches usually have a high false alarms [1]. This would lead to a huge waste of resources since every time a changepoint is declared, subsequent diagnoses are needed. Lacking accurate theoretical characterization of falsealarms, existing approaches usually have to incur expensive Monte Carlo simulations to determine the falsealarms and are prohibitive for large networks.
Our contributions. In this paper, we present a novel online changepoint detection framework tailored to multidimensional intertwined event data streams over networks (or conceptual networks) tackling the above challenges. We formulate the problem by leveraging the mathematical framework of sequential hypothesis testing and point processes modeling, where before the change the event stream follows one point process, and after the change the event stream becomes a different point process. Our goal is to detect such changes as quickly as possible after the occurrences. We derive generalized likelihood ratio statistics, and present an efficient EMlike algorithm to compute the statistic online with streaming data. The EMlike algorithm is parameterfree and can be implemented in a distributed fashion and, hence, it is suitable for large networks.
Specifically, our contributions include the following:
(i) We present a new sequential hypothesis test and likelihood ratio approach for detecting changes for the event data streams over networks. We will either use the Poisson process as the null distribution to detect the appearance of temporal independence, or use the Hawkes process as the null distribution to detect the possible alteration of the dependency structure. For (inhomogeneous) Poisson process, time intervals between events are assumed to be independent and exponentially distributed. For Hawkes process, the occurrence intensity of events depends on the events that have occurred, which implies that the time intervals between events would be correlated. Therefore, Hawkes process can be thought of as a special autoregressive process in time, and multivariate Hawkes process also provides a flexible model to capture crossdimension dependency in addition to temporal dependency. Our model explicitly captures the information diffusion (and dependencies) both over networks and time, and allows us to aggregate information for weak signal detection. Our proposed detection framework is quite general and can be easily adapted to other point processes.
In contrast, existing work on changepoint detection for point processes has also been focused on a single stream rather than the multidimensional case with networks. These work including detecting change in the intensity of a Poisson process [9, 10, 11] and the coefficient of continuous diffusion process [12]; detecting change using the selfexciting Hawkes processes include trend detection in social networks [13]; detecting for Poisson processes using a score statistic [14].
(ii) We present an efficient expectationmaximization (EM) like algorithm for updating the likelihoodratio detection statistic online. The algorithm can be implemented in a distributed fashion due to is structure: only neighboring nodes need to exchange information for the Estep and Mstep.
(iii) We also present accurate theoretical approximation to the falsealarmrate (formally the averagerunlength or ARL) of the detection algorithm, via the recently developed changeofmeasure approach to handle highly correlated statistics. Our theoretical approximation can be used to determine the threshold in the algorithm accurately.
(iv) Finally, we demonstrate the performance gain of our algorithm over two baseline algorithms (which ignore the temporal correlation and correlation between nodes), using synthetic experiments and realworld data. These two baseline algorithms representing the current approaches for processing event stream data. We also show that our algorithm is very sensitive to true changes, and the theoretical falsealarmrates are very accurate compared to the experimental results.
Related work. Recently, there has been a surge of interests in using multidimensional point processes for modeling dynamic event data over networks. However, most of these works focus on modeling and inference of the point processes over networks. Related works include modeling and learning bursty dynamics [5]; shaping social activity by incentivization [15]; learning information diffusion networks [4]; inferring causality [16]; learning mutually exciting processes for viral diffusion [17]; learning triggering kernels for multidimensional Hawkes processes [18]; in networks where each dimension is a Poisson process [19]; learning latent network structure for general counting processes [20]; tracking parameters of dynamic point process networks [21]; and estimating point process models for the coevolution of network structure an information diffusion [22], just to name a few. These existing works provide a wealth of tools through which we can, to some extent, keep track of the network dynamics if the model parameters can be sequentially updated. However, only given the values of the uptodate model parameters, especially in high dimensional networks, it is still not clear how to perform change detection based on these models in a principled fashion.
Classical statistical sequential analysis (see, e.g., [23, 24]), where one monitors i.i.d. univariate and lowdimensional multivariate observations observations from a single data stream is a welldeveloped area. Outstanding contributions include Shewhart’s control chart [25], the minimax approach Page’s CUSUM procedure [26, 27], the Bayesian approach ShiryaevRoberts procedure [28, 29], and windowlimited procedures [30]. However, there is limited research in monitoring largescale data streams over a network, or even event streams over networks. Detection of changepoints in point processes has so far mostly focused on the simple Poisson process models without considering temporal dependency, and most of the detection statistics are computed in a discretetime fashion, that is, one needs to aggregate the observed events into bins and then apply the traditional detection approaches to timeseries of count data. Examples include [31, 32, 2] .
The notations are standard. The remaining sections are organized as follows. Section II presents the point process model and derives the likelihood functions. Section III presents our sequential likelihood ratio procedure. Section IV presents the EMlike algorithm. Section V presents our theoretical approximation to falsealarmrate. Section VI contains the numerical examples. Section VI presents our results for realdata. Finally, Section VIII summarizes the paper. All proofs are delegated to the Appendix.
Ii Model and Formulation
Consider a sequence of events over a network with nodes, represented as a double sequence
(1) 
where denotes the realvalued time when the th event happens, and and indicating the node index where the event happens. We use temporal point processes [33] to model the discrete event streams, since they provide convenient tool in directly modeling the time intervals between events, and avoid the need of picking a time window to aggregate events and allow temporal events to be modeled in a fine grained fashion.
Iia Temporal point processes
A temporal point process is a random process whose realization consists of a list of discrete events localized in time, , with and . We start by considering onedimensional point processes. Let the list of times of events up to but not including time be the history
Let represent the total number of events till time . Then the counting measure can be defined as
(2) 
where is the Dirac function.
To define the likelihood ratio for point processes, we first introduce the notion of conditional intensity function [34]. The conditional intensity function is a convenient and intuitive way of specifying how the present depends on the past in a temporal point process. Let be the conditional probability that the next event happens before given the history of previous events
and let be the corresponding conditional density function. The conditional intensity function (or the hazard function) [34] is defined by
(3) 
and it can be interpreted as the probability that an event occurs in an infinitesimal interval
(4) 
This general model includes Poisson process and Hawkes process as special cases.
(i) For (inhomogeneous) Poisson processes, each event is stochastically independent to all the other events in the process, and the time intervals between consecutive events are independent with each other and are exponentially distributed. As a result, the conditional intensity function is independent of the past, which is simply deterministic
(ii) For one dimensional Hawkes processes, the intensity function is history dependent and models a mutual excitation between events
(5) 
where is the base intensity (deterministic), (due to the requirement of stationary condition) is the influence parameter, and is a normalized kernel function . Together, they characterize how the history influences the current intensity. Fixing the kernel function, a higher value of means a stronger temporal dependency between events. A commonly used kernel function is the exponential kernel , which we will use through the paper.
(iii) The multidimensional Hawkes process is defined similarly, with each dimension being a onedimensional counting process. It can be used to model the sequence of events over network such as (1). We may convert a multidimensional process into a double sequence, using the first coordinate to represent time of the event, and the second coordinate to represent the index of the corresponding node.
Define a multivariate counting process , , with each component recording the number of events of the th component (node) of the network during . The intensity function is
(6) 
where represents the strength of influence of the th node on the th node by affecting its intensity process . If , then it means that is not influencing . Written in matrix form, the intensity can be expressed as
(7) 
where
and is the influence matrix, which is our main quantityofinterest when detect a change. The diagonal entries characterize the selfexcitation and the offdiagonal entries capture the mutualexcitation among nodes in the network. The influence matrix can be asymmetric since influence can be bidirectional.
(a) Poisson to Hawkes  (b) Hawkes to Hawkes 
IiB Likelihood function
In the following, we will explicitly denote the dependence of the likelihood function on the parameters in each setting. The following three cases are useful for our subsequent derivations. Let denote the probability density function. For the onedimensional setting, given a sequence of events (event times) before time . Using the conditional probability formula, we obtain
(8) 
The last equation is from the following argument. From the definition of the conditional density function, we have
Hence, , where , since event cannot happen at time . Therefore,
The likelihood function for multidimensional Hawkes process can be derived similarly, by redefining and according to the intensity functions of the multidimensional processes.
Based on the above principle, we can derive the following likelihood functions.
IiB1 Homogeneous Poisson process
For homogeneous Poisson process, Given constant intensity, the loglikelihood function for a list of events in the time interval can be written as
(9) 
IiB2 One dimensional Hawkes process
For onedimensional Hawkes process with constant baseline intensity and exponential kernel, we may obtain its loglikelihood function based on the above calculation. By substituting the conditional intensity function (5) into (8), the loglikelihood function for events in the time interval is given by
(10) 
To obtain the above expression, we have used the following two simple results for exponential kernels, due to the property of counting measure defined in (2):
(11) 
and
(12) 
IiB3 Multidimensional Hawkes process
For multidimensional point process, we consider the event stream such as (1). Assume base intensities are constants with . Using similar calculations as above, we obtain the loglikelihood function for events in the time interval as
(13) 
Iii Sequential changepoint detection
We are interested in detecting two types of changes sequentially from event streams, which capture two general scenarios in real applications (Fig. 2 illustrates these two scenarios for the one dimensional setting): (i) The sequence before change is a Poisson process and after the change is a Hawkes process. This can be useful for applications where we are interested in detecting an emergence of self or mutualexcitation between nodes. (ii) The sequence before change is a Hawkes process and after the change the magnitude of influence matrix increases. This can be a more realistic scenario, since often nodes in a network will influence each initially. This can be useful for applications where a triggering event changes the behavior or structure of the network. For instance, detecting emergence of a community in network [35].
In the following, we cast the changepoint detection problems as sequential hypothesis test [36], and derive generalized likelihood ratio (GLR) statistic for each case. Suppose there may exist an unknown changepoint such that after that time, the distribution of the point process changes.
Iiia Change from Poisson to Hawkes
First, we are interested in detecting the events over network changing from dimensional independent Poisson processes to an intertwined multivariate Hawkes process. This models the effect that the change affects the spatial dependency structure over the network. Below, we first consider onedimensional setting, and then generalize them to multidimensional case.
IiiA1 Onedimensional case
The data consists of a sequence of events occurring at time . Under the hypothesis of no change (i.e. ), the event time is a onedimensional Poisson process with intensity . Under the alternative hypothesis (i.e. ), there exists a changepoint . The sequence is a Poisson process with intensity initially, and changes to a onedimensional Hawkes process with parameter after the change. Formally, the hypothesis test can be stated as
(14) 
Assume intensity can be estimated from reference data and is given as a priori. We treat the postchange influence parameter as unknown parameter since it represents an anomaly.
Using the likelihood functions derived in Section IIB, equations (9) and (10), for a hypothetical changepoint location , the loglikelihood ratio as a function of , and , is given by
(15) 
Note that loglikelihood ratio only depends on the events in the interval and . We maximize the statistic with respect to the unknown parameters and to obtain the log GLR statistic. Finally, the sequential changepoint detection procedure is a stopping rule (related to the nonBayesian minimax type of detection rule, see [37]):
(16) 
where is a prescribed threshold, whose choice will be discussed later. Even though there does not exist a closedform expression for the estimator of , we can estimate via an EMlike algorithm, which will be discussed in Section IVB.
Remark 1 (Offline detection).
We can adapt the procedure for offline changepoint detection by considering the fixedsample hypothesis test. For instance, for the onedimensional setting, given a sequence of events with , we may detect the existence of change when the detection statistic, , exceeds a threshold. The changepoint location can be estimated as that obtains the maximum. However, the algorithm consideration for online and offline detection are very different, as discussed in Section IV.
IiiA2 Multidimensional case
For the multidimensional case, the event stream data can be represented as a double sequence defined in (1). We may construct a similar hypothesis test as above. Under the hypothesis of no change, the event times is multidimensional Poisson process with a vector intensity function . Under the alternative hypothesis, there exists a changepoint . The sequence is a multidimensional Poisson process initially, and changes to a multidimensional Hawkes process with influence matrix afterwards. We omit the formal statement of the hypothesis test as it is similar to (14).
Again, using the likelihood functions derived in IIB, we obtain the likelihood ratio. The loglikelihood ratio for data up to time , given a hypothetical changepoint location and parameter , is given by
(17) 
The sequential changepoint detection procedure is a stopping rule:
(18) 
where is a predetermined threshold. The multidimensional maximization can be computed efficiently via an EM algorithm described in Section IVB .
Remark 2 (Topology of network).
The topology of the network has been embedded in the sparsity pattern of the influence matrix , which are given as a priori. The dependency between different nodes in the network and the temporal dependence over events can be captured in updating (or tracking) the influence matrix with events stream. This can be achieved as an EMlike algorithm, which is resulted from solving a sequential optimization problem with warm start (i.e., we always initialize the parameters using the optimal solutions of the last step).
IiiB Changes from Hawkes to Hawkes
Next, consider the scenario where the process prior to change is a Hawkes process, and the change happens in the influence parameter or the influence matrix .
IiiB1 Onedimensional case
Under the hypothesis of no change, the event stream is a onedimensional Hawkes process with parameter . Under the alternative hypothesis, there exists a changepoint . The sequence is a Hawkes process with intensity , and after the change, the intensity changes to . Assume the parameter prior to change is known.
IiiB2 Multidimensional case
For the multidimensional setting, we assume the change will alter the influence parameters of the multidimensional Hawkes process over network. This captures the effect that, after the change, the influence between nodes becomes different. Assume that under the hypothesis of no change, the event stream is a multidimensional Hawkes process with parameter . Alternatively, there exists a changepoint . The sequence is a multidimensional Hawkes process with influence matrix before the change, and after the change, the influence matrix becomes . Assume the influence matrix prior to change is known.
Iv Algorithm for computing likelihood online
In the online setting, we obtain new data continuously. Hence, in order to perform online detection, we need to update the likelihood efficiently to incorporate the new data. To reduce computational cost, update of the likelihood function can be computed recursively and the update algorithm should have low cost. To reduce memory requirement, the algorithm should only store the minimum amount of data necessary for detection rather than the complete history. These requirements make online detection drastically different from offline detection. Since in the offline setting, we can afford more computational complexity.
Iva Sliding window procedure
The basic idea of online detection procedure is illustrated in Fig. 3. We adopt a sliding window approach to reduce computational complexity as well the memory requirement. When evaluating the likelihood function, instead of maximizing over possible changepoint location , we pick a windowsize and set to be a fixedvalue
This is equivalent to constantly testing whether a changepoint occurs samples before. By fixing the windowsize, we reduce the computational complexity, since we eliminate the maximization over the changepoint location. This also reduces the memory requirement as we only need to store events that fall into the sliding window. The drawback is that, by doing this, some statistical detection power is lost, since we do not use the most likely changepoint location, and it may increase detection delay. When implementing the algorithm, we choose to achieve a good balance in these two aspect. We have to choose large enough so that there is enough events stored for us to make a consistent inference. In practice, a proper length of window relies on the nature of the data. If the data are noisy, usually a longer time window is needed to have a better estimation of the parameter and reduce the false alarm.
IvB Parameter Free EMlike Algorithm
We consider onedimensional point process to illustrate the derivation of the EMlike algorithm. It can be shown that the likelihood function (15) is a concave function with respect to the parameter . One can use gradient descent to optimize this objective, where the algorithm will typically involves some additional tuning parameters such as the learning rate. Although there does not exist a closedform estimator for influence parameter or influence matrix , we develop an efficient EM algorithm to update the likelihood, exploiting the structure of the likelihood function [38]. The overall algorithm is summarized in Algorithm 1.
First, we obtain a concave lower bound of the likelihood function using Jensen’s inequality. Consider all events fall into a sliding window at time . Introduce auxiliary variables for all pair of events within the window and such that . The variables are subject to the constraint
(21) 
These can be interpreted as the probability that th event influence the th event in the sequence. It can be shown that the likelihood function defined in (10) can be lowerbounded
Note that the lowerbound is valid for every choice of which satisfies (21).
To make the lower bound tight and ensure improvement in each iteration, we will maximize it with respect to and obtain (22) (assuming we have from previous iteration or initialization). Once we have the tight lower bound, we will take gradient of this lowerbound with respect to . When updating from the th iteration to the th iteration, we obtain (23)
(22)  
(23) 
where the superscript denotes the number of iterations. The algorithm iterates these two steps until the algorithm converges and obtains the estimated . In practice, we find that we only need 3 or 4 iterations to converge if using warm start.
Similarly, online estimate for the influence matrix for multidimensional case can be estimated by iterating the following two steps:
Remark 3 (Computational complexity).
The key computation is to compute pairwise interevent times for pairs of event , . It is related to the window size (since we have adopted a sliding window approach), the size of the network, and the number of EM steps. However, note that in the EM algorithm, we only need to compute the interevent times for nodes that are connected by an edge, since the summation is weighted by and the term only counts if is nonzero. Hence, the updates only involve neighboring nodes and the complexity is proportional to the number of edges in the network. Since most social networks are sparse, the will significantly lower the complexity. We may reduce the number of EM iterations for each update, by leveraging a warmstart for initializing the parameter values: since typically for two adjacent sliding window, the corresponding optimal parameter values should be very close to the previous one.
Remark 4 (Distributed implementation).
Our EMlike algorithm in the network setting can be implemented in a distributed fashion. This has embedded in the form of the algorithm already. Hence, the algorithm can be used for process large networks. In the Estep, when updating the , we need to evaluate a sum in the denominator, and this is the only place where different nodes need to exchange information, i.e., the event times happened at that node. Since we only need to sum over all events such that the corresponding is nonzero, this means that each node only needs to consider the events happened at the neighboring nodes. Similarly, in the Mstep, only neighboring nodes need to exchange their values of and event times to update the influence parameter values.
V Theoretical threshold
A key step in implementing the detection algorithm is to set the threshold. The choice of threshold involves a tradeoff between two standard performance metrics for sequential changepoint detection: the falsealarm rate and how fast we can detect the change. Formally, these two performance metrics are: (i) the expected stopping time when there is no changepoints, or named average run length (ARL); and (ii) the expected detection delay when there exists a changepoint.
Typically, a higher threshold results in a larger ARL (hence smaller falsealarm rate) but larger detection delay. A usual practice is to set the falsealarmrate (or ARL) to a predetermined value, and find the corresponding threshold . The predetermined ARL depends on how frequent we can tolerate false detection (once a month or once a year). Usually, the threshold is estimated via direct Monte Carlo by relating threshold to ARL assuming the data follow the null distribution. However, Monte Carlo is not only computationally expensive, in some practical problems, repeated experiments would be prohibitive. Therefore it is important to find a cheaper way to accurately estimate the threshold.
We develop an analytical function which relates the threshold to ARL. That is, given a prescribed ARL, we can solve for the corresponding threshold analytically. We first characterize the property of the likelihood ratio statistic in the following lemma, which states that the mean and variance of the loglikelihood ratios both scale roughly linearly with the postchange time duration. This property of the likelihood ratio statistics is key to developing our main result.
Lemma 1 (Mean and variance of loglikelihood ratios).
When the number of postchange samples is large, the mean and variance of loglikelihood ratio for the singledimensional and the multidimensional cases, denoted as , for our cases converges to simple linear form. Under the null hypothesis, and . Under the alternative hypothesis, and . Above, , , , and are defined in Table I for various settings we considered.
Our main theoretical result is the following general theorem that can be applied for all hypothesis test we consider. Denote the probability and the expectation under the hypothesis of no change by and , respectively.
Theorem 1 (ARL under the null distribution).
When and for some constant , the average run length (ARL) of the stopping time defined in (16) for onedimensional case, is given by
(24) 
For multidimensional case, the same expression holds for except that is replaced by , which means taking integral with respect to all nonzero entries of the matrix Above, the special function
The specific expressions for , , , and for various settings are summarized in Table I, and
(25) 
Above, and are the cumulative distribution function (CDF) and the probability density function (PDF) of the standard normal, respectively.
Remark 5 (Evaluating integral).
The multidimensional integral can be evaluated using Monte Carlo method [39]. We use this approach for our numerical examples as well.
Remark 6 (Interpretation).
The parameters , , and have the following interpretation
(26) 
which are the mean and the variance of the loglikelihood ratio under the null and the alternative distributions, per unit time, respectively. Moreover, can be interpreted roughly as the KullbackLeibler information per time for each of the hypothesis test we consider.
The proof of the Theorem 1 combines the recently developed changeofmeasure techniques for sequential analysis, with properties the likelihood ratios for the point processes, mean field approximation for point processes, and Delta method [40].
Setting  








In the table above, denote the Hadamard product, and related quantities are defined as
Vi Numerical examples
In this section, we present some numerical experiments using synthetic data. We focus on comparing EDD of our algorithm with two baseline methods, and demonstrate the accuracy of the analytic threshold.
Via Comparison of EDD
ViA1 Two baseline algorithms
We compare our method to two baseline algorithms:
(i) Baseline 1 is related to the commonly used “data binning” approach for processing discrete event data such as [6]. This approach, however, ignores the temporal correlation and correlation between nodes. Here, we convert the event data into counts, by discretize time into uniform grid, and count the number of events happening in each interval. Such counting data can be modeled via Poisson distribution. We may derive a likelihood ratio statistic to detect a change. Suppose are the sequence of counting numbers following Poisson distribution with intensity , is the index of the discrete time step. Assume under the null hypothesis, the intensity function is . Alternatively, there may exist a changepoint such that before the change, , and after the change, . It can be shown that the loglikelihood ratio statistic as
We detect a change whenever for a predetermined threshold . Assume every dimension follows an independent Poisson process, then the loglikelihood ratio for the multidimensional case is just a summation of the loglikelihood ratio for each dimension. Suppose the total dimension is , then
We detect a change whenever .
(ii) Baseline 2 method calculates the onedimensional changepoint detection statistic at each node separately as (15) and (19), and then combine the statistics by summation into a global statistic to perform detection. This approach, however, ignores the correlation between nodes, and can also be viewed as a centralized approach for changepoint detection and it is related to multichart changepoint detection [37].
ViA2 Setup of synthetic experiments
We consider the following scenarios and compare the EDD of our method to two baseline methods. EDD is defined as the average time (delay) it takes before we can detect the change, and can be understood as the power of the test statistic in the sequential setting. The thresholds of all the three methods are calibrated so that the ARL under the null model is unit time and the corresponding thresholds are obtained via direct Monte Carlo for a fair comparison. The sliding window is set to be unit time. The exponential kernel is used and . The scenarios we considered are described below. The illustrations of the Case 1 and Case 2 scenarios are displayed in Fig. 2. The network topology for Case 3 to Case 7 are demonstrated in Fig. 4.
Case 1. Consider a situation when the events first follow a onedimensional Poisson process with intensity and then shift to a Hawkes process with influence parameter . This scenario describes the emergence of temporal dependency in the event data.
Case 2. The process shifts from a onedimensional Hawkes process with parameter , to another Hawkes process with a larger influence parameter . The scenario represents the change of the temporal dependency in the event data.
Case 3. Consider a star network scenario with one parent and nine children, which is commonly used in modeling how the information broadcasting over the network. Before the changepoint, each note has a base intensity and the selfexcitation , . The mutualexcitation from the parent to each child is set to be , (if we use the first node to represent the parent). After the changepoint, all the self and mutual excitation increase to 0.5.
Case 4. The network topology is the same as Case 3. But we consider a more challenging scenario. Before the change, parameters are set to be the same as Case 3. After the change, the selfexcitation , deteriorate to , and the influence from the parent to the children increase to , . In this case, for each note, the occurring frequency of events would be almost the same before and after the changepoints. But the influence structure embedded in the network has actually changed.
Case 5. Consider a network with a chain of ten nodes, which is commonly used to model information propagation over the network. Before the change, each note has a base intensity and the selfexcitation , and mutualexcitation , where . After the changepoint, all the self and mutualexcitation parameters increase to 0.5.
Case 6. Consider a sparse network with an arbitrary topology and one hundred nodes. Each note has a base intensity and the selfexcitation , . We randomly select twenty directed edges over the network and set the mutualexcitation to be , where , . After the changepoint, all the self and mutualexcitation increase to 0.5.
Case 7. The sparse network topology and the prechange parameters are the same with Case 6. The only difference is that after the changepoint, only half of the self and mutualexcitation parameters increase to 0.5.
ViA3 EDD results and discussions
For the above scenarios, we compare the EDD of our method and two baseline algorithms. The results are shown in Table II. We see our method compares favorably to the two baseline algorithms. In the first five cases, our method has a significant performance gain. Especially for Case 4, which is a challenging setting, only our method succeeds in detecting the spatial structure changes. For Case 6 and Case 7, our method has similar performance as Baseline 2. A possible reason is that in these cases the network topology is a sparse graph so the nodes are “loosely” correlated. Hence, the advantage of combining over graph is not significant in these cases.
Moreover, we observe that Baseline 1 algorithm is not stable. In certain cases (Case 6 and Case 7), it completely fails to detect the change. An explanation is that there is a chance that the number of events fall into a given time bin is extremely small or close to zero, and this causes numerical issues when calculating the the likelihood function (since there is a log function of the number of events). On the other hand, our proposed loglikelihood ratio is eventtriggered, and hence will avoid such numerical issues.
Baseline 1  Baseline 2  Our Method  

Case 1  22.1  4.8  
Case 2  19.6  18.8  
Case 3  8.2  6.9  4.3 
Case 4  19.8  
Case 5  6.1  5.7  4.7 
Case 6  10.5  10.8  
Case 7  32.5  32.5 
Note: ‘’ means the corresponding method fails to detect the changes; ‘’ means in onedimensional case Baseline 2 is identical to ours.
ViB Sensitivity analysis
We also perform the sensitivity analysis by comparing our method to Baseline 1 algorithm via numerical simulation. The comparison is conducted under various kernel decay parameter , and the strength of the postchange signals, which can be controlled by the magnitudes of the changes in (or ). For each dataset, we created 500 samples of sequences with half of them containing one true changepoint and half of them containing no changepoint. We then plot the area under the curve (AUC) (defined as the true positive rate versus the false positive rate under various threshold) for comparison, as shown in Fig. 5.
ViB1 Setup of synthetic experiments
Overall, we consider various decay parameter and the magnitudes of the changes in to compare the approaches.
Onedimensional setting. First, consider that before the change the data is a Poisson process with base intensity . For A.1A.4, the postchange data become one dimensional Hawkes process: for A.1–A.3, , and , respectively; for A.4, , and . By comparing the AUC curves, we see that, our method has a remarkably better performance in distinguishing the true positive changes from the false positive changes compared to the baseline method. The superiority would become more evident under larger and bigger magnitudes of shifts in . For weak changes, the baseline approach is just slightly better than the random guess, whereas our approach consistently performs well. Similar results can be found if the prechange data follow the Hawkes process. For example, in B.1B.3, the prechange data follow Hawkes process with , , and , and the postchange parameters shift to a Hawkes process with , and , respectively. We can see the similar trend as before by varying and .
Network setting. We first consider the twodimensional examples in the following and get the same results. For C.1C.2, the prechange data follow two dimensional Poisson processes with , and the postchange data follow two dimensional Hawkes processes with influence parameter , with , respectively. For D.1–D.3, consider the star network with one parent and nine children. Before the changepoint, for each node the base intensity is , , and the influence from the parent to each child is . After the change, changes to 0.4 for D.1, and changes to 0.5, respectively for D.2 and D.3.
A.1  A.2  A.3 
A.4  B.1  B.2 
B.3  C.1  C.2 
D.1  D.2  D.3 
(a)  (b) 
(c)  (d) 
(e)  (f) 
(g)  (h) 
ViC Accuracy of theoretical threshold
We evaluate the accuracy of our approximation in Theorem 1 by comparing the threshold obtained via Theorem 1 with the true threshold obtained by direct Monte Carlo. We consider various scenarios and parameter settings. We demonstrate the results in Fig. 6 and list the parameters below.
For Fig. 6(a)(b)(c), the null distribution is onedimensional Poisson process with intensity . We choose as a priori, and vary the length of the sliding time window. We set , respectively. For Fig. 6(d), we select and let . By comparing these four examples, we find our approximated threshold is very accurate regardless of and .
For Fig. 6(e)(f), the null hypothesis is a onedimensional Hawkes process with base intensity and influence parameter , . We vary the sliding window length to be , respectively. We can see the accurate approximations as before. For Fig. 6(g)(h), we consider a multidimensional case. The null distribution is a two dimensional Poisson processes with base intensity . We set and vary the window length to be and respectively. The results demonstrate that our analytical threshold is also sharply accurate in the multidimensional situation.
Vii Realdata
We evaluate our online detection algorithm on real Twitter and news websites data. By evaluating our loglikelihood ratio statistic on the real twittering events, we see that the statistics would rise up when there is an explanatory major event in actual scenario. By comparing the detected change points to the true major event time, we verify the accuracy and effectiveness of our proposed algorithm. In all our real experiments, we set the sliding window size to be minutes, and set the kernel bandwidth to be 1. The number of total events for the tested sequences ranges from 3000 to 15000 for every dataset.
Viia Twitter Dataset
For Twitter dataset we focus on the star network topology. We create a dataset for famous people users and randomly select 30 of their followers among the tens of thousands followers. We assume there is a starshaped network from the celebrity to the followers, and collect all their re/tweets in late January and early February 2016. Fig. 9(a) demonstrates the statistics computed for the account associated to a TV series named Mr. Robot. We identify that the statistics increase around late January 10th and early 11th. This, surprisingly corresponds to the winning of the 2016 Golden Glob Award^{1}^{1}1http://www.tvguide.com/news/goldenglobeawardswinners2016/. Fig. 9(b) shows the statistics computed based on the events of the First lady of the USA and 30 of her randomly selected followers. The statistics reveal a sudden increase in 13th of January. We find a related event  Michelle Obama stole the show during the president’s final State of the Union address by wearing a marigold dress which sold out even before the president finished the speech^{2}^{2}2http://www.cnn.com/2016/01/13/living/michelleobamadressmarigoldnarcisorodriguezfeat/. Fig. 9(c) is related to Suresh Raina, an Indian professional cricketer. We selecte a small social circle around him as the center of a starshaped network. We notice that he led his team to win an important game on Jan. 20^{3}^{3}3http://www.espncricinfo.com/syedmushtaqalitrophy201516/content/story/963891.html, which corresponds to a sharp increase of the statistics. More results for this dataset can be found in Appendix E.
We further perform sensitivity analysis using the twitter data. We identify 116 important real life events. Some typical examples of such events are release of a movie/album, winning an award, Pulse Nightclub shooting, etc. Next, we identify the twitter handles associated with entities representing these events. We randomly sample 50 followers from each of these accounts and obtain a star topology graph centered around each handle. We collect tweets of all users in all these networks for a window of time before and after the real life event. For each network we compute the statistics. The AUC curves in Fig. 7 are obtained by varying the threshold. A threshold value is said to correctly identify the true changepoint if the statistic value to the right of the changepoint is greater than the threshold. This demonstrates the good performance of our algorithm against two baseline algorithms.
ViiB Memetracker Dataset
As a further illustration of our method, we also experiment with the Memetracker^{4}^{4}4http://www.memetracker.org/ dataset to detect changes in new blogs. The dataset contains the information flows captured by hyperlinks between different sites with timestamps during nine months. It tracks short units of texts and short phrases, which are called memes and act as signatures of topics and events propagation and diffuse over the web in mainstream media and blogs [41]. The dataset has been previously used in Hawkes process models of social activity [42, 18].
We create three instances of changepoint detection scenarios from the Memetracker dataset using the following common procedure. First, we identify a key word associated with a piece of news occurred at . Second, we identify the top websites which have the most mentions of the selected key word in a time window around the news break time (i.e., ). Third, we extract all articles with time stamps within containing the keyword, and each article is treated as an event in the point process. Fourth, we construct the directed edges between the websites based on the reported linking structure. These instances correspond to real world news whose occurrences are unexpected or uncertain, and hence can cause abrupt behavior changes of the blogs. The details of these instances are showed in table III.
real world news  

Obama elected president  80  11/04/08  11/02/08  11/05/08 
Ceasefire in Israel  60  01/17/09  01/13/09  01/17/09 
Olympics in Beijing  100  08/05/08  08/02/08  08/05/08 
The first piece of news corresponds to “Barack Obama was elected as the 44th president of the United States^{5}^{5}5https://en.wikipedia.org/wiki/United_States_presidential_election,_2008”. In this example, we also plot the largest connected component of the network as shown in Fig. 8. It is notable that this subset includes the credible news agencies such as BBC, CNN, WSJ, Hufftingtonpost, Guardian, etc. As we show in Fig. 10(a), our algorithm can successfully pinpoint a change right at the time that Obama was elected. The second piece of news corresponds to “the ceasefire in IsraelPalestine conflict back in 2009”. Our algorithm detects a sharp change in the data, which is aligned closely to the time right before the peak of the war and one day before the Israel announces a unilateral ceasefire in the Gaza War back in 2009^{6}^{6}6http://news.bbc.co.uk/2/hi/middle_east/7835794.stm. The third piece of news corresponds to “the summer Olympics game in Beijing”. Fig. 10(c) shows the evolution of our statistics. The changepoint detected is 23 days before the opening ceremony where all the news websites started to talk about the event^{7}^{7}7https://en.wikipedia.org/wiki/2008_Summer_Olympics.
Viii Summary
In this paper, we have studied a set of likelihood ratio statistics for detecting change in a sequence of event data over networks. To the best of our knowledge, our work is the first to study changepoint detection for network Hawkes process. We adopted the network Hawkes process for the event streams to model self and mutual excitation between nodes in the network, and cast the problem in sequential changepoint detection frame, and derive the likelihood ratios under several models. We have also presented an EMlike algorithm, which can efficiently compute the likelihood ratio statistic online. The distributed nature of the algorithm enables it to be implemented on larger networks. Highly accurate theoretical approximations for the falsealarmrate, i.e., the averagerunlength (ARL) for our algorithms are derived. We demonstrated the performance gain of our algorithms relative to two baselines, which represent the current main approaches to this problem. Finally, we also tested the performance of the proposed method on synthetic and real data.
References
 [1] N. Laptev, S. Amizadeh, and I. Flint, “Generic and scalable framework for automated timeseries anomaly detection,” in Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2015, pp. 1939–1947.
 [2] C. Leduc and F. Roueff, “Detection and localization of changepoints in highdimensional network traffic data,” The Annals of Applied Statistics, pp. 637–662, 2009.
 [3] N. Christakis and J. Fowler, “Social network sensors for early detection of contagious outbreaks,” PloS one, vol. 5, no. 9, p. e12948, 2010.
 [4] M. Rodriguez, D. Balduzzi, and B. Schölkopf, “Uncovering the temporal dynamics of diffusion networks,” in Proceedings of the 28th International Conference on Machine Learning (ICML), 2011.
 [5] S. Myers and J. Leskovec, “The bursty dynamics of the twitter information network,” in Proceedings of the 23rd International Conference on World Wide Web. ACM, 2014, pp. 913–924.
 [6] R. Ratnam, J. Goense, and M. E. Nelson, “Changepoint detection in neuronal spike train activity,” Neurocomputing, 2003.
 [7] C. Milling, C. Caramanis, S. Mannor, and S. Shakkottai, “Distinguishing infections on different graph topologies,” IEEE Transactions on Information Theory, vol. 61, no. 6, pp. 3100–3120, 2015.
 [8] “Twitter statistics,” http://www.internetlivestats.com/twitterstatistics/.
 [9] J. Shen and N. Zhang, “Changepoint model on nonhomogeneous poisson processes with application in copy number profiling by nextgeneration dna sequencing,” The Annals of Applied Statistics, vol. 6, no. 2, pp. 476–496, 2012.
 [10] N. Zhang, B. Yakir, C. Xia, and D. Siegmund, “Scanning a poisson random field for local signals,” arXiv preprint arXiv:1406.3258, 2014.
 [11] T. Herberts and U. Jensen, “Optimal detection of a change point in a poisson process for different observation schemes,” Scandinavian Journal of Statistics, vol. 31, no. 3, pp. 347–366, 2004.
 [12] F. Stimberg, A. Ruttor, M. Opper, and G. Sanguinetti, “Inference in continuoustime changepoint models,” in Advances in Neural Information Processing Systems (NIPS), 2011.
 [13] J. Pinto, T. Chahed, and E. Altman, “Trend detection in social networks using hawkes processes,” in Proceedings of the 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 2015. ACM, 2015, pp. 1441–1448.
 [14] V. Solo and A. Pasha, “A test for independence between a point process and an analogue signal,” Journal of Time Series Analysis, vol. 33, no. 5, pp. 824–840, 2012.
 [15] M. Farajtabar, N. Du, M. Rodriguez, I. Valera, H. Zha, and L. Song, “Shaping social activity by incentivizing users,” in Advances in Neural Information Processing Systems (NIPS), 2014.
 [16] H. Xu, M. Farajtabar, and H. Zha, “Learning granger causality for hawkes processes,” arXiv preprint arXiv:1602.04511, 2016.
 [17] S. Yang and H. Zha, “Mixture of mutually exciting processes for viral diffusion,” in Proceedings of the 30th International Conference on Machine Learning (ICML), 2013.
 [18] K. Zhou, H. Zha, and L. Song, “Learning triggering kernels for multidimensional hawkes processes,” in Proceedings of the 30th International Conference on Machine Learning (ICML), 2013.
 [19] S. Rajaram, T. Graepel, and R. Herbrich, “Poissonnetworks: A model for structured point processes,” in Proceedings of the 10th International Workshop on Artificial Intelligence and Statistics. Citeseer, 2005, pp. 277–284.
 [20] S. Linderman and R. Adams, “Discovering latent network structure in point process data,” arXiv preprint arXiv:1402.0914, 2014.
 [21] E. Hall and R. Willett, “Tracking dynamic point processes on networks,” arXiv preprint arXiv:1409.0031, 2014.
 [22] M. Farajtabar, Y. Wang, M. Rodriguez, S. Li, H. Zha, and L. Song, “Coevolve: A joint point process model for information diffusion and network coevolution,” in Advances in Neural Information Processing Systems (NIPS), 2015.
 [23] M. Basseville and I. V. Nikiforov, Detection of Abrupt Changes: Theory and Application. Prentice Hall, 1993.
 [24] A. Tartakovsky, I. Nikiforov, and M. Basseville, Sequential Analysis: Hypothesis Testing and Changepoint Detection. Chapman and Hall/CRC, 2014.
 [25] A. W. Shewhart, “Economic control of quality of manufactured product,” Preprinted by ASQC quality press, 1931.
 [26] E. S. Page, “Continuous inspection schemes,” Biometrika, vol. 41, no. 1/2, pp. 100–115, June 1954.
 [27] ——, “A test for a change in a parameter occurring at an unknown point,” Biometrika, vol. 42, no. 3/4, pp. 523–527, 1955.
 [28] W. A. Shiryaev, “On optimal methods in quickest detection problems,” Theory Prob. Appl., vol. 8, pp. 22 – 46, Jan. 1963.
 [29] S. W. Roberts, “A comparison of some control chart procedures,” Technometrics, no. 8, pp. 411–430, 1966.
 [30] T. L. Lai, “Sequential changepoint detection in quality control and dynamical systems,” Journal of the Royal Statistical Society. Series B (Methodological), pp. 613–658, 1995.
 [31] A. Ihler, J. Hutchins, and P. Smyth, “Adaptive event detection with timevarying poisson processes,” in Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD), 2006.
 [32] Y. Mei, S. Han, and K. Tsui, “Early detection of a change in poisson rate after accounting for population size effects,” Statistica Sinica, pp. 597–624, 2011.
 [33] D. Daley and D. Jones, An introduction to the theory of point processes: volume II: general theory and structure. Springer Science & Business Media, 2007.
 [34] J. G. Ransmussen, “Temporal point processes: the conditional intensity function,” Lecture Notes, Jan. 2011.
 [35] N. Barbieri, F. Bonchi, and G. Manco, “Influencebased networkoblivious community detection,” in IEEE 13th Int. Conf. on Data Mining (ICDM), 2013.
 [36] D. Siegmund, Sequential Analysis: Test and Confidence Intervals. Springer, Aug. 1985.
 [37] A. Tartakovsky, I. Nikiforov, and M. Basseville, Sequential analysis: Hypothesis Testing and Changepoint Detection. Chapman and Hall/CRC, 2014.
 [38] A. Simma and M. Jordan, “Modeling events with cascades of poisson processes,” in Association for Uncertainty in Artificial Intelligence (UAI), 2012.
 [39] G. Casella and C. Robert, Monte Carlo Statistical Methods. Springer, 2004.
 [40] G. Casella and R. L. Berger, Statistical inference. Cengage Learning, 2001.
 [41] J. Leskovec, L. Backstrom, and J. Kleinberg, “Memetracking and the dynamics of the news cycle,” in Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2009, pp. 497–506.
 [42] M. Farajtabar, X. Ye, S. Harati, L. Song, and H. Zha, “Multistage campaigning in social networks,” arXiv preprint arXiv:1606.03816, 2016.
 [43] B. Yakir, Extremes in random fields: a theory and its applications. John Wiley & Sons, 2013.
 [44] D. Siegmund and E. S. Venkatraman, “Using the generalized likelihood ratio statistic for sequential detection of a changepoint,” Ann. Statist., vol. 23, no. 1, pp. 255 – 271, 1995.
 [45] D. O. Siegmund and B. Yakir, “Detecting the emergence of a signal in a noisy image,” Statistics and Its Inference, vol. 1, pp. 3–12, 2008.
 [46] A. Hawkes, “Spectra of some selfexciting and mutually exciting point processes,” Biometrika, vol. 58, no. 1, pp. 83–90, 1971.
 [47] E. Bacry and J. Muzy, “Second order statistics characterization of hawkes processes and nonparametric estimation,” arXiv preprint arXiv:1401.0903, 2014.
 [48] D. Daley and D. Jones, “Scoring probability forecasts for point processes: the entropy score and information gain,” Journal of Applied Probability, pp. 297–312, 2004.
 [49] D. Jones, “Probabilities and information gain for earthquake forecasting,” Selected Papers From Volume 30 of Vychislitel’naya Seysmologiya, pp. 104–114, 1998.
Appendix A Proofs of Theorem 1
We show the one dimensional case as an example. The following informal derivation justifies the theorem. Let be the current time, and let the windowlength be . Recall our notations: and denote the probability measure and the expectation under the null hypothesis; and denote the probability measure and the expectation under the alternative hypothesis. We also use the notation use to denote conditional expectation.
First, to evaluate ARL, we study the probability that the detection statistic exceeds the threshold before a given time . We will use the changeofmeasure technique [43]. Under the null hypothesis, the boundary crossing probability can be written as
(27) 
where the last equality follows from changing the order of summation and the expectation. Using changeofmeasure , the last equation (27) can be written as
After rearranging each term and introducing additional notations, the last equation above (A) can be written as
(28) 
where
The final expression is also based on the following approximation. When the interval slightly changes from to , changes little under the null hypothesis since is estimated from data stored in . Therefore, in the small neighborhood of , we may regard as a constant. This leads to an approximation: