# Fused Gromov-Wasserstein Alignment

for Hawkes Processes

###### Abstract

We propose a novel fused Gromov-Wasserstein alignment method to jointly learn the Hawkes processes in different event spaces, and align their event types. Given two Hawkes processes, we use fused Gromov-Wasserstein discrepancy to measure their dissimilarity, which considers both the Wasserstein discrepancy based on their base intensities and the Gromov-Wasserstein discrepancy based on their infectivity matrices. Accordingly, the learned optimal transport reflects the correspondence between the event types of these two Hawkes processes. The Hawkes processes and their optimal transport are learned jointly via maximum likelihood estimation, with a fused Gromov-Wasserstein regularizer. Experimental results show that the proposed method works well on synthetic and real-world data.

## 1 Introduction

There is often a need to align real-world entities in different domains, based on their sequential behavior in continuous time, , linking accounts in different social networks based on behaviors within each network. For each domain, the entities in the domain formulate an event space and their sequential behavior can be represented as event sequences, in which each event is a tuple containing a timestamp and an event type (, the entity involved in the event). When these event sequences yield multi-dimensional point process models, the proposed problem can be reformulated as an alignment problem: learning two point processes and finding the correspondence between their event types.

Focusing on event sequences that are modeled as Hawkes processes, we propose a novel fused Gromov-Wasserstein alignment (FGWA) method. As illustrated in Figure 1(a), the event sequences in each domain are modeled as a Hawkes process parametrized via a base intensity vector and an infectivity matrix. The base intensity captures the intrinsic expected happening rate of each event type, while the infectivity matrix describes the self- and mutually-triggering pattern between different event types. The Wasserstein discrepancy between the two domains is formulated based on their base intensities, and their Gromov-Wasserstein discrepancy is formulated based on their infectivity matrices. We learn an optimal transport to minimize the fusion of these two discrepancies, , the fused Gromov-Wasserstein discrepancy vayer2018fused . The learned optimal transport are used to regularize updating of the Hawkes processes. After several iterations, we jointly derive the two Hawkes processes and the optimal transport, indicating the correspondence between their event types. As shown in Fig. 1(b-e), compared with its competitors our FGWA method learns the optimal transport matrix with the highest certainty — each row just contains one nonzero element.

## 2 Proposed Alignment Method

A temporal point process with event types can be represented as a counting process , where each is the number of type- events happening at or before time . The event sequences of the point process are denoted , where is the number of sequences, is the number of events in , with and representing respectively the time-stamp and the event type of the -th event. Point processes are characterized by their intensity functions , where represents the expected instantaneous happening rate of type- events given the history . As a special kind of point process, the Hawkes process hawkes1971spectra has a particular form of intensity luo2015multi ; zhou2013learning :

(1) |

Here, is the base intensity, independent of history, capturing the intrinsic happening rate of the type- event, and is the impact function measuring the infectivity of the type- event to the type- event type, over time. Generally, we can parameterize each impact function by a predefined base function, , , where is an exponential function and is a learnable coefficient. Therefore, we denote an event sequence yielding to a Hawkes process as , with basic intensity and infectivity matrix . Given a set of event sequences , we can learn a Hawkes processes via maximum likelihood estimation. The likelihood of is

(2) |

The base intensity and the infectivity matrix provide, respectively, the feature of each event type and the relationship among different event types. These two kinds of information can be applied to measure the similarity between different event types in a framework of fused Gromov-Wasserstein discrepancy vayer2018fused . In particular, fused Gromov-Wasserstein discrepancy is a combination of traditional Wasserstein discrepancy (WD) villani2008optimal and Gromov-Wasserstein discrepancy (GWD) peyre2016gromov . Focusing on the alignment of Hawkes processes, the proposed fused Gromov-Wasserstein discrepancy can be used as a regularizer when learning the Hawkes process models. Suppose that we have two sets of event sequences corresponding to source and target Hawkes processes, , and , where and for and . We learn these two Hawkes processes and align their event types via maximum likelihood estimation with a fused Gromov-Wasserstein regularizer:

(3) |

where and represent the empirical distribution of the event type in the source and target domain, respectively. These are estimated via the histograms of the counts of events according to and . The hyperparameter controls the significance of the proposed fused Gromov-Wasserstein regularizer. is the discretized version of fused Gromov-Wasserstein discrepancy based on the Hawkes process parameters:

(4) |

where is a mean-square-error (MSE) loss, and represents the matrix inner product. Accordingly, and , whose element ; and , where represents a -dimensional all-one vector. controls the balance between the Wasserstein term and the Gromov-Wasserstein term. The Wasserstein discrepancy compares the event types of the two Hawkes processes in an absolute way while the Gromov-Wasserstein discrepancy compares their event types in a relational way. Taking them into account, the final optimal transport represents the joint distribution of the event types in different Hawkes processes. As shown in Figure 1(a), the pairs of event types with high probability indicate the correspondence between the event types. The learned optimal transport fills the gap between the source and the target Hawkes processes, and the models can be learned jointly under the guidance of the optimal transport.

## 3 Learning Algorithm

We solve (3) effectively based on an alternating optimization strategy. In each iteration, given the current Hawkes process models, we update the optimal transport between them, and then the Hawkes processes are updated based on the learned optimal transport.

Updating Hawkes processes In the -th iteration, given the optimal transport learned in the previous iteration, , , we update the Hawkes process models by

(5) |

This problem can be solved effectively via stochastic gradient descent (SGD) mei2017neural . We randomly select a batch of events and their historical events, and calculate the gradients of the base intensities and the infectivity matrices related to the event types appearing in the batch. After the parameters are updated via gradient descent, they are projected into the nonnegative space to match the constraints in (3).

Updating optimal transport Given updated Hawkes processes, we further update the optimal transport by solving the following optimization problem:

(6) |

where and are calculated based on the updated base intensities and infectivity matrices. Inspired by the work in peyre2016gromov ; xu2019gromov , we apply a proximal gradient method to solve (6) iteratively. Given current optimal transport , we add a proximal term as the regularizer of (6):

(7) |

where is the Kullback-Leibler (KL) divergence. Applying the proximal gradient method, (7) is solved iteratively, and each iteration corresponds to solving the following problem via Sinkhorn iterations xu2019gromov .

When updating the Hawkes processes, sub-problem (5) is convex and can be solved with a high convergence rate. When updating the optimal transport, the proposed algorithm is a special case of successive upper-bound minimization (SUM) razaviyayn2013unified , whose global convergence is guaranteed. Applying SGD, we solve (5) with computational complexity , where is the size of batch (, the number of selected events), and is the length of each event’s history. Because in general and , the updating of the Hawkes processes scales well. The complexity of updating the optimal transport is . Both these two steps can be done in parallel on GPUs.

## 4 Experimental Results

To demonstrate the feasibility and the effectiveness of the proposed alignment method (FGWA), we consider both synthetic and real-world data. In the following experiments, we set , which balances the influence of Wasserstein discrepancy and that of Gromov-Wasserstein discrepancy. We compare our method with the following baselines: 1) aligning event types according to their empirical distributions and directly (Empirical); 2) aligning Hawkes process purely based on Wasserstein discrepancy, , (HP-WD); and 3) aligning Hawkes process purely based on Gromov-Wasserstein discrepancy, , (HP-GWD). Given the real correspondence and the optimal transport , we evaluate various methods based on the following three measurements: i) Top- alignment accuracy , where converts each row of to binary vector, whose nonzero elements corresponds to the maximum values of the row. ii) Cosine similarity . iii) Entropy . When the real correspondence is bijective, this measurement reflects the uncertainty of the learned correspondence.

Method | Empirical | HP-WD | HP-GWD | FGWA | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|

Synthetic | Sim | Sim | Sim | Sim | ||||||||

=10 | 0.41 | 0.45 | 3.23 | 0.43 | 0.48 | 2.30 | 0.49 | 0.45 | 3.20 | 0.69 | 0.50 | 2.30 |

=50 | 0.12 | 0.08 | 7.63 | 0.19 | 0.12 | 4.75 | 0.18 | 0.09 | 7.62 | 0.22 | 0.12 | 4.60 |

=100 | 0.03 | 0.02 | 12.43 | 0.06 | 0.05 | 9.66 | 0.06 | 0.05 | 12.42 | 0.11 | 0.06 | 9.60 |

Real-world | Sim | – | Sim | – | Sim | – | Sim | – | ||||

MIMIC-III | 0.196 | 0.251 | – | 0.332 | 0.469 | – | 0.314 | 0.336 | – | 0.464 | 0.471 | – |

MC3 | 0.081 | 0.061 | – | 0.177 | 0.099 | – | 0.129 | 0.102 | – | 0.253 | 0.106 | – |

Synthetic data The synthetic event sequences are generated via the following method: For the source Hawkes process with event types, we generate with and with . Given a predefined correspondence matrix , the parameters of the target Hawkes process are and . Accordingly, the source and the target event sequences are generated based on Ogata’s thinning algorithm ogata1981lewis . We keep and set them from . For both the source and target Hawkes process, we simulate () event sequences with length , and set the decay function as an exponential function, , . We consider trials, and calculate the average results. Because the real correspondence in each trial is a bijective function, we consider the top- alignment accuracy in this experiment. Table 1 shows the results of various methods. The proposed FGWA method outperforms its competitors in most situations. Considering the fused Gromov-Wasserstein regularizer is beneficial for our alignment task indeed. The entropy of our optimal transport matrix is often smaller than those of other methods, which means that the correspondence we have learned has high certainty. The optimal transport matrices shown in Figure 1(b-e) further demonstrates that our FGWA method has the highest certainty.

Real-world data We further test the proposed method on two real-world datasets: the MIMIC-III dataset johnson2016mimic and the call-network used in the Mini-Challenge 3 (MC3) of VAST Challenge 2018 http://vacommunity.org/VAST+Challenge+2018+MC3. The MIMIC-III records 18,756 patient admission sequences. Each admission is an event in the sequence, containing a pair of diagnose ICD code and procedure ICD code. The dataset contains 56 diagnoses and 25 procedures. According to the coherency of the diagnoses and the procedures in the observed admission sequences, we obtain the correspondence between them. The call-network we used records the phone calls among a company’s employees in continuous time domain, which contains 2,507 callers and 2,481 responders. The pairs of callers and responders appearing in the call-network indicates the correspondence between them. For the MIMIC-III dataset, we consider the sequences of diagnoses and those of procedures, and model them via two Hawkes processes. Applying various alignment methods, we try to estimate the correspondence between diagnoses and procedures. Similarly, for the MC3 dataset, we model the sequences of callers and those of responders via two Hawkes processes and try to estimate the correspondence between them. In both of these two datasets, their correspondences are not bijective. Therefore, we consider top- alignment accuracy for the MIMIC-III dataset and top- alignment accuracy for the MC3 dataset, respectively. Table 1 shows that our FGWA method outperforms other methods on both datasets.

## 5 Conclusions and Future Work

We have proposed an alignment method for Hawkes processes based on fused Gromov-Wasserstein discrepancy, which achieves encouraging results on matching the event types of different Hawkes processes. The proposed method shows the potential of optimal transport techniques to the learning and the alignment of temporal point processes. In the future, we plan to further improve the scalability of the proposed method for large-scale applications.

Acknowledgements This research was supported in part by DARPA, DOE, NIH, ONR and NSF.

## References

- (1) A. G. Hawkes. Spectra of some self-exciting and mutually exciting point processes. Biometrika, 58(1):83–90, 1971.
- (2) A. E. Johnson, T. J. Pollard, L. Shen, H. L. Li-wei, M. Feng, M. Ghassemi, B. Moody, P. Szolovits, L. A. Celi, and R. G. Mark. MIMIC-III, a freely accessible critical care database. Scientific data, 3:160035, 2016.
- (3) D. Luo, H. Xu, Y. Zhen, X. Ning, H. Zha, X. Yang, and W. Zhang. Multi-task multi-dimensional Hawkes processes for modeling event sequences. In IJCAI, 2015.
- (4) H. Mei and J. M. Eisner. The neural Hawkes process: A neurally self-modulating multivariate point process. In NIPS, 2017.
- (5) Y. Ogata. On Lewis’ simulation method for point processes. IEEE Transactions on Information Theory, 27(1):23–31, 1981.
- (6) G. Peyré, M. Cuturi, and J. Solomon. Gromov-Wasserstein averaging of kernel and distance matrices. In ICML, 2016.
- (7) M. Razaviyayn, M. Hong, and Z.-Q. Luo. A unified convergence analysis of block successive minimization methods for nonsmooth optimization. SIAM Journal on Optimization, 23(2):1126–1153, 2013.
- (8) T. Vayer, L. Chapel, R. Flamary, R. Tavenard, and N. Courty. Fused Gromov-Wasserstein distance for structured objects: theoretical foundations and mathematical properties. arXiv preprint arXiv:1811.02834, 2018.
- (9) C. Villani. Optimal transport: Old and new, volume 338. Springer Science & Business Media, 2008.
- (10) H. Xu, D. Luo, H. Zha, and L. Carin. Gromov-wasserstein learning for graph matching and node embedding. arXiv preprint arXiv:1901.06003, 2019.
- (11) K. Zhou, H. Zha, and L. Song. Learning social infectivity in sparse low-rank networks using multi-dimensional Hawkes processes. In AISTATS, 2013.