Mean-field inference of Hawkes point processes

# Mean-field inference of Hawkes point processes

Emmanuel Bacry Centre de Mathématiques Appliquées, École Polytechnique and CNRS
UMR 7641, 91128 Palaiseau, France
Stéphane Gaïffas Centre de Mathématiques Appliquées, École Polytechnique and CNRS
UMR 7641, 91128 Palaiseau, France
Iacopo Mastromatteo Jean-François Muzy
###### Abstract

We propose a fast and efficient estimation method that is able to accurately recover the parameters of a -dimensional Hawkes point-process from a set of observations. We exploit a mean-field approximation that is valid when the fluctuations of the stochastic intensity are small. We show that this is notably the case in situations when interactions are sufficiently weak, when the dimension of the system is high or when the fluctuations are self-averaging due to the large number of past events they involve. In such a regime the estimation of a Hawkes process can be mapped on a least-squares problem for which we provide an analytic solution. Though this estimator is biased, we show that its precision can be comparable to the one of the Maximum Likelihood Estimator while its computation speed is shown to be improved considerably. We give a theoretical control on the accuracy of our new approach and illustrate its efficiency using synthetic datasets, in order to assess the statistical estimation error of the parameters.

## 1 Introduction

The use of point processes, in particular Hawkes processes [1, 2] is ubiquitous in many fields of applications. Such applications include, among others, geophysics [3], high frequency finance [4, 5, 6], neuroscience [7], predictive policing [8] and social networks dynamics [9, 10, 11, 12, 13, 14, 15]. A possible explanation for the success of this model is certainly its simplicity yet ability to account for several real-world phenomena, such as self-excitement, coexistence of exogeneous and endogeneous factors or power-law distributions [16].

Various estimation methods of the Hawkes process parameters have been proposed in both parametric and non-parametric situations. The most commonly used is to maximize the likelihood function that can be written explicitely [17] while alternative linear methods such as the contrast function minimization [7], spectral methods [18] or estimation through the resolution of a Wiener-Hopf system [19, 20] have been proposed.

In many of the above mentioned applications, especially in the field of network activities, viral propagation or community detection one has to handle systems of very large dimensions for which these estimation methods can be heavy to implement. Motivated by the goal to devise a fast and simple estimation procedure, we introduce an alternative approach that is inspired by recent results [21] justifying a mean-field approximation of a Hawkes process that supposes small fluctuations of the stochastic intensities with respect to their mean values. This allows us to solve the estimation problem by replacing the original objective (log-likelihood) by a quadratic problem, which is much easier to optimize (see Sec. 3 below). Note that this quadratic problem differs from the usual least-squares objective for counting processes [7] (see Sec. 4 below for a precise comparison).

We show that in a wide number of situations this new framework allows a much faster estimation without inducing losses in precision. Indeed, we show that its bias can be negligible and its accuracy as good as the maximum likelihood provided the level of endogeneity of the process is sufficiently weak or the interactions are suffciently “self-averaging” meaning that they involve a large number of events over past times or over components of the system (i.e., in the large dimensional setting). Besides theoretical arguments, we give numerical illustrations of the fact that this our approach leads to an improvement (that can be of several orders of magnitude) of the classical ones based on state-of-the-art convex solvers for the log-likelihood.

The organization of the paper is the following: in Sec. 2 we formally introduce the Hawkes process and define the main notations of the paper. The mean-field inference method is defined In Sec. 3. We show how this method is naturally obtained using a Baysian approach in a regime where the fluctuations of the stochastic intensity are very small. We conduct a theoretical analysis of the domain of validity of the method by comparing its accuracy to the Maximum Likelihood Estimation. This notably allows us to provide a quantitative measure of what weakly endogenous or “self-averaging” means for the interactions. In Sec. 4 we describe the implementation of the algorithm and compare it with other state-of-the-art algorithms. Sec. 5 shows the numerical results that we obtain with this method on synthetic data. Sec. 6 is devoted to concluding remarks and to the discussion of the possible extensions of our method, particularly as far as penalization issues are concerned. The more technical parts of the discussion are relegated to the appendices.

## 2 The Hawkes process

### 2.1 Definition

Let us consider a network of nodes, in which one observes a set of events encoded as a sequence , where labels the time of the event number and denotes its corresponding node. Then we can define a set of counting functions as , where indicates the Kronecker delta. These counting functions can be associated to a vector of stochastic intensities defined as

 λit=limdt→0P(Nit+dt−Nit=1∣∣Ft)dt (1)

where the filtration encodes the information available up to time . Then the process is called a Hawkes process if the stochastic intensities can be written as

 λit=μi+d∑j=1∫t0dNjt′Φij(t−t′), (2)

where is a component-wise positive, causal (i.e., whose support is in ), locally -integrable matrix kernel representing (after proper normalization) the probability for an event of type occurring at time to trigger an event of type at time , while is a vector of positive exogenous intensities (see Ref. [22] for a more rigorous definition). It is well known that a sufficient condition for the intensity processes to be stationnary (i.e., for the processes to have stationnary increments) is the so-called Stability Condition

 (SC):\leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak ||Φ||1<1 (3)

where stands for the spectral norm of the matrix made of the norms of each kernel .

In the following discussion, we will restrict our attention to stable Hawkes processes (in the sense of (SC)) for which the matrix can be written as

 Φij(t)=p∑q=1αijqgq(t) (4)

where we have introduced a set of known basis kernels satisfying

 ∫∞0dtgq(t)=1

.

### 2.2 Notations

For the sake of conciseness, it will be useful to preliminary introduce some notation.

• We will use the integer indexes in order to identify pairs . Consequently, summing over allows to run both over the components and the basis functions of the kernels. While the set of indices will be employed to label single nodes, the notation will be used to label a pair (node/basis kernel). In that respect we set

• ,

• and

• .

• For convenience we will also define the deterministic process , which will be associated with the kernel equal to a Dirac delta function shifted by an infinitesimal amount .

• According to previous notations, Eq. (2) can be compactly rewritten as

 λit=dp∑a=0θia∫t0dNat′ga(t−t′) (5)

where the parameters refer to the Hawkes parameters, namely111Note that, in order for the parameters to be dimensionally homogeneous, we need to assume a dimensionless , while is taken to be of dimension . Moreover, we implicitly assume to be dimensionless through a suitable choice of a unit (i.e, we take ).

 θi0=μi\leavevmode\nobreak\ \leavevmode\nobreak\ and \leavevmode\nobreak \leavevmode\nobreak θia=αia,\leavevmode\nobreak \leavevmode\nobreak ∀a>0.
• We will adopt the notation for scalars , for vectors and for square matrices. Correspondingly, we define the usual matrix-vector composition by and the matrix-matrix product as .

• If is a vector will refer to the norm of whereas if is a matrix, will refer to its spectral norm. Moreover if is a matrix of functions, will refer to the spectral norm of the matrix .

• Hereafter, we will have to handle collections of scalars, vectors or matrices indexed by , that we will write as (for scalars), (for vectors) and (for matrices).

According to these conventions, The goal of this paper is to present a mean-field framework for estimation of the vectors of parameters given a set of observations . In the following the set of all the parameters will be often referred to as .

### 2.3 The Likelihood function

The probability for a Hawkes process parametrized by to generate a trajectory in a time interval has been first computed in Ref. [17], and is given by

 P(Nt|θ)=e−∑di=1∫T0dtλitn∏m=1λumtm (6)

which implicitly depends on through the stochastic intensities . We define the negative log-likelihood

 L(Nt,θ)=−logP(Nt|θ). (7)

Note that the maximum-likelihood estimator (that we will compare to our mean-field estimator to throughout this paper)

 θMLE=argminθL(Nt,θ), (8)

is the most commonly used estimator employed in the literature [17].

## 3 Inference and mean-field approximation

### 3.1 A Bayesian approach

From a Bayesian standpoint, the probability for an observed sample to be generated by a Hawkes process parametrized by is given by the posterior distribution

 P(θ|Nt)=P(Nt|θ)P0(θ)P(Nt) (9)

where is a prior which we assume to be flat (see Sec. 6 below for other choices). Hence, we will be interested in averaging the inferred couplings over the posterior, so to compute their averages and their covariances

 Eθia = ∫dθP(θ|Nt)θia (10) C(θia,θjb) = ∫dθP(θ|Nt)θiaθib−EθiaEθjb,

where we are writing . This can be technically done by introducing the partition function defined as the Laplace transform of the posterior

 Z(s)=∫dθP(θ|Nt)e−T∑di=1si⊤θi (12)

where is a collection of Laplace-parameter vectors . Its computation allows one to obtain Eqs. (10) and (3.1) by differentiating the free-energy density , so that

 Eθia = ∂siaf(s)|s=0 (13) C(θia,θjb) = −1T∂sia∂sjbf(s)|s=0 (14)

We will be interested in finding an approximated expression for the free-energy density allowing to compute efficiently Eqs. (10) and (3.1).

Remark: Let us point out that, since we are interested in derivatives of the free energy, we can drop additive -independent terms in its computation. This will be done implicitely all along the computation.

Let us first use Eq. (6) so to write explicitly the partition function as

 Z(s)=∫dθe−T∑di=1(si+h)⊤θi+∑nm=1logλumtm (15)

where we have dropped multiplicative -independent terms, and introduced auxiliary vector equal to

 ha=1T∫T0dt∫t0dNat′ga(t−t′). (16)

Note that in the limit of large sample size and in the vicinity of , the integral is dominated by the maximum of the exponential corresponding to the maximum likelihood estimator (8) which is commonly employed in the literature relating to parametric inference of Hawkes processes [17].

Let us also point out that can be rewritten as:

 Z(s)=∫dθe∑di=1(−T(si+h)⊤θi+∫T0dNitlogλit)=d∏i=1∫dθie−T(si+h)⊤θi+∫T0dNitlogλit (17)

and therefore the partition function is a product of independent partition functions associated with each node . This means that the inference problem factorizes in single node problems associated with each vector .

### 3.2 Mean-field approximation

Our strategy in order to approximate Eq. (15) is to suppose the fluctuations of the to be small with respect to their empirical averages

 ¯Λ={¯Λi}1≤i≤d={NiTT}1≤i≤d. (18)

This allows us to approximate the partition function appearing in Eq. (15) by a quadratic expansion of the logarithmic functions around the empirical averages . Let us first suppose that the Hawkes process is stable in the sense of (SC) (see (3)). We have seen that it means that the processes are stationary. The mean-field expansion basically assumes that each process has small fluctuations around its empirical averaged value . More precisely, let us suppose that the following Mean-Field Hypothesis holds

 \bf(MFH) :\leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak ri=|λit−¯Λi|¯Λi≪1, (19)

then

 logλit≃log¯Λi+λit−¯Λi¯Λi−(λit−¯Λi)22(¯Λi)2. (20)

In order to get (see (17)), we need to compute for all . Substituing (20) in this expression leads to three terms.

• The first term

 ∫T0dNitlog¯Λi

is independent of both and , thus it can be dropped.222Had we introduced as a variational parameter to estimate self-consistently, this term would have contributed to fixing it to its optimal value. This is non-necessary in this case, since we are implicitly forcing the empirical value of to be the mean-field parameter.

• The second term

 ∫T0dNitlogλit−¯Λi¯Λi

after dropping and -independent terms lead to

 Tdp∑a=0θiakia

where we have introduced a collection of vectors defined as:

 kia=1NiT∫t≠t′dNitdNat′ga(t−t′) (21)
• The third term (after straightforward computations, dropping again and -independent terms) leads to

 −T2dp∑a,b=0θiaθibJiab+Tdp∑a=0θiakia,

where we have introduced a collection of matrices defined as:

 Jiab=T(NiT)2∫t∫t′≠t∫t′′≠tdNitdNat′dNbt′′ga(t−t′)gb(t−t′′). (22)

It then results from (17) that, up to a constant factor,

 Z(s)≃∫dθe−T∑di=1((si+h−2ki)⊤θi+12θi⊤Jiθi). (23)

The integral appearing in Eq. (23) can be evaluated analytically as a simple Gaussian integral333As the region of integration in is the positive orthant , Eq. (24) should in principle include extra terms preventing the emergence of negative couplings. We prefer to extend the region of integration to the whole space , so to include potentially negative couplings. A saddle-point expansion of Eq. (23) reveals indeed that these differences are subleading in ., so that one can write the free-energy density as

 (24)

where denotes the inverse of matrix . Under this approximation, one obtains

 Eθia ≃ dp∑b=0Ciab(2kib−hb)≡θiaMF (25) C(θia,θjb) ≃ δijCiabT. (26)

These are the central equations of our paper: Eq. (25) expresses the definition of our mean-field estimator while Eq. (26) expresses the convergence rate to the expected value of the estimator, so that .

### 3.3 Validity conditions for the mean-field approximation

As discussed previously, the mean-field approximation is likely to be pertinent in the regime described by (MFH) (19), i.e., when the calculated under the inferred parameters have small fluctuations with respect to the empirical averages . However the computation we performed in the previous section (relying on a second order truncation of ) can be hardly used to obtain a control on the accuracy of this approximation or to define its domain of validity. These problems are addressed in Appendix D where we show that the error of the mean-field estimates with respect to the maximum-likehood estimates (8)

 δθi=θiMF−θiMLE (27)

is directly bounded by the ratio of the empirical variance of to its empirical mean. In particular, at leading order in the fluctuations and in the limit of large , one has

 ||δθi||≤¯V(λit)(¯Λi)2||Ci||||¯Λ|| (28)

where refers to the norm in case of vectors and spectral norm in case of matrices, and where the variance corresponds to the empirical one computed under the saddle-point parameters . We see that when the fluctuation ratio is very small, the results of the mean-field estimate can be very close to the maximum-likelihood estimate.

#### Analysis of the fluctuation ratio

The natural question that follows these considerations concerns the characterization of the situations when this fluctuation ratio is very small. In appendix A, we establish these conditions in the particular case of a perfectly homogeneous Hawkes process. We show that in the limit of small (Weak endogeneity case (i) below), or in the limit of large dimension or slow interactions (Self-averaging interactions case (ii) below) the fluctuation ratio is small. More precisely (see Eq. (58)), we find that it is directly controlled by the spectral norm , the system dimension and the interaction characteristic time-scale (defined in Appendix A):

 V(λit)(Λi)2∼1Λiτg(||Φ||21d(1−||Φ||1)2−1/η). (29)

Let us recall that in the exponential kernel case, one has (see Appendix A). Intuitively, this characterization indicates that the mean-field approximation is valid whenever one of these conditions is met:

• Weak endogeneity ( ): If the spectral norm (that controls the level of endogeneity of the Hawkes process) is much smaller than one, then the intensity vector is dominated by the exogenous component, and one can expect a very small fluctuation ratio.

• Self-averaging interactions (): Even if the system is not weakly endogenous (i.e. is not necessarily small), if the component-wise intensities are determined by a very large number of events with comparable contributions, a law of large numbers leads to small fluctuation ratios. This can notably occur in two different situations. First in large dimensional “quasi-homogenous” systems where a large number of events associated with each of the components contribute to the intensity. This regime corresponds to large in Eq. (29) and has been studied notably in Ref. [21]. The second situation is the case of “slow interactions” when a large number of past events equally impact the intensity. If the characteristic time scale of the kernel is large with respect to the typical inter-event distance, then one expects the past contribution to the instantaneous intensity to have small fluctuations. This regime corresponds to a large in Eq. (29).

The perfect homogeneous case is a very good toy model, it has the main ingredients that make the mean-field hypothesis valid. The result we so-obtained in the Appendix A reproduce much more complex situations. Indeed, in the following we considered a Hawkes process with a 2-block interaction matrix and exponential kernels fully specified in Sec. 5. Figs 1 and 2 show that the formula for the fluctuation ratio (see Eq. (29) or equivalently (58) for general kernel case and Eq. (63) for exponential kernels) lead to a perfect prediction of the simulated curve.

The right plot of Fig. 1 shows that the fluctuation ratio of the first block () behaves in very good agreement with Eq. (63). It scales like when the dimension varies. Fig. 2 shows the same ratio as a function of both the spectral norm and the dimension . Let us point out that, when varying the norm and/or the dimension , the average expectation is held fixed. One can notably see that, both (i) for a fixed value of the dimension, there exists a sufficiently small interaction such that the fluctuation ratio can be arbitrarily small and (ii) for a fixed value of the norm (even large), there exists a sufficiently large system dimension such that the fluctuation ratio can be arbitrarily small. The displayed contour plots are the ones predicted by our analytical considerations (63) proved in Appendix A (see Eqs (58) and (63)). The theoretical prediction appears to be in good agreement with simulated data.

Finally, let us remark that while above considerations are useful for assessing the validity of the mean-field approximation under some specific assumption for the model, it is possible to bound a priori the approximation error by exploiting the results of App. E, that allow one to estimate the mean-field error by supplying the empirical cumulants of the data, without requiring any assumption about the underlying model.

#### Estimation error and validity condition

Since it basically amounts in solving a linear system, it is clear that the mean-field estimator will perform much faster than any MLE based estimator (see next section for deatiled analysis on the complexity). The idea is to get, while being much faster, an estimation as precise as the MLE. This is true provided is smaller than the statistical error associated with the maximum likelihood estimator. One can obtain a self-consistent estimation of the regime in which the error is dominated by its statistical component by using Eq. (26)):

 ||Δθi||≃(tr(Ci)T)1/2,

that in principle is valid under the mean-field approximation. Figure 5 below shows that this assumption is correct under a large choice of parameter values. Then a simple condition for the mean-field method to perform as well as the MLE is

 ||δθi||<(tr(Ci)T)1/2.

Eq. (28) provides a sufficient condition for this to hold (after identifying empirical averages and statistical averages), namely

 V(λit)(Λi)2||Ci||||Λ||≲(tr(Ci)T)1/2.

Eq. (29) has been established in the perfectly homogeneous case for which . Moreover, we use very conservatively the inequality , which is likely to be largely underestimating the statistical error, so that in practice the performance of the MF algorithm is much better than what appears from Eq. (30) below (see Fig. 4). Indeed, plugging (29) and assuming the scaling in the last equation leads to the sufficient condition

 1√Λiτg(||Φ||21d1/2(1−||Φ||1)2−1/η)≲(1T)1/2.

Therefore

 T≲T⋆=Λiτ2g(d(1−||Φ||1)4−2/η||Φ||41). (30)

Although far from being optimal, this rough bound provides a qualitative idea of how the different values of , and can contribute in determining the performance of the mean-field approximation. We will see in Sec. 5, that in practive can be very large for moderate system dimension and values of the spectral norm. In other words, the situations where the performance of the mean-field method is comparable to those of the maximum likehood are very easy to meet, in particular for large system sizes. In Sec. 5 we will illustrate this point by showing that in a broad range of regimes of we will be able to exploit the condition , thus outperforming the maximum-likelihood algorithm through the superior numerical performance of the mean-field approximation.

## 4 Implementation and complexity

In this section we want to illustrate the numerical implementation of the mean-field algorithm, by comparing its complexity with the one of other state-of-the art algorithm. We find that the mean-field algorithm has a cost corresponding to iterations of a gradient ascent in the likelihood, both for the quasi-Newton maximisation of the likelihood (Sec. 4.2) and the expectation-maximisation algorithm (Sec. 4.3). This is typically much faster than the times required by these methods to achive convergence (Fig. 6). The complexity of the contrast function algorithm (Sec. 4.4) is in principle of the same order of the mean-field approximation, but is in practice much slower (see Fig. 6).

### 4.1 Mean-field (MF) algorithm

The simplest procedure to use in order to obtain the estimator Eq. (25) and its associated covariance matrix is summarized in Algorithm 1.

Note that the complexity of the preprocessing phase in which one computes the auxiliary functions and is dominated by the computation of the , and scales as . The inversion of the matrices can be performed in a time bounded by . In order to speed up the computation, one can use the following tricks:

• The pre-processing time can be reduced by fixing a threshold so to cutoff the integrals over the basis kernels when is sufficiently large. This reduces the pre-processing time to , where is an -dependent bound on the characteristic time required for the kernels to decay below the threshold (see Appendix B).

• If one is not interested in the covariances , the inversion of the matrix can be avoided, by replacing it with the solution of the linear systems . This substantially reduces the time required to find in step 2. and 3. of Algorithm 1, as the matrix inversion can be traded with a faster linear solver (e.g., Cholesky decomposition, or iterative algorithms such as BFGS [23]).

• The case of weak endogeneity described above allows one to perform an approximate inversion of the matrices , as shown in Appendix C, reducing the time required for the solution of the linear systems from to (corresponding to matrix-vector compositions). Note indeed that this is expected to yield inaccurate results when the fluctuations of the are of a larger order than , even though the elements of the matrix are individually small (see again Appendix C).

The time of computation is typically dominated by the preprocessing phase up to very large sizes. As an illustrative example, a parallel implementation in python + c on a four-core machine with a 3.40 Hz CPU the calculation of the auxiliary functions takes  s for , and events, while the matrix inversions can be performed in a time  s. Finally, we remark that the algorithm can be straightforwardly parallelized across components, meaning that a factor can be gained in both the preprocessing and the matrix inversion phase.

### 4.2 Maximum-Likelihood Estimation (MLE)

This class of mehods, following Ref. [17], builds upon the direct maximization of the likelihood function . This problem is concave, so efficient solvers like BFGS are able to quickly find a solution. Indeed, the main drawback for this approach is the computational cost of each of the iterations of the algorithm, as the complexity of each evaluation of the gradient function scales with the number of points in the sample . More precisely, this strategy requires to minimize the function

 −logL(θ|Nt)=H(θ)=d∑i=1h⊤θ−n∑m=1logdp∑a=0θumaGma. (31)

where

 Gma=∫T0dtga(tm−t) (32)

Even though one can reduce the cost of each iteration by precomputing the coefficients on the right-hand-side of Eq. (31) (complexity is )), one is still left with a sum over all the terms on the right hand side. This can have a huge impact on the computational limits of this strategy (see Fig. 6), which quickly become prohibitive in . In fact, the cost per evaluation of both the likelihood function and its gradient scales as . Then, if number of gradient evaluations is much larger than , then mean-field methods are faster. This is typically the case, as seen in Fig. 6.

### 4.3 Expectation-Maximization (EM)

In the case of the model we focus on, EM is actually as complex as MLE, as it requires calculating a gradient of the likelihood function at each iteration [24]. According to the type of MLE algorithm adopted, the relative speed of EM with respect to MLE can change. Taking BFGS as an example, the EM algorithm suffers actually from a slower convergence, as it is unable to exploit second-order information, as done instead by Newton and quasi-Newton methods do.

### 4.4 Contrast Function (CF)

The closest method to the one that we have presented is a generalization of the constrast function approach proposed in [7]. In this framework, the should minimize a loss function defined as

 C(θ)=12∑i∫T0dt(dNitdt−λit)2, (33)

resulting in

 C(θ)=T(d∑i=1(−¯Λiki⊤θi+12θi⊤J′θi)), (34)

where the collection is defined as in Eq. (21), while the matrix is given by

 J′ab=1T∫T0dtdNat′dNbt′′ga(t−t′)gb(t−t′′). (35)

This approach also maps the inference problem on a linear system, whose solution is given by

 θi=¯ΛiC′ki (36)

where . This solution is numerically close to Eq. (25). The computation of the has a complexity of , but due to the impossibility of exploiting the trick illustrated in App. B for the computation of it needs to be implemented in a different manner (unless one doesn’t introduce some artificial binning in the data). In practice (see Fig. 6), we find MF to be significantly faster than CF. Nevertheless, CF seems to be a method particularly suitable in order to investigate the mean-field regime of the Hawkes process, although its precise relation with MF still needs to be rigorously established.

## 5 Numerical examples

We report in this section the results obtained by calibrating a Hawkes process from synthetic data generated by a known model, and by comparing the performance of our algorithm with the one of known methods.

#### Single exponential basis kernel

First, we have considered as a benchmark the case in which , so that . For , we have assumed the basis kernel to have the exponential structure

 g(t)=βe−βt\mathds1t>0. (37)

The topology of the matrix has been chosen from the following ensemble:

• A block structure in which each nodes belong to one out of several clusters. In particular we studied the case in which is zero (if and do not belong to the same cluster) or (if they do belong to the same cluster, with size given by );

Regardless of the structure, in both cases one has , so that we used the parameter in order to interpolate between the non-interacting case and the critical one . We have chosen for the purpose of these numerical experiments the values .

As an illustrative example, we first plot in Fig. 3 the results of our algorithm in the case of a three block structure for the matrix. The results are for this choice of value essentially the same as the ones obtained by maximum likelihood estimation, with a considerable reduction in the computational time.

In order to systematically assess the performance of the algorithm under various circumstances, we have varied uniformly in the interval , and addressed the problem of studying the scaling of the results in (we used ) and in (we considered ). Unless specifed otherwise throughout the text, we have assumed block structures with two clusters, so that . We measured the quality of the reconstruction by using

• The negative log-likelihood of the sample under the inferred couplings

• The relative error on the non-zero couplings

 δα2rel=∑i,j|αijtrue≠0⎛⎝αijinfαijtrue−1⎞⎠2. (38)
• The absolute error on all the couplings

 δα2abs=∑i,j(αijinf−αijtrue)2. (39)

The results that we have found for the relative error in the two-block case are summarized in Fig. 4, where we have shown them for and . What one finds is that, as expected, the error of the maximum likelihood estimator decreases as by increasing the sample size . Indeed, the same plot also shows that the MF estimator is able to match the results of the maximum likelihood estimator up to a maximum value , beyond which the statistical error no longer dominates for the mean-field estimator, and the performance of MF deteriorates. At that point, the approximation error becomes comparable with the statistical one, inducing a plateau in the curve of the error as a function of . What is interesting is that – even for moderate values of  – the value of is quite large, indicating that the MF approximation is extremely effective in a broad range of regimes. By increasing the value of , even though the relative error decreases due to the increase of the signal , the quality of the MF approximation decreases, and the value of becomes smaller. Indeed, for a fixed value of and , it is always possible to find a sufficiently large so that , and the MF approximation is effective.

This is confirmed in Fig. 5, where we have represented the absolute error obtained by MLE against the one estimated by using Eq. 26.

On the other hand, the computational time required in order to run the MF algorithm is considerably smaller, as summarized in Fig. 6. In such figure we have represented the computational time required in order to run the different algorithm up to a target negative log-likelihood , on a machine with the same specifications as above, and shown that by taking into account both the pre-processing phase and the solution of the linear system via matrix inversion, the MF algorithm achieves within a single iteration a value of the target function which is very close to the the asymptotic one got from a BFGS minimization.

#### Multiple exponential basis kernels

We have also considered the more general case in which the is given by a sum of kernels of different nature. Even though the implementation of an algorithm with multiple basis kernels is technically more challenging, there are no qualitative differences with respect to what we have stated above. That is why we have chosen to present an illustrative example showing how the MF algorithm performs in the multiple basis case. In particular, we have considered an example in which , in which the matrix has a uniform two-block structure as in the previous case, while for the second matrix only the complementary blocks are uniformly filled (see Fig. 7). For both and , the non-zero entries have value , being the size of the block. Finally, we chose and , so that for one recovers a completely homogeneous interaction structure. Fig. 7 shows an illustrative plot of the results of the reconstruction in the multiple basis kernel scenario.

#### The role of β

We have also studied the effect of the misspecification of by studying the effect of an input kernel different from the used to generate the data. We did it in the case of , with a block matrix as described in the paragraph above relating to the single exponential basis kernel scenario. Also in this case we have chosen , with denoting the true and unique decay constant for the exponential kernels, whereas will denote the decay constant chosen to perform the inference. Our results are summarized in Fig. 8, where we illustrate them in a specific case. We find that the likelihood of the model under the inferred couplings is not strongly affected by the misspecification of , as the inferred intensity turns out to be very weakly dependent on . What actually happens is that by changing is is possible to tune a bias changing in favor of .

This behavior is easy to understand in the framework of our MF approximation. In fact, App. C shows that the value of the inferred couplings is proportional to the fluctuations of the auxiliary function . According to Eq. (98), these fluctuations are obtained by integrating the empirical correlation function against against the kernel in order to obtain . Hence, when is large is peaked close to the origin, where the correlations are larger, resulting in a greater value for . Conversely, for large such integral assumes a small value, because the correlations are small at the timescales in which has large mass. Hence, small value of leads to a bias in favor of , while large values favor . Independently of , the combination of and leading to the inferred value of is almost unchanged.

Even though these conclusions depend on the choice of an exponential kernel for , in more general cases it is possible to predict the direction of the bias by checking the timescales in which the mass of the correlation function lies agains the timescales relative to . A maximum overlap favors , while a small one will increase .

## 6 Conclusion and prospects

We have presented an approximated method for the calibration of a Hawkes process from empirical data. This method allows one to obtain in close form the inferred parameters for such a process and its implementation is considerably faster than the algorithms customarily employed in this field (MLE+BFGS, EM, CF). In particular, we map the problem of the estimation of a Hawkes process on a least-square regression, for which extremely efficients techniques are available [25, 23]. Although this method is asymptotically biased, we have shown numerically and analytically that the bias is negligible in situations when the interactions of the system are sufficiently weak or self-averaging, which is thought to be the case in many applications to high-dimensional inference.

We believe that our method is not limited to the framework that we have presented, but can be extended to more general settings like marked or non-linear Hawkes processes. Another prospect of particular interest is the case in which a non trivial prior is added to the Bayesian formula Eq. (9), which allows one to embed regularization in our framework. Popular priors that are relevant in the field of optimization include for example the Gaussian prior

 PL20(θ)∝exp(∑iaκia(θia)2) (40)

and the Laplace prior

 PL10(θ)∝exp(∑iaκia|θia|). (41)

The inference problem analogous to the evaluation of the partition function (12) associated with these priors is obviously more complex than the one with a flat prior that we have presented. Nevertheless, our mean-field approach allows even in those cases a mapping on simpler problems, for which efficient solution methods exists [23]. In particular,

• The case of a Gaussian prior can be mapped to the one of the minimization of the quadratic form

 HL2(θ)=d∑i=1(12θi⊤Jiθi−(2ki−h)⊤θi)+∑iaκia(θia)2 (42)
• The case of a Laplace prior can be mapped to the LASSO problem [26]

 HL1(θ)=d∑i=1(12θi⊤Jiθi−(2ki−h)⊤θi)+∑iaκia|θia| (43)

Other types of regularizers might be considered by applying this same scheme.

We believe the approach we propose in this paper to be of particular interest for the problem of high-dimensional inference as, by construction, it is particularly effective in the regime of large (especially relevant for big-data analysis) where traditional numerical methods can be computationally expensive. Last but not least, the mean-field method can provide an easy and efficient way to set a good starting point for standard iterative algorithms that maximize the likelihood objective.

## Acknowledgements

This research benefited from the support of the “Chair Markets in Transition”, under the aegis of “Louis Bachelier Finance and Sustainable Growth” laboratory, a joint initiative of École Polytechnique, Université d’Évry Val d’Essonne and Fédération Bancaire Française.

In order to get