# Neural Jump Stochastic Differential Equations

###### Abstract

Many time series can be effectively modeled with a combination of continuous flows along with random jumps sparked by discrete events. However, we usually do not have the equation of motion describing the flows, or how they are affected by jumps. To this end, we introduce Neural Jump Stochastic Differential Equations that provide a data-driven approach to learn continuous and discrete dynamic behavior, i.e., hybrid systems that both flow and jump. Our approach extends the framework of Neural Ordinary Differential Equations with a stochastic process term that models discrete events. We then model temporal point processes with a piecewise-continuous latent trajectory, where stochastic events cause an abrupt change in the latent variables. We demonstrate the predictive capabilities of our model on a range of synthetic and real-world marked point process datasets, including classical point processes such as Hawkes processes, medical records, awards on Stack Overflow, and earthquake monitoring.

plus 1fil

Neural Jump Stochastic Differential Equations

Junteng Jia Cornell University jj585@cornell.edu Austin R. Benson Cornell University arb@cs.cornell.edu

noticebox[b]Preprint. Under review.\end@float

## 1 Introduction

In a wide variety of real-world problems, the system of interest evolves continuously over time, but may also be interrupted by stochastic events Glover and Lygeros (2004); Hespanha (2004); Li et al. (2017). For instance, the political view of a group of Twitter users may gradually change with time and determines how likely a certain topic gets tweeted, while a tweet could in turn confirm the beliefs or spark debate in the user group. How can we simultaneously model these continuous and discrete dynamics?

One approach is with hybrid systems, which are dynamical systems characterized by piecewise continuous trajectories with a finite number of discontinuities introduced by discrete events Branicky (2005). Hybrid systems have long been used to describe physical scenarios Van Der Schaft and Schumacher (2000), where the equation of motion is often given by an ordinary differential equation. A simple example is table tennis—the ball follows physical laws of motion and changes trajectory abruptly when bouncing off paddles. However, for problems arising in social and information sciences, we usually know little about the time evolution mechanism. And in general, we also have little insight about how the stochastic events are generated.

Here, we present Neural Jump Stochastic Differential Equations (JSDEs) for learning the continuous and discrete dynamics of a hybrid system in a data-driven manner. In particular, we use a latent vector to encode the state of a system. The latent vector flows continuously over time until an event happens at random, which introduces an abrupt jump and changes its trajectory. The continuous flow is described by Neural Ordinary Differential Equations (Neural ODEs), while the event probability density and the size of the jump are parameterized with neural networks as functions of .

The Neural ODEs framework models continuous transformation of a latent vector as an ODE flow and parameterizes the flow dynamics with a neural network Chen et al. (2018). The approach is a continuous analogy to residual networks, ones with infinite depth and infinitesimal step sizes, which brings many desirable properties. Remarkably, the derivative of the loss function can be computed via the adjoint method, which integrates the adjoint equation backwards in time with constant memory regardless of the network depth. However, the downside of these continuous models is that they cannot incorporate discrete events (or inputs) that abruptly change the latent vector. To address this limitation, we extend the Neural ODEs framework with discontinuities for modeling hybrid systems. In particular, we show how the discontinuities caused by discrete events should be handled in the adjoint method. More specifically, at the time of a discontinuity, not only does the latent vector describing the state of the system changes abruptly; as a consequence, we also have that the adjoint vector representing the loss function derivatives also jumps. Furthermore, our Neural JSDE model can serve as a stochastic process for simulating event generation. The latent vector determines the probability density of an event arrival, which in turn leads to a discontinuity in at the time of the event.

A major advantage of Neural JSDEs is that they can be used to model a variety of marked point processes, where events can be accompanied with either a discrete value (say, a class label) or a vector of real-valued features (e.g., spatial locations); thus, our framework is broadly applicable for time series analysis. We test our Neural JSDE model in a variety of scenarios. First, we find that our model can learn the intensity function of a number of classical point processes, including self-correcting processes and Hawkes processes (which are already used broadly in modeling, e.g., social systems Blundell et al. (2012); Li and Zha (2013); Stomakhin et al. (2011)). After, we show that Neural JSDEs can achieve state-of-the-art performance in predicting discrete-typed event labels, using datasets of awards on Stack Overflow and medical records. Finally, we demonstrate the capabilities of Neural JSDEs for modeling point processes where events have real-valued feature vectors, using both synthetic data as well as earthquake data, where the events have spatial locations as features.

## 2 Background, Motivation, and Challenges

In this section, we review classical temporal point process models and the Neural ODE framework of Chen et al. Chen et al. (2018). Compared to a discrete time step model like an RNN, the continuous time formation of Neural ODEs makes it more suitable for describing events with real-valued timestamps. However, Neural ODE enforces continuous dynamics and therefore cannot model sudden event effects.

### 2.1 Classical Temporal Point Process Models

A temporal point process is a stochastic generative model whose output is a sequence of discrete events . An event sequence can be represented as a counting function recording the number of events before time , which can be formally stated as follows:

(1) |

where is the Heaviside step function. Oftentimes, we are interested in a temporal point process whose historical events influence future behavior Daley and Vere-Jones (2003). Such processes are best described by a conditional intensity function . Let denote the subset of events up to but not including . Then defines the probability density of observing an event conditioned on the event history:

(2) |

Using this form, we now describe some of the most well-studied point process models, which we will use in our experiments.

Poisson processes. The intensity function is a function independent of event history . The simplest case is a homogeneous Poisson process where the intensity function is a constant :

(3) |

Hawkes processes. These processes assume that events are self-exciting. In other words, an event leads to an increase in the conditional intensity function, whose effect decays over time:

(4) |

where is the baseline intensity, , and is a kernel function. We consider two widely used kernels: (1) the exponential kernel , which is often used for its computational efficiency Laub et al. (2015); and (2) the power-law kernel , which is used for modeling in seismology Ogata (1999) and social media Rizoiu et al. (2016):

(5) |

The variant of the power-law kernel we use here has a delaying effect.

Self-correcting processes. A self-correcting process assumes the conditional intensity grows exponentially with time and an event suppresses future events. This model has been used for modeling earth-quakes after aftershocks have been removed Ogata and Vere-Jones (1984):

(6) |

Marked Temporal Point Processes. Oftentimes, we care not only about when an event happens, but also what the event is; having such labels makes the point process marked. In these cases, we use a vector embedding to denote event type, and for an event sequence, where each tuple denotes an event with embedding happening at timestamp . This setup is applicable to events with discrete types as well as events with real-valued features. For discrete-typed events, we use a one-hot encoding , where is the number of discrete event types. Otherwise, the are real-valued featured vectors.

### 2.2 Neural ODEs

A Neural ODE defines a continuous-time transformation of variables Chen et al. (2018). Starting from an initial state , the transformed state at any time is given by integrating an ODE forward in time.

(7) |

Here, is a neural network parameterized by that defines the ODE dynamics.

Assuming the loss function depends directly on the latent variable values at a sequence of checkpoints (i.e., , Chen et al. proposed to use the adjoint method to compute the derivatives of the loss function with respect to the initial state , model parameters , and the initial time as follows. First, we define the initial condition of the adjoint variables.

(8) |

Then, the loss function derivatives , , and , can be computed by integrating the following ordinary differential equation backward in time:

(9) |

Although solving Eq. 9 requires the value of along its entire trajectory Chen et al. (2018), can be recomputed backwards in time together with the adjoint variables starting with its final value and therefore induce no memory overhead.

### 2.3 When can Neural ODEs Model Temporal Point Processes?

The continuous Neural ODE formulation makes it a good candidate for modeling events with real-valued timestamps. In fact, Chen et al. applied their model for learning the intensity of Poisson processes, which notably do not depend on event history. However, in many real-world applications, the event (e.g., financial transactions or tweets) often provides feedback to the system and influences the future dynamics Hardiman et al. (2013); Kobayashi and Lambiotte (2016).

There are two possible ways to encode the event history and model event effects. The first approach is to parametrize with an explicit dependence on time: events that happen before time changes the function and consequently influence the trajectory after time . Unfortunately, even the mild assumption requiring to be finite would imply the event effects “kick in” continuously, and therefore cannot model events that create immediate shocks to a system (e.g., effects of Federal Reserve interest rate changes on the stock market). For this reason, areas such as financial mathematics have long advocated for discontinuous time series models Cox and Ross (1976); Merton (1976). The second alternative is to encode the event effects as abrupt jumps of the latent vector . However, the original Neural ODE framework needs a Lipschitz continuous trajectory, and therefore cannot model temporal point processes that depend on event history (such as a Hawkes process).

In the next section, we show how to incorporate jumps into the Neural ODE framework for modeling event effects, while maintaining the simplicity of the adjoint method for training.

## 3 Neural Jump Stochastic Differential Equations

In our setup, we are given a sequence of events , and we are interested in both simulating and predicting the likelihood of future event sequences.

### 3.1 Latent Dynamics and Stochastic Events

At a high level, our model represents the latent state of the system with a vector . The latent state continuously evolves with a deterministic trajectory until interrupted by a stochastic event. Within any time interval , an event happens with the following probability:

(10) |

where is the total conditional intensity for events of all types. The embedding of an event happening at time is sampled from . Here, both and are parameterized with neural networks and learned from data. In cases where events have discrete types, is supported on the finite set of one-hot encodings and the neural networks directly outputs the intensity for every event. On the other hand, for events with real-valued features, we parameterize with a Gaussian mixture model, whose parameters depend on . The mapping from to is learned with another neural network.

Next, let be the number of events up to time . The latent state dynamics of our Neural JSDE model is described by the following equation:

(11) |

where and are two neural networks that control the flow and jump, respectively. Following our definition for the counting function (Eq. 1), all time dependent variables are left continuous in , i.e., . Section 3.3 describes the neural network architectures for , , , and .

Now that we have fully defined the latent dynamics and stochastic event handling, we can simulate the hybrid system by integrating Eq. 11 forward in time with an adaptive step size ODE solver. The complete algorithm for simulating the hybrid system with stochastic events is in Section A.1. However, in this paper, we focus on prediction instead of simulation.

### 3.2 Learning the Hybrid System

For a given set of model parameters, we compute the log probability density for a sequence of events and define the loss function as

(12) |

In practice, the integral in Eq. 12 is computed by a weighted sum of intensities on checkpoints . Therefore, computing the loss function requires integrating Eq. 11 forward from to and recording the latent vectors along the trajectory.

The functional derivatives are evaluated with the adjoint method (Eq. 9). However, we encounter jumps in the latent vector when integrating the adjoint equations backwards in time (Fig. 1). These jumps also introduce discontinuities to the adjoint vectors at .

Denote the right limit of any time dependent variable by . Then, at any timestamp when an event happens, the left and right limits of the adjoint vectors , , and exhibit the following relationships (see Section A.2 for the derivation):

(13) |

In order to compute the loss function derivatives , , and , we integrate the adjoint vectors backwards in time following Eq. 9. However, at every when an event happens, the adjoint vectors is discontinuous and needs to be lifted from its right limit to its left limit. One caveat is that computing the Jacobian in Eq. 13 requires the value of at the left limit, which need to be recorded during forward integration. The complete algorithm for integrating forward and backward is described in Section A.3.

### 3.3 Network Architectures

Figure 2 shows the network architectures that parameterizes our model. In order to better simulate the time series, the latent state is further split into two vectors: encodes the internal state, and encodes the memory of events up to time , where .

Dynamics function . We parameterize the internal state dynamics by a multi-layer perceptron (MLP). Furthermore, we require to be orthogonal to . This constrains the internal state dynamics to a sphere and guarantees the stability of the ODE solution. On the other hand, the event memory decays over time, with a decay rate parameterized by another MLP, whose output passes through a softplus activation to guarantee the decay rate to be positive.

Jump function . An event introduces a jump to event history . The jump is parameterized by a MLP that takes the event embedding and internal state as input. Our architecture also assume the event does not directly interrupt internal state (i.e., ).

Intensity and probability . We use a MLP to compute both the total intensity and the probability distribution over the event embedding. For events that are discrete (where is a one-hot encoding), the MLP directly outputs the intensity of each event type. For events with real-valued features, the probability density distribution is represented by a mixture of Gaussians, and the MLP outputs the weight, mean, and variance of each Gaussian.

## 4 Experimental Results

Next, we use our model to study a variety of synthetic and real-world time series of events that occur at real-valued timestamps.

### 4.1 Modeling Conditional Intensity — Synthetic Data from Classical Point Process Models

We first demonstrate our model’s flexibility to capture the influence of event
history on the conditional intensity in a variety of point processes models.
To show the robustness of our model, we consider the following generative processes
^{1}^{1}1Using code from https://github.com/dunan/MultiVariatePointProcess
(we only focus on modeling the conditional intensity in this part, so all events are assigned the same type):
(i) Poisson Process: the conditional intensity is given by , where ;
(ii) Hawkes Process (Exponential Kernel): the conditional intensity is given by Eq. 4 with the exponential kernel in Eq. 5, where ;
(iii) Hawkes Process (Power-Law Kernel): the conditional intensity is given by Eq. 4 with the power-law kernel in Eq. 5, where ; and
(iv) Self-Correcting Process: the conditional intensity is given by Eq. 6, where .

For each generative process, we create a dataset by simulating event sequences within the time interval
and use for training, for validation and for testing.
We fit our Neural JSDE model to each dataset using the training procedure described above.
In addition, we fit the parameters of each of the four point processes to each dataset using maximum likelihood estimation;
these serve as baselines for our model.
Furthermore, we also compare the performance of our model with an RNN.^{2}^{2}2The RNN network uses a -dimensional latent vector, while our model use a -dimensional latent vector.
The RNN models the latent state on evenly spaced timestamps across the entire time window, and each event is rounded to
the closest timestamp.

The average conditional intensity varies among different generative models. For a meaningful comparison, we measure accuracy with the mean absolute percentage error:

(14) |

Table 1 shows the prediction accuracy of the conditional intensity for our model and the baselines.

Poisson | Hawkes (E) | Hawkes (PL) | Self-Correcting | |

Poisson | 3.6 | 188.2 | 95.6 | 29.1 |

Hawkes (E) | 3.6 | 3.5 | 155.4 | 29.1 |

Hawkes (PL) | 4.2 | 128.5 | 9.8 | 29.1 |

Self-Correcting | 101.9 | 101.0 | 87.1 | 1.6 |

RNN | 3.2 | 22.0 | 20.1 | 24.3 |

Neural JSDE | 1.3 | 5.9 | 17.1 | 9.3 |

In all cases, our neural JSDE model is a better fit for the data than the RNN and other point process models (except for the ground truth model, which shows what we can expect to achieve with perfect information). Remarkably, our model is able to capture the delaying effect in the power-law kernel (Fig. 3D) through a complex interplay between the internal state and event memory: although an event immediately introduces a jump to the event memory , the intensity function peaks when the internal state is the largest, which lags behind .

### 4.2 Discrete Event Type Prediction on social and medical datasets.

Error Rate | Du et al. (2016) | Mei and Eisner (2017) | NJSDE |

Stack Overflow | 54.1 | 53.7 | 52.7 |

MIMIC2 | 18.8 | 16.8 | 19.8 |

Next, we evaluate our model on a discrete-type event prediction task with two real-world datasets. The Stack Overflow dataset contains the awards history of users in an online question-answering website Du et al. (2016). Each sequence is a collection of badges a user received over a period of years, and there are different badges types in total. The medical records (MIMIC2) dataset contains the clinical visit history of de-identified patients in an Intensive Care Unit Du et al. (2016). Each sequence consists of the visiting events of a patient over a period of years, where event type is the reason for the visit ( reasons in total). Using 5-fold cross validation, we predict the event type of every held-out event by choosing the event embedding with the largest probability given the past event history . We compare the event type classification accuracy of our model against two other models for learning event sequences that directly simulate the next event based on the history. The baselines we compare against are two neural point processes models based on an RNN Du et al. (2016) and LSTM Mei and Eisner (2017) that directly models events (as opposed to simulating the trajectory). Our model achieve similar performance with those state of the art models (Table 2) but also allows us to model events with real-valued features, as we study next.

### 4.3 Real-Valued Event Feature Prediction — Synthetic and Earthquake data

Finally, we use our model to predict events with real-valued features. To this end, we first test our model on synthetic event sequences whose event times are generated by a Hawkes process with exponential kernel, but the feature of each event records the time interval since the previous event. We train our model in a similar way as to Section 4.1 and achieves a mean absolute error of . In contrast, the baseline of simply predicting the “average feature” would results in a mean absolute error of . Fig. 4 shows one event sequence and predicted event features.

Finally, we provide an illustrative example of real-world data with real-valued features.
We use our model to predict the time and locations of earthquakes above level 4.0 in 2007–2018 using
historical data from 1970–2006.^{3}^{3}3Data from https://www.kaggle.com/danielpe/earthquakes
In this case, an event’s features are the longitude and latitude locations of an earthquake.
We run our model using a 20-dimensional latent vector, and the event feature’s probability density distribution is
given by a mixture of 5 Gaussian.
Figure 5 shows the contours of the conditional intensity of the learned Neural JSDE model.

## 5 Related Work

Temporal point processes are an elegant abstraction for time series analysis. The self-exciting nature of the Hawkes process has made it a key model within machine learning and information science Zhou et al. (2013); Farajtabar et al. (2015); Valera and Gomez-Rodriguez (2015); Li and Zha (2014); Xu et al. (2017); Xu and Zha (2017); Farajtabar et al. (2014); Zarezade et al. (2017). However, classical point process models (including the Hawkes process) make strong assumptions about how the event history influences future dynamics. To get around this, RNNs and LSTMs have been adapted to directly model events as time steps within the model Du et al. (2016); Mei and Eisner (2017). However, these models do not consider latent space dynamics in the absence of events as we have, which may reflect time-varying internal evolution that inherently exists in the system. Xiao et al. also proposed a combined approach to event history and internal state evolution by simultaneously using two RNNs—one that takes event sequence as input, and one that models evenly spaced time intervals Xiao et al. (2017b). In contrast, our model provides a unified approach that addresses both aspects, making a connection to physical models as ordinary differential equations and can be efficiently trained with the adjoint method using only constant memory. Another approach uses GANs to circumvent modeling the intensity function Xiao et al. (2017a); however, it cannot provide insight into the dynamics in the system.

More broadly, learning differential equations from data has been successful for physics-based problems Raissi and Karniadakis (2018); Raissi et al. (2018); Khoo et al. (2017); Long et al. (2017); Fan et al. (2018). In terms of stochastic differential equations, Ryder et al. proposed an RNN-based variational approximation for latent states Ryder et al. (2018); however the methods are not amenable for the jump processes we consider here.

## 6 Discussion

We have developed Neural Jump Stochastic Differential Equations, a general framework for modeling temporal event sequences. Our model learns both the latent continuous dynamics of the system and the abrupt effects of events from data. The model maintains the simplicity and memory efficiency of Neural ODEs and uses a similar adjoint method for learning; in our case, we additionally model jumps in the trajectory with a neural network, and handle the effects of this discontinuity in the learning method. We demonstrate the state of the art performance of our model on a variety of prediction tasks, including intensity functions of classical temporal point processes, discrete event types in Stack Overflow awards and electrical medical records, and spatiotemporal prediction of earthquakes.

## References

- Blundell et al. (2012) Blundell, C., Beck, J., and Heller, K. A. (2012). Modelling reciprocating relationships with hawkes processes. In Advances in Neural Information Processing Systems, pages 2600–2608.
- Branicky (2005) Branicky, M. S. (2005). Introduction to Hybrid Systems. Birkhäuser Boston, Boston, MA.
- Chen et al. (2018) Chen, T. Q., Rubanova, Y., Bettencourt, J., and Duvenaud, D. K. (2018). Neural ordinary differential equations. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R., editors, Advances in Neural Information Processing Systems 31, pages 6571–6583. Curran Associates, Inc.
- Corner et al. (2018) Corner, S., Sandu, C., and Sandu, A. (2018). Adjoint Sensitivity Analysis of Hybrid Multibody Dynamical Systems. arXiv.
- Cox and Ross (1976) Cox, J. C. and Ross, S. A. (1976). The valuation of options for alternative stochastic processes. Journal of financial economics, 3(1-2):145–166.
- Daley and Vere-Jones (2003) Daley, D. J. and Vere-Jones, D. (2003). An introduction to the theory of point processes. Probability and its Applications (New York). Springer-Verlag, New York, second edition. Elementary theory and methods.
- Du et al. (2016) Du, N., Dai, H., Trivedi, R., Upadhyay, U., Gomez-Rodriguez, M., and Song, L. (2016). Recurrent marked temporal point processes: Embedding event history to vector. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, pages 1555–1564, New York, NY, USA. ACM.
- Fan et al. (2018) Fan, Y., Lin, L., Ying, L., and Zepeda-Núnez, L. (2018). A multiscale neural network based on hierarchical matrices. arXiv preprint arXiv:1807.01883.
- Farajtabar et al. (2014) Farajtabar, M., Du, N., Rodriguez, M. G., Valera, I., Zha, H., and Song, L. (2014). Shaping social activity by incentivizing users. In Advances in neural information processing systems, pages 2474–2482.
- Farajtabar et al. (2015) Farajtabar, M., Wang, Y., Rodriguez, M. G., Li, S., Zha, H., and Song, L. (2015). Coevolve: A joint point process model for information diffusion and network co-evolution. In Advances in Neural Information Processing Systems, pages 1954–1962.
- Glover and Lygeros (2004) Glover, W. and Lygeros, J. (2004). A stochastic hybrid model for air traffic control simulation. In Alur, R. and Pappas, G. J., editors, Hybrid Systems: Computation and Control, pages 372–386, Berlin, Heidelberg. Springer Berlin Heidelberg.
- Hardiman et al. (2013) Hardiman, S. J., Bercot, N., and Bouchaud, J.-P. (2013). Critical reflexivity in financial markets: a hawkes process analysis. The European Physical Journal B, 86(10):442.
- Hespanha (2004) Hespanha, J. P. (2004). Stochastic hybrid systems: Application to communication networks. In Alur, R. and Pappas, G. J., editors, Hybrid Systems: Computation and Control, pages 387–401, Berlin, Heidelberg. Springer Berlin Heidelberg.
- Khoo et al. (2017) Khoo, Y., Lu, J., and Ying, L. (2017). Solving parametric pde problems with artificial neural networks. arXiv preprint arXiv:1707.03351.
- Kobayashi and Lambiotte (2016) Kobayashi, R. and Lambiotte, R. (2016). Tideh: Time-dependent hawkes process for predicting retweet dynamics. In Tenth International AAAI Conference on Web and Social Media.
- Laub et al. (2015) Laub, P. J., Taimre, T., and Pollett, P. K. (2015). Hawkes Processes. arXiv e-prints, page arXiv:1507.02822.
- Li and Zha (2013) Li, L. and Zha, H. (2013). Dyadic event attribution in social networks with mixtures of hawkes processes. In Proceedings of the 22nd ACM international conference on Information & Knowledge Management, pages 1667–1672. ACM.
- Li and Zha (2014) Li, L. and Zha, H. (2014). Learning parametric models for social infectivity in multi-dimensional hawkes processes. In Twenty-Eighth AAAI Conference on Artificial Intelligence.
- Li et al. (2017) Li, X., Omotere, O., Qian, L., and Dougherty, E. R. (2017). Review of stochastic hybrid systems with applications in biological systems modeling and analysis. EURASIP Journal on Bioinformatics and Systems Biology, 2017(1):8.
- Long et al. (2017) Long, Z., Lu, Y., Ma, X., and Dong, B. (2017). PDE-net: Learning pdes from data. arXiv preprint arXiv:1710.09668.
- Mei and Eisner (2017) Mei, H. and Eisner, J. (2017). The neural Hawkes process: A neurally self-modulating multivariate point process. In Advances in Neural Information Processing Systems, Long Beach.
- Merton (1976) Merton, R. C. (1976). Option pricing when underlying stock returns are discontinuous. Journal of financial economics, 3(1-2):125–144.
- Ogata (1999) Ogata, Y. (1999). Seismicity analysis through point-process modeling: A review. pure and applied geophysics, 155(2):471–507.
- Ogata and Vere-Jones (1984) Ogata, Y. and Vere-Jones, D. (1984). Inference for earthquake models: A self-correcting model. Stochastic Processes and their Applications, 17(2):337 – 347.
- Raissi and Karniadakis (2018) Raissi, M. and Karniadakis, G. E. (2018). Hidden physics models: Machine learning of nonlinear partial differential equations. Journal of Computational Physics, 357:125–141.
- Raissi et al. (2018) Raissi, M., Perdikaris, P., and Karniadakis, G. E. (2018). Multistep neural networks for data-driven discovery of nonlinear dynamical systems. arXiv preprint arXiv:1801.01236.
- Rizoiu et al. (2016) Rizoiu, M.-A., Xie, L., Sanner, S., Cebrian, M., Yu, H., and Van Hentenryck, P. (2016). Expecting to be HIP: Hawkes Intensity Processes for Social Media Popularity. arXiv e-prints, page arXiv:1602.06033.
- Ryder et al. (2018) Ryder, T., Golightly, A., McGough, A. S., and Prangle, D. (2018). Black-box variational inference for stochastic differential equations. arXiv preprint arXiv:1802.03335.
- Stomakhin et al. (2011) Stomakhin, A., Short, M. B., and Bertozzi, A. L. (2011). Reconstruction of missing data in social networks based on temporal patterns of interactions. Inverse Problems, 27(11):115013.
- Valera and Gomez-Rodriguez (2015) Valera, I. and Gomez-Rodriguez, M. (2015). Modeling adoption and usage of competing products. In 2015 IEEE International Conference on Data Mining, pages 409–418. IEEE.
- Van Der Schaft and Schumacher (2000) Van Der Schaft, A. J. and Schumacher, J. M. (2000). An introduction to hybrid dynamical systems, volume 251. Springer London.
- Xiao et al. (2017a) Xiao, S., Farajtabar, M., Ye, X., Yan, J., Song, L., and Zha, H. (2017a). Wasserstein learning of deep generative point process models. In Advances in Neural Information Processing Systems, pages 3247–3257.
- Xiao et al. (2017b) Xiao, S., Yan, J., Chu, S. M., Yang, X., and Zha, H. (2017b). Modeling the intensity function of point process via recurrent neural networks. In AAAI.
- Xu et al. (2017) Xu, H., Luo, D., and Zha, H. (2017). Learning hawkes processes from short doubly-censored event sequences. In Proceedings of the 34th International Conference on Machine Learning, pages 3831–3840.
- Xu and Zha (2017) Xu, H. and Zha, H. (2017). A dirichlet mixture model of hawkes processes for event sequence clustering. In Advances in Neural Information Processing Systems, pages 1354–1363.
- Zarezade et al. (2017) Zarezade, A., Upadhyay, U., Rabiee, H. R., and Gomez-Rodriguez, M. (2017). Redqueen: An online algorithm for smart broadcasting in social networks. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, pages 51–60. ACM.
- Zhou et al. (2013) Zhou, K., Zha, H., and Song, L. (2013). Learning triggering kernels for multi-dimensional hawkes processes. In International Conference on Machine Learning, pages 1301–1309.

## Appendix A Appendix

### a.1 Algorithm for Simulating Hybrid System with Stochastic Events

Note that when an event happens within the step size proposed by the ODE solver, needs to shrink so that is no larger than .

### a.2 Adjoint Sensitivity Analysis at Discontinuities

When the event happens at timestamp , the left and right limits of latent variables are related by,

(15) |

where all the time dependent variables are left continuous in time. According to Remark 2 from Corner et al. (2018), the left and right limits of adjoint sensitivity variables at a discontinuity satisfy

(16) |

Substituting Eq. 15 in Eq. 16 gives,

(17) |

Moreover, Eq. 16 can be generalized to obtain the jump of and at the discontinuities. In the work of Chen et al. Chen et al. (2018), the authors define an augmented latent variables and its dynamics as,

(18) |

Following the same convention, we define the augmented jump function at as,

(19) |

We can verify that the left and right limits of the augmented latent variables satisfy

(20) |

The augmented dynamics is only a special case of the general Neural ODE framework, and the jump of adjoint variables can be calculated as

(21) |

which is equivalent to Eq. 13.