Temporal Normalizing Flows
Abstract
Analyzing and interpreting timedependent stochastic data requires accurate and robust density estimation. In this paper we extend the concept of normalizing flows to socalled “temporal Normalizing Flows” (tNFs) to estimate time dependent distributions, leveraging the full spatiotemporal information present in the dataset. Our approach is unsupervised, does not require an apriori characteristic scale and can accurately estimate multiscale distributions of vastly different length scales. We illustrate tNFs on sparse datasets of Brownian and chemotactic walkers, showing that the inclusion of temporal information enhances density estimation. Finally, we speculate how tNFs can be applied to fit and discover the continuous PDE underlying a stochastic process.
Code and examples at github.com/PhIMaL/temporal_normalizing_flows
Introduction
Density estimations from sparse time series data are ubiquitous to interpret probabilistic or stochastic phenomena in quantitative science, e.g. in econometrics [23], variational inference [20] and biological sciences [15]. In this latter application, single particle tracking (SPT) has become the method of choice to investigate the dynamics, structure and interaction of many molecules in a cellular context, allowing the observation of single molecule trafficking on the nanoscale throughout the cell. The obtained trajectories are typically interpreted as random walks and analyzed in terms of their mean squared displacement (MSD). This analysis provides insight into the underlying transport processes and has revealed its nonergodicity and anomalous diffusive properties [15]. The main difficulty in SPT is linking the particles between the frames to create a trajectory; particles cross, thus exchanging their identity, or stop fluorescing completely [15]. Alternatively, there exists a rich mathematical literature studying trajectories in terms of walker densities [12]. This perspective provides an alternative way to extract transport properties from experimental data without the need to link the particles between the frames. Key to this approach is accurately inferring the evolving particle density, particularly when data is sparse.
The classical approach to density estimation is binning. It provides an accurate density estimate when the samplesize is large, but becomes sensitive to the location of the bins when data becomes sparse. This method is also subject to a biasvariance tradeoff; smallscale features are not captured when using oversized bins, whereas undersized bins will lead to a very noisy estimate. Alternatively, one can use continuous methods such as the Kernel Density Estimate (KDE). In KDE, a kernel is placed on each particle and the mean over all kernels then gives an estimate of the density. The resulting estimate is highly sensitive to the width of the kernel and although several automated estimators exist [21, 23], choosing the right width is a nontrivial task. While these techniques are firmly established, inferring the distribution of a random variable evolving through time remains challenging. Explicitly including the temporal axis suppresses natural variations in the estimate by exploiting temporal correlations; samples taken closely together in time are likely to be only slightly different. To our knowledge, KDE and binning cannot include temporal dynamics under the constraint of particle conservation. In this paper we propose a novel technique based on normalizing flows, which is capable of handling these constraints.
Normalizing Flows (NFs) learn an arbitrarily complex probability distribution by applying a series of transformations to a known distribution in a latent space. NFs originated in the field of machine learning and were initially applied to infer posterior distributions in the context of variational inference [20]. They have been successfully applied as generative models (e.g. to generate novel faces [11]) and many papers showcase their capability as density estimators for 2D, time independent toy problems [10, 4, 6]. NFs have several advantages as a density estimation technique: they are unsupervised, do not require an apriori length scale such as the bin or kernel width and they can naturally accommodate several of such length scales across a dataset, a notoriously hard problem. Here we extend NFs to include temporal dynamics and hence name our approach temporal Normalizing Flows (tNFs).
The rest of the paper is organized as follows. In Section 2, we introduce and implement tNFs. Section 3 presents the application of tNFs on a multiscale toy problem and datasets of Brownian and chemotactic particles. In Section 4 we present some further perspectives of this approach, in particular its potential as a physics informed density estimator and its ability to perform accurate density estimations on finite domains.
Methods
Normalizing Flows
Consider a set of samples taken from an unknown distribution . Given some model , we can estimate by minimizing the negative loglikelihood of the model on the data,
(1) 
To obtain an accurate estimate of , the model needs to be flexible enough. Normalizing flows [5, 16, 20] allow the construction of an arbitrary model by applying an invertible transformation to a known probability density. Consider a random variable distributed by pdf . Given an invertible transformation , is then distributed by , which is given by,
(2) 
Typically, is referred to as the real space and as the latent space and is usually a Gaussian. Normalizing flows learn the real to latent space mapping and consequently the density by minimizing the negative log likelihood,
(3) 
temporal Normalizing Flows
The normalizing flows as presented in the previous section cannot account for temporal dynamics. Nonetheless, our starting point for deriving the temporal NF is the Ndimensional equivalent of eq. 2,
(4) 
where is Jacobian of . As we cannot write a conservation relation for the temporal axis, i.e. , we cannot include it as an additional dimension in eq. 4, explaining why NFs cannot account for temporal dynamics. However, assume for now that such a construction is possible. The determinant of the Jacobian for a 1D temporallyvarying distribution can then be written as,
(5) 
where is the latent spatial coordinate and the latent temporal coordinate. Note that both are dependent on and , i.e. and . As the transformation of is not allowed, the latent time must be equal to the real time , so that determinant of the Jacobian becomes,
(6) 
The 1D temporal Normalizing Flow can then be written as,
(7) 
We show a graphical interpretation of this in figure 2. While the temporal axis is not stretched or compressed, all frames are coupled through the mapping . Using a single mapping for the whole dataset prevents overfitting and suppresses natural variations in the estimate, as we will show in the results.
Implementation
The prime challenge of implementing NFs and tNFs in practice is finding a flexible yet invertible transformation. NFs are generally applied as generative models on highdimensional data, requiring a computationally efficient method to evaluate the determinant of the Jacobian (see e.g. FFJORD [9], Autoregressive flows [17] or GLOW [11]). Spatiotemporal density estimation contains up to four dimensions, such that calculating the determinant of the Jacobian is not a computational constraint. This allows us to propose a relatively simple implementation.
Wehenkel et al. [22] recently introduced a method for the construction of monotonic neural networks, independent of the networks’ specific architecture. Building on the observation that a function is monotonic if its derivative is positive, they propose to constrain a neural network to positive outputs only and numerically integrate over the output to obtain a monotonic function. We slightly modify their approach and use an unconstrained feedforward neural network to model the log Jacobian, naturally leading to monotonic and hence invertible mapping . This leads to the following implementation for the tNF,
(8) 
Here is a time dependent offset function. Both and are modeled by unconstrained neural networks with a tanhactivation function ( contains 3 hidden layers of 30 neurons and contains 1 hidden layer of 100 neurons). In the remainder of this work we choose a time independent Gaussian as latent distribution, . We perform the integration in eq. 8 over a regular grid rather then integrating over the particles’ positions. This approach scales with the size of the grid, rather than with the number of particles, works well when data is sparse and scales to higher dimensions.
Results
We now demonstrate tNFs on three datasets:

A multiscale toy problem to show tNFs can accommodate different length scales in a single distribution;

A dataset of Brownian motion to show how tNFs enhance density estimation for sparse datasets;

A dataset of chemotactic walkers to show that tNFs can correctly estimate a multimodal, nonGaussian density.
Multiscale density estimation
A key problem in density estimation is inferring an accurate distribution when vastly different length scales are present within a single dataset. Classical approaches such as binning and KDE require a single characteristic length scale, prohibiting an accurate estimate of a multiscale distribution. We now show that normalizing flows, and by extension tNFs, are capable of accurately inferring such a distribution.
We build an artificial distribution consisting of three normal distributions with standard deviations of and (thus spanning three orders of magnitude) and respective weights and . Figure 3 shows the inferred distribution from 5000 samples for the NF and the KDE with Scott’s rule determining the lengthscale. Observe that, as expected, the KDE is unable to accommodate the different scales and that due the different weighting of each peak, the widest is dominating the lengthscale estimation. Contrarily, the NF provide an accurate density estimate for all lengthscales present in the problem, independent of their weights.
Brownian motion
Brownian motion is the most basic and ubiquitous random walk and thus an ideal test case to assess the performance of tNFs, comparing them to time independent NFs and classical binning. We generate a single trajectory for a Brownian random walker by the recursive relation, . Here is the step number with the initial position, the diffusive coefficient and the time step. In the limit of an infinite number of walkers, the walker density is described by the diffusion equation, .
Our dataset consists of walkers with , with snapshots being taken every for frames. The initial positions were sampled from a Gaussian centered at with width ; in this case, the diffusion equation can be solved exactly and the solution behaves as a spreading Gaussian in time. We show the estimated density at and in figure 4 (a) and (b) for the tNF, the time independent NF and binning. The tNF provides a significantly better density estimate than the time independent NF, illustrated by the difference in error; for the tNF and for the NF, averaged over frame 15 and 85.
Normalizing flows are based on neural networks and hence prone to overfitting. We analyze the effect of overfitting in Appendix I and show that NFs overfit more strongly than tNFs and perform worse in terms of the error. We mainly attribute this improvement to the temporal correlations in the dataset, which suppresses the natural frametoframe variations in the density estimate. Nonetheless, tNFs are not immune to overfitting and we speculate performance could be enhanced by applying techniques such as early stopping.
For the diffusion equation the true mapping can be trivially derived. We compare it to the learned mapping in figure 4(c). It shows perfect agreement at , but deviates from the true curve for at . As can be seen in figure 4(a), no samples were present in this domain, explaining the deviance. Nonetheless, it implies that the network does not generalize well outside the sampling domain. We speculate that techniques such as batch normalization or a different architecture for the network (a recurrent network, for example) might further improve performance.
Chemotaxis
The Brownian motion presented in the previous section was a linear problem with a unimodal, Gaussian solution. We now apply tNFs to socalled chemotactic walkers, a nonlinear problem with a multimodal solution. Bacteria and other microorganisms sense gradients of chemicals throughout their environment and use this to guide their motion towards a food source. This effect is known as chemotaxis and is typically modelled by a random walker with a superimposed drift; , where is the chemical density and is the chemotactic sensitivity, which controls the interaction between the chemical and the bacteria. In the infinite walker limit, the walker and chemical density are given by the KellerSegel model: and . Here and are the diffusion coefficients of the bacteria and the chemical respectively and a decay set by has been added to the chemical density.
Our dataset consisted of walkers with and we sampled the initial position from a Gaussian centred at . The food source was modelled by a Gaussian with diffusion coefficient , centred at ; the walkers will thus drift towards food source over time. Figure 5 shows a comparison of the time independent NF, tNF and the binning method. In figure 5(a) and (b) we find that the tNF leads to a significantly more accurate density estimation, illustrated by the difference in error ( for the NF versus for the tNF, averaged over and ). The tNF captures the multimodal distribution at excellently, without overfitting, contrarily to the time independent NF. The mapping, as shown in figure 5c, is nonlinear, in contrast to the mapping obtained for the Brownian motion.
Perspective
Boundary conditions
Density estimation near boundaries is often problematic [1, 14], as they introduce discontinuities in the profile. Applying KDE in such situations leads to nonzero probabilities past the boundary. We show here that NFs are less prone to these artifacts. In figure 6 we compare binning, KDE and tNF for 1000 random walkers between two reflective boundaries at . We show the corresponding Jacobian and latent density in figure 6b. At the boundaries, the latent density approaches zero, which must be compensated by the Jacobian to obtain the nonzero density of the true profile. How well the network is able to do this determines the quality of the estimate at the boundary and might lead to artifacts. To improve the density estimate near the boundary, we propose to use a latent distribution with finite support, e.g., the Epanechnikov kernel [7]. However, this introduces a discontinuity in the cost function, leading to training issues.
Physics Informed Normalizing Flows
Physics Informed Neural Networks (PINNs)[19] have emerged as a powerful yet simple method to include physical constraints in neural networks. They have been applied to (i) solve PDE’s [13], (ii) infer parameters of a known equation [18] and (iii) perform model discovery [2]. Here, we propose Physics Informed Normalizing Flows (PINFs) to directly fit continuous models to single particle data. Contrarily to PINNs, PINFs do not require an estimate of the density before fitting and explicitly conserve energy, mass or probability densities. By including the fitting in the cost function, PINFs form an endtoend differentiable model to fit continuous models to discrete data. We construct it by adding the continuous model to the loglikelihood, analogously to a PINN,
(9) 
Here, is a constant and sets the relative strength of the fitting term. The two terms in eq. 9 are of different origin (i.e. a likelihood term vs a MSE term) and hence are typically of different orders of magnitude. Consequently, training is more complex than PINNs, but preliminary testing on random walkers confirmed that PINFs are indeed capable of inferring the parameters of the PDE directly from the positional data. Further research however is required to improve the performance of these PINFs.
Discussion
In this paper we have introduced temporal Normalizing Flows (tNFs), an extension of normalizing flows to estimate a timevarying probability density. We demonstrate that tNFs can naturally accommodate different length scales in a problem and outperform binning and timeindependent normalizing flows, even when the density is nonGaussian and multimodal. tNFs use the full time series data to perform density estimation, rather than inferring the density one frame at a time. This exploits the temporal correlations in the data, which improves the performance of the neural network used to model the mapping. The use of an unconstrained monotonic neural network opens up the possibility of applying techniques such as batching and batch normalization, or even completely different architectures, e.g. RNNs.
We provide two perspectives, building on this work: (i) density estimation on a finite domain and (ii) discovering and fitting a PDE to the data. (i): Boundaries typically lead to discontinuous density profiles. In this situation, tNFs can provide a more accurate estimate of the true profile, compared to e.g. KDE. While the discontinuous density profile cannot be strictly modeled using a Gaussian latent distribution, we speculate that using a distribution with finite support could capture such discontinuities. (ii) Typically a continuous PDE can be derived for a timedependent distribution. tNFs can be used to fit the corresponding PDE directly to positional data by simultaneously making an estimate of the density and fitting a PDE to the data. Rather than inferring parameters, we speculate that PINFs can also be used as PDE solvers (similar to [13]) where energy or mass conservation is required. Our initial results with these physics informed normalizing flows are encouraging, but much work remains to be done, especially optimizing the training scheme.
Our work fits in the wider context of temporal reasoning in machine learning. When applying generative modeling to a time series of images for example, the temporal axis must also be treated differently. Approaches based on modeling the latent time as a Gaussian process [3] or as a Linear Gaussian State Space Model [8] have recently been been brought forward. We propose temporal normalizing flows could be used for similar time dependent applications.
Access to the temporal dynamics of a process has both theoretical and practical benefits. Analysis and modeling experimental data is often limited to equilibrium processes [16], restricting the potential of the data at hand. Being able to study the temporal dynamics of such systems in terms of the underlying probability distribution or PDE opens up many opportunities in outofequilibrium science. We thus believe that tNFs can greatly aid the study of outofequilibrium processes.
Appendix I
In this appendix we study the effect of overfitting on the density estimation by comparing the error with the loglikelihood as a function of the training epoch in figure 7a and b. Here we performed a density estimate for 500 Brownian walkers with parameters identical as those selected in the main text. We delineate the minimum error with a black dashed line; note that this occurs after roughly 7000 epochs and that the error, with respect to the analytical solution increases upon training further. The negative log likelihood keeps decreasing however, corresponding to overfitting the solution. We found empirically that the minimum error occurs roughly at the elbow of the cost function, which for all cases considered is roughly at epochs so all the NF and tNF have been trained for 10000 epochs.
We show that tNFS are less prone to overfitting than NFs by comparing the loglikelihood and error for a single representative frame in figure 8 for both approaches. The loglikelihood of the timeindependent NF keeps decreasing, leading to overfitting and an increased error. On the other hand, the tNF likelihood saturates and no longer decreases significantly after 10000 epochs, and neither does the error.
References
 (201010) Kernel density estimation via diffusion. The Annals of Statistics 38 (5), pp. 2916–2957 (en). Note: arXiv: 1011.2602 External Links: ISSN 00905364, Link, Document Cited by: Boundary conditions.
 (201904) DeepMoD: Deep learning for Model Discovery in noisy data. arXiv:1904.09406 [physics, qbio, stat] (en). Note: arXiv: 1904.09406 External Links: Link Cited by: Physics Informed Normalizing Flows.
 Gaussian Process Prior Variational Autoencoders. pp. 12 (en). Cited by: Discussion.
 (201709) ContinuousTime Flows for Efficient Inference and Density Estimation. arXiv:1709.01179 [stat] (en). Note: arXiv: 1709.01179 External Links: Link Cited by: Introduction.
 (201806) Neural Ordinary Differential Equations. arXiv:1806.07366 [cs, stat] (en). Note: arXiv: 1806.07366 External Links: Link Cited by: Normalizing Flows.
 (201605) Density estimation using Real NVP. arXiv:1605.08803 [cs, stat] (en). Note: arXiv: 1605.08803 External Links: Link Cited by: Introduction.
 (196901) NonParametric Estimation of a Multivariate Probability Density. Theory of Probability & Its Applications 14 (1), pp. 153–158 (en). External Links: ISSN 0040585X, 10957219, Link, Document Cited by: Boundary conditions.
 A Disentangled Recognition and Nonlinear Dynamics Model for Unsupervised Learning. pp. 10 (en). Cited by: Discussion.
 (201810) FFJORD: Freeform Continuous Dynamics for Scalable Reversible Generative Models. arXiv:1810.01367 [cs, stat] (en). Note: arXiv: 1810.01367 External Links: Link Cited by: Implementation.
 (201804) Neural Autoregressive Flows. arXiv:1804.00779 [cs, stat] (en). Note: arXiv: 1804.00779 External Links: Link Cited by: Introduction.
 (201807) Glow: Generative Flow with Invertible 1x1 Convolutions. arXiv:1807.03039 [cs, stat] (en). Note: arXiv: 1807.03039 External Links: Link Cited by: Introduction, Implementation.
 (201108) First Steps in Random Walks: From Tools to Applications. Oxford University Press (en). External Links: ISBN 9780199234868, Link, Document Cited by: Introduction.
 (201907) DeepXDE: A deep learning library for solving differential equations. arXiv:1907.04502 [physics, stat] (en). Note: arXiv: 1907.04502 External Links: Link Cited by: Physics Informed Normalizing Flows, Discussion.
 Nonparametric Kernel Density Estimation Near the Boundary. pp. 36 (en). Cited by: Boundary conditions.
 (201512) A review of progress in single particle tracking: from methods to biophysical insights. Reports on Progress in Physics 78 (12), pp. 124601 (en). External Links: ISSN 00344885, 13616633, Link, Document Cited by: Introduction.
 (201909) Boltzmann generators: Sampling equilibrium states of manybody systems with deep learning. Science 365 (6457), pp. eaaw1147 (en). External Links: ISSN 00368075, 10959203, Link, Document Cited by: Normalizing Flows, Discussion.
 (201705) Masked Autoregressive Flow for Density Estimation. arXiv:1705.07057 [cs, stat] (en). Note: arXiv: 1705.07057 External Links: Link Cited by: Implementation.
 (201704) Inferring solutions of differential equations using noisy multifidelity data. Journal of Computational Physics 335, pp. 736–746 (en). Note: arXiv: 1607.04805 External Links: ISSN 00219991, Link, Document Cited by: Physics Informed Normalizing Flows.
 (201711) Physics Informed Deep Learning (Part I): Datadriven Solutions of Nonlinear Partial Differential Equations. arXiv:1711.10561 [cs, math, stat] (en). Note: arXiv: 1711.10561 External Links: Link Cited by: Physics Informed Normalizing Flows.
 (201505) Variational Inference with Normalizing Flows. arXiv:1505.05770 [cs, stat] (en). Note: arXiv: 1505.05770 External Links: Link Cited by: Introduction, Introduction, Normalizing Flows.
 Bandwidth Selection in Kernel Density Estimation: A Review. pp. 33 (en). Cited by: Introduction.
 (201908) Unconstrained Monotonic Neural Networks. arXiv:1908.05164 [cs, stat] (en). Note: arXiv: 1908.05164 External Links: Link Cited by: Implementation.
 (201212) A Review of Kernel Density Estimation with Applications to Econometrics. arXiv:1212.2812 [stat] (en). Note: arXiv: 1212.2812 External Links: Link Cited by: Introduction, Introduction.