Inferring Interaction Rules from Observations of Evolutive Systems I: The Variational Approach
Abstract
In this paper we are concerned with the learnability of nonlocal interaction kernels for first order systems modeling certain social interactions, from observations of realizations of their dynamics. This paper is the first of a series on learnability of nonlocal interaction kernels and presents a variational approach to the problem. In particular, we assume here that the kernel to be learned is bounded and locally Lipschitz continuous and that the initial conditions of the systems are drawn identically and independently at random according to a given initial probability distribution. Then the minimization over a rather arbitrary sequence of (finite dimensional) subspaces of a least square functional measuring the discrepancy from observed trajectories produces uniform approximations to the kernel on compact sets. The convergence result is obtained by combining meanfield limits, transport methods, and a convergence argument. A crucial condition for the learnability is a certain coercivity property of the least square functional, majoring an norm discrepancy to the kernel with respect to a probability measure, depending on the given initial probability distribution by suitable push forwards and transport maps. We illustrate the convergence result by means of several numerical experiments.
Keywords: nonlocal interaction kernel learning, first order nonlocal interaction equations, meanfield equations, convergence
Contents

1 Introduction
 1.1 General abstract framework
 1.2 Example of gradient flow of nonlocally interacting particles
 1.3 Parametric energies and their identifications
 1.4 The optimal control approach and its drawbacks
 1.5 A variational approach towards learning parameter functions in nonlocal energies
 1.6 Numerical implementation of the variational approach
 2 Preliminaries
 3 The learning problem for the kernel function
 4 convergence of to
 5 Numerical experiments
 6 Appendix
 References
1 Introduction
What are the instinctive individual reactions which make a group of animals forming coordinated movements, for instance a flock of migrating birds or a school of fish? Which biological interactions between cells produce the formation of complex structures, like tissues and organs? What are the mechanisms which induce certain significant changes in a large amount of players in the financial market? In this paper we are concerned with the “mathematization” of the problem of learning or inferring interaction rules from observations of evolutions. The framework we consider is the one of evolutions driven by gradient descents. The study of gradient flow evolutions to minimize certain energetic landscapes has been the subject of intensive research in the past years [AGS]. Some of the most recent models are aiming at describing timedependent phenomena also in biology or even in social dynamics, borrowing a leaf from more established and classical models in physics. For instance, starting with the seminal papers of Vicsek et. al. [VCBCS95] and CuckerSmale [CucSma07], there has been a flood of models describing consensus or opinion formation, modeling the exchange of information as longrange social interactions (forces) between active agents (particles). However, for the analysis, but even more crucially for the reliable and realistic numerical simulation of such phenomena, one presupposes a complete understanding and determination of the governing energies. Unfortunately, except for physical situations where the calibration of the model can be done by measuring the governing forces rather precisely, for some relevant macroscopical models in physics and most of the models in biology and social sciences the governing energies are far from being precisely determined. In fact, very often in these studies the governing energies are just predetermined to be able to reproduce, at least approximately or qualitatively, some of the macroscopical effects of the observed dynamics, such as the formation of certain patterns, but there has been relatively little effort in the applied mathematics literature towards matching data from reallife cases.
In this paper we aim at bridging in the specific setting of first order models, the welldeveloped theory of dynamical systems and meanfield equations with classical approaches of approximation theory, nonlinear time series analysis, and machine learning. We provide a mathematical framework for the reliable identification of the governing forces from data obtained by direct observations of corresponding timedependent evolutions. This is a new kind of inverse problem, beyond more traditionally considered ones, as the forward map is a strongly nonlinear evolution, highly dependent on the probability measure generating the initial conditions. As we aim at a precise quantitative analysis, and to be very concrete, we will attack the learning of the governing laws of evolution for specific models in social dynamics governed by nonlocal interactions. The models considered in the scope of this paper are deterministic, however we intend in follow up work to extend our results towards stochastic dynamical systems.
1.1 General abstract framework
Many timedependent phenomena in physics, biology, and social sciences can be modeled by a function , where represents the space of states of the physical, biological or social system, which evolves from an initial configuration towards a more convenient state or a new equilibrium. The space can be a conveniently chosen Banach space or just a metric space; let be the metric on . This implicitly assumes that evolves driven by a minimization process of a potential energy . In this preliminary introduction we consciously avoid specific assumptions on , as we wish to keep a rather general view. We restrict the presentation to particular cases below.
Inspired by physics, for which conservative forces are the derivatives of the potential energies, one can describe the evolution as satisfying a gradient flow inclusion of the type
(1) 
where is some notion of differential of with respect to , which might already take into consideration additional constraints which are binding the states to certain sets.
1.2 Example of gradient flow of nonlocally interacting particles
Let us introduce an example of the general framework described above. It is actually the main focus of this paper. Assume that and that
where is a suitable nonlinear interaction kernel function, which, for simplicity we assume to be smooth (see below more precise conditions), and is the Euclidean norm in . Then, the formal unconstrained gradient flow (1) associated to this energy is written coordinatewise as
(2) 
Under suitable assumptions of local Lipschitz continuity and boundedness of the interaction function
(3) 
this evolution is wellposed for any given and it is expected to converge for to configurations of the points whose mutual distances are close to local minimizers of the function , representing steady states of the evolution as well as critical points of .
It is also wellknown, see [AGS] and Proposition 2.2 below, that for a meanfield approximation holds: if the initial conditions are i.i.d. according to a compactly supported probability measure for , the empirical measure weakly converges for to the probability measurevalued trajectory satisfying the equation
(4) 
in weak sense, where , for . In fact the differential equation (4) corresponds again to a gradient flow of the “energy”
on the metric space endowed with the socalled Wasserstein distance. Continuity equations of the type (4) with nonlocal interaction kernels are currently the subject of intensive research towards the modeling of the biological and social behavior of microorganisms, animals, humans, etc. We refer to the articles [13CarrilloChoiHaurayMFL, cafotove10] for recent overviews on this subject. Despite the tremendous theoretical success of such research direction in terms of mathematical results on wellposedness and asymptotic behavior of solutions, as we shall stress below in more detail, one of the issues which is so far scarcely addressed in the study of models of the type (2) or (4) is their actual applicability. Most of the results are addressing a purely qualitative analysis given certain smoothness and asymptotic properties of the kernels or at the origin or at infinity, in terms of wellposedness or in terms of asymptotic behavior of the solution for . Certainly such results are of great importance, as such interaction functions, if ever they can really describe social dynamics, are likely to differ significantly from wellknown models from physics and it is reasonable and legitimate to consider a large variety of classes of such functions. However, a solid mathematical framework which establishes the conditions of “learnability” of the interaction kernels from observations of the dynamics is currently not available and it will be the main subject of this paper.
1.3 Parametric energies and their identifications
Let us now return to consider again an abstract energy and let us assume that it is dependent on a parameter function , as indicated in the superscript. As in the example mentioned above, may be defining a nonlocal interaction kernel as in (3). The parameter function not only determines the abstract energy, but also the corresponding evolutions driven according to (1), for fixed initial conditions . (Here we assume that the class of is such that the evolutions exist and they are essentially wellposed; we explicitly stress again the dependency on with a superscript , which below we may remove as soon as such dependency is clear from the context.) The fundamental question to be here addressed is: can we recover with high accuracy given some observations of the realized evolutions? This question is prone to several specifications, for instance, we may want to assume that the initial conditions are generated according to a certain probability distribution or they are chosen deterministically ad hoc to determine at best , that the observations are complete or incomplete, etc. As one quickly realizes, this is a very broad field to explore with many possible developments. Surprisingly, there are no results in this direction at this level of generality, and relatively little is done in the specific directions we mentioned in the example above. We refer, for instance, to [parisi08, parisi081, parisi082, Hildenbrandt01112010, mann11, heoemascszwa11] and references therein, for groundbreaking statistical studies on the inference of social rules in collective behavior.
1.4 The optimal control approach and its drawbacks
Let us introduce an approach, which perhaps would be naturally considered at a first instance, and focus for a moment on the gradient flow model (1). Given a certain gradient flow evolution depending on the unknown parameter function , one might decide to design the recovery of as an optimal control problem [brpi07]: for instance, we may seek a parameter function which minimizes
(5) 
being the solution of gradient flow (1) for , i.e.,
(6) 
and is a suitable regularization functional, which restricts the possible minimizers of (5) to a specific class. The first fundamental problem one immediately encounters with this formulation is the strongly nonlinear dependency of on , which results in a strong nonconvexity of the functional (5). This also implies that a direct minimization of (5) would risk to lead to suboptimal solutions, and even the computation of a first order optimality condition in terms of Pontryagin’s minimum principle would not characterize uniquely the minimal solutions. Besides these fundamental hurdles, the numerical implementation of either strategy (direct optimization or solution of the first order optimality conditions) is expected to be computationally unfeasible to reasonable degree of accuracy as soon as the number of particles is significantly large (the wellknown term curse of dimensionality conied by Richard E. Bellman for optimal control problems).
1.5 A variational approach towards learning parameter functions in nonlocal energies
Let us now consider again the more specific framework of the example in Section 1.2. We restrict our attention to interaction kernels belonging to the following set of admissible kernels
In particular every is weakly differentiable, and its local Lipschitz constant is finite for every compact set . Our goal is to learn the unknown interaction function from the observation of the dynamics of the empirical measure , defined by , where are driven by the interaction kernel according to the equations
(7) 
Instead of the nonconvex optimal control problem above, we propose an alternative, direct approach which is both computationally very efficient and guarantees accurate approximations under reasonable assumptions. In particular, we consider as an estimator of the kernel a minimizer of the following discrete error functional
(8) 
among all competitor functions . Actually, the minimization of has a close connection to the optimal control problem, as it also promotes the minimization of the discrepancy in (5) (here we remind that in this setting is the Euclidean space ):
Proposition 1.1.
If then there exist a constant depending on , and such that
(9) 
for all , where and are the solutions of (7) for the interaction kernels and respectively. (Here , for .)
Therefore, if makes small, the trajectories of system (7) with interaction kernel instead of are as well a good approximation of the trajectories at finite time.
The proof of this statement follows by Jensen’s inequality and an application of Gronwall’s lemma, as reported in detail in Section 3.
For simplicity of notations, we may choose to ignore below the dependence on of the trajectories, and write when such a dependence is clear from the context. Additionally, whenever we consider the limit , we may denote the dependency of the trajectory on the number of particles by setting .
Contrary to the optimal control approach, the functional is convex and can be easily computed from witnessed trajectories and . We may even consider discretetime approximations of the time derivatives (e.g., by finite differences) and we shall assume that the data of the problem is the full set of observations for , for a prescribed finite time horizon . Furthermore, being a simple quadratic functional, its minimizers can be efficiently numerically approximated on a finite element space: given a finite dimensional space , we let
(10) 
The fundamental mathematical question addressed in this paper is

For which choice of the approximating spaces (we assume here that is a countable family of approximating subspaces of ) does for and and in which topology should the convergence hold?
We show now how we address this issue in detail by a variational approach, seeking a limit functional, for which techniques of convergence [MR1201152], whose general aim is establishing the convergence of minimizers for a sequence of equicoercive functionals to minimizers of a target functional, may provide a clear characterization of the limits for the sequence of minimizers . Recalling again that , for , we rewrite the functional (8) as follows:
(11) 
for . This formulation of the functional makes it easy to recognize that the candidate for a limit is then
(12) 
where is a weak solution to the meanfield equation (2), as soon as the initial conditions are identically and independently distributed according to a compactly supported probability measure .
Although all of this is very natural, several issues need to be addressed at this point. The first one is to establish the space where a result of convergence may hold and the identification of can take place. As the trajectories do not explore the whole space in finite time, we expect that such a space may not be independent of the initial probability measure , as we clarify immediately. By Jensen inequality we have
(13)  
(14) 
where is the pushforward of by the Euclidean distance map defined by . In other words, is defined for every Borel set as . The mapping is lower semicontinuous for every open set , and it is upper semicontinuous for any compact set (see Lemma 3.1). We may therefore define a timeaveraged probability measure on the Borel algebra of by averaging over : for any open set we define
(15) 
and extend this set function to a probability measure on all Borel sets. Finally we define
(16) 
for any Borel set , to take into account the polynomial weight as appearing in (14). Then one can reformulate (14) in a very compact form as follows
(17) 
Notice that is defined through which depends on the initial probability measure .
To establish coercivity of the learning problem it is essential to assume that there exists such that also the following additional lower bound holds
(18) 
for all relevant . This crucial assumption eventually determines also the natural space for the solutions,
which therefore depends on the choice of the initial conditions . In particular the constant might not be nondegenerate for all the choices of
and one has to pick the initial distribution so that (18) can hold for .
In Section 3.2 we show that for some specific choices of and rather general choices of one can construct probability measurevalued trajectories which allow to validate
(18).
In order to ensure compactness of the sequence of minimizers of , we shall need to restrict the sets of possible solutions to classes of the type
where is some predetermined constant and is a suitable compact set.
We now introduce the key property that a family of approximation spaces must possess in order to ensure that the minimizers of the functionals over converge to minimizers of .
Definition 1.2.
Let and interval in be given. We say that a family of closed subsets , has the uniform approximation property in if for all there exists a sequence converging uniformly to on and such that for every .
We are ready to state the main result of the paper:
Theorem 1.3.
Assume , fix and let be an interval in with as in Proposition 2.2. Set
For every , let be i.i. distributed and define as in (11) for the solution of the equation (4) with initial datum
For , let be a sequence of subsets with the uniform approximation property as in Definition 1.2 and consider
Then the sequence has a subsequence converging uniformly on to some continuous function such that
.
If we additionally assume the coercivity condition (18), then in . Moreover, in this latter case, if there exist rates , constants , and a sequence of elements such that
(19) 
and
(20) 
then there exists a constant such that
(21) 
for all . In particular, in this case, it is the entire sequence (and not only subsequences) to converge to in .
We remark that the used in our results is useful when has positive density on large intervals of . Notice that the main result, under the validity of the coercivity condition, not only ensures the identification of on the support of , but it also provides a prescribed rate of convergence. For functions in and for finite element spaces of continuous piecewise linear functions constructed on regular meshes of size a simple sequence realizing (19) with and is the piecewise linear approximation to which interpolates on the mesh nodes. For the approximation estimate (20) there are plenty of results concerning such rates and we refer to [descsc13] and references therein. Roughly speaking, for the empirical measure obtained by sampling times independently from , the bound (20) holds with high probability for a certain for of order (more precisely see [descsc13, Theorem 1]), which is a manifestation of the aforementioned curse of dimensionality. While it is in general relatively easy to increase as the smoothness increases, and doing so independently of , since is a function of one variable only, obtaining is in general not possible unless has very special properties, see [fohavy11, Section 4.4 and Section 4.5].
1.6 Numerical implementation of the variational approach
The strength of the result from the variational approach followed in Section 1.5 is the total arbitrariness of the sequence except for the assumed uniform approximation property and that the result holds  deterministically  with respect to the uniform convergence, which is quite strong. However, the condition that the spaces are to be picked as subsets of requires the prior knowledge of . Hence, the finite dimensional optimization (10) is not anymore a simple unconstrained least squares (as implicitly claimed in the paragraph before (10)), but a problem constrained by a uniform bound on both the solution and its gradient. Nevertheless, as we clarify in Section 5, for fixed and choosing made of piecewise linear continuous functions, imposing the uniform bounds in the least square problem does not constitute a severe difficulty. Also the tuning of the parameter turns out to be rather simple. In fact, for fixed the minimizers have the property that the map
is monotonically decreasing as a function of the constraint parameter and it becomes constant for , for empirically not depending on . We claim that this special value is indeed the “right” parameter for the bound. For such a choice, we show also numerically that, as expected, if we let grow, the minimizers approximates better and better the unknown potential .
Despite the fact that both the tuning of and the constrained minimization over requiring bounds are not severe issues, it would be way more efficient to perform a unconstrained least squares over . In our followup paper [bofohamaXX] we extend the approach developed by Binev et al. in [MR2249856, MR2327596] towards universal algorithms for learning regression functions from independent samples drawn according to an unknown probability distribution. This extension presents several challenges including the lack of independence of the samples collected in our framework and the nonlocality of the scalar products of the corresponding least squares. This has a price to pay, i.e., that the spaces need to be carefully chosen and the result of convergence holds only with high probability. For the development of the latter results, we need to address several variational and measure theoretical properties of the model which are considered in details in this first paper as reported below.
2 Preliminaries
2.1 Optimal transport and Wasserstein distances
The space is the set of probability measures on , while the space is the subset of whose elements have finite th moment, i.e., . We denote by the subset of which consists of all probability measures with compact support. For any and a Borel function , we denote by the pushforward of through , defined by
In particular, if one considers the projection operators and defined on the product space , for every we call first (resp., second) marginal of the probability measure (respectively, ). Given and , we denote with the family of couplings between and , i.e. the subset of all probability measures in with first marginal and second marginal .
On the set we shall consider the following distance, called the Wasserstein or MongeKantorovichRubinstein distance,
(22) 
If , we have the following equivalent characterization of the Wasserstein distance:
(23) 
where stands for the Lipschitz constant of on . We denote by the set of optimal couplings for which the minimum is attained, i.e.,
It is wellknown that is nonempty for every , hence the infimum in (22) is actually a minimum. For more details, see e.g. [AGS, villani].
For any and , the notation stands for the convolution of and :
This function is continuous and finitevalued whenever is continuous and sublinear, i.e., there exists a constant such that for all .
2.2 The meanfield limit equation and existence of solutions
As already stated in the introduction, our learning approach is based on the following underlying finite time horizon initial value problem: given and , consider a probability measurevalued trajectory satisfying
(24) 
We consequently give our notion of solution for (24).
Definition 2.1.
The equation (24) is closely related to the family of ODEs, indexed by ,
(25) 
which may be rewritten as
(26) 
for , by means of the empirical measure defined as
(27) 
As already explained in the introduction, we shall restrict our attention to interaction kernels belonging to the following set of admissible kernels
The wellposedness of (26) is rather standard under the assumption . The wellposedness of system (24) and several crucial properties enjoyed by its solutions may also be proved as soon as . We refer the reader to [AGS] for results on existence and uniqueness of solutions for (24), and to [13CarrilloChoiHaurayMFL] for generalizations in case of interaction kernels not necessarily belonging to the class . Nevertheless, in the following we recall the main results, whose proofs are collected in the Appendices in order to keep this work selfcontained and to allow explicit reference to constants.
Proposition 2.2.
Let be given. Let be a sequence of empirical measures of the form
satisfying . For every , denote with the curve given by (27) where is the unique solution of system (25).
Then, there exists depending only on , and such that the sequence converges, up to extraction of subsequences, in equipped with the Wasserstein metric to a solution of (24) with initial datum satisfying
2.3 The transport map and uniqueness of meanfield solutions
Another way for building a solution of equation (24) is by means of the socalled transport map, i.e., the function describing the evolution in time of the initial measure . The transport map can be constructed by considering the following singleagent version of system (26),
(28) 
where is a mapping from to and . Here is a continuous map with respect to the Wasserstein distance satisfying and , for a given .
According to Proposition 6.10, if is any solution of (24), we can consider the family of flow maps , indexed by and the mapping , defined by
where is the unique solution of (28) with initial datum . The by now wellknown result [CanCarRos10, Theorem 3.10] shows that the solution of (24) with initial value is also the unique fixedpoint of the pushfoward map
(29) 
A relevant, basic property of the transport map is proved in the following
Proposition 2.3.
is a locally biLipschitz map, i.e. it is a locally Lipschitz map, with locally Lipschitz inverse.
Proof.
Let be sufficiently large such that . The choice in Proposition 6.3, Lemma 6.5, and Lemma 6.6 imply the following stability estimate
(30) 
i.e., is locally Lipschitz.
In view of the uniqueness of the solutions to the ODE (28), it is also clear that, for any , the inverse of is given by the transport map associated to the backwardintime ODE
However, this problem in turn can be cast into the form of an usual IVP simply by considering the reverse trajectory . Then solves
The corresponding stability estimate for this problem then yields that the inverse of exists and is locally Lipschitz (with the same local Lipschitz constant as ). ∎
It is also known, see, e.g., [CanCarRos10], that one has also uniqueness and continuous dependence on the initial data for (24) (we report a proof of these properties in Appendix 6.3 for completeness):
Theorem 2.4.
Fix and let and be two equicompactly supported solutions of (24), for and respectively. Let be such that for every
(31) 
Then, there exist a positive constant depending only on , , and such that
(32) 
for every . In particular, equicompactly supported solutions of (24) are uniquely determined by the initial datum.
3 The learning problem for the kernel function
As already explained in the introduction, our goal is to learn from observation of the dynamics of corresponding to system (25) with as interaction kernel, as initial datum, and as finite time horizon.
We pick among those functions in which would give rise to a dynamics close to : roughly speaking we choose as a minimizer of the following discrete error functional
(33) 
Let us remind that, by Proposition 1.1, this optimization guarantees also that any minimizer produces very good trajectory approximations to the “true” ones at least at finite time .