Analysis of a micromacro acceleration method with minimum relative entropy moment matching
Abstract.
We analyse convergence of a micromacro acceleration method for the Monte Carlo simulation of stochastic differential equations with timescale separation between the (fast) evolution of individual trajectories and the (slow) evolution of the macroscopic function of interest. We consider a class of methods, presented in [DebSamZie2017], that performs short bursts of path simulations, combined with the extrapolation of a few macroscopic state variables forward in time. After extrapolation, a new microscopic state is then constructed, consistent with the extrapolated variable and minimising the perturbation caused by the extrapolation. In the present paper, we study a specific method in which this perturbation is minimised in a relative entropy sense. We discuss why relative entropy is a useful metric, both from a theoretical and practical point of view, and rigorously study local errors and numerical stability of the resulting method as a function of the extrapolation time step and the number of macroscopic state variables. Using these results, we discuss convergence to the full microscopic dynamics, in the limit when the extrapolation time step tends to zero and the number of macroscopic state variables tends to infinity.
Key words and phrases:
micromacro simulations, entropy optimisation, stiff stochastic differential equations, KullbackLeibler divergence, weak convergence2010 Mathematics Subject Classification:
Primary, 65C30, 60H35, 94A17; Secondary, 62E17, 65J22Mendeley.bib
1. Introduction
The considerations and results presented in this manuscript originate from the need to efficiently simulate the following expectations
(1.1) 
for times , where is a given diffusion process and is a function of interest. In the present work, we focus on issues concerning the temporal discretisation of the underlying evolution of the random variable , with time step , for a large final time . The full simulation requires also the consistent approximation of expectations in (1.1); this is usually achieved by Monte Carlo methods [Caflisch1998, Higham2001].
From the computational perspective, we are interested in stiff systems, with a separation between a (fast) timescale, on which the individual trajectories of need to be simulated, and the (slow) timescale, on which the expectations (1.1) evolve. This feature leads to a stability constraint on the time discretisation methods that forces us to take very small steps , compared to the desired time horizon for (1.1). The discrepancy between the minuscule leaps we have to make and the big times we want to arrive at, quickly makes the cost of Monte Carlo simulation prohibitive. This problem led to the development of various general multiscale algorithmic approaches, such as heterogeneous multiscale [EEng2003, EEnqLiRenVan2007] or equationfree [KevGeaHymKevRunThe2003, KevSam2009] methods, which try to overcome the scale separation, or even use it to one’s advantage.
As a part of this study, we analyse the accuracy of a micromacro acceleration method to efficiently simulate observables (1.1). The algorithm exploits the timescale separation by operating with two time steps: a microscopic one , suited for the underlying stochastic process, and a macroscopic , which we believe to be natural for the evolution of the expectations. To describe the coarse (macroscopic) behaviour of the process, we reduce the diffusion to a finite number of macroscopic state variables, given as
(1.2) 
for some appropriately chosen functions (cf. [KevGeaHymKevRunThe2003]). These variables store partial (statistical) information about the distributions of the stochastic process. Due to scale separation, we can expect that the variables evolve under the influence of a vector field with natural time scale . Although we do not know this vector field in general, we can (and will) approximate it by directly estimating the time derivatives of every , to move the simulation forward in time by . One time step of the micromacro acceleration method includes (i) microscopic simulation of for a small batch of time steps of size ; (ii) restriction, i. e., extraction of an estimate of the macroscopic time derivative, based on the simulation in the first stage; (iii) forward in time extrapolation of the macroscopic state; and (iv) matching of the last microscopic state from (i) with the extrapolated macroscopic state. We provide a more detailed description in Section 2.
The most challenging stage is the matching. It amounts to an inference procedure to pick a distribution, having prescribed (extrapolated) macroscopic state – a particular point in the dimensional space of macroscopic state variables. This is an illposed problem: there may be no solution, or the solution may not be unique. Both cases may depend sensitively on the prescribed macroscopic state that one wants to match with.
Our strategy is to use a prior distribution , which comes from the last available microscopic state in the current step and alter it, so that it becomes consistent with the extrapolated macroscopic state. Particularly, if are the extrapolated macroscopic states, we obtain the matched distribution from the prior as the solution to the following optimisation problem
(1.3) 
where
and we minimise over all probability distributions absolutely continuous with respect to . The objective function in (1.3) is the relative entropy of with respect to , also known as KullbackLeibler or information divergence in the information theory literature [KulLei1951, Kullback1978].
The analysis and intuition behind problem (1.3) relies on a geometric interpretation that views matching as a projection operator in the space of distributions, endowed with the topology generated by the relative entropy [Csiszar1975, PavFer2013]. This is not a metric topology [Harremoes2007]. Nevertheless, due to Pinsker’s inequality, by which relative entropy dominates the square of total variation norm, and various analogies with Euclidean geometry, can be regarded as a “square distance” between two probability distributions. In particular, whenever is a solution to (1.3), and satisfies the constraints, a socalled Pythagorean identity holds:
(1.4) 
We can intuitively understand the foregoing property as: the matching is an “orthogonal projection” of on the submanifold of probability densities that satisfy the constraints generated by the moments of .
Before moving on to the technical content of the paper, we finalize this introduction with two important points. First, Section 1.1 discusses the reasons behind the choice for relative entropy as the quantity to be minimized in (1.3). Second, Section 1.2 briefly sketches the main contributions of this work and the outline of the paper.
1.1. On the usefulness of relative entropy matching
No rigorous justification exists why the relative entropy is the proper choice for the matching procedure. The first description of the micromacro acceleration with matching in [DebSamZie2017] contained multiple examples of metrics that could be used in the optimisation procedure (1.3). Nevertheless, we identify below three reasons that motivate the choice for relative entropy: the first one from a “physical” point of view, the second one from a “numerical” point of view, and the third one from a “theoretical” point of view (related to error control and adaptivity).
The physical point of view. The choice for relative entropy, specified in (1.3), is closely related to the maximum entropy principle [Jaynes1957, Jaynes1957a], which dictates that one should look for a distribution, consistent with available data, that maximises the entropy , see also [Jaynes1982]. This convention has been extensively used for constructing closures of moment systems to derive constitutive equations for kinetic equations [IlgKarOtt2002, HauLevTit2008, SamLelLeg2011, MalBruDub2015]. Moreover, in the context of data assimilation, procedure (1.3) serves as the riskneutral approach for calibrating assetpricing models [AveFriHolSam1997, Avellaneda1998] and an optimal approximation of spectral densities [GeoLin2003, FerRamTic2011].
The numerical point of view. Relative entropy is also convenient numerically. The computational procedure to determine (1.3) is based on a dual formulation, see also [DebSamZie2017], which looks for the vector of Lagrange multipliers that solve
(1.5) 
where
is the partition function. As long as we can compute or estimate the integrals, (1.5) constitutes a finitedimensional system of nonlinear equations, which can be solved by a Newton procedure. Moreover, the density of the distribution satisfying (1.3) reads
(1.6) 
There are two advantages to this representation of . First, because the exponential function is positive, is always equivalent to the prior distribution , that is, their supports are the same. Second, the exponential function serves as the likelihood ratio for the importance sampling of [AsmGly2007, Ch. V.1]. Therefore, we can estimate the observables (1.1) with respect to , by considering a number of replicas , distributed according to the prior , and computing weighted averages with weights . For more details on the numerical implementation, we refer to [DebSamZie2017].
Error control and adaptivity. The properties of relative entropy provide also a convenient a posteriori error analysis that allows appending the set of macroscopic state variables with new ones that reduce relative entropy in a greedy way. To illustrate this idea, assume that the macroscopic states are moments of an unknown target probability distribution , that is, for . Moreover, let be the matching of a prior distribution with , which we already computed. We want to get an indication of the gain we can expect by adding a new macroscopic state variable, corresponding to a function , to the matching procedure (1.3).
Denote by the matching of the same prior with extended system , where in accordance with our assumption. By construction, the set of constraints in (1.3) generated by is a subset of those yielded by . Therefore, by the transitivity property of the relative entropy matching [Csiszar1975, Thm. 2.3], we can alternatively obtain by matching with . This has two consequences. First, as we already computed that has correct first macroscopic states, using it instead of in (1.5), we can cheaply obtain the Lagrange multipliers for . Second, applying the Pythagorean identity (1.4) to , with in place of and as a prior, produces
from which we get
(1.7) 
The left hand side of equality (1.7) gives an indication of how much accuracy one expects to gain by adding to the system of macroscopic state variables. The right hand side reads
Note that does not depend on the target density and can be evaluated numerically, as soon as we estimate the Lagrange multipliers by solving (1.5). Therefore, equality (1.7) enables to develop an adaptive procedure selecting new macroscopic state variables that maximally reduce the relative entropy at a current time step of the micromacro acceleration method.
1.2. Main contributions and outline
The above arguments give ample motivation to study micromacro acceleration methods with relative entropy matching. The micromacro acceleration method was introduced in [DebSamZie2017] using a more general, axiomatic definition of the matching operator, and a convergence result was presented there based on some generic properties for all underlying components of the method. The assumptions in [DebSamZie2017] do not apply to the matching given by (1.3), and only numerical results indicating the convergence are presented, for a nontrivial test case originating from the micromacro simulation of dilute polymers. In this respect, the current paper expands the body of work initiated in [DebSamZie2017].
This paper investigates the numerical properties of the micromacro acceleration method with relative entropy matching: (i) numerical stability, to establish bounds on the propagation of local errors; and (ii) local errors produced by the matching with finite number of macroscopic state variables. We achieve this goal by demonstrating how the properties of minimum relative entropy regularisation can be combined with the features of the underlying evolution of , to provide a rigorous analysis of the micromacro acceleration method. To establish convergence of the micromacro acceleration method to the underlying microscopic dynamics, we then combine the above results and consider the limit when the extrapolation time step tends to zero and the number of macroscopic state variables tends to infinity.
The remainder of this manuscript is organised as follows. Section 2 gives a detailed account of the micromacro acceleration method, keeping the exposition general enough so that it applies in a broader context than the one we study later. In Section 3, we start with the basic notions and assumptions on the underlying diffusion process . In Section 4, we rigorously define the matching operator corresponding to (1.3) and study its properties, such as dependence on the prior distribution. We introduce the remaining constructions and gather all assumptions needed to complete the proof of convergence in Section 5. Section 6 is devoted to the investigation of the relation between the evolution of the diffusion and the relative entropy. Finally, the last two sections expose the convergence proof that relies on two main ingredients: the numerical stability of the method (Section 7), which reduces the global errors to local ones, and the consistency of local errors (Section 8), which implies the convergence.
2. Micromacro acceleration method
The micromacro acceleration method aims at being faster than a full microscopic simulation, while converging to it when the extrapolation time step vanishes and the number of extrapolated macroscopic state variables goes to infinity. The underlying assumption for the method to be efficient is that the macroscopic state variables can be simulated on a much slower time scale than the microscopic dynamics, thus allowing the choice of a large extrapolation time step compared with the time step for microscopic simulation.
The main building blocks of the method can be grouped into two categories: propagators, which move the simulation forward in time on the micro or macro time scales; and transition operators, which connect two levels of description. The microscopic states are given by the random variables , and the macroscopic states are described by vectors in the Euclidean space , with the number of macroscopic state variables used to preserve information about distributions. We now detail first the transition operators (Section 2.1), after which we discuss the propagation operators (Section 2.2). All components are then collected in a description of the micromacro acceleration method in Section 2.3.
2.1. Transition operators
To transition from microscopic to macroscopic states, we consider the restriction operator . It is determined by the vector of functions , and for a random variable , we define
(2.1) 
This formula is consistent with (1.2), as when is the diffusion generating the observables in (1.1). Note also that the vector depends only on the law of a random variable , which we denote . Therefore for the analysis, it will turn out to be more convenient to consider as acting on the family of probability measures, see (4.1).
[On notation] As we mention in Section 1, we use the Euclidean space to store the statistical (coarse) information of the underlying distributions. To visually highlight the elements of and valued functions, we henceforward apply bold fonts for their symbols. We also use to denote the Euclidean norm in .
To proceed from the macroscopic state to the microscopic distributions, we face the inverse problem
(2.2) 
This is an illposed problem: there may be no solution or the solution may not be unique, and both cases may depend sensitively on . Usually, when (2.2) has a solution, it is underdetermined in the sense that infinitely many consistent (laws of) random variables exist. As announced in the Introduction, we will regularize (2.2) by considering a prior random variable , which is naturally available in the micromacro acceleration method, and define the matching operator as
(2.3) 
To make sense of , we first consider the probability measure that solves (1.3), and next choose any random variable so that . There is no generic way to pick but, as long as we are concerned with the expectations and measure the weak error, the particular choice of is not important. {rem}[Matching ensembles] In practice, when performing Monte Carlo simulation, we always start with an ensemble of replicas sampled from . The formula (1.6) then provides a convenient way to sample with the weighted replicas , where the weights are and the Lagrange multipliers satisfy (1.5). For more on the practical implementation of the matching operator with finite ensembles, we refer to [DebSamZie2017].
2.2. Propagators
The first propagator, operating on the micro time scale, is the onestep time discretisation of SDE
(2.4) 
which generates the diffusion process . It performs a full microscopic simulation on a time interval of length . The computational cost of the simulation is usually high, but the time we devote to it is very short, compared to the time scale on which the averages (1.1) evolve. In practice, we divide into steps of length , thus obtaining a time mesh , and use a stochastic numerical method for SDE (2.4). For example, we can employ an EulerMaruyama step to propagate a given initial random variable as
(2.5) 
for .
The second propagator is extrapolation, which moves only the macroscopic variables forward in time over the macroscopic time step . In this manuscript, we consider first order extrapolation of the macroscopic variables, called coarse forward Euler integration [GeaKevThe2002]. Assuming we have at our disposal two macroscopic variables separated by , which we obtain by averaging the microscopic states, the extrapolation proceeds as follows:
(2.6) 
Higher order versions of (2.6), which require macroscopic states at additional time instances, can be constructed in several ways: using polynomial extrapolation [GeaKevThe2002]; implementing AdamsBashforth or RungeKutta methods [RicGeaKev2004, LeeGea2007, LafLejSam2016]; or trading accuracy for stability by designing a multistep state extrapolation method [VanRoo2008].
2.3. Micromacro acceleration method
We now have all the ingredients to describe the complete method in Algorithm 2.3. We introduce two indices, and , to emphasise the fact that there are two time steps involved: the microscopic time step , to evolve the full microscopic dynamics over ; and the macroscopic time step , to perform extrapolation of the macroscopic state variables up to the final time . {alg} Given a microscopic state at time , a number of macroscopic state variables, macroscopic step size , microscopic step size , and a number of microscopic steps, with , compute the microscopic state at time via a fourstep procedure:

Simulate the microscopic system over with time steps of size using a microscopic discretization scheme, such as (2.5), to obtain a sequence of microscopic states
with .

Record the dimensional macroscopic states for .

Extrapolate the macroscopic states over a step of size , for instance using (2.6), to a new macroscopic state at time .

Match the microscopic state at time with the extrapolated macroscopic state
to obtain a new microscopic state at time .
By successive application of Algorithm 2.3, we obtain after performing steps the random variable that “approximates” the final value of the diffusion process. Because we are interested in estimating the averages given by (1.1), we measure the quality of by the weak error
We find sufficient conditions, under which this error goes to zero as the time steps and go to zero, and the number of macroscopic states , used for extrapolation, goes to infinity. The precise statement of the result we prove is the content of Section 5.
3. Mathematical setting
Throughout the manuscript, we consider diffusion processes that live on a configuration space denoted by . To avoid technical complications that are unnecessary, in view of the goals of the paper, we make the following standing assumption on the configuration space: {ass} The configuration space is either the Euclidean space , or the torus , with dimension . This assumption avoids, for instance, the issue of proper boundary conditions on the involved diffusion processes on bounded subsets of . Nevertheless, Assumption 3 still contains two common settings for diffusions:

The whole space acts as an example of a noncompact configuration space;

The torus acts as a physically relevant compact case, resulting from periodic boundary conditions.
It will turn out that the proofs and derivations for a noncompact configuration space will require additional assumptions, compared to the compact setting. We will point out these assumptions when relevant.
[Basic notations] The Lebesgue measure on is denoted by , and for any two points , stands for the distance between them. On , this is the usual metric generated by the Euclidean norm ; on this distance is defined as , where, to make our notation more consistent, we do not distinguish between a representative and its equivalence class. If , is the transpose and, consequently, and are the scalar product and tensor product of two vectors and . Throughout the paper, a smooth function means a function, and we use , , for the partial derivative, gradient and Hessian, respectively. For vectorvalued functions, we write to denote the strong derivative (Jacobian matrix).
3.1. Spaces of measures and spaces of functions
In what follows, we denote by the set of all probability measures on defined on the field of Borel subsets of . The symbols and stand for the expectation and variance(covariance) with respect to . We also consider the Banach space of all bounded and signed Borel measures, of which is a convex subset. The norm on is the total variation (TV), and for it reads
see, e.g., [Bogachev2007]. For , this norm induces the total variation distance , which amounts to the norm of the difference between the densities
whenever are absolutely continuous (denoted ) with respect to a common measure , and , are the corresponding densities (RadonNikodym derivatives). We also write whenever the measures are singular (their supports are disjoint) and when they are equivalent (have the same sets of measure zero).
Besides the spaces in which probability measures live, we also need to characterize the space of functions we want to consider as macroscopic state variables. We denote by the space of bounded, Borel measurable functions on equipped with the supnorm . The symbol stands for the pairing (congruence) between a function and a signed measure , and stands for the measure having density with respect to . Note that, if , we have . We will also use two subspaces of : , of all continuous functions “vanishing at infinity”^{1}^{1}1, if for all there is a compact such that for every ; and , of all bounded continuous functions on the configuration space . Recall that, if is compact, , and both consist of all continuous functions on . When we need higher regularity, we consider the Banach space , of all times differentiable functions with bounded derivatives, with norm , where is a multiindex, and in particular, see Section 5, its subspace of functions with vanishing derivatives. For a vector function with values in the space of macroscopic variables , such that for all , we denote (see Remark 2.1)
Finally, we need to describe in what sense we will consider convergence of sequences of probability measures. In this paper, we will mainly be concerned with the weak convergence of probability measures on . A sequence of probability measures on converges weakly to , if holds for every . The usefulness of the weak topology on , induced by this convergence, stems from its metrizability (by the Prohorov metric) and the convenient characterisation of compactness [Dudley2002]: the weakly closed family of measures is weakly compact in if and only if it is (uniformly) tight, i.e. given any , there is a compact subset such that for all . In particular, if is compact itself, is compact in the weak topology. In the noncompact case, a sufficient condition results from uniform control over the absolute first moment: {lem} Let be a family of probability measures such that there is a constant and for all , then is tight.
Proof.
Fix and consider a closed ball , where is large enough so that . From the Markov inequality we get
Throughout this manuscript, we work mainly on , but we introduce to utilize its elements as the “directions” for derivatives of mappings on . We say that a direction is admissible for , if there is an such that . (Note that this immediately implies that, for any admissible direction , we have .) {dfn} Let and . The mapping has a (onesided) directional derivative in the direction , admissible for , if the limit
exists. We extend this definition, in an obvious way, when acts into a Banach space, like or . In the case depends on other variables, we use the symbol with an appropriate lower subscript on . We summarise a few useful properties of directional derivatives below. {lem} Let both and be continuous and have directional derivatives at in the direction , with . Then

; (chain rule)

; (product rule)

, (mean value inequality)
for some .
Moreover, if is linear and bounded, for any the directional derivative exists in every direction , and .
3.2. Diffusions and related concepts
In this Section, we expose our working hypotheses and necessary results from the theory of diffusion processes. We assume that the process satisfies on the configuration space the stochastic differential equation (SDE)
(3.1) 
where is an dimensional Wiener process, an initial value, and the functions , are given drift and diffusion fluxes. For , the th column of the matrixvalued function is denoted by . We also fix a time interval , with , on which we want to approximate the particular observable of (3.1) and use the notation whenever we consider the process up to time only.
We assume that the coefficients and are timehomogeneous, but extension to the timedependent case is straightforward. We impose two conditions on the coefficients: bounded differentiability, to guarantee the existence and smoothness of the laws of , and uniform ellipticity, which is the simplest assumption to ensure a “sufficient spreading” of the randomness: {ass} The functions and are smooth with all derivatives bounded, and there exists such that
for all and . We refer to [Baudoin2014, Stroock2008a] for all the results we present in the remainder of this section, which we include to make the manuscript selfcontained. In the following, we will denote by generic constants that can depend on , , , and the bounds on the derivatives of and . Note that we use the same constants for all the presented estimates. This is legitimate, since we can always increase one or both of them to relax the bounds. In later sections, during computations, we also allow the value of both to change (increase) from line to line.
Assumption 3.2 guarantees that the process is a unique solution to SDE (3.1) for all and it admits a smooth transition probability density – the likelihood of finding at when starting from at time . Moreover, satisfies Aronson’s estimates: there exists such that for all and
(3.2) 
As we detail in Appendix A, under an additional assumption on the initial law, the bounds in (3.2) result in Gaussian lower and upper estimates for the densities of the process , uniformly in (see Lemma A). These, in turn, provide us with a good control of the relative entropy between laws at different times, which we need for the analysis in Sections 6.1 and 8.1.
In the backward variable , the transition densities generate the diffusion semigroup given by
(3.3) 
for every Borel function with polynomial growth. In particular, for each , the mapping is a continuous linear contraction with respect to the norm, and . The semigroup leaves invariant and is strongly continuous when restricted to this subspace^{2}^{2}2That is for every .. The (infinitesimal) generator of is defined by
(3.4) 
with limit taken in supnorm, and the domain being the set of for which the limit exists. The space , of all twice differentiable functions with vanishing derivatives, is a core for , on which acts as the second order differential operator
(3.5) 
In the forward variable , the transition densities provide the fundamental solution to the Kolmogorov’s forward equation
(3.6) 
where is the adjoint of , with a subset of , the dual of . Accordingly, the laws of the process are propagated forward in time by the adjoint semigroup , defined via relation
(3.7) 
for all and , see also [Bobrowski2005, §8.1.15]. The family can be extended to a conservative semigroup on that leaves positive measures invariant.
3.3. Euler scheme
For the analysis of the microscopic step, we approximate , on a small time horizon , by the Euler scheme (2.5) on a time mesh with time steps . The approximate solution we obtain is a timehomogeneous Markov chain with step transition probability kernels [YuaMao2004], where , which, owing to Assumption 3.2, have a density, which we denote as , for any [LemMen2010].
Using these kernels, we can define the transition operator and its adjoint
(3.8) 
where is the characteristic function of a set and . For every probability measure , the two operators satisfy relation (3.7). Both and are, for each fixed , linear in and respectively.
In parallel with (3.2), we also have the following Gaussian estimates for the transition densities [LemMen2010]: there exists such that for all , and
(3.9) 
The generic constants are uniform with respect to the discretisation parameter . In later sections, we employ the following sharp estimate in the difference between the transition density of the process and the scheme (2.5), see [GobLab2008, Thm. 2.3]. {thm} If Assumption 3.2 holds, then for every , there are constants such that
(3.10) 
for every . We use this result in Section 8.1 to control the error in TV distance between the densities and the weak error between expectations, see also Appendix A.
4. Minimum relative entropy moment matching
In this Section, we will study the properties of relative entropy (see equation (1.3)) and the minimum relative entropy matching operator, which we denote by (see equation (2.3)). We provide a precise definition and characterization of in Section 4.3, together with an investigation of the continuity and the differentiability on each coordinate in Section 4.4. In particular, we treat directional derivatives with respect to the prior measure, which constitute a crucial element in the study of the numerical stability of the micromacro acceleration method in Section 7. Before that, we introduce the elements that we will use to obtain a convenient description of the matching procedure. In Section 4.1, we elaborate on the restriction operator and the moment space, to extract the macroscopic variables (1.2) and control the feasibility of the statistical constraints. In Section 4.2, we discuss exponential families, which will turn out to be convenient to represent the density obtained through the matching.
4.1. Restriction operator and moment space
Fix and a vector of functions . To accelerate the simulation of SDE (3.1), we will use the statistical information contained in the vectors , where is the law of the solution at some time instance. We formalize this by introducing the restriction operator, , generated by , that reads
(4.1) 
The restriction operator is continuous in the weak topology on , and it is linear when extended, in an obvious way, to . When , the law of a random variable , formula (4.1) is consistent with the restriction (2.1) that was introduced in the algorithmic context of Section 2.
The moment space corresponding to and is a convex subset of defined as
(4.2) 
Whenever the configuration space and the vector of restriction functions are fixed, we write . This set will serve to check the feasibility of constraints for optimisation in (1.3). Obviously, when the vector of macroscopic states does not belong to , we cannot reconstruct a probability measure having these moments. However, even if , the entropy problem (1.3) need not have a solution (see [Junk2000]). The results presented in this Section and Section 4.3 will demonstrate that is a sufficient condition for the existence of the minimiser to (1.3), provided the system and the prior distribution satisfy the following strengthening of algebraic independence [Lewis1995]:
We say that functions are linearly independent modulo if they are linearly independent on every subset of with positive measure, or, equivalently, if
for all . In particular, when is compact, any linearly independent set of realanalytic functions on will be linearly independent modulo . Note also that this property persists whenever we switch to any measure that is absolutely continuous with respect to .
With Definition 4.1 at our disposal, we acquire the following property of the interior of the moment space , which will turn out to be essential for the definition of the matching operator in Section 4.3: {thm}[[BorLew1991b, Thm. 2.9]] Assume that are linearly independent modulo a fully supported^{3}^{3}3The support of measure is defined as , and the measure has full support if . measure . For every there exists a probability measure such, that and . Note that, in the hypothesis of Theorem 4.1, we require the system of restriction functions to be independent from the constant function as well. This is the natural situation in our framework. As we are working with probability measures, the constant statistics do not bring new information, and any linear dependence of components of on constants makes the vector of expectations reducible.
In practice, one would consider a fixed, dominating measure on , e.g. the Lebesgue measure , and choose linearly independent modulo . See also the final paragraph of Section 4.3. Then, the conclusion of Theorem 4.1 holds for all fully supported measures .
We finish this Section with a general description of the moment space: {lem}[[Lewis1995, Thm. 2.1]] If has full support, and are linearly independent modulo , the following relations hold:

;

.
4.2. Exponential families
For a vector and a measure , with a fixed vector of moment functions , define
We call the partition function and the logpartition function. For fixed , the logpartition function determines a family of probability distributions that reads
The function is called the exponential family with respect to [AmaNag2000, LehRom2005]. {lem}

For each , the function is convex and smooth on , with

For every , the function is concave and weakly continuous on .

The mapping is continuous on with topology.
Proof.
Item (i) follows from differentiation under the integral sign, valid due to the Lebesgue dominated convergence theorem. The first two derivatives of partition function read
from which the formulas for the gradient and the Hessian of follow. The details can be found, for example, in [LehRom2005, Sec. 2.7].
The proof of claim (ii) is straightforward.
The conclusion of item (iii) follows from the estimate
Thus if , we have , and the same holds for the logpartition function . The sequential continuity implies the continuity in , due to the metrizability of the weak topology on [Dudley2002, Thm. 11.3.3.]. ∎
Note that the measures and are equivalent, the RadonNikodym derivative of with respect to is with norm bounded by . According to Lemma 4.2, this density is differentiable in and, by the chain rule, we have a simple estimate on this derivative, which we will need in Section 8.1: {lem} For any fixed and
One nice feature the assumption of linear independence modulo (Definition 4.1) guarantees is the invertibility of the Hessian matrix of the logpartition function. {lem} If the functions are linearly independent modulo , the Hessian is positive definite.
Proof.
We can assume (up to changing to ) that , so the Hessian is
Take a vector . The variancecovariance matrix is always positivesemidefinite so we already know that . Suppose now that this form is equal to zero. By the linearity of expectation, this reads as
Since the exponential distribution is a probability measure equivalent to , this equality can hold only if and we get a contradiction with the linear independence modulo . ∎
Finally, we find the directional derivatives of the logpartition function with respect to the underlying measure. {lem} For each fixed the function has the directional derivative , see Definition 3.1, in every admissible direction for , with
(4.3) 
Proof.
On one hand, since the functional extends linearly to , we have
On the other hand, we compute
The limits exist according to the concavity of