An Optimal Transport Formulation of the Ensemble Kalman Filter

An Optimal Transport Formulation of the Ensemble Kalman Filter

Amirhossein Taghvaei, Prashant G. Mehta A. Taghvaei is with the Department of Mechanical and Aerospace Engineering at University of California Irvine. The research reported in this paper was performed while he was a graduate student at the University of Illinois at Urbana-Champaign G. Mehta is with the Coordinated Science Laboratory and the Department of Mechanical Science and Engineering at the University of Illinois at Urbana-Champaign (UIUC) support from the NSF CMMI grants 1462773 and 1761622 is gratefully acknowledged.The conference version of this paper appears in [1] and [2].

Controlled interacting particle systems such as the ensemble Kalman filter (EnKF) and the feedback particle filter (FPF) are numerical algorithms to approximate the solution of the nonlinear filtering problem in continuous time. The distinguishing feature of these algorithms is that the Bayesian update step is implemented using a feedback control law. It has been noted in the literature that the control law is not unique. This is the main problem addressed in this paper. To obtain a unique control law, the filtering problem is formulated here as an optimal transportation problem. An explicit formula for the (mean-field type) optimal control law is derived in the linear Gaussian setting. Comparisons are made with the control laws for different types of EnKF algorithms described in the literature. Via empirical approximation of the mean-field control law, a finite- controlled interacting particle algorithm is obtained. For this algorithm, the equations for empirical mean and covariance are derived and shown to be identical to the Kalman filter. This allows strong conclusions on convergence and error properties based on the classical filter stability theory for the Kalman filter. It is shown that, under certain technical conditions, the mean squared error (m.s.e.) converges to zero even with a finite number of particles. A detailed propagation of chaos analysis is carried out for the finite- algorithm. The analysis is used to prove weak convergence of the empirical distribution as . For a certain simplified filtering problem, analytical comparison of the m.s.e. with the importance sampling-based algorithms is described. The analysis helps explain the favorable scaling properties of the control-based algorithms reported in several numerical studies in recent literature.

I Introduction

The subject of this paper concerns Monte-Carlo methods for simulating a nonlinear filter (conditional distribution) in continuous-time settings. The mathematical abstraction of any filtering problem involves two processes: a hidden Markov process and the observation process . The numerical problem is to compute the posterior distribution where is the filtration generated by the observations. A standard solution approach is the particle filter which relies on importance sampling to implement the effect of conditioning [3, 4]. In numerical implementations, this often leads to the particle degeneracy issue (or weight collapse) whereby only a few particles have large weights. To combat this issue, various types of resampling schemes have been proposed in the literature [5, 6].

In the past decade, an alternate class of algorithms has attracted growing attention. These algorithms can be regarded as a controlled interacting particle system where the central idea is to implement the effect of conditioning using feedback control. Mathematically, this involves construction of controlled stochastic process, denoted by . In continuous-time settings, the model for the is a stochastic differential equation (sde):


where the [additional terms] are pre-specified (these terms may be zero). The control problem is to design mean-field type control law and such that the conditional distribution of (given ) is equal to the posterior distribution of . If this property holds, the filter is said to be exact. In a numerical implementation, the mean-field terms in the control law are approximated empirically by simulating copies of (1). The resulting system is a controlled interacting particle system with a finite number of interacting particles. The particles have uniform importance weights by construction. Therefore, the particle degeneracy issue does not arise. Resampling is no longer necessary and steps such as rules for reproduction, death or birth of particles are altogether avoided.

The focus of this paper is on (i) formal methods for design of control laws ( and ) for (1); (ii) algorithms for empirical approximation of the control laws using particles; and (iii) error analysis of the finite- interacting particle models as . The main problem highlighted and addressed in this paper is the issue of uniqueness: one can interpret the controlled system (1) as transporting the initial distribution at time (prior) to the conditional distribution at time t (posterior). Clearly, there are infinitely many maps that transport one distribution into another. This suggests that there are infinitely many choices of control laws that all lead to exact filters. This is not surprising: The exactness condition specifies only the marginal distribution of the stochastic process at times , which is not enough to uniquely identify a stochastic process, e.g., the joint distributions at two time instants are not specified.

Although these issues are relevant more generally, the scope of this paper is limited to the linear Gaussian problem. A motivation comes from the widespread use of the ensemble Kalman filter (EnKF) algorithm in applications. It is noted that the mean-field limit of the EnKF algorithm is exact only in linear Gaussian settings. The issue of non-uniqueness is manifested in the different types of EnKF algorithms reported in literature. Some of these EnKF types are discussed as part of the literature survey (in Sec. I-B) and in the main body of the paper (in Sec. II-B).

I-a Contributions of this paper

The following contributions are made in this paper:

  1. Non-uniqueness issue: For the linear Gaussian problem, an error process is introduced to help explain the non-uniqueness issue in the selection of the control law in (1). The error process helps clarify the relationship between the different types of control laws leading to the different types of EnKF algorithms that have appeared over the years in the literature.

  2. Optimal transport FPF: To select a unique control law, an optimization problem is proposed in the form of a time-stepping procedure. The optimality concept is motivated by the optimal transportation theory [7, 8]. The solution of the time-stepping procedure yields a unique optimal control law. The resulting filter is referred to as the optimal transport FPF. The procedure is suitably adapted to handle the case, important in Monte-Carlo implementations with finitely many particles, where the covariance is singular. In this case, the optimal (deterministic) transport maps are replaced by optimal (stochastic) couplings. The general form of the optimal FPF includes stochastic terms which are zero when the covariance is non-singular.

  3. Error analysis: A detailed error analysis is carried out for the deterministic form of the optimal FPF for the finite but large limit. For the purposes of error analysis, it is assumed that the linear system is controllable and observable, and the initial empirical covariance matrix is non-singular. The main results are as follows:

    • Empirical mean and covariance of particles is shown to converge almost surely to exact mean and covariance as even for finite (Prop. 2-(i));

    • Mean-squared error is shown to be bounded by where the constant has polynomial dependence on the problem dimension (Prop. 2-(ii));

    • A propagation of chaos analysis is carried out to show that empirical distribution of the particles converges in a weak sense to the exact filter posterior distribution (Cor. 1).

  4. Comparison to importance sampling: For a certain simplified filtering problem, a comparison of the m.s.e. between the importance sampling and control-based filters is described. The main result is to show that using an important sampling approach, the number of particles must grow exponentially with the dimension . In contrast, with a control-based approach, scales at most as order in order to maintain the same error (Prop. 4). The conclusions are also verified numerically (Fig. 1).

This paper extends and completes the preliminary results reported in our prior conference papers [1, 2]. The optimal transport formulation of the FPF and the time-stepping procedure was originally introduced in [1]. However, its extension to the singular covariance matrix case, in Sec. III-D, is original and has not appeared before. Preliminary error analysis of the deterministic form of optimal FPF appeared in our prior work [2]. The current paper extends the results in [2] in two key aspects: (i) The error bounds for the convergence of the empirical mean and covariance, in Prop. 2-(ii), reveal the scaling with the problem dimension (see Remark 3-(ii)), whereas previous results did not; (ii) The propagation of chaos analysis is carried out for the vector case (Cor. 1), whereas previous result was only valid for the scalar case. These improvements became possible with a proof approach that is entirely different than the one used in [2]. Finally, the analytical comparison with the importance sampling particle filter, in Sec. V, is new.

I-B Literature survey

Two examples of the controlled interacting particle systems are the classical ensemble Kalman filter (EnKF) [9, 10, 11, 12] and the more recently developed feedback particle filter (FPF) [13, 14]. The EnKF algorithm is the workhorse in applications (such as weather prediction) where the state dimension is very high; cf., [12, 15]. The high-dimension of the state-space provides a significant computational challenge even in linear Gaussian settings. For such problems, an EnKF implementation may require less computational resources (memory and FLOPS) than a Kalman filter [15, 16]. This is because the particle-based algorithm avoids the need to store and propagate the error covariance matrix (whose size scales as ).

An expository review of the continuous-time filters including the progression from the Kalman filter (1960s) to the ensemble Kalman filter (1990s) to the feedback particle filter (2010s) appears in [17]. In continuous-time settings, the first interacting particle representation of the nonlinear filter appears in the work of Crisan and Xiong [18]. Also in continuous-time settings, Reich and collaborators have derived deterministic forms of the EnKF [19, 12]. In discrete-time settings, Daum and collaborators have pioneered the development of closely related particle flow algorithms [20, 21].

The technical approach of this paper has its roots in the optimal transportation theory. These methods have been widely applied for uncertainty propagation and Bayesian inference: The ensemble transform particle filter is based upon computing an optimal coupling by solving a linear optimization problem [22]; Polynomial approximations of the Rosenblatt transport maps for Bayesian inference appears in [23, 24]; Solution of such problems using the Gibbs flow is the subject of [25]. The time stepping procedure of this paper is inspired by the J-K-O construction in [26]. Its extension to the filtering problem appears in [27, 28, 29, 30].

Closely related to error analysis of this paper is the recent literature on stability and convergence of the EnKF algorithm. For the discrete-time EnKF algorithm, these results appear in [31, 32, 33, 34, 35]. The analysis for continuous-time EnKF is more recent. For continuous-time EnKF with perturbed observation, under additional assumptions (stable and fully observable), it has been shown that the empirical distribution of the ensemble converges to the mean-field distribution uniformly for all time with the rate  [36]. This result has been extended to the nonlinear setting for the case with Langevin type dynamics with a strongly convex potential and full linear observation [37]. The stability assumption is recently relaxed in [38]. Under certain conditions, convergence and long term stability results appear in [39].

In independent numerical evaluations and comparisons, it has been observed that EnKF and FPF exhibit smaller simulation variance and better scaling properties – with respect to the problem dimension – when compared to the traditional methods [40, 41, 42, 43, 44]. The error analysis (in Sec. IV) together with the analytical bounds on comparison with the importance sampling (in Sec. V) provide the first such rigorous justification for the performance improvement reported in literature. The analysis of this paper is likely to spur wider adoption of the control-based algorithms for the purposes of sampling and simulation.

I-C Paper outline

Sec. II includes the preliminaries along with a discussion of the non-uniqueness issue. Its resolution is provided in Sec. III where the optimal FPF is derived. The error analysis appears in Sec. IV and comparison with importance sampling particle filter is given in Sec. V. The proofs appear in the Appendix.

Notation: For a vector , denotes the Euclidean norm. For a square matrix , denotes the Frobenius norm, is the spectral norm, is the matrix-trace, denotes the null-space, denotes the range space, and denotes the spectrum. For a symmetric matrix , and denote the maximum and minimum eigenvalues of respectively. The partial order of positive definite matrices is denoted by such that means is positive definite. is a Gaussian probability distribution with mean and covariance .

Ii The Non-Uniqueness Issue

Ii-a Preliminaries

The linear Gaussian filtering problem is described by the linear stochastic differential equations (sde-s):


where and are the state and observation at time , are mutually independent standard Wiener processes taking values in and , respectively, and , , are matrices of appropriate dimension. The initial condition is assumed to have a Gaussian distribution . The filtering problem is to compute the posterior distribution,


where .

Kalman-Bucy filter: In this linear Gaussian case, the posterior distribution is Gaussian , whose mean and variance evolve according to the Kalman-Bucy filter [45]:


where is the Kalman gain and .

Feedback particle filter: The stochastic linear FPF [14, Eq. (26)] (and also the square-root form of the EnKBF [12, Eq (3.3)]) is described by the Mckean-Vlasov sde:


where is the Kalman gain, is a standard Wiener process, , are the mean-field terms, and . We use


to denote the conditional distribution of mean-field process .

The FPF control law is exact. The exactness result appears in the following theorem which is a special case of the [14, Thm. 1] that describes the exactness result for the general nonlinear non-Gaussian case. A proof is included in Appendix A. The proof is useful for studying the non-uniqueness issue described in Sec. II-B.

Theorem 1

(Exactness of linear FPF) Consider the linear Gaussian filtering problem (2a)-(2b) and the linear FPF (5). If then

The notation nomenclature is tabulated in Table I.

Variable Notation Equation
State of the hidden process Eq. (2a)
State of the particle in finite- sys. Eq. (8), (21)
State of the mean-field model Eq. (5), (19)
Kalman filter mean and covariance Eq. (4a)-(4b)
Empirical mean and covariance Eq. (9)
Mean-field mean and covariance Eq. (5)-(19)
Conditional distribution of Eq. (3)
Conditional distribution of Eq. (6)
Empirical distribution of particles Eq. (10)
TABLE I: Nomenclature.

Ii-B The non-uniqueness issue

In the proof of Thm. 1 (given in Appendix A), it is shown that (i) the conditional mean process evolves according to (4a); and (ii) the conditional variance process evolves according to (4b).

Define an error process for . The equation for is obtained by subtracting the equation for the mean, (38) in Appendix A, from (5):

This is a linear system and therefore, the variance of , easily seen to be given by , evolves according to the Lyapunov equation

which is identical to (4b).

These arguments suggest the following general procedure to construct an exact process: Express as a sum of two terms:

where evolves according to the Kalman-Bucy equation (4a) and the evolution of is defined by the sde:

where and are independent Brownian motions, and , , and are solutions to the matrix equation


By construction, the equation for the variance is given by the Riccati equation (4b).

In general, there are infinitely many solutions for (7). Below, we describe three solutions that lead to three established form of EnKF and linear FPF:

  1. EnKF with perturbed observation [19, Eq. (27)]:

  2. Stochastic linear FPF [14, Eq. (26)] or square-root form of the EnKF [12, Eq (3.3)] :

  3. Deterministic linear FPF [2]:

Given a particular solution , one can construct a family of solutions , where is any skew-symmetric matrix. For the linear Gaussian problem, the non-uniqueness issue has been discussed in literature: At least two forms of EnKF, the perturbed observation form [19] and the square-root form [12], are well known. A homotopy of exact deterministic and stochastic filters is given in [46]. An explanation for the non-uniqueness in terms of the Gauge transformation appears in [47].

Ii-C Finite- implementation

In a numerical implementation, one simulates stochastic processes (particles) , where is the state of the -particle at time . The evolution of is obtained upon empirically approximating the mean-field terms. The finite- filter for the linear FPF (5) is an interacting particle system:


where ; are independent copies of ; for ; and the empirical approximations of the two mean-field terms are as follows:


We use the notation


to denote the empirical distribution of the particles. Here, denotes the Dirac delta distribution at .

Iii Optimal Transport FPF

The problem is to uniquely identify an exact stochastic process . The solution is based on an optimality concept motivated by the optimal transportation theory [7, 8]. The background on this theory is briefly reviewed next.

Iii-a Background on optimal transportation

Let and be two probability measures on with finite second moments. The (Monge) optimal transportation problem with quadratic cost is to minimize


over all measurable maps such that


If it exists, the minimizer is called the optimal transport map between and . The optimal cost is referred to as -Wasserstein distance between and [8].

Theorem 2

(Optimal map between Gaussians [48, Prop. 7]) Consider the optimization problem (11)-(12). Suppose and are Gaussian distributions, and , with . Then the optimal transport map between and is given by


where .

Iii-B The time-stepping optimization procedure

To uniquely identify an exact stochastic process , the following time stepping optimization procedure is proposed:

  1. Divide the time interval into equal time steps with the time instants .

  2. Initialize a discrete time random process according to the initial distribution (prior) of ,

  3. For each time step , evolve the process according to


    where the map is the optimal transport map between the probability measures and .

  4. Take the limit as to obtain the continuous-time process and the sde:


The procedure leads to the control laws and that depend upon . Since is unknown, one simply replaces it with – the two are meant to be identical by construction. The resulting sde (15) is referred to as the optimal transport FPF or simply the optimal FPF. A definition is needed to state the main result.

Definition 1

For a given positive-definite matrix , define as the (unique such) symmetric solution to the matrix equation:

Remark 1

The symmetric solution to the matrix equation (16) is explicitly given by:

The solution can also be expressed as:


where is the (unique such) skew-symmetric matrix that solves the matrix equation


The main result is as follows. Its proof appears in the Appendix B.

Proposition 1

Consider the linear Gaussian filtering problem (2a)-(2b). Assume the initial covariance . Then the optimal transport FPF is given by


where , , , , and

The filter is exact: That is, the conditional distribution of is Gaussian with and .

Using the form of the solution (17) for , the optimal transport sde (19) is expressed as


Compared to the original (linear Gaussian) FPF (5), the optimal transport FPF (20) has two differences:

  1. The stochastic term is replaced with the deterministic term . Given a Gaussian prior, the two terms yield the same posterior. However, in a finite- implementation, the difference becomes significant. The stochastic term serves to introduce an additional error of order .

  2. The sde (20) has an extra term involving the skew-symmetric matrix . The extra term does not effect the posterior distribution. This term is viewed as a correction term that serves to make the dynamics symmetric and hence optimal in the optimal transportation sense. It is noted that for the scalar () case, the skew-symmetric term is zero. Therefore, in the scalar case, the update formula in the original FPF (5) is optimal. In the vector case, it is optimal iff .

Iii-C Finite- implementation in non-singular covariance case

The finite- implementation of the optimal transport sde (19) is given by the following system of equations:


for , where ; ; and empirical approximations of mean and variance are defined in (9).

The matrix is not well-defined if is a singular matrix. This is a problem because in a finite- implementation, it is possible that even though . In particular, when , the empirical covariance matrix is of rank at most and hence singular. Note that this problem only arises for the optimal and deterministic forms of the FPF. In particular, the stochastic FPF (8) does not suffer from this issue. It can be simulated for any choice of . In Sec. III-D, we extend the optimal transportation formulation to handle also the singular forms of the covariance matrix.

Iii-D The singular covariance case

The derivation of the optimal FPF (19) crucially relies on the assumption that which in turn implies that, in the time-stepping procedure, for . In the proof of Prop. 1, the assumption is used to derive the optimal transport map . In general, when the covariance of Gaussian random variables or is singular, the optimal transport map may not exist.

In the singular case, a relaxed form of the optimal transportation problem, first introduced by Kantorovich, is used to search for optimal (stochastic) couplings instead of (deterministic) transport maps [8]. The following example helps illuminate the issue:

Example 1

Consider Gaussian random variable and with distributions, and , respectively. Suppose

where is small. If , the optimal transportation map exists, and is given by

If then there is no transport map that satisfies the constraints of the optimal transportation problem.

The Kantorovich relaxation of the optimal transportation problem (11) is the following optimization problem:


where is a joint distribution on with marginals fixed to and .

Although a deterministic map does not exist for the problem, a (stochastic) coupling that solves the Kantorovich problem (22) exists and is given by

where is independent of .

In Appendix C, the Kantorovich relaxation is used to motivate an optimization problem whose solution yields the following extension of the optimal FPF:


with where is the projection matrix into the kernel of the matrix , and is any symmetric solution of the matrix equation

Remark 2

When is singular, the solution to the matrix equation (24) is not unique. It is shown in Appendix C that the solution is of the following general form:


where is the pseudo inverse111The pseudo inverse of matrix is the unique matrix that satisfies , , is symmetric, and is symmetric [49]. of , is projection matrix onto the range of , and is a skew-symmetric matrix that solves a certain matrix equation (46) and is an arbitrary symmetric matrix.

Using the formula (25), the optimal FPF (23) is expressed as follows:


The formula (26) allows one to clearly see the relationship between the deterministic and stochastic forms of the optimal FPF. In particular, when the covariance matrix is non-singular, , and . This results in the deterministic form of the optimal transport FPF (20). When the covariance matrix is singular, then the effect of the linear term in (20) is now simulated with the two terms in (26). This is conceptually similar to the Example 1, where the deterministic optimal transport map is replaced with a stochastic coupling. The [additional terms] in (26) do not affect the distribution.

Iii-E Finite- implementation in the singular covariance case

The finite- implementation of the optimal transport sde (23) is given by the following system of equations:


where ; are independent copies of , , is a symmetric matrix solution to the matrix equation

and and are empirical mean and covariance defined in (9). Note that the stochastic term is zero when , which is true, e.g., when .

Iv Error analysis

This section is concerned with error bounds in the large but finite regime. Given that is large, we restrict ourselves to the non-singular case. For the purposes of the error analysis, the following assumptions are made:

Assumption (I): The system is detectable and is stabilizable.

Assumption (II): Assume and the initial empirical covariance matrix almost surely.

The main result for the finite- deterministic FPF (21) is as follows with the proof given in Appendix D.

Proposition 2

Consider the Kalman filter (4a)-(4b) initialized with the prior and the finite- deterministic form of the optimal FPF (21) initialized with for . Under Assumption (I) and (II), the following characterizes the convergence and error properties of the empirical mean and covariance obtained from the finite- filter relative to the mean and covariance obtained from the Kalman filter:

  • Convergence: For any finite , as :

    for all where is a fixed positive constant (see (48) in Appendix D).

  • Mean-squared error: For any , as :


    for all . The constant depends on , and spectral norms , and , where is the solution to the algebraic Riccati equation (see Lemma 2).

Remark 3

Two remarks are in order:

  1. Asymptotically (as ) the empirical mean and variance of the finite- filter becomes exact. This is because of the stability of the Kalman filter whereby the filter forgets the initial condition. In fact, for any (not necessarily i.i.d Gaussian) , that satisfy the assumption , the result holds.

  2. (Scaling with dimension) If the parameters of the linear Gaussian filtering problem (2a)-(2b) scale with the dimension in a way that the spectral norms , , , and do not change, then the constant in the error bounds (28a)-(28b) do not change. The only term that scales with the dimension is . For example, with , . Therefore, the bound on the error typically scales as in problem dimension.

Iv-a Propagation of chaos

In this section, we study the convergence of the empirical distribution of the particles for the finite- system (21) to the exact posterior distribution . Derivation of error estimates involve construction of independent copies of the exact process (19). Consistent with the convention to denote mean-field variables with a bar, the stochastic processes are denoted as where denotes the state of the particle at time . The particle evolves according to the mean-field equation (19) as


where is the Kalman gain and the initial condition , the right-hand side being the initial condition of the particle in the finite- FPF (21). The mean-field process is thus coupled to through the initial condition.

In order to carry out the error analysis, the following event is defined for an arbitrary choice of a fixed matrix :


The following proposition characterizes the error between and , when the event is true (the estimate is key to the propagation of chaos analysis). The proof appears in the Appendix E.

Proposition 3

Consider the stochastic processes and whose evolution is defined according to the optimal transport FPF (21) and its mean-field model (29), respectively. The initial condition for Then, under Assumptions (I) and (II):


The estimate (31) is used to prove the following important result that the empirical distribution of the particles in the linear FPF converges weakly to the true posterior distribution. Its proof appears in the Appendix E.

Corollary 1

Consider the linear filtering problem (2a)-(2b) and the finite- deterministic FPF (21). The initial condition for . Under Assumptions (I) and (II), for any bounded and Lipschitz function ,

V Comparison to Importance Sampling

For the purposes of the comparison of the optimal FPF with the importance sampling-based particle filter, we restrict to the static filtering example with a fully observed observation model:


for , where . The posterior distribution at time , denoted as , is a Gaussian with and .

Let be i.i.d samples of . The importance sampling-based particle filter yields an empirical approximation of the posterior distribution as follows:


In contrast, given the initial samples , the FPF approximates the posterior by implementing a feedback control law as follows:


where the empirical mean and covariance are approximated as (9).

The m.s.e in estimating the conditional expectation of a given function is defined as follows:

where the subscript is either the PF or the FPF.

For , a numerically computed plot of the level-sets of m.s.e, as a function of and , is depicted in Figure 1-(a)-(b). The expectation is approximated by averaging over independent simulations. It is observed that, in order to have the same error, the importance sampling-based approach requires the number of samples