Convergence of the Square Root Ensemble Kalman Filterin the Large Ensemble Limit

# Convergence of the Square Root Ensemble Kalman Filter in the Large Ensemble Limit

Evan Kwiatkowski111Department of Mathematical and Statistical Sciences, University of Colorado Denver, Denver, CO, USA. Partially supported by the U.S. National Science Foundation under the grant DMS-1216481.    Jan Mandel111Department of Mathematical and Statistical Sciences, University of Colorado Denver, Denver, CO, USA. Partially supported by the U.S. National Science Foundation under the grant DMS-1216481. 222Institute of Computer Science, Academy of Sciences of the Czech Republic, Prague, Czech Republic. Partially supported by the Czech Science Foundation under the grant 13-34856S.
###### Abstract

Ensemble filters implement sequential Bayesian estimation by representing the probability distribution by an ensemble mean and covariance. Unbiased square root ensemble filters use deterministic algorithms to produce an analysis (posterior) ensemble with a prescribed mean and covariance, consistent with the Kalman update. This includes several filters used in practice, such as the Ensemble Transform Kalman Filter (ETKF), the Ensemble Adjustment Kalman Filter (EAKF), and a filter by Whitaker and Hamill. We show that at every time index, as the number of ensemble members increases to infinity, the mean and covariance of an unbiased ensemble square root filter converge to those of the Kalman filter, in the case of a linear model and an initial distribution of which all moments exist. The convergence is in all , , and the convergence rate does not depend on the model or data dimensions. The result holds in infinitely dimensional Hilbert spaces as well.

D

ata assimilation, laws of large numbers, Hilbert space, ensemble Kalman filter {AMS} 60F25, 65C05, 65C35

## 1 Introduction

Data assimilation uses Bayesian estimation to incorporate observations into the model of a physical system. The model produces the forecast estimate, and the incorporation of the observation data produces the analysis estimate. The resulting analysis is used to initialize the next run of the model, producing the next forecast, which subsequently is used in the next analysis, and the process thus continues. Data assimilation is widely used, e.g., in geosciences [17].

The Kalman filter represents probability distributions by the mean and covariance. It is an efficient method when the probability distributions are close to Gaussian. However, in applications, the dimension of the state is large, and it is not feasible to even store the covariance of the system state exactly. Ensemble Kalman filters are variants of Kalman filters in which the state probability distribution is represented by an ensemble of states, and the state mean and covariance are estimated from the ensemble [12]. The dynamics of the model, which could be nonlinear in this case, are applied to each ensemble member to produce the forecast estimate. The simplest estimate of covariance from the ensemble is the sample covariance, which, however, suffers from sampling errors for small ensembles. For this reason, localization techniques, such as tapering [14] and covariance inflation [3, 4], need to be used for small ensembles [12, Ch. 15]. Simulation studies have shown that the ensemble filters with relatively small ensembles and with localization and covariance inflation are able to efficiently handle nonlinear dynamics and high dimension.

The major differences between ensemble Kalman filters are in the way the analysis ensemble is produced from the forecast and the data. The analysis ensemble can be formed in a stochastic or deterministic manner. The purpose of this paper is to examine ensemble Kalman filters that use a deterministic method to produce an analysis ensemble with exactly the desired statistics. Such filters are called unbiased square root filters, because the ensemble mean equals the prescribed mean, and construction of the analysis ensemble to match the prescribed ensemble covariance leads to taking the square root of a matrix. This includes several filters used in practice, such as the Ensemble Transform Kalman Filter (ETKF) [6, 15, 33], the Ensemble Adjustment Kalman Filter (EAKF) [2], and a filter by Whitaker and Hamill [34]. Criteria necessary for an ensemble square root filter to be unbiased are discussed in [23, 30, 32]. The base square root ensemble Kalman filter (cf., Algorithm 3 below) is often modified to support localization and covariance inflation for small ensembles and nonlinear problems. Since we are interested in large ensemble asymptotics and the linear case, localization and covariance inflation are not studied in this paper.

An important question for understanding ensemble filters is a law of large numbers as the size of the ensemble grows to infinity, even if practical implementations are necessarily limited to small ensembles. In [22, 24], it was proved independently for the version of the ensemble Kalman filter with randomized data from [8], that the ensemble mean and the covariance matrix converge to those of the Kalman filter, as the number of ensemble members grows to infinity. Both analyses obtain convergence in all , , but the convergence results are not independent of the space dimensions. The proof in [24] relies on the fact that ensemble members are exchangeable, uses uniform integrability, and does not provide any convergence rates, while [22] uses stochastic inequalities for the random matrices and vectors to obtain convergence with the usual Monte Carlo rate , but relies on entry-by-entry arguments.

Here, we show that at every time index, as the number of ensemble members increases to infinity, the mean and covariance of an unbiased ensemble square root filter converge to those of the Kalman filter, in the case of a linear model and an initial distribution of which all moments exist. The convergence is in all , with the usual rate , the constant does not depend on the dimension, and the result holds in the infinitely dimensional case as well. The constants in the estimate are constructive and depend only on the model and the data, namely the norms of the model operators and of the inverse of the data covariance, and of the vectors given. The analysis builds on some of the tools developed in [22], and extends them to obtain bounds on the operators involved in the formulation of square root ensemble filters, independent of the state space and data space dimensions, including the infinitely dimensional case. The square root ensemble filters are deterministic, which avoids technical complications associated with data perturbation in infinite dimension. Convergence of ensemble filters with data perturbation, independent of the dimension, will be studied elsewhere.

The main idea of the analysis is simple: by the law of large numbers, the ensemble mean and covariance of the initial ensemble converge to those of the background distribution. Every analysis step is a continuous mapping of the mean and the covariance, and the convergence in the large ensemble limit follows. The analysis quantifies this argument.

Since the principal observation is that the mean and the covariance are transformed in exactly the same way as in the Kalman filter, the continuity estimates in this paper can also be interpreted as a type of stability of the Kalman filter for arbitrary but fixed time with respect to perturbations of the initial distribution. The continuity of the Kalman filter formulas is foundational, and, as such, has not received much attention. The pointwise estimates in the present paper are somewhat stronger, and they imply local Lipschitz continuity with polynomial growth of the constant. The estimates can be interpreted as the stability of each Kalman filter time step separately with respect to random perturbations of the initial mean vector and covariance operator. This type of stability of the Kalman filter seems not to have been studied before. However, there has been a keen interest in long-time stability, particularly the effect of a finite error in the initial distribution diminishing over time, both theoretically for the Kalman filter (e.g., [21, 28, 37]) and empirically for the ensemble Kalman filter in meteorology (e.g., [36]). For a class of abstract dynamical systems with the whole state observed, the ensemble Kalman filter with a fixed ensemble size was proved to be well-posed and not blowing up faster than exponentially, and to stay within a bounded distance from the truth if sufficiently large covariance inflation is used [18].

The paper is organized as follows: in Section 2 we introduce notation and review select background concepts. Section 3 contains statements of the Kalman filter and the unbiased square root filter, and shows some basic properties, which are needed later. In Section 4, we show the continuity properties of the transformation of the statistics from one time step to the next. Section 5 contains a derivation of the laws of large numbers, needed for the convergence of the statistics of the initial ensemble. Section 6 presents the main result.

## 2 Notation and preliminaries

We will work with random elements with values in a Hilbert space . Readers interested in finite dimension only can think of random vectors in . All notations and proofs are presented in a way that applies in , as well as in a more general Hilbert space.

### 2.1 Finite dimensional case

Vectors are columns, and the inner product is where denotes transpose, and is the vector norm of . Throughout this paper, we will use single bars for deterministic norms and double bars for stochastic norms. We will use the notation for the space of all matrices. For a matrix , denotes transpose, and stands for the spectral norm. We will also need the Hilbert-Schmidt norm (more commonly called Frobenius norm) of matrices, induced by the corresponding inner product of two matrices,

 |A|HS=(n∑i,j=1a2ij)1/2=⟨A,A⟩1/2HS,⟨A,B⟩HS=n∑i,j=1aijbij.

The Hilbert-Schmidt norm dominates the spectral norm,

 |A|≤|A|HS, (1)

for any matrix . The notation means that is symmetric and positive semidefinite, means symmetric positive definite. For , means that random vector has the normal distribution on , with mean and covariance matrix . For vectors , their tensor product is the matrix,

 u⊗v=uv∗∈[V].

It is evident that

 |u⊗v|HS=|u||v|, (2)

because

 |u⊗v|2HS=n∑i=1n∑j=1(uivj)2=n∑i=1u2in∑j=1v2j=|u|2|v|2.

### 2.2 Hilbert space case

Readers interested in finite dimension only should skip this section. In the general case, is a separable Hilbert space equipped with inner product and the norm . The space of all bounded operators from Hilbert space to Hilbert space is , and . For a bounded linear operator , denotes the adjoint operator, and is the operator norm. The Hilbert-Schmidt norm of a linear operator on is defined by

where is any complete orthonormal sequence; the values do not depend on the choice of . An operator on is called a Hilbert-Schmidt operator if , and is the space of all Hilbert-Schmidt operators on . The Hilbert-Schmidt norm again dominates the spectral norm, so (1) holds, and . The importance of for us lies in the fact that is a Hilbert space, while is not.

The notation for operator means that is symmetric, , and positive semidefinite, for all . The notation means here that is symmetric and bounded below, that is, for all and some In particular, if , then . For , means that and are symmetric, and It is well known from spectral theory that implies . Let denote the spectral radius of , and if is symmetric, then . For a symmetric operator , there exists a unique symmetric such that , and if , then also and .

An operator is trace class if , where is the trace of , defined by

 TrA=∞∑i=1⟨Aei,ei⟩.

If is trace class, then is Hilbert-Schmidt, because .

For vectors , their tensor product is now a mapping defined by

 u⊗v:w∈V↦u⟨v,w⟩,

and the proof of (2) becomes

 |x⊗y|2HS =∞∑i=1|(x⊗y)ei|2=∞∑i=1|x⟨y,ei⟩|2=|x|2∞∑i=1|⟨y,ei⟩|2=|x|2|y|2,

from Bessel’s equality.

The mean of a random element is defined by

The mean of the tensor product of two random elements is defined by

The covariance of a random element (defined below in (3)) exists when the second moment , and the proposed covariance is a trace class operator. On the other hand, if is trace class, the normal distribution can be defined as a probability measure on , consistently with the finite dimensional case. Just as in the finite dimensional case, if , then has all moments , . See [10, 19, 20] for further details.

### 2.3 Common definitions and properties

The rest of the background material is the same regardless if or if is a general Hilbert space. To unify the nomenclature, matrices are identified with the corresponding operators of matrix-vector multiplication. Covariance of a random vector is defined by

 Cov(X) =E((X−E(X))⊗(Y−E(Y))) =E(X⊗Y)−E(X)⊗E(Y), (3)

if it exists. For , the space of all random elements with finite moment is denoted by , and it is equipped with the norm If and , then and . If are independent, then and . The following lemma will be used repeatedly in obtaining estimates. It is characteristic of our approach to use higher order norms to bound lower order norms.

{lemma}

[ Cauchy-Schwarz inequality]If and , then

 ∥|U||V|∥p≤∥U∥2p∥V∥2p.
{proof}

By the Cauchy-Schwartz inequality in applied to the random variables and , which are in ,

 ∥|U||V|∥pp =E(|U|p|V|p)≤(E|U|2p)1/2(E|V|2p)1/2 =((E|U|2p)1/2p(E|V|2p)1/2p)p=∥X∥p2p∥Y∥p2p.

Taking the -th root of both sides yields the desired inequality.

An ensemble consists of random elements , . The ensemble mean is denoted by or , and is defined by

 ¯¯¯¯¯XN=EN(XN)=1NN∑i=1Xi. (4)

The ensemble covariance is denoted by or , and is defined by

 QN=CN(XN) =1NN∑i=1(Xi−¯¯¯¯¯XN)⊗(Xi−¯¯¯¯¯XN) =EN(XN⊗XN)−EN(XN)⊗EN(XN), (5)

where

 XN⊗XN=X1⊗X1+…+XN⊗XN.

We use instead of the more common , which would give an unbiased estimate, because it allows writing the sample covariance with the sample mean in (5) without the additional multiplicative factors . Note that convergence as is not affected.

## 3 Definitions and basic properties of the algorithms

In the Kalman filter, the probability distribution of the state of the system is described by its mean and covariance. We first consider the analysis step, which uses Bayes’ theorem to incorporate an observation into the forecast state and covariance to produce the analysis state and covariance. The system state is an -valued random vector. We denote by the forecast state mean, the forecast state covariance, and and the analysis state mean and covariance, respectively. The observation data vector is , where , with the linear observation operator, and , , is the observation error covariance.

Given the forecast mean and covariance, the Kalman Filter analysis mean and covariance are

 ¯¯¯¯¯Xa =¯¯¯¯¯Xf+K(d−H¯¯¯¯¯Xf) (6) Qa =Qf−QfH∗(HQfH∗+R)−1HQf=Qf−KHQf, (7)

where

 K=QfH∗(HQfH∗+R)−1 (8)

is the Kalman gain matrix. See, e.g., [1, 16, 31].

In the general case, the state space and the data space above become separable Hilbert spaces, one or both of which may be infinitely dimensional, and matrices become bounded linear operators. The Kalman filter formulas (6)–(8) remain the same. The assumption guarantees that the inverse in is well-defined. In particular, when the data space is infinitely dimensional, the definition of a positive definite operator in Section 2.2 implies that this inverse is bounded. In that case, cannot be the covariance of a probability distribution in the classical sense, because cannot be of trace class in infinite dimension. However, all statistical estimates will be in the state space, not the data space, so this is not a problem.

For future reference, we introduce the operators

 K(Q) =QH∗(HQH∗+R)−1, (9) B(X,Q) =X+K(Q)(d−HX), (10) A(Q) =Q−QH∗(HQH∗+R)−1HQ (11) =Q−K(Q)HQ, (12)

which evaluate the Kalman gain, analysis mean, and the analysis covariance, respectively, in the Kalman filter equations (6)–(8). We are now ready to state the complete Kalman filter for reference. The superscript signifies quantities at time step .

###### Algorithm \thetheorem (Kalman filter)

Suppose that the model at each time is linear, , and the initial mean and the background covariance of the state are given. At time , the analysis mean and covariance from the previous time are advanced by the model,

 ¯¯¯¯¯X(k),f =M(k)¯¯¯¯¯X(k−1),a+b(k), (13) Q(k),f =M(k)Q(k−1),aM(k)∗. (14)

The analysis step incorporates the observation , where has mean zero and covariance , and it gives the analysis mean and covariance

 ¯¯¯¯¯X(k),a =B(¯¯¯¯¯X(k),f,Q(k),f), (15) Q(k),a =A(Q(k),f), (16)

where and are defined by (10) and (12) respectively, with and , at time .

In many applications, the state dimension of the system is large and computing or even storing the exact covariance of the system state is computationally impractical. Ensemble Kalman filters address this concern. Ensemble Kalman filters use a collection of state vectors, called an ensemble, to represent the distribution of the system state. This ensemble will be denoted by , comprised of random elements , . The ensemble mean and ensemble covariance, defined by (4) and (5), are denoted by and respectively, while the Kalman Filter mean and covariance are denoted without subscripts, as and .

There are several ways to produce an analysis ensemble corresponding to the Kalman filter algorithm. Unbiased ensemble square root filters produce an analysis ensemble such that (15) and (16) are satisfied for the ensemble mean and covariance. This was not always the case in early variants of ensemble square root filters [30]. Because our results assume a linear model, we formulate the algorithm in the linear case to cut down on additional notation.

###### Algorithm \thetheorem (Unbiased square root ensemble filter)

Generate the initial ensemble at time by sampling from a distribution with mean and covariance . At time , the analysis ensemble members are advanced by the model

resulting in the forecast ensemble with ensemble mean and covariance

 ¯¯¯¯¯X(k),fN=EN(X(k),fN),Q(k),fN=CN(X(k),fN). (18)

The analysis step creates an ensemble which incorporates the observation that has mean zero and covariance . The analysis ensemble is constructed (in a manner determined by the specific method) to have the ensemble mean and covariance given by

 EN(¯¯¯¯¯X(k),aN) =B(¯¯¯¯¯X(k),fN,Q(k),fN)=¯¯¯¯¯X(k),fN+K(Q(k),fN)(d(k)−H(¯¯¯¯¯X(k),fN)), (19) CN(¯¯¯¯¯X(k),aN) =A(Q(k),fN)=Q(k),fN−K(Q(k),fN)H(k)Q(k),fN, (20)

where and are defined by (10) and (12) with , , and , at time .

The Ensemble Adjustment Kalman Filter (EAKF, [2]), the filter by Whitaker and Hamill [34], and the Ensemble Transform Kalman Filter (ETKF, [6]) and its variants, the Local Ensemble Transform Kalman Filter (LETKF, [15]) and its revision [33], satisfy properties (19) and (20), cf., [23, 32], and therefore are of this form.

The background covariance does not need to be stored explicitly. Rather, it is used in a factored form [7, 13],

 B=LLT. (21)

and only multiplication of times a vector is needed,

 X(0)i=X(0)+LYi,Yi∼N(0,I).

For example, a sample covariance of the form (21) can be created from historical data. To better represent the dominant part of the background covariance, can consist of approximate eigenvectors for the largest eigenvalues, obtained by a variant of the power method [17]. Another popular choice is , where is a transformation, such as FFT or wavelet transform, which requires no storage of the matrix entries at all, and is a sparse matrix [11, 27, 29]. The covariance of the Fourier transform of a stationary random field is diagonal, so even an approximation with diagonal matrix is useful, and computationally very inexpensive.

A key observation of our analysis is that, in the linear case, the transformations of the ensemble mean and covariance in unbiased square root ensemble filters are exactly the same as in the Kalman filter, and they can be described without reference to the ensemble at all, as shown in the next lemma. All that is needed for the convergence of the unbiased square root ensemble filter is the convergence of the initial ensemble mean and covariance, and the continuity of the transformations, in a suitable statistical sense.

{lemma}

The transformations of the mean and covariance in step of Algorithms 3 and 3 are the same,

Kalman Filter
 Unbiased Square Root Ensemble Kalman Filter
{proof}

From (18) and the definition (5) of ensemble covariance, it follows that

 Q(k),fN=M(k)Q(k−1),aNM(k)∗. (22)

The rest of the transformations in Algorithms 3 and 3 are already the same.

## 4 Continuity of the analysis step

Fundamental to our analysis are continuity estimates for the operators and , which bring the forecast statistics to the analysis statistics. Our general strategy is to first derive pointwise estimates which apply to every realization of the random elements separately, then integrate them to get the corresponding estimates. We will prove the following estimates for general covariances and , with the state covariance and sample covariance of the filters in mind. Likewise, estimates will be made with general elements and , with the state mean and sample mean of the filters in mind.

### 4.1 Pointwise bounds

The first estimate is the continuity of the Kalman gain matrix (or operator) (8) as a function of the forecast covariance, shown in the next lemma and its corollary. See also [22, Proposition 22.2].

{lemma}

If and then

{proof}

Since and , and are defined in (9). For , , we have the identity,

 (I+A)−1−(I+B)−1=(I+A)−1(B−A)(I+B)−1, (23)

which is verified by multiplication by on the left and on the right, and

 ∣∣(I+A)−1−(I+B)−1∣∣≤|B−A|, (24)

which follows from (23) using the inequalities , , because . Now write

 (HQH∗+R)−1=R−1/2(R−1/2HQH∗R−1/2+I)−1R−1/2, (25)

using the symmetric square root of . By (24) with and and (25), we have that

 ∣∣(HQH∗+R)−1−(HPH∗+R)−1∣∣ ≤|Q−P||H|2|R−1|2. (26)

Since , we have

 |(HQH∗+R)−1|≤|R−1|. (27)

Using (26), (27), and the definition of the operator from (9), we have

 |K(Q)−K(P)|= |QH∗(HQH∗+R)−1−PH∗(HPH∗+R)−1| = |QH∗(HQH∗+R)−1−PH∗(HQH∗+R)−1 +PH∗(HQH∗+R)−1−PH∗(HPH∗+R)−1| ≤

Swapping the roles and yields

 |K(Q)−K(P)|≤|Q−P||H||R−1|(1+|Q||H|2∣∣R−1∣∣),

which completes the proof.

A pointwise bound on the Kalman gain follows.

{corollary}

If and , then

 |K(Q)|≤|Q||H||R−1|. (28)
{proof}

Use Lemma 4.1 with .

{corollary}

If and , then

 |B(X,Q)|≤|X|+|Q||H||R−1|(|d−HX|). (29)

Proof. By the definition of operator in (10), the pointwise bound on the Kalman gain (28), and the triangle inequality,

 |B(X,Q)|=|X+K(Q)(d−HX)|≤|X|+|Q||H||R−1|(|d−HX|).\endproof

The pointwise continuity of operator also follows from Lemma 4.1. To reduce the notation we introduce the constant

 c=|H|2|R−1|. (30)
{lemma}

If and then

 |A(Q)−A(P)|≤|Q−P|(1+c|Q|+c|P|+c2|P||Q|).

Proof. Since and , it follows that , , , and are defined. From the definition of operator (12), Lemma 4.1, and Corollary 4.1,

 |A(Q)−A(P)| =|(Q−K(Q)HQ)−(P−K(P)HP)| =|Q−P+K(P)HP−K(Q)HQ| ≤|Q−P|(1+c|Q|+c|P|+c2|P||Q||).\endproof
###### Remark \thetheorem

The choice of in in the proof of Lemma 4.1 was made to preserve symmetry. Swapping the roles of and in the proof gives a sharper, but a more complicated bound

 |A(Q)−A(P)| ≤|Q−P|(1+c|Q|+c|P|+c2⋅min{|P|2,|P||Q|,|Q|2}).

Instead of setting above, we can get a well-known sharper pointwise bound on directly from (11). The proof is written in a way suitable for our generalization.

{lemma}

If and , then

 0≤A(Q)≤Q, (31)

and

 |A(Q)|≤|Q|. (32)
{proof}

By the definition of operator in (11),

 A(Q)=Q−QH∗(HQH∗+R)−1HQ≤Q,

because . It remains to show that . Note that for any , since ,

 A∗A+R≥A∗A,

and, consequently,

 (A∗A+R)−1/2A∗A(A∗A+R)−1/2≤I.

Since for , is the same as the spectral radius , and, for any and , , we have

 A(A∗A+R)−1A∗≤I. (33)

Using (33) with gives

 Q1/2H∗(HQH∗+R)−1HQ1/2≤I,

and, consequently

 QH∗(HQH∗+R)−1HQ≤Q,

which gives

Since and are symmetric, (31) implies (32). \endproof

The pointwise continuity of operator follows from Lemma 4.1 as well.

{lemma}

If , and , then,

 |B(X,Q)−B(Y,P)| ≤|X−Y|(1+c|Q|)+|Q−P||H||R−1|(1+c|P|)(|d−HY|). (34)
{proof}

Estimating the difference, and using the pointwise bound on the Kalman gain (28) and the pointwise continuity of the Kalman gain from Lemma 4.1,

 |B(X,Q) −B(Y,P)| =|(X+K(Q)(d−HX))−(Y+K(P)(d−HY))| =|X−Y+K(Q)(HY−HX)+(K(Q)−K(P))(d−HY)| ≤|X−Y|+c|Q||X−Y|+|Q−P||H||R−1|(1+c|P|)(|d−HY|)

which is (34). \endproof

### 4.2 Lp bounds

Using the pointwise estimate on the continuity of operator , we can now estimate its continuity in spaces of random vectors. We will only need the result with one of the arguments random and the other one constant (i.e., non random), which simplifies its statement and proof. This is because the application of these estimates will be the ensemble sample covariance, which is random, and the state covariance, which is constant. In a similar way we will estimate the continuity of operator in .

{lemma}

Let be a random operator such that almost surely (a.s.), let be constant, and let . Then, for all ,

 ∥A(Q)−A(P)∥p≤(1+c|P|)∥Q−P∥p+(c+c2|P|)∥Q∥2p∥Q−P∥2p. (35)

Proof. From Lemma 4.1, the triangle inequality, Lemma 2.3, and recognizing that is constant,

 ∥A(Q)−A(P)∥p ≤∥|Q−P|(1+c|Q|+c|P|+c2|P||Q|)∥p ≤∥Q−P∥p+c∥Q∥2p∥Q−P∥2p +c|P|∥Q−P∥p+c2|P|∥Q∥2p∥Q−P∥2p =(1+c|P|)∥Q−P∥p+(c+c2|P|)∥Q∥2p∥Q−P∥2p.\endproof

Instead of setting above, we get a better bound on directly.

{lemma}

Let be a random operator such that a.s., and . Then, for all ,

 ∥A(Q)∥p≤∥Q∥p. (36)
{proof}

From (32), it follows that \endproof

Using the pointwise estimate on the continuity of the operator , we estimate its continuity in spaces of random vectors. Again, we keep the arguments of one of the terms constant, which is all we will need, resulting in a simplification.

{lemma}

Let and be random operators, and and be constant. Let a.s., , and . Then, for all ,

 ∥B(X,Q)−B(Y,P)∥p ≤∥X−Y∥p+c∥Q∥2p∥X−Y∥2p +∥Q−P∥p|H||R−1|(1+c|P|)(|d−HY|).
{proof}

Applying the norm to the pointwise bound (34), using the triangle inequality, recognizing that and are constant, and applying the Cauchy-Schwarz inequality (Lemma 2.3), we get

 ∥B(X,Q)−B(Y,P)∥p =∥|X−Y|(1+c|Q|)+|Q−P||H||R−1|(1+c|P|)|d−HY|∥p ≤∥X−Y∥p+c∥Q∥2p∥X−Y∥2p +∥Q−P∥p|H||R−1|(1+c|P|)(|d−HY|).\endproof

## 5 Lp laws of large numbers

Similarly as in [22, 24], we will work with convergence in all , . To prove convergence of the initial ensemble mean and covariance to the mean and covariance of the background distribution, we need the corresponding laws of large numbers.

### 5.1 Lp convergence of the sample mean

The law of large numbers for i.i.d. is classical:

 ∥EN(XN)−E(X1)∥2≤1√N∥X1−E(X1)∥2≤2√N∥X1∥2. (37)

The proof relies on the expansion (assuming without loss of generality),

 ∥EN(XN)∥22 =E(⟨1NN∑i=1Xi,1NN∑i=1Xi⟩)=1N2N∑i=1N∑j=1E(⟨Xi,Xj⟩) (38) =1N2N∑i=1E(⟨Xi,Xi⟩)=1NE(⟨X1,X1⟩)=1N∥X1∥22,

which yields . To obtain laws of large numbers for , the equality (38) needs to be replaced by the following.

{lemma}

(Marcinkiewicz-Zygmund inequality) If and , and , , then

 E(∣∣ ∣∣N∑i=1Yi∣∣ ∣∣p)≤BpE(N∑i=1|Yi|2)p/2, (39)

where depends on only.

{proof}

For the finite dimensional case, see [26] or [9, p. 367]. In the infinitely dimensional case, note that a separable Hilbert space is a Banach space of Rademacher type 2 [5, p. 159], which implies (39) for any [35, Proposition 2.1]. All infinitely dimensional separable Hilbert spaces are isometric, so the same works for any of them. \endproof

The Marcinkiewicz-Zygmund inequality begets a variant of the weak law of large numbers in norms, similarly as in [9, Corollary 2, page 368]. Note that the Marcinkiewicz-Zygmund inequality does not hold in general Banach spaces, and in fact it characterizes Banach spaces of Rademacher type 2 [35, Proposition 2.1], so it is important that or is a separable Hilbert space, as assumed throughout.

{theorem}

Let and be i.i.d. Then,

 ∥EN(XN)−E(X1)∥p≤Cp√N∥X1−E(X1)∥p≤2Cp√N∥X1∥p, (40)

where depends on only.

{proof}

If , the statement becomes (37). Let , and consider the case . By Hölder’s inequality with the conjugate exponents and ,

 N∑i=1|Xi|2 =N∑i=11(|Xi|2)≤(N∑i=1(1p/p−2))(p−2)/p(N∑i=1(|Xi|2)p/2)2/p =N(p−2)/p(N∑i=1|Xi|p)2