A Rigorous Theory of Conditional Mean Embeddings

# A Rigorous Theory of Conditional Mean Embeddings

## Abstract

Abstract: Conditional mean embeddings (CME) have proven themselves to be a powerful tool in many machine learning applications. They allow the efficient conditioning of probability distributions within the corresponding reproducing kernel Hilbert spaces (RKHSs) by providing a linear-algebraic relation for the kernel mean embeddings of the respective probability distributions. Both centered and uncentered covariance operators have been used to define CMEs in the existing literature. In this paper, we develop a mathematically rigorous theory for both variants, discuss the merits and problems of each, and significantly weaken the conditions for applicability of CMEs. In the course of this, we demonstrate a beautiful connection to Gaussian conditioning in Hilbert spaces.

Keywords: conditional mean embedding, Gaussian measure, reproducing kernel Hilbert space.

2010 Mathematics Subject Classification: 46E22, 62J02, 28C20.

\automark

ZIBZuse Institute Berlin, Takustraße 7, 14195 Berlin, Germany. , ZalandoZalando SE, 11501 Berlin, Germany. FUBFreie Universität Berlin, Arnimallee 6, 14195 Berlin, Germany.

## 1 Introduction

Reproducing kernel Hilbert spaces (RKHSs) have long been popular tools in machine learning because of the powerful property — often called the “kernel trick” — that many problems posed in terms of the base set of the RKHS (e.g. classification into two or more classes) become linear-algebraic problems in under the embedding of into induced by the reproducing kernel . This insight has been used to define the kernel mean embedding (KME; Berlinet and Thomas-Agnan, 2004; Smola et al., 2007) of an -valued random variable as the -valued mean of the embedded random variable , and also the conditional mean embedding (CME; Fukumizu et al., 2004, Song et al., 2009), which seeks to perform conditioning of the original random variable through application of the Gaussian conditioning formula (also known as the Kálmán update) to the embedded non-Gaussian random variable . This article aims to provide rigorous mathematical foundations for this attractive but apparently naïve approach to conditional probability, and hence to Bayesian inference.

To be more precise, let us fix two RKHSs and over and respectively, with reproducing kernels and and canonical feature maps and . Let and be random variables taking values in and respectively, and let , , and denote the kernel mean embeddings (KMEs) of the distributions of , of , and given by

 μX\coloneqqE[φ(X)]∈H,μY\coloneqqE[ψ(Y)]∈G,μY|X=x\coloneqqE[ψ(Y)|X=x]∈G.

The conditional mean embedding (CME) offers a way to perform conditioning of probability distributions in the corresponding feature spaces and , where it becomes a linear-algebraic transformation (Figure 1.1). Under the assumptions that is an element of and that is invertible, the well-known formula for the CME is given by

 μY|X=x =CYXC−1Xφ(x),x∈X. (1.1)

Here, and denote the kernel covariance and cross-covariance operators defined in (2.3). Note that there are in fact two theories of CMEs, one working with centred covariance operators (Fukumizu et al., 2004; Song et al., 2009) and the other with uncentred ones (Fukumizu et al., 2013). We will discuss both theories in detail, but let us focus for a moment on the centred case for which the above formula was originally derived (Song et al., 2009, Theorem 4).

In the trivial case where and are independent, the CME should yield . However, independence implies that , and so (1.1) yields , regardless of . In order to understand what has gone wrong it is helpful to consider in turn the two cases in which the constant function is, or is not, an element of .

• If , then cannot be injective, since , and (1.1) is not applicable.

• If and and are independent, then the assumption cannot be fulfilled (except for those special elements for which or if for all , respectively).

In summary, (1.1) is never applicable for independent random variables except in certain degenerate cases. Note that this problem does not occur in the case of uncentred operators, where is typically injective.

Therefore, this paper aims to provide a rigorous theory of CMEs that addresses not only the above-mentioned pathology but also substantially generalises the assumptions under which CME can be performed. We will treat both centred an uncentred (cross-)covariance operators, with particular emphasis on the centred case, and will also exhibit a connection to Gaussian conditioning in general Hilbert spaces.

1. The standard assumption (creftypecap A) for CME is rather restrictive.1 For example, it does not hold in the case of independent random variables and Gaussian kernels; see Counterexample B.6. We show in Section 4 that this assumption can be significantly weakened in the case of centred kernel (cross-)covariance operators as defined in (2.3): only shifted by some constant function needs to lie in (creftypecap B). In this setting, the correct expression of the CME formula is

 μY|X=x=μY+(C†XCXY)∗(φ(x)−μX)for PX-a.e.\ x∈X, (1.2)

where denotes the adjoint and the Moore–Penrose pseudo-inverse of a linear operator . As a first sanity check, note that this formula indeed yields when and are independent. Similarly, as shown in Section 5, for uncentred kernel (cross-)covariance operators as defined in (2.5), the more general CME formula is

 μY|X=x=(uC†XuCXY)∗φ(x)for PX-a.e.\ x∈X. (1.3)
2. Furthermore, the assumption is hard to check in most applications. To the best of our knowledge, the only verifiable condition that supposedly implies this assumption is given by Fukumizu et al. (2004, Proposition 4). However, this implication turns out to be incorrect; see Counterexamples B.6 and B.5. We will present weaker assumptions (creftypecap ) for the applicability of CMEs which hold whenever the kernel is characteristic.2 Characteristic kernels are well studied (see e.g. Sriperumbudur et al. (2010)) and therefore provide a verifiable condition as desired.

3. The applicability of (1.1) requires the additional assumptions that is injective and that lies in the range of , which is also hard to verify in practice.3 We show that both assumptions can be avoided completely by replacing in (1.1) by in (1.2) and (1.3), and this turns out to be a globally-defined and bounded operator under rather weak assumptions (creftypecap C).

4. The experienced reader will also observe that, modulo the replacement of by , (1.2) is identical to the familiar Sherman–Morrison–Woodbury / Schur complement formula for conditional Gaussian distributions, a connection on which we will elaborate in detail in Section 7. We call particular attention to the fact that the random variable , which has no reason to be normally distributed, behaves very much like a Gaussian random variable in terms of its conditional mean.

###### Remark 1.1.

Note that we stated (1.2) and (1.3) only for -a.e. . This is the best that one can generally hope for, since the regular conditional probability is uniquely determined only for -a.e.  (Kallenberg, 2006, Theorem 5.3). The work on CMEs so far completely ignores the fact that conditioning (especially on events of the form ) is not trivial, requires certain assumptions and, in general, yields results only for -a.e. .

The rest of the paper is structured as follows. Section 2 establishes the notation and problem setting, and motivates some of the assumptions that are made. Section 3 discusses several critical assumptions for the applicability of the theory of CMEs and the relations among them. Section 4 proceeds to build a rigorous theory of CMEs using centred covariance operators, with the main results being Theorems 4.4 and 4.3, whereas Section 5 does the same for uncentred covariance operators, with the main result being Theorem 5.3. Section 6 reviews the established theory for the conditioning of Gaussian measures on Hilbert spaces, and this is then used in Section 7 to rigorously connect the theory of CMEs to the conditioning of Gaussian measures, with the main result being Theorem 7.1. We give some closing remarks in Section 8. Appendix A contains various auxiliary technical results and Appendix B gives counterexamples to some CME-related results of Fukumizu et al. (2004) and Alpay (2001).

## 2 Setup and Notation

Throughout this paper, when considering Hilbert-space valued random variables and defined over a probability space , the expected value is meant in the sense of a Bochner integral (Diestel, 1984), as are the uncentred and centred covariance operators

 uCov[U,V]\coloneqqE[U⊗V],Cov[U,V]\coloneqqE[(U−E[U])⊗(V−E[V])],

where, for and , the outer product is the rank-one linear operator . Naturally, we write and for and respectively, and all of the above reduces to the usual definitions in the scalar-valued case.

Our treatment of conditional mean embeddings will operate under the following assumptions and notation:

###### Assumption 2.1.
1. is a probability space, is a measurable space, and is a Borel space.4

2. and are symmetric and positive definite kernels, such that and are measurable functions for each and .

3. and are the corresponding RKHSs, which we assume to be separable. Indeed, according to Owhadi and Scovel (2017), if the base sets and are separable absolute Borel spaces or analytic subsets of Polish spaces, then separability of and follows from the measurability of their respective kernels and feature maps.

4. and are the corresponding canonical feature maps. Note that they satisfy the “reproducing properties” for and that are measurable by Steinwart and Christmann (2008, Lemma 4.25).

5. and are random variables with distributions and and joint distribution . creftypecap 2.11 and Kallenberg (2006, Theorem 5.3) ensure the existence of a -a.e.-unique regular version of the conditional probability distribution . We assume that

 E[∥φ(X)∥2H+∥ψ(Y)∥2G] <∞and E[∥ψ(Y)∥2G∣X=x] <∞for PX-a.e.\ x∈X, (2.1)

which guarantees that , , since, by the reproducing property and the Cauchy–Schwarz inequality, for all ,

 ∥h∥2L2(PX) ≤∫X∥h∥2H∥φ(x)∥2HdPX(x)=E[∥φ(X)∥2H]∥h∥2H (2.2)

and similarly for and , . It follows from (2.2) that the inclusions , and are bounded linear operators.

6. Observe that defines a symmetric and positive-semidefinite bilinear form on . Since it is invariant under -a.s. constant shifts of and and since if and only if is -a.s. constant, we can make it positive definite (and thereby an inner product) by considering the quotient space5 , where

 C \coloneqq{f∈L2∣f is constant PX-a.s.}, ⟨[f1],[f2]⟩L2C \coloneqqCov[f1(X),f2(X)].

Similarly, we define and identify with a subset of , .

7. Since and are measurable, is a well-defined -valued random variable; (2.1) ensures that has finite second moment and its mean and covariance have the following block structure:

 (2.3)

where the components

 μY \coloneqqE[ψ(Y)], CY \coloneqqCov[ψ(Y)], CYX \coloneqqCov[ψ(Y),φ(X)], μX \coloneqqE[φ(X)], CXY \coloneqqCov[φ(X),ψ(Y)], CX \coloneqqCov[φ(X)]

are called the kernel mean embeddings (KME) and kernel (cross-)covariance operators, respectively. Note that and that the reproducing properties translate to the KMEs and covariance operators in the following way:

 ⟨h,μX⟩H =E[h(X)], ⟨h,CXh′⟩H =Cov[h(X),h′(X)], ⟨h,CXYg⟩H =Cov[h(X),g(Y)]

and so on, for arbitrary and . We are further interested in the conditional kernel mean embedding and the conditional kernel covariance operator given by

 μY|X=x=E[ψ(Y)|X=x],CY|X=x=Cov[ψ(Y)|X=x]. (2.4)

Similarly, has the uncentred kernel covariance structure

 (2.5)

where etc. Note that, for , , and similarly for functions of .

8. For we let . These functions will be of particular importance since, for and , we obtain , our main object of interest. Note that since by (2.1), (2.2), and the law of total expectation,

 ∥fg∥L2(PX) =E[fg(X)2]=E[E[g(Y)|X]2]≤E[E[g(Y)2|X]] =E[g(Y)2]=∥g∥L2(PY)<∞,

and that, again by the law of total expectation,

 E[fg(X)]=E[g(Y)],E[fψ(y)(X)]=μY(y). (2.6)
9. For any linear operator between Hilbert spaces, denotes its Moore–Penrose pseudo-inverse, i.e. the unique extension of to a (possibly unbounded) linear operator defined on and such that .

###### Remark 2.2.

Measurability of and together with the separability of and guarantee the measurability of and (Steinwart and Christmann, 2008, Lemma 4.25). Separability of and is also needed for Gaussian conditioning (see Owhadi and Scovel (2018) and Section 6), for the existence of a countable orthonormal basis of , and to ensure that weak (Pettis) and strong (Bochner) measurability of Hilbert-valued random variables coincide.

## 3 The Crucial Assumptions for CMEs

This section discusses the various versions of the assumption under which we are going to prove various versions of the CME formula.

###### Assumption A.

For all we have .

###### Assumption B.

For all there exist a function and a constant such that -a.e. in .

###### Assumption C.

For all there exists a function such that

 Cov[hg(X)−fg(X),h(X)]=0for all h∈H.

In this case we denote (in conformity with creftypecap B).

###### Assumption uC.

For all there exists a function such that

 uCov[hg(X)−fg(X),h(X)]=⟨hg−fg,h⟩L2(PX)=0for all h∈H.
###### Remark 3.1.

Note that A B C, A , and that C B if is dense. In terms of the spaces and , Assumptions A can be reformulated as follows:

• for .

• for .

• The orthogonal projection of onto lies in for .

• The orthogonal projection of onto lies in for .

In contrast to creftypecap A, Assumptions B and C do not require the unfavourable property for independent random variables and . Instead, this case reduces to the trivial condition . At the same time, the proofs of the key properties of CMEs are not affected by replacing creftypecap A with creftypecap B as long as we work with centred operators (see Theorems 4.3 and 4.1 below). Therefore, it is surprising that this modification has not been considered earlier, even though the issues with independent random variables have been observed before (Fukumizu et al., 2013). One reason might be that, instead of centred (cross-) covariance operators, researchers started using uncentred ones, for which such a modification is not feasible.

creftypecap C, on the other hand, is not strong enough for proving the main formula for CMEs (the last statement of Theorem 4.3). Clearly, this cannot be expected: If the RKHS is not rich enough, e.g.  or , and is reasonably large, then no map from to can cover sufficiently many kernel mean embeddings, in particular the embeddings of the conditional probability for various (while creftypecap C is trivially fulfilled for or ). The weakness of creftypecap C lies in the fact that it only requires the vanishing of the orthogonal projection of onto . Only if is rich enough (e.g. if it is dense in ) can this condition have useful implications.

While it is nice to have a weaker form of creftypecap A, the Assumptions A, B and C remain hard to check in practice. Fukumizu et al. (2004, Proposition 4) provide a condition that is sufficient for creftypecap A and often easier to check, but unfortunately it is incorrect; see Counterexamples B.6 and B.5 in Appendix B. Since characteristic kernels are well studied in the literature, Lemma A.3 gives hope for a verifiable condition for the applicability of CMEs: it states that is dense in whenever the kernel is characteristic. So, if the denseness of in were sufficient for performing CMEs, then the condition that be characteristic would be sufficient as well, thus providing a favorable criterion for the applicability of formula (1.2). Unfortunately, neither condition implies creftypecap B. Therefore, we will consider the following slightly weaker versions of Assumptions A and B, under which conditional mean embeddings can be performed if one allows for certain finite-rank approximations of the (cross-) covariance operators:

###### Assumption A∗.

For each we have .

###### Assumption B∗.

For each we have .

Note that creftypecap C and creftypecap have no weaker versions, since they would become trivial if were replaced by and by respectively. In summary, we consider the hierarchy of assumptions illustrated in Figure 3.1. The main contributions of this paper are rigorous proofs of three versions of the CME formula under various assumptions:

Whether an analogue of Theorem 4.4 for the uncentred case can be proven under creftypecap remains an open problem.

## 4 Theory for Centred Operators

In this section we will formulate and prove two versions of the CME formula (1.2) — the original one under creftypecap B and a weaker version involving finite-rank approximations of the (cross-)covariance operators under creftypecap . The following theorem demonstrates the importance of creftypecap C (which follows from creftypecap B). It implies that the range of is contained in that of , making the operator well defined. By Theorem A.1 it is even a bounded operator, which is a non-trivial result requiring the application of the closed graph theorem.

Similar considerations cannot be performed, in general, under creftypecap alone: it can no longer be expected that , which is why we have to introduce the above-mentioned finite-rank approximations in order to guarantee that .

In summary, creftypecap B allows for the simple CME formula (1.2) by Theorem 4.1, while under creftypecap we have to make a detour using certain approximations. Note that this distinction is very similar to the theory of Gaussian conditioning in Hilbert spaces introduced by Owhadi and Scovel (2018) and recapped in Section 6 below, a connection that will be elaborated upon in detail in Section 7.

###### Theorem 4.1.

Under creftypecap 2.1, the following statements are equivalent:

1. creftypecap C holds.

2. For each there exists such that

3. .

###### Proof.

Note that 3 is just a reformulation of 2, so we only have to prove 12. Let and . By Lemma A.5, , and so

 Cov[h(X),hg(X)]=Cov[h(X),fg(X)] ∀h∈H ⟺⟨h,CXhg⟩=⟨h,CXYg⟩ ∀h∈H ⟺CXhg=CXYg,

which completes the proof. ∎

Note that creftypecap C implies that is the orthogonal projection of onto with respect to (see the reformulation of creftypecap C in Remark 3.1). Therefore, there might be some ambiguity in the choice of if contains constant functions. However, there is a particular choice of that always works:

###### Proposition 4.2.

Under creftypecap 2.1, if creftypecap B or creftypecap C holds, then may be chosen as

 hg=C†XCXYg. (4.1)

More precisely, if creftypecap C holds, then for all and ; and if creftypecap B holds, then for all there exists a constant such that -almost everywhere .

###### Proof.

By Theorem 4.1, (4.1) is well defined. Under creftypecap C, for all and ,

 Cov[h(X),(C†XCXYg)(X)] =⟨h,CXC†XCXYg⟩ =⟨h,CXYg⟩ =Cov[h(X),fg(X)] by Lemma A.5.

Under creftypecap B, for all , there exist a function and a constant such that, -a.e. in , . Theorem 4.1 implies that , and so Lemma A.4 implies that is constant -a.e. Therefore is constant -a.e. ∎

We now give our first main result, the rigorous statement of the CME formula for centred (cross-)covariance operators. In fact, we give two results: a “weak” result (4.2) in which the CME, as a function on , holds only when tested against elements of in the inner product, and a “strong” result (4.3), an almost-sure equality in .

###### Theorem 4.3 (Centred CME).

Under Assumptions 2.1 and C, is a bounded operator and, for all and ,

 ⟨h,μY|X=to5.24992pt\hss$⋅$\hss(y)⟩L2(PX) =⟨h,(μY+(C†XCXY)∗(φ(to7.499886pt\hss$⋅$\hss)−μX))(y)⟩L2(PX). (4.2)

1. the kernel is characteristic or

2. is dense in or

3. creftypecap B holds or

4. for each ,

then, for -a.e. ,

 μY|X=x =μY+(C†XCXY)∗(φ(x)−μX). (4.3)
###### Proof.

Theorems A.1 and 4.1 imply that the operator is well defined and bounded and that, for each , we may choose the function in Assumptions B and C to be (by Proposition 4.2). Now (2.6), Lemma A.6, and the definition of (see creftypecap C) yield that, for and ,

 hψ(y)(x)+cψ(y)=(μY+(C†XCXY)∗(φ(x)−μX))(y). (4.4)

This yields (4.2) for each via

 ⟨h,(μY|X=to5.24992pt\hss$⋅$\hss−μY−(C†XCXY)∗(φ(to7.499886pt\hss$⋅$\hss)−μX))(y)⟩L2(PX) =⟨h,fψ(y)−hψ(y)−cψ(y)⟩L2(PX) =Cov[h(X),(fψ(y)−hψ(y))(X)]=0+E[h(X)](E[(fψ(y)−hψ(y))(X)]−cψ(y)=0)=0.

If 1 or 2 holds (note that, by Lemma A.3, 12), then (4.3) follows directly. If 3 or 4 holds, then (4.3) can be obtained from

 μY|X=x(y) =E[ℓ(y,Y)|X=x]=fψ(y)(x)\lx@stackrel(∗)=hψ(y)(x)+cψ(y) =(μY+(C†XCXY)∗(φ(x)−μX))(y),

where the last equality follows from (4.4). ∎

Note that step in the proof of Theorem 4.3 genuinely requires that (which follows from creftypecap B), and creftypecap C alone does not suffice. Again we see that needs to be rich enough. The reason that we get (4.2) in terms of the inner product of , and not its weaker version in , is that we took care of the shifting constant .

Motivated by the theory of Gaussian conditioning in Hilbert spaces (Owhadi and Scovel, 2018) presented in Section 6 and Theorem 6.2 in particular, we hope to generalise CMEs to the case where (i.e., by Theorem 4.1, creftypecap C) does not necessarily hold. As mentioned above, this will require us to work with certain finite-rank approximations of the operators and . We are still going to need some assumption that guarantees that is rich enough to be able to perform the conditioning process in the RKHSs. For this purpose creftypecap B will be replaced by its weaker version .

###### Theorem 4.4 (Centred CME under finite-rank approximation).

Let creftypecap 2.1 hold. Further, let be complete orthonormal system of that is an eigenbasis of , let , let , let be the orthogonal projection onto , and let

Then and therefore is well defined for each . For each and ,

 ⟨h,μY|X=to5.24992pt\hss$⋅$\hss(y)⟩L2(PX)=limn→∞⟨h,(μY+(C(n)†XC(n)XY)∗(φ(to7.499886pt\hss$⋅$\hss)−μX))(y)⟩L2(PX). (4.5)

1. the kernel is characteristic or

2. is dense in or

then, for -a.e. ,

 (4.6)
###### Proof.

Note that, since is a trace-class operator, so is . Furthermore, by Baker (1973, Theorem 1), for some bounded operator . Since has finite rank, this implies that . Similarly to the proof of Theorem 4.3, we define for and obtain by (2.6) and Lemma A.6 for , and that

 h(n)ψ(y)(x)+c(n)ψ(y)=(μY+(C(n)†XC(n)XY)∗(φ(x)−μX))(y), (4.7)

Identity (4.5) can be obtained similarly to (4.2) except that we additionally need to show that for all , as proved in Lemma A.81.

In order to prove (4.6) we first note that is the -orthogonal projection of onto for all by Lemma A.82. Now let and . Since and by assumption (note that, by Lemma A.3, 12