A Measure-Theoretic Approach to Kernel Conditional Mean Embeddings
We present a new operator-free, measure-theoretic definition of the conditional mean embedding as a random variable taking values in a reproducing kernel Hilbert space. While the kernel mean embedding of marginal distributions has been defined rigorously, the existing operator-based approach of the conditional version lacks a rigorous definition, and depends on strong assumptions that hinder its analysis. Our definition does not impose any of the assumptions that the operator-based counterpart requires. We derive a natural regression interpretation to obtain empirical estimates, and provide a thorough analysis of its properties, including universal consistency. As natural by-products, we obtain the conditional analogues of the Maximum Mean Discrepancy and Hilbert-Schmidt Independence Criterion, and demonstrate their behaviour via simulations.
The idea of embedding probability distributions into a reproducing kernel Hilbert space (RKHS), a space associated to a positive definite kernel, has received a lot of attention in the past decades (Berlinet and Thomas-Agnan, 2004; Smola et al., 2007), and has found a wealth of successful applications, such as independence testing (Gretton et al., 2008), two-sample testing (Gretton et al., 2012), learning on distributions (Muandet et al., 2012; Lopez-Paz et al., 2015; Szabó et al., 2016), goodness-of-fit testing (Chwialkowski et al., 2016; Liu et al., 2016) and probabilistic programming (Schölkopf et al., 2015; Simon-Gabriel et al., 2016), among others – see review by Muandet et al. (2017). It extends the idea of kernelising linear methods by embedding data points into high- (and often infinite-)dimensional RKHSs, which has been applied, for example, in ridge regression, spectral clustering, support vector machines and principal component analysis among others (Schölkopf and Smola, 2002; Hofmann et al., 2008; Christmann and Steinwart, 2008).
Conditional distributions can also be embedded into RKHSs in a similar manner (Song et al., 2013; Muandet et al., 2017). Compared to marginal distributions, conditional distributions can represent more complicated relations between several random variables, and therefore conditional mean embeddings (CMEs) have the potential to unlock the whole arsenal of kernel mean embeddings to a much wider setting. Indeed, conditional mean embeddings have been applied successfully to dynamical systems (Song et al., 2009), inference on graphical models via belief propagation (Song et al., 2010), probabilistic inference via kernel sum and product rules (Song et al., 2013), reinforcement learning (Grünewälder et al., 2012; Nishiyama et al., 2012), kernelising the Bayes rule and applying it to nonparametric state-space models (Fukumizu et al., 2013) and causal inference (Mitrovic et al., 2018) to name a few.
Despite such progress, the current prevalent definition of the conditional mean embedding based on composing cross-covariance operators (Song et al., 2009) relies on some stringent assumptions, which are often violated and hinder its analysis. Klebanov et al. (2019) recently attempted to clarify and weaken some of these assumptions, but strong and hard-to-verify conditions still persist. Grünewälder et al. (2012) provided a regression interpretation, but here, only the existence of the CME is shown, without an explicit expression. The main contribution of this paper is to provide a theoretically rigorous, operator-free definition of the CME that requires drastically weaker assumptions, and comes in an explicit expression. We believe this will enable a more principled analysis of its theoretical properties, and open doors to new application areas. We derive the empirical estimate based on vector-valued RKHS regression, and provide an in-depth analysis of its properties, including a universal consistency result of rate . In particular, we relax the assumption of Grünewälder et al. (2012) to allow for infinite-dimensional RKHSs.
As natural by-products, we obtain quantities that are extensions of the Maximum Mean Discrepancy (MMD) and the Hilbert-Schmidt Independence Criterion (HSIC) to the conditional setting, which we call the Maximum Conditional Mean Discrepancy (MCMD) and the Hilbert-Schmidt Conditional Independence Criterion (HSCIC). We demonstrate their properties through simulation experiments.
All proofs can be found in Appendix C.
We take as the underlying probability space. Let , and be separable measurable spaces, and let , and be random variables with distributions , and . We will use as the conditioning variable throughout.
2.1 Positive Definite Kernels and RKHS Embeddings
Let be a vector space of real-valued functions on , endowed with the structure of a Hilbert space via an inner product . A symmetric function is a reproducing kernel of if and only if: 1. , ; 2. and , . A Hilbert space of real-valued functions which possesses a reproducing kernel is called a reproducing kernel Hilbert space (RKHS) (Berlinet and Thomas-Agnan, 2004). A symmetric function is a positive-definite function if its Gram matrix is positive definite. The Moore-Aronszajn Theorem (Aronszajn, 1950) shows that the set of positive-definite functions and the set of reproducing kernels on are in fact identical.
Assuming , we define the kernel mean embedding of the distribution as . Note that the integrand is an element in a Hilbert space (and therefore a Banach space), so this integral is not a Lebesgue integral, but a Bochner integral (Dinculeanu, 2000, p.15, Definition 35). The square-root integrability assumption ensures that is indeed Bochner-integrable. We will generalise the following lemma to the conditional case later.
Lemma 2.1 (Smola et al. (2007)).
For each , .
Next, suppose is an RKHS of functions on with kernel , and consider the tensor product RKHS (see Weidmann (1980, pp.47-48) for a definition of tensor product Hilbert spaces).
Theorem 2.2 (Berlinet and Thomas-Agnan (2004, p.31, Theorem 13)).
The tensor product is generated by the functions , with and defined by . Moreover, is an RKHS of functions on with kernel
Now let us impose a slightly stronger integrability condition:
This ensures that is Bochner -integrable, and so . The next lemma is analogous to Lemma 2.1:
Lemma 2.3 (Fukumizu et al. (2004, Theorem 1)).
For each pair , ,
As a consequence, for any pair and , we have:
There exists an isometric isomorphism , where is the space of Hilbert-Schmidt operators from to . The cross-covariance operator is defined as (Fukumizu et al., 2004). It is straightforward to show that .
The notion of characteristic kernels is essential, since it tells us that the associated RKHSs are rich enough to enable us to distinguish different distributions from their embeddings.
Definition 2.4 (Fukumizu et al. (2008)).
A positive definite kernel is characteristic to a set of probability measures defined on if the map is injective.
Sriperumbudur et al. (2010) discuss various characterisations of characteristic kernels and show that the well-known Gaussian and Laplacian kernels are characteristic. We then have a metric on via for , which goes by the name maximum mean discrepancy (MMD) (Gretton et al., 2007).
The HSIC is defined as the Hilbert-Schmidt norm of , or equivalently, (Gretton et al., 2005), i.e. the MMD between and . If is characteristic, then if and only if .
In this subsection, we briefly review the concept of conditioning in the formal measure-theoretic probability theory, in the context of Banach space-valued random variables. We consider a sub--algebra of and a Banach space .
Definition 2.5 (Conditional Expectation, Dinculeanu (2000, p.45, Definition 38)).
Suppose is a Bochner -integrable, -valued random variable. Then the conditional expectation of given is any -measurable, Bochner -integrable, -valued random variable such that for all . Any satisfying this condition is said to be a version of . We write to mean , where is the sub--algebra of generated by the random variable .
Definition 2.6 (Çınlar (2011, p.149)).
For each , the conditional probability of given is .
Note that, in the unconditional case, the expectation is defined as the integral with respect to the measure, but in the conditional case, the expectation is defined first, and the measure is defined as the expectation of the identity.
For this definition of conditional probability to be useful, we require an additional property, called a \sayregular version. We first define the transition probability kernel
Definition 2.7 (Çınlar (2011, p.37,40)).
Let , be measurable spaces. A mapping is a transition kernel from to if (i) , is -measurable; (ii) , is a measure on . If , is said to be a transition probability kernel.
Definition 2.8 (Çınlar (2011, p.150, Definition 2.4)).
For each , let be a version of . Then is said to be a regular version of the conditional probability measure if is a transition probability kernel from to .
The following theorem, proved in Appendix C, is the reason why a regular version is important. It means that, roughly speaking, the conditional expectation is indeed obtained by integration with respect to the conditional measure.
Theorem 2.9 (Adapted from Çınlar (2011, p.150, Proposition 2.5)).
Suppose that admits a regular version . Then with is a version of for every Bochner -integrable .
Next, we define the conditional distribution.
Definition 2.10 (Çınlar (2011, p.151)).
Let be a random variable taking values in a measurable space . Then the conditional distribution of given is any transition probability kernel from to such that, for all , .
If has a regular version , then letting
for , defines a version of the conditional distribution of given . Unfortunately, a regular version of a conditional probability measure does not always exist. Also, it is not guaranteed that any version of the conditional distribution exists. The following definition and theorem tell us that, fortunately, these versions exist more often than not.
Definition 2.11 (Çınlar (2011, p.11)).
A measurable space is standard if it is isomorphic to , where is some Borel subset of and is the Borel -algebra of .
Theorem 2.12 (Çınlar (2011, p.151, Theorem 2.10)).
If is a standard measurable space, then there exists a version of the conditional distribution of given . In particular, if is a standard measurable space, then the conditional measure has a regular version.
Examples of standard measurable spaces include , and with their respective Borel -algebras, complete separable metric spaces with their Borel -algebra, Polish spaces with their Borel -algebras, separable Banach spaces with their Borel -algebras and separable Hilbert spaces with their Borel -algebras.
2.3 Vector-Valued RKHS Regression
In this subsection, we introduce the theory of vector-valued RKHS regression, based on operator-valued kernels. Let be a Hilbert space, which will be the output space of regression.
Definition 2.13 (Carmeli et al. (2006, Definition 1)).
An -valued RKHS on is a Hilbert space such that 1. the elements of are functions ; 2. there exists such that for all .
For the next definition, we let denote the Banach space of bounded linear operators from into itself.
Definition 2.14 (Carmeli et al. (2006, Definition 2)).
A -kernel of positive type on is a map such that, for all , and , for all .
Analogously to the scalar case, it can be shown that any -valued RKHS possesses a reproducing kernel, which is an -kernel of positive type satisfying, for any , and , and . There is also an analogy of the Moore-Aronszajn Theorem:
Theorem 2.15 (Carmeli et al. (2006, Proposition 1)).
Given an -kernel of positive type , there is a unique -valued RKHS on with reproducing kernel .
Now suppose we want to perform regression with input space and output space , by minimising the following regularised loss functional:
where is a regularisation parameter and . There is a corresponding representer theorem:
3 Conditional Mean Embedding
We are now ready to introduce a formal definition of the conditional mean embedding of given .
Definition 3.1 (Conditional Mean Embedding (CME)).
We define the conditional mean embedding of given as
This is a direct extension of the marginal kernel mean embedding, , but instead of being a fixed element in , is a -measurable random variable taking values in (see Definition 2.5). Also, for , is a real-valued -measurable random variable. The following lemma is analogous to Lemma 2.1.
Suppose admits a regular version. Then for any ,
Next, we define , a -measurable random variable taking values in . The following lemma is an analogy of Lemma 2.3.
Suppose that admits a regular version. Then for each pair and ,
almost surely. Hence, we define the conditional cross-covariance operator as (see Section 2.1 for the definition of ).
3.1 Comparison with Existing Definitions
As previously mentioned, the idea of CMEs and conditional cross-covariance operators is not a novel one, yet our development of the theory and definitions above differ significantly from the existing works. In this subsection, we review the previous approaches and compare them to ours.
The prevalent definition of CMEs in the literature is the one given in the following definition. We first need to endow the conditioning space with a scalar kernel, say , with corresponding RKHS .
Definition 3.4 (Song et al. (2009, Definition 3)).
The conditional mean embedding of the conditional distribution is the operator defined by , where and are unconditional (cross-)covariance operators as defined in Section 2.1.
As noted by Song et al. (2009), the motivation for this comes from Theorem 2 in the appendix of Fukumizu et al. (2004), which states that if , then for any , . This relation can be used to prove the following theorem.
Theorem 3.5 (Song et al. (2009, Theorem 4)).
Take any . Then assuming , satisfies: 1. ; 2. .
Now we highlight the key differences between this approach and ours. Firstly, this approach requires the endowment of a kernel on the conditioning space , and subsequently defines the CME as an operator from to . By contrast, our definition did not consider any kernel or function on , and defined the CME as a Bochner conditional expectation given , i.e. a -measurable, -valued random variable. It seems more logical not to have to endow the conditioning space with a kernel, at least before the estimation stage. Secondly, the operator-based approach assumes that , as a function in , lives in . This is a severe restriction; it is stated in Song et al. (2009) that this assumption, while true for finite domains with characteristic kernels, is not necessarily true for continuous domains, and Fukumizu et al. (2013) gives a simple counterexample using the Gaussian kernel. Lastly, it also assumes that exists, which is another severe restriction. Fukumizu et al. (2013) mentions that this assumption is too strong in many situations involving popular kernels, and gives a counterexample using the Gaussian kernel. The most common remedy is to resort to the regularised version and treat it as an approximation of . These assumptions have been clarified and slightly weakened in Klebanov et al. (2019), but strong and hard-to-verify conditions persist.
In contrast, our definitions extend the notions of kernel mean embedding, expectation operator and cross-covariance operator to the conditional setting simply by using the formal definition of conditional expectations (Definition 2.5), and only rely on the mild assumption that the conditional probability measure admits a regular version.
Grünewälder et al. (2012) gave a regression interpretation, by showing the existence, for each , of that satisfies . However, the main drawback here is that there is no explicit expression for , limiting its analysis and use. In contrast, our definition has an explicit expression , with which it is easy to explore potential applications.
In Fukumizu et al. (2004), the conditional cross-covariance operator is defined, but in a significantly different way. It is defined as , where is the right inverse of on . This has the property that, for all and , . Note that this is different to our relation stated after Lemma 3.3; the conditional covariance is integrated out over . In fact, this difference is explicitly noted by Song et al. (2009).
3.2 A Discrepancy Measure between Conditional Distributions
In this subsection, we propose a conditional analogue of the maximum mean discrepancy (MMD), and explore the role of characteristic kernels in the conditional case. Let be an additional random variable, satisfying .
We define the maximum conditional mean discrepancy (MCMD) between and to be
We note that MCMD is not a fixed value, but a real-valued, -measurable random variable.
(Gretton et al., 2012). The analogous (a.s.) equality for the MCMD is:
where we used Lemma 3.2. We define the conditional witness function as the -valued random variable
Casting aside measure-theoretic issues arising from conditioning on an event of measure 0, we can informally think of the realisation of the MCMD at each with as \saythe MMD between and , and as \saythe witness function between and . The following theorem says that, with characteristic kernels, the MCMD can indeed act as a discrepancy measure between conditional distributions.
Suppose is a characteristic kernel, and assume that admits a regular version. Then almost surely if and only if, almost surely, for all .
The MCMD is reminiscent of the conditional maximum mean discrepancy of Ren et al. (2016), defined as the Hilbert-Schmidt norm of the operator (see Definition 3.4). However, due to strong assumptions previously discussed, and often do not even exist, and/or do not have the desired properties of Theorem 3.5, so even at population level, is often not an exact measure of discrepancy between conditional distributions. On the other hand, Theorem 3.7 with the MCMD is an exact mathematical statement at population level that is valid between any pair of conditional distributions.
The discussion on characteristic kernels in the conditional setting, and the precise meaning of an \sayinjective embedding of conditional distributions, has largely been absent in the existing literature. We suspect that this is because the operator-based definition is somewhat cumbersome to work with, and it is not immediately clear how to express such statements. The new, mathematically elegant definition of the CME can remedy that through Theorem 3.7. We conjecture that characteristic kernels will play a crucial role in many future applications of the CME.
3.3 A Criterion of Conditional Independence
In this subsection, we introduce a novel criterion of conditional independence, via a direct analogy with the HSIC.
We define the Hilbert-Schmidt Conditional Independence Criterion between and given to be
Note that is an instance of the MCMD in the tensor product space , and is a (real-valued) random variable. Again, casting aside measure-theoretic issues arising from conditioning on an event of probability 0, we can conceptually think of the realisation of the HSCIC at each as \saythe HSIC between and . Since HSCIC is an instance of MCMD, the following theorem follows immediately from Theorem 3.7.
Suppose is a characteristic kernel
Concurrent and independent work by Sheng and Sriperumbudur (2019) proposes a similar criterion with the same nomenclature (HSCIC). However, they omit the discussion of CMEs entirely, and define the HSCIC as the usual HSIC between and , without considerations for conditioning on an event of measure 0. Their focus is more on investigating connections to distance-based measures (Wang et al., 2015; Sejdinovic et al., 2013). Fukumizu et al. (2008) propose , defined as the squared Hilbert-Schmidt norm of the normalised conditional cross-covariance operator , where and . As discussed, these operator-based definitions rely on a number of strong assumptions that will often mean that does not exist, or it does not satisfy the conditions for it to be used as an exact criterion even at population level. On the other hand, the HSCIC defined as in Definition 3.8 is an exact mathematical criterion of conditional independence at population level. Note that is a single-value criterion, whereas the HSCIC is a random criterion.
4 Empirical Estimates
In this section, we discuss how we can obtain empirical estimates of .
Let be a random variable taking values in a measurable space with underlying probability space , and let be a Hausdorff topological space, with its Borel -algebra . A mapping is measurable with respect to and if and only if for some deterministic function , measurable with respect to and .
Letting and , the upshot of Theorem 4.1 is that we can write
where is some deterministic, measurable function. Figure 1 depicts this relation. Hence, the problem of estimating boils down to estimating the function , and this is exactly the setting for vector-valued regression discussed in Section 2.3, with input space and output space . In contrast to Grünewälder et al. (2012), where regression is motivated by applying the Riesz representation theorem conditioned on each value of , we derive the CME as an explicit function of , which we argue is a more principled way to motivate regression. Moreover, for continuous , the event has measure 0 for each , so it is not measure-theoretically rigorous to apply the Riesz representation theorem conditioned on .
The natural optimisation problem is to minimise the loss
among all , where is a vector-valued RKHS of functions endowed with a kernel , where is a scalar kernel on .
We cannot minimise directly, since we do not observe samples from , but only the pairs from . We bound this with a surrogate loss that has a sample-based version:
where we used generalised conditional Jensen’s inequality (see Appendix A, or Perlman (1974)). Section 4.1 discusses the meaning of this surrogate loss. We replace the surrogate population loss with a regularised empirical loss based on samples from the joint distribution :
where is a regularisation parameter. We see that this loss functional has exactly the same form as in (4). Therefore, by Theorem 2.16, the minimiser of has the form , where we wrote and . By Theorem 2.16, the coefficients are the unique solutions of the linear equations , where , and is the identity matrix. Hence, the coefficients are , where we wrote . Finally, substituting this into the expression for , we have
4.1 Surrogate Loss, Universality and Consistency
There is no doubt that in (6) is a more natural loss functional than the surrogate loss . In this subsection, we investigate the meaning and consequences of using this surrogate loss, as well as the implications of using a universal kernel and the consistency properties of our algorithm.
Denote by the Banach space of (equivalence classes of) measurable functions such that is -integrable, with norm . We can note that the true function belongs to , because Theorem 4.1 tells us that is indeed measurable, and
minimises both and in . Moreover, it is almost surely equal to any other minimiser of the loss functionals.
Note the difference in the statement of Theorem 4.2 from Grünewälder et al. (2012, Theorem 3.1), who only consider the minimisation of the loss functionals in , whereas we consider the larger space in which to minimise our loss functionals.
Next, we discuss the concepts of universal kernels and universal consistency.
Definition 4.3 (Carmeli et al. (2008, Definition 2)).
An operator-valued reproducing kernel with associated RKHS is if is a subspace of , the space of continuous functions vanishing at infinity. The kernel is -universal if is dense in for any measure .
Recall that we are using the kernel . Carmeli et al. (2008, Example 14) show that is -universal if is a universal scalar kernel, which in turn is guaranteed if is Gaussian or Laplacian, for example (Steinwart, 2001).
The consistency result with optimal rate of in Grünewälder et al. (2012) based on Caponnetto and De Vito (2006) imposes strong assumptions about the kernel , and finite-dimensional assumption on the output space . These are violated for many commonly used kernels such as the Gaussian kernel, and so we do not use this result in our paper (see Appendix B for more details). Fukumizu (2015) also shows consistency with rate slightly worse than with weaker assumptions. We prove the following universal consistency result, which relies on even weaker assumptions and achieves a better rate of .
Suppose and are bounded kernels, i.e. for all , for some and for all , for some . Then our learning algorithm that produces is universally consistent (in the surrogate loss ), i.e. for any joint distribution and constants and ,
for large enough . The rate of convergence is .
The boundedness assumption is satisfied with many commonly used kernels, such as the Gaussian and Laplacian kernels, and hence is not a restrictive condition. The key observation is that the target values are all of the form for , so the target space is bounded if is bounded (see Appendix B and the proof in Appendix C for details).