Counterfactual Mean Embeddings
Abstract
Counterfactual inference has become a ubiquitous tool in online advertisement, recommendation systems, medical diagnosis, and finance. An accurate modelling of outcome distributions associated with different interventions—known as counterfactual distributions—is crucial for the success of these applications. In this work, we propose to model counterfactual distributions using a novel Hilbert space representation called counterfactual mean embedding (CME). The CME embeds the associated counterfactual distribution into a reproducing kernel Hilbert space (RKHS) endowed with a positive definite kernel, which allows us to perform causal inference over the entire landscape of the counterfactual distribution. Based on this representation, we propose a distributional treatment effect (DTE) which can quantify the causal effect over entire outcome distributions. Our approach is nonparametric as the CME can be estimated consistently from observational data without requiring any parametric assumption about the underlying distributions. We also establish a rate of convergence of the proposed estimator which depends on the smoothness of the conditional mean and the RadonNikodym derivative of the underlying marginal distributions. Furthermore, our framework also allows for more complex outcomes such as images, sequences, and graphs. Lastly, our experimental results on synthetic data and offpolicy evaluation tasks demonstrate the advantages of the proposed estimator.
authorMuandet, Kanagawa, Saengkyongam, and Marukatat \ShortHeadingsCounterfactual Mean EmbeddingsMuandet, Kanagawa, Saengkyongam, and Marukatat \firstpageno1
Counterfactual Inference, Kernel Mean Embedding, Potential Outcome Framework, Reproducing Kernel Hilbert Space, Causality.
1 Introduction
To make a rational decision, a decision maker must be able to anticipate the effects of a decision to the outcomes of interest, before committing to that decision. For instance, before building a certain facility in a city, e.g., a dam, policy makers and citizens must seek to understand its environmental effects. In medicine, a doctor has some prior knowledge about the effects a certain drug will have on a patient’s health, before actually prescribing it. In business, a company needs to understand the effects of a certain strategy of advertisement to its revenue and reputation. One approach to addressing these questions is counterfactual inference.
In counterfactual inference, there are three main ingredients: covariate , treatment , and outcome . Given certain realizations of these variables , an analyst wishes to know the effects of on , provided ; thus, corresponds to a decision to be made. For example, in medicine may represent a patient’s characteristics (e.g., age, weight, medical record, etc.), whether a certain drug is prescribed, and whether the patient recovers; each index in the data represents the identity of the th patient, and denotes his/her features, received treatment and resulting state of recovery.
This problem is called counterfactual, since for each index , the analyst can only observe an outcome of one treatment . The outcome that would have been observed for an alternative treatment is not observable. In the bandit community, this problem is referred to as a bandit feedback (Dudík et al., 2011). For example, it is impossible to observe the outcome of drug when the patient had actually received drug . This is known as the fundamental problem of causal inference (Holland, 1986). One way to partially address this issue is a randomized experiment (Fisher, 1935), which corresponds to obtaining as independently and identically distributed (i.i.d.) as the population random variables . Although considered a gold standard, randomization can be too expensive, timeconsuming, or unethical in practice. In most cases, an analysis about treatment effects can only be done on the basis of observational data that may not be i.i.d. (i.e., the tratment assignment and the outcome may depend on and some hidden confounders ); this setting is commonly known as observational studies (Rosenbaum, 2002; Rubin, 2005).
A fundamental framework for observational studies is the potential outcome framework (Neyman, 1923; Rubin, 1974), which guarantees that one can estimate treatment effects using observational data , provided that the data satisfy a certain assumption; see Section 3.1. It has been studied extensively in statistics, and has a wide range of applications in medicine, social science, and economics, see, e.g., Imbens and Rubin (2015). Moreover, important applications of machine learning such as online advertisement and recommendation system can be reformulated under this framework (Bottou et al., 2013; Liang et al., 2016; Schnabel et al., 2016). We argue, however, that there exist the following challenges:
Average treatment effects. Existing works focus on estimating the average treatment effect (ATE), which is the difference between the means of the outcome distributions; see Section 3.1 for details. However, the ATE does not inform changes in higherorder moments, even when they exist. For instance, if a treatment of interest has an effect only in the variance of the distribution of outcomes, then the analysis of average treatment effects cannot capture such effects. If the treatment is whether to provide a certain drug, and the outcome is the blood pressure of a patient; just analyzing the average treatment effects may lead to an incorrect conclusion, if the drug increases/decreases the blood pressure of a patient whose blood pressure was already high/low. This highlights the importance of analyzing the outcome distribution as a whole.
Parametric models. Many of existing approaches make parametric assumptions about relationships between covariates , treatments , and outcomes . However, if the imposed parametric assumption is incorrect, i.e., model misspecification, then the conclusion about treatment effects can be wrong or misleading.
Overparameterized models. Deep learning has become the first choice in many applied fields due to its excellent empirical performance, and thus has also been applied to counterfactual inference, e.g., Johansson et al. (2016); Hartford et al. (2017). Unfortunately, such approaches based on deep learning lack theoretical guarantees, because arguably deep learning itself lacks an established theory as a learning method (at least until now). This is problematic when consequential decisions are based on the analysis of treatment effects (e.g., political decisions and medical treatments).
Multivariate/structured outputs. Existing works often deal with outcomes that are binary or realvalued. However, depending on the application, outcome variables may be multivariate (possibly highdimensional) or structured, such as images and graphs. For example, in medical data analysis, outcomes may be fMRI data taken from a subject after providing a certain treatment. Thus, it is not straightforward to apply existing approaches.
In this work, we propose a novel framework for counterfactual inference to address the above challenges, which we term counterfactual mean embedding (CME). The CME is based on kernel mean embedding (Berlinet and ThomasAgnan, 2004; Smola et al., 2007; Muandet et al., 2017), a framework for representing probability distributions as elements in a reproducing kernel Hilbert space (RKHS), so that each element representing a distribution maintains all of its information (cf. Section 2.2 and 2.3). We define an element representing a counterfactual distribution, for which we propose a nonparameteric estimator. Notable features of the proposed approach are summarized as follows:

The proposed estimator can be computed based only on linear algebraic operations involving kernel matrices. Being a kernel method, it can be applied to not only standard domains (such as the Euclidean space), but also more complex and structured covariates and/or outcomes such as images, sequences, and graphs, by using offtheshelf kernels designed for such data (Gärtner, 2003); this widens possible applications of counterfacutal inference in general (cf. Section 3.3).

The proposed estimator can be used for computing a distance between the counterfactual and controlled distributions, thereby providing a way of quantifying the effect of a treatment to the distribution of outcomes; we define this distance as the maximum mean discrepancy (MMD) (Borgwardt et al., 2006; Gretton et al., 2012) between the counterfactual and controlled distributions. It also provides a way to sample points from a counterfactual distribution based on kernel herding (Chen et al., 2010), a kernelbased deterministic sampling method (cf. Section 3.4).

The proposed estimator is nonparametric, and has theoretical guarantees. Specifically, we prove the consistency of the proposed estimator under a very mild condition (cf. Theorem 4.1), and derive its convergence rates under certain regularity assumptions involving kernels and underlying distributions (cf. Theorem 4.2). Both results hold without assuming any parametric assumption.
The rest of the paper is organized as follows. After summarizing related work in Section 1.1, we review in Section 2 the potential outcome framework as well as kernel mean embedding of distributions. Section 3 introduces counterfactual learning and then provides a generalization of Hilbert space embedding to counterfactual distributions. In this section, we also present the theoretical results and the application of CME and the associated distributional treatment effect (DTE). We subsequently provide the detailed convergence analysis in Section 4, followed by examples of the important applications in Section 5. Finally, we demonstrate the effectiveness of the proposed estimator on simulated data as well as realworld policy evaluation tasks in Section 6.
1.1 Related Work
We summarize below related works on counterfactual inference.
Individual treatment effect (ITE) estimation. Estimating the ITE is one of the most fundamental tasks in counterfactual inference (Rubin, 1974; Shalit et al., 2017). This task is hindered by the fact that one cannot observe both outcomes at the same time for each subject. Moreover, the data is usually biased by a nonrandomized treatment assignment. Modern approaches attempt to resolve these problems by resorting to stateoftheart ML algorithms. For example, Hill (2011) develops a nonparametric method for estimating the ITE based on Bayesian additive regression tree (BART). Athey and Imbens (2016) and Wager and Athey (2018) adapt treebased methods to treatment effect estimation. Shalit et al. (2016) and Johansson et al. (2016) formulate the problem as a domain adaptation problem and propose to balance the covatiates using representation learning. Hartford et al. (2017) develop a twostep regression method based on deep neural networks for instrumental variable regression. Adversarial training of neural networks for causal inference have also been considered, e.g., in Yoon et al. (2018). Unlike these works, we focus on the distributional treatment effect, which involves the entire outcome distributions. This scenario often arises in several realworld applications in econometrics; see, e.g., Rothe (2010); Chernozhukov et al. (2013).
Policy learning from observational data. Learning an optimal policy by interacting directly with an environment is an important task, but can sometimes be prohibitive due to safety constraints. As a result, several works have attempted to leverage historical data collected using a logging policy in offpolicy learning (Atan et al., 2018; Kandasamy et al., 2017). Most methods rely on importance sampling (Langford et al., 2008; Bottou et al., 2013; Swaminathan and Joachims, 2015). When the logging policy is unknown, Strehl et al. (2010) provides error bounds when it is learned from the data. Dudík et al. (2011) uses a doubly robust estimator to reduce the variance of offpolicy evaluation. Swaminathan and Joachims (2015) presents a framework for policy learning called counterfactual risk minimization (CRM) based on empirical variance regularization. In this work, we also demonstrate the application of our estimator in policy evaluation.
Causal inference with kernel mean embeddings. Hilbert space embedding of distributions has been applied extensively in causal inference. In causal discovery, Fukumizu et al. (2008); Zhang et al. (2011); Doran et al. (2014) develop powerful kernelbased tests of conditional independence which allow for the recovery of the causal graphs up to the Markov equivalence class (Pearl, 2000; Spirtes et al., 2000). Based on functional causal models, Schölkopf et al. (2015) proposes to identify the causal direction between and using the kernel mean embeddings of and . LopezPaz et al. (2015) adopts a datadriven approach for causeeffect inference by treating it as a classification problem on the embeddings of . A notion of an asymmetry between the decompositions and has been used in Chen et al. (2014) and Mitrovic et al. (2018) to identify the causal direction between and . In potential outcome framework, the maximum mean discrepancy (MMD) (Borgwardt et al., 2006; Gretton et al., 2012) has become a popular technique used in covariate balancing between treatment and control groups (Shalit et al., 2016; Johansson et al., 2016). Our work, on the contrary, focuses on characterizing the representation of counterfactual distribution of outcomes using the kernel mean embedding and provide nonparametric inference tools.
2 Preliminaries
The counterfactual mean embedding relies on the potential outcome framework as well as the concepts of kernels, reproducing kernel Hilbert spaces (RKHSs), and kernel mean embedding of distributions. We review these concepts in this section.
2.1 Kernels and Reproducing Kernel Hilbert Spaces (RKHSs)
We first review kernels and RKHSs, details of which can be found in, e.g., Schölkopf and Smola (2002), Berlinet and ThomasAgnan (2004), and Smola et al. (2007).
Let be a nonempty set. Let be a Hilbert space consisting of functions on with and being its innerproduct and norm, respectively. The Hilbert space is called a reproducing kernel Hilbert space (RKHS), if there exists a symmetric function , called the reproducing kernel of , satisfying the following properties:

For all , we have . Here is the function of the first argument with being fixed, such that .

For all and , we have . This is called the reproducing property of (or of ).
It is known that the linear span of functions , denoted by , is dense in , i.e.,
where the closure in the right hand side is taken with respect to the norm of . In other words, any can be written as for some and such that .
Any RKHS is uniquely associated with its reproducing kernel , which is positive definite: a symmetric function is called positive definite, if for all , , and all , we have . On the other hand, for any positive definite kernel , there exists an RKHS for which is the reproducing kernel (Aronszajn, 1950). Therefore, by defining a positive definite kernel, one always implicitly defines its RKHS.
As indicated from the definition of positive definiteness, kernels can be defined on any nonempty set . Therefore, they have been defined not only for the real vector space , but also for nonstandard domains such as those of images and graphs. Popular kernels on the include linear kernels , polynomial kernels , Gaussian kernels , and Laplace (or more generally Matérn) kernels . More examples of positive definite kernels can be found, e.g., in Genton (2002) and Hofmann et al. (2008).
2.2 Kernel Mean Embedding of Distributions
In this work, we use kernels and RKHSs to represent, compare, and estimate probability distributions. This is enabled by the approach known as kernel mean embedding of distributions (Berlinet and ThomasAgnan, 2004; Smola et al., 2007; Muandet et al., 2017), which we review here. In what follows, we assume that is a measurable space with some sigma algebra .
[Kernel mean embedding (KME)]
Let be the set of all probability measures on a measurable space , and be a measurable positive definite kernel with associated RKHS , such that .
Then, the kernel mean embedding (KME) of is defined as a Bochner integral
(1) 
The element may be alternatively called the kernel mean of . For a random variable , the kernel mean may also be written as .
The kernel mean serves as a representation of in the RKHS . This is justified if is characteristic (Fukumizu et al., 2004): the RKHS (and the associated kernel ) is defined to be characteristic, if the mapping in (1) is injective. In other words, is characteristic, if for any , we have if and only if . That is, is uniquely associated with , and thus becomes a unique representation of in , maintaining all information about . Examples of characteristic kernels on include Gaussian, Matérn and Laplace kernels (Sriperumbudur et al., 2010). On the other hand, linear and polynomial kernels are not characteristic, since their RKHSs are finite dimensional and only provide unique representations of distributions up to certain moments.
The kernel mean embedding (1) is the key ingredient of a wellknown metric on probability measures called maximum mean discrepancy (MMD) (Borgwardt et al., 2006; Gretton et al., 2012). For two distributions , their MMD is given as the RKHS distance between the corresponding kernel means , :
(2) 
where the second identity follows from the reproducing property and being a vector space (Gretton et al., 2012, Lemma 4). The right expression is the maximum discrepancy between the means of functions from the unit ball of the RKHS , and is the original definition of MMD. Being defined via the RKHS distance, MMD is a pseudometric on . Moreover, if is characteristic, holds if and only if , and thus MMD becomes a proper metric on probability measures. See Sriperumbudur et al. (2010); SimonGabriel and Schölkopf (2018) for details and relationships to other popular metrics on probability measures.
Given an i.i.d. sample from , the kernel mean can be estimated simply by an empirical average
(3) 
The consistency of (3), that is as , has been established in Song (2008, Theorem 27) and also in Gretton et al. (2012); LopezPaz et al. (2015); Tolstikhin et al. (2017). Importantly, this holds without any parametric assumption about the underlying distribution .
Given another i.i.d. sample from , and defining as an estimate of the kernel mean , the (squared) MMD (2) can be estimated as
where the right expression follows from the reproducing property (Gretton et al., 2012, Eq. 5). Applying the triangle inequality, it follows that as , implying the consistency of the above estimator of MMD with a parametric convergence rate. This estimator only requires evaluations of the kernel, and therefore is easy to implement in practice. We note that the above MMD estimator is biased, while being consistent; an unbiased estimator is also available for MMD (Gretton et al., 2012, Eq. 3).
2.3 Kernel Mean Embedding of Conditional Distributions
Finally, the notion of KME can be extended to conditional distributions (Song et al., 2009; Grünewälder et al., 2012; Song et al., 2013; Fukumizu et al., 2013). To describe this, let be a random variable taking values in the product space , where and are measurable spaces. Define a measurable kernel on and let be the associated RKHS. Similarly, define a measurable kernel on and let be the associated RKHS. Let be the joint distribution of , and be the conditional distribution of given .
The KME of the conditional distribution is then defined as the conditional expectation of with respect to :
(4) 
Again, if is characteristic, this kernel mean maintains all information about , thus being qualified as its representation. Note that is defined for each .
Given an i.i.d. sample from the joint distribution , the conditional mean embedding (4) can be estimated as
(5) 
where
Here, is the kernel matrix such that , and is a regularization constant. As pointed out by Grünewälder et al. (2012), this estimator can be interpreted as that of functionvalued kernel ridge regression, where the task is to estimate the mapping from training data . In fact, the weights in (5) are identical to those of kernel ridge regression (or Gaussian process regression). As such, the regularization constant should decay to at an appropriate speed as , in order to ensure a good convergence rate of the estimator (5), see, e.g., Caponnetto and Vito (2007).
3 Counterfactual Mean Embeddings
In this section, we formulate our problem of estimating distributional treatment effects and describe our approach. In Section 3.1, we review the potential outcome framework and, based on it, we define distributional treatment effects. The key concepts here are counterfactual distributions on outcomes. In Section 3.2, we describe our approach, counterfactual mean embeddings, as the kernel mean embeddings of counterfactual distributions. We then define their empirical estimators in Section 3.3. Finally, we introduce the kernel treatment effect (KTE) as a way to evaluate the distributional treatment effect in Section 3.4.
3.1 Potential Outcome Framework and Distributional Causal Effects
We pose our problem based on the potential outcome framework, also known as the NeymanRubin causal model, which is a classic and widely used approach to estimating causal effects of treatments from observational data (Neyman, 1923; Rubin, 1974, 2005).
We consider a hypothetical subject (e.g., a patient) in a population. Let be a covariate random variable representing the subject’s features (e.g., age, weight, blood pressure, etc.), where is a measurable space. Let be a treatment random variable whose effects on the subject we are interested in. In this work we focus on binary treatments for simplicity, but extension to multiple treatments is straightforward. For instance, may represent an active treatment of giving the patient a certain drug, and a control treatment of not giving any drug.
Let be random variables representing potential outcomes, where is a measurable space. That is, represents the outcome of interest after the subject is exposed to the treatment , and the outcome after the subject is exposed to the treatment . For instance, may be the blood pressure of the patient measured after the patient had the drug, and be that after having nothing. The problem here, known as the fundamental problem of causal inference, is that one can only observe either or , but not both. For instance, if one gave the drug to the patient and measured the resulting blood pressure, it is no longer possible to measure the blood pressure of the same patient without the drug. Thus, the observed outcome can be defined as
where if , and zero otherwise. Note that in observational studies, treatment assignments may not be completely random, i.e., may depend on , and .
Assume that there are subjects, and that each subject is associated with random variables that are distributed as independently to the other subjects,
(6) 
Note that for each subject , only one of or can be observed. Thus, observational data given to the analyst are
(7) 
which are i.i.d. with . We write the number of subjects receiving treatment , and that of treatment .
We consider three kinds of distributional causal effect, as described below. For ease of understanding, we also present the corresponding expressions based on the sample (6). Nevertheless, these sample expressions are also counterfactual quantities due to the fundamental problem of causal inference.
Distributional Average Treatment Effect (DATE)
Let and be the distributions of and , respectively. Then we define the distributional average treatment effect (DATE) as the difference between these two distributions:
(8) 
The corresponding sample expression is given by
where is the Dirac distribution. As mentioned, this sample expression cannot be obtained from observational data (7), since for each subject we only have either or .
The DATE (8) can capture the treatment effects on the potential outcomes that may not be identified only by the average treatment effect (ATE) (Imbens, 2004), the difference between the expectations of and :
(9) 
or its corresponding sample version
For instance, even when the ATE is , the higher order moments of and , such as their variances, may differ. The DATE can capture such a difference, while the ATE cannot.
Distributional Effects Caused by Treatment Assignment Mechanism
This is defined by the difference between the conditional distribution of given and that of given :
(10) 
where is the conditional distribution of given . Note that a similar definition can be given for . The corresponding sample expression is given by
Note that the second term is counterfactual in the sense that the potential outcome of subject with is not observable.
The above distributional differences capture the effects caused by the difference in the characteristics of subjects exposed to different treatments, rather than the effects caused by the treatment itself. For instance, assume that a drug was assigned to subjects whose blood pressures were already high () and not assigned to subjects with low blood pressures (). Then the counterfacutal blood pressures of the subjects with , which would have been observed if they had not took the drug, would be higher than those with .
Distributional Effects Caused by Treatment
This is defined as the difference in two conditional distributions as
(11) 
For , this can be understood as the distributional treatment effect for the treated, and the corresponding sample expression is given by
where the second term is counterfactual. The details of the conditional treatment effect (11) can be found for example in Chernozhukov et al. (2013, p.2214).
3.2 Counterfactual Distributions
To deal with distributional treatment effects discussed in the previous subsection, we need to introduce the notion of counterfactual distributions (Chernozhukov et al., 2013). We first summarize the notation defined above and introduce new ones, which we follow Chernozhukov et al. (2013, Appendix C). {definition} Let and be random variables taking values in , and and be random variables taking values in and , respectively. The random variables , and () are defined as
In Definition 3.2, is the observed outcome variable. is then given that the treatment is (). By definition of , this implies that , that is, is the potential outcome conditional on . Note that, since and may be dependent, may be different from as a random variable. The variable is the covariate variable conditional on . The pair of variables can thus be seen as observed random variables conditional on the treatment .
The following is a key assumption, which is needed in general for counterfactual inference with observational data.
Assumption 1

Conditional exogeneity: almost surely for .

Support condition: , where is the support of for .
The conditional exogeneity (A1), also known as the unconfoundedness or ignorability, is a common assumption in observational studies to guarantee the identifiability of causal effects from observational data (Rosenbaum and Rubin, 1983; Imbens, 2004; Rubin, 2005). It requires that there is no hidden confounder, say , that affects both the treatment and potential outcomes . In other words, the covariates include all important characteristics regarding the potential outcomes. The support condition (A2) is needed to make the counterfactual distribution (introduced in (12) below) welldefined, and is also made in Chernozhukov et al. (2013, Eq. 2.3). It is analogous to the overlap assumption required for propensity score methods (e.g. Imbens, 2004, Assumption 2.2).
We now define counterfactual distributions. Let and be the probability distributions of and , respectively. Denote by and the corresponding marginal distributions of outcomes defined by
where is the conditional distribution of given , and is that of given . Following Chernozhukov et al. (2013), counterfactual distributions are then defined as
(12)  
(13) 
which are welldefined as long as the support condition in Assumption 1 is satisfied.
The distributions introduced above are defined in terms of the observed random variables . We now see how these distributions are related to the distributions on potential outcomes that appear in distributional causal effects (10) and (11). First, as summarized in the following lemma, and are nothing but and , respectively. For completeness, we include the proof in Appendix B.1. {lemma} We have and .
On the other hand, the counterfactual distributions and are respectively equal to distributions and appearing in (10) and (11), provided that Assumption 1 holds (Chernozhukov et al., 2013, Lemma 2.1); we provide a proof for completeness in Appendix B.2. {lemma}[Causal interpretation] Suppose that Assumption 1 is satisfied. Then we have and . Lemma 3.2 shows that the distributions and , which play the key role in analyzing distributional causal effects (10) and (11), can be obtained by estimating the corresponding counterfactual distributions and defined in terms of observed random variables . The key in this regard is the conditional exogeneity in Assumption 1.
3.3 Kernel Mean Embeddings for Counterfactual Distributions
We now define counterfactual mean embeddings. Let be a positive definite kernel on , and assume that the support condition in Assumption 1 is satisfied. We then refer to the kernel mean embeddings of the counterfactual distributions (12) and (13)
(14)  
(15) 
as counterfactual mean embeddings (CME). Lemma 3.2 implies that, under Assumption 1, these CMEs are respectively identical to the kernel mean embeddings of and defined as
Therefore, by defining an empirical estimator of the CME (14), one can hope to estimate the treatment effect as given in (10), which will be done below.
Estimating Counterfactual Mean Embeddings
We introduce our estimator of the CME defined in (14). In practice, it is not possible to obtain a sample from , and therefore the counterfactual mean embedding cannot be estimated directly. We therefore propose an estimator that instead uses samples from and to estimate . To this end, first note that in (14) can be written in terms of the conditional mean embedding (4) of :
where This formulation suggests that can be estimated by i) constructing an estimator of the conditional mean embedding and then ii) taking its average over . This is how our estimator is derived below.
Suppose that we are given independent samples from and from . For , let denote the estimate (5) of the conditional mean embedding based on . Then an empirical estimator of is defined and expressed as
(16) 
where is a regularization constant, , with , and with .
The proposed estimator (16) is nonparametric, and can be implemented without knowledge about parametric forms of the conditional and marginal . Thus, the estimator is useful when such knowledge is not available. In Section 4, we theoretically analyze the asymptotic behavior of the estimator, proving its consistency and deriving convergence rates. In doing so, we elucidate conditions required for the consistency of the proposed estimator.
The computational complexity of our estimator (16) is because of the matrix inversion, which may be expensive when the sample size is huge. To reduce the complexity, one can adopt existing approximation methods such as Nyström method and random Fourier features (Williams and Seeger, 2001; Rahimi and Recht, 2008).
We note that the form of the estimator is identical to the kernel sum rule (Song et al., 2013, Section 4.1), a mean embedding approach to computing forward probabilities in Bayesian inference. The way we use the estimator is different from this previous approach, however. We use our estimator to estimate the counterfactual distribution and distributional causal effects (10), and this requires Assumption 1 to hold for data (or for the population random variables), as shown in Lemma 3.2.
3.4 Kernel Treatment Effects
We quantify distributional treatment effects by using the RKHS distance between the mean embeddings of potential outcome distributions under consideration. We call this approach Kernel Treatment Effects (KTE). We show below how KTEs can be defined for the different distributional treatment effects discussed in Section 3.1.
KTE for Distributional Average Treatment Effects
As before, let be a kernel on the output space and be its RKHS. For the distributional average treatment effect (8) discussed in Section 3.1.1, the corresponding KTE is defined as
(17) 
where and are the kernel mean embeddings of the distributions of potential outcomes and , respectively:
(18) 
The KTE (17) may be regarded as a generalization of the ATE (9) in the sense that, if is the linear kernel on , then the KTE only distinguishes the means of the two outcome distributions. By using a different kernel , the KTE may capture the differences between higherorder statistics of the outcome distributions and . For instance, if is a polynomial kernel of degree with , then the KTE (17) is equal to if and only if and have the same moments up to degree (see, e.g., Muandet et al. 2017, Chapter 3).
If is a characteristic kernel, such as Gaussian and Matérn kernels, then the KTE (17) is equal to if and only if the two distributions and are the same. In this case, the KTE takes a positive value if and only if there is a difference between and . This means that the KTE informs the existence of any difference in the potential outcome distributions, quantifying the average distributional treatment effect.
The question is how to estimate the KTE (17) from data. As in (7), let be observational data, which are i.i.d. with the random variables . Recall is the observed outcome and thus given by , where and are the potential outcomes. In observational studies, it is common to use the propensity score , the conditional probability of the treatment being made given that the covariates are , to define an unbiased estimator of the average treatment effect (Rosenbaum and Rubin, 1983). We show here that the same strategy of inverse propensity weighting (Imbens, 2004, Section IIIC) can be straightforwardly used to define unbiased estimators of the mean embeddings and of potential outcome distributions and , respectively, thus providing a way of estimating the KTE. That is, assuming that the propensity is available, we define
(19) 
where and are the populations of treated and control groups, respectively.
In the special case of a randomized experiment where and are independent and thus the propensity is for all , the above estimators reduce to the standard empirical estimators of mean embeddings: and . Note that these uniformlyweighted empirical estimators are biased if the experiment is not randomized, i.e., in observational studies. This is because, for instance, the sample contributing to follows the distribution of , which is different from the unconditional . Thus, we need the inverse propensity weighting to obtain unbiased estimators in the case of observational studies.
The following result shows that the estimators (19) are indeed unbiased estimators of the corresponding mean embeddings and of potential outcome distributions. The proof is presented in Appendix B.3.
Suppose that for all and that the conditional exogeneity in Assumption 1 is satisfied. Let be i.i.d. with , and let and be the estimators (19) of the the mean embeddings and of the potential outcome distributions and in (18). Then we have
Theorem 3.4.1 shows that the estimators (19) are unbiased, but does not say anything about their convergence rates as the sample size goes to infinity. The following result provides this; it essentially shows that the estimators (19) converge to the mean embeddings and at the same rates as the standard kernel mean estimators, which are minimax optimal (Tolstikhin et al., 2017). The key assumption here is that the propensity is uniformly lower and upperbounded away from and , respectively. The proof is presented in Appendix B.4.