Counterfactual Mean Embedding: A Kernel Method for Nonparametric Causal Inference
Abstract
This paper introduces a novel Hilbert space representation of a counterfactual distribution—called counterfactual mean embedding (CME)—with applications in nonparametric causal inference. Counterfactual prediction has become an ubiquitous tool in machine learning applications, such as online advertisement, recommendation systems, and medical diagnosis, whose performance relies on certain interventions. To infer the outcomes of such interventions, we propose to embed the associated counterfactual distribution into a reproducing kernel Hilbert space (RKHS) endowed with a positive definite kernel. Under appropriate assumptions, the CME allows us to perform causal inference over the entire landscape of the counterfactual distribution. The CME can be estimated consistently from observational data without requiring any parametric assumption about the underlying distributions. We also derive a rate of convergence which depends on the smoothness of the conditional mean and the RadonNikodym derivative of the underlying marginal distributions. Our framework can deal with not only realvalued outcome, but potentially also more complex and structured outcomes such as images, sequences, and graphs. Lastly, our experimental results on offpolicy evaluation tasks demonstrate the advantages of the proposed estimator.
1 Introduction
To infer causal relation, it is natural to state the problem in terms of counterfactual question, e.g., would the patient has recovered had the medical treatment been different? This school of thought is influenced predominantly by the potential outcome framework Rubin05:PO (). It has been studied extensively in classical statistics and has a wide range of applications in social science, econometrics, and epidemiology. Moreover, important applications of machine learning such as online advertisement and recommendation system can be reformulated under this framework Bottou13:Counterfactual (); Liang:16CausalRecom (); Schnabel16:RecomTreat (). Although a randomized experiment—which is considered a gold standard in causal inference—can in principle be employed for these applications, it can be too expensive, timeconsuming, or unethical to implement in practice. Hence, this work will focus on what are known as observational studies Rosenbaum02:OS (); Rubin05:PO ().
In observational studies, we are interested in inferring causal relation between a treatment and an outcome , which have been recorded along with a covariate . In an online advertisement, for example, the covariate usually encodes information about the users and the associated queries. The treatment may be an ad placement and the outcome is determined by whether or not the user clicks the advertisement. Throughout, we denote the space of treatments by , the space of covariates by , and the space of possible outcomes by . For and , denotes the potential outcome for under the treatment . Likewise, we denote the counterfactual outcome for under the treatment after the treatment is already applied by . That is, is defined after we already observe the value of . We refer to the distribution of as the counterfactual distribution. Then, inferring causal relation between and would have been as straightforward as calculating the difference had we known the value of both and . However, as implied in the preceding statement, it is virtually impossible to observe both of them simultaneously due to the fundamental problem of causal inference. Instead, we must resort to a logged dataset where .
From machine learning perspective, attempts have recently been made to formulate the problem above as a supervised learning Johansson16:CounterfactualRep (); Shalit16:CounterfactualBound (); Swaminathan15:CRM (). For example, one possibility is to fit a predictor directly to the dataset and then use it to estimate Hill11:Nonparam (); Johansson16:CounterfactualRep (). Unfortunately, the learnt predictor is almost always biased because only one potential outcome is observed for a given covariate . Moreover, the treatment assignment mechanism is not always under our control and in general could depend on a hidden confounder. Hence, learning causal relation from logged data differs fundamentally from standard supervised learning. In Johansson16:CounterfactualRep (); Shalit16:CounterfactualBound (), the authors propose to reduce this bias by also learning a joint representation of covariates in treatment and control groups. Other wellknown techniques such as inverse propensity score (IPS) weighting Langford08:ES (); Bottou13:Counterfactual (), doubly robust estimator Dudik11:DoublyRobust (), and deep learning Johansson16:CounterfactualRep (); Hartford17a:DeepIV () have also been applied successfully.
In this work, we propose a novel representation of counterfactual distributions of called counterfactual mean embedding (CME). The CME relies on kernel mean embedding Berlinet04:RKHS (); Smola07Hilbert (); Muandet17:KME () which maps probability distributions into a reproducing kernel Hilbert space (RKHS). Based on this representation, causal inference can be performed over the entire landscape of counterfactual distribution using the kernel arsenal. We show that this representation can be estimated consistently from observational data. Furthermore, we establish a convergence rate of the estimator which depends on the smoothness regarding the conditional distribution and the RadonNikodym derivative of the underlying marginal distributions. Interestingly, we found that our estimator is guaranteed to converge reasonably fast, if either the RadonNikodym derivative or the conditional mean is smooth. This property resembles that of doubly robust estimator Cassel76:DR (); Dudik11:DoublyRobust (). Our framework is nonparametric as it requires no parametric assumption about the underlying distributions. Since the CME depends only on Gram matrices evaluated on the data, it can potentially be applied to more complex and structured outcomes such as images, sequences, and graphs. Lastly, we demonstrate the effectiveness of the proposed estimator on simulated data as well as realworld policy evaluation tasks.
The rest of the paper is organized as follows. Section 2 introduces kernel methods and Hilbert space embedding of distributions, which form the backbone of this work. Section 3 introduces counterfactual learning and then provides a generalization of Hilbert space embedding to counterfactual distributions. In this section, we also present the theoretical results and the application of CME in offpolicy evaluation. Finally, experimental results on both simulated and real data are provided in Section 4.
2 Hilbert space embedding of distributions
Let be a nonempty set and be a reproducing kernel Hilbert space (RKHS) of functions . The RKHS is uniquely determined by a symmetric, positive definite kernel function and possesses two important properties aronszajn50reproducing (): (i) for any , the function is an element of , (ii) the inner product in satisfies the reproducing property, i.e., for all and , . Let denote the set of probability measures on a measurable space . Then, the kernel mean embedding (KME) of can be defined as Berlinet04:RKHS (); Smola07Hilbert (); Muandet17:KME ()
(1) 
It is welldefined if is measurable and bounded, i.e., . By reproducing property of , for any . Given an i.i.d. sample from , the empirical estimate of (1) is given by . The consistency of has been established in (Song08:Thesis, , Theorem 27) and also in Gretton12:KTT (); LopezPaz15:Towards (); TolstikhinSM17:Minimax ().
By virtue of (1), a wellknown discrepancy measure called maximum mean discrepancy (MMD) between two distributions and can be defined as . The MMD has been applied extensively in twosample testing Borgwardt06:MMD (); Gretton12:KTT (). The RKHS (and the associated kernel ) is said to be characteristic if if and only if . In which case, the map (1) is injective which implies that captures all necessary information about . Examples of characteristic kernels include Gaussian kernel and Laplacian kernels Fukumizu04:DRS (); Sriperumbudur10:Metrics ().
The KME can be extended to conditional distributions via the notion of covariance operators in RKHS Song10:KCOND (); Song2013 (). Let be a random variable taking value on , where is another measurable space, and and be RKHSs with measurable kernels on and , respectively. Assume that and . The (uncentered) covariance operator , which encodes the information of the joint distribution on , is defined as for . Alternatively, can be expressed in terms of a tensor product . If , we write the covariance operator as . See Appendix C.1 and, e.g., Fukumizu04:DRS (); Zwald04:KPCA () for furthers details on covariance operators.
The conditional mean embedding of for some is defined as where is an element in and is an operator from to Song10:KCOND (); Song2013 (). By the reproducing property of , for all . The embeddings and can respectively be expressed in terms of the covariance operators as and under certain assumptions Song10:KCOND (); Song2013 (); Muandet17:KME ().^{1}^{1}1This holds under the assumption that for all Song10:KCOND (); Song2013 (); Fukumizu13:KBR (). Note that we nevertheless do not require this condition for our theoretical analysis; see Sec. 3.2 In what follows, we treat and as RKHS representations of and . See, e.g., Song2013 (); Muandet17:KME () and references therein for applications of conditional mean embedding.
3 Counterfactual mean embedding
Throughout, we will assume w.l.o.g. that . Then, the causal effect in potential outcome framework is usually characterized by an individual treatment effect (ITE), given by , for a covariate . Since we can never observe both and at the same time, one often resort to the average treatment effect (ATE) defined by instead. Unlike the ITE, the ATE can be estimated empirically as where and are the posttreatment outcomes. Other summary statistics such as ratio and quantile of the distribution have also been investigated Heckman85:Intervention ().
To make causal claims, we require the following assumptions, which are common in observational studies.
Assumption 1.
{enumerate*}Stable unit treatment value assumption (SUTVA): the outcome of subject is independent of the outcomes of other individuals and their received treatments.
Conditional exogeneity/unconfoundedness/ignorability: . In other words, given the covariates the outcome is independent of the treatment assignment. This assumption implies that the distributions of and agree Rosenbaum83:Propensity (); Heckman85:Intervention (); Imbens2004:Exogeneity ().
The common support assumption: .
Under these assumptions, we can claim that if has no causal effect on . Nevertheless, in many contexts, e.g., applied econometrics, the outcome distribution may change in ways that cannot be revealed by an examination of the averages. For example, wage distributions tend to be skewed to the right in which case mean effects would deem inappropriate Machado05:Quantile (). This motivates us to consider the estimator of the entire outcome distribution.
3.1 Hilbert space representation of counterfactual distribution
We first introduce the notion of counterfactual distribution and then define corresponding embedding in the RKHS. As a working example, let us consider the following question taken from Chernozhukov13:Counterfactual (): “what would women wages have prevailed if they face the men wage schedule, or vice versa?”. This type of questions cannot be addressed by randomized experiments, due to an ethical or practical issue.
Let and be probability distributions defined on and , and assume they respectively represent observed distributions of wages and features for men. Similarly, let and be the corresponding observed distributions for women. Then, the counterfactual distribution of wages that would have prevailed for women had they faced the men’s wage schedule is defined by Chernozhukov13:Counterfactual () as
(2) 
Assumption (A3) ensures that the above integral is well defined. Note that does not arise as a distribution from any observable distribution and hence is not observable in practice. Through (2), we are able to consider counterfactual scenarios, where changes may occur in the covariate distributions, or in the conditional distribution of the outcome given covariates. In this paper, we focus on the effect of changes in the covariate distributions (i.e., the change from to ) to the outcomes. This scenario is also related to domain adaptation problems in machine learning BenDavid10:TLD (); Zhang13:Shift ().
Definition 1 (Counterfactual mean embedding).
Assume that (A3) holds. Then an RKHS embedding of the counterfactual distribution (2) is defined by
(3) 
The counterfactual distribution corresponding to the individual treatment effect (ITE) can be obtained by restricting to the Dirac measure for some . In what follows, we define as the true interventional distribution associated with the treatment. Then, the following lemma assigns a causal interpretation to the CME, which follows from (Chernozhukov13:Counterfactual, , Lemma 2.1) and Definition 1.
Lemma 1 (Causal interpretation).
Suppose that (A2) holds, i.e., almost surely for , and that (A3) holds. Then where is the embedding of .
Lemma 1 equips with an arsenal to perform causal inference, i.e., it can be viewed as a representation of the actual interventional distribution associated with the specified treatment. If we further assume that is characteristic, then captures all necessary information about .
In practice, it is not possible to obtain a sample from , and therefore cannot be estimated directly, which differs from the case of the standard mean embedding . We therefore propose the following estimator (4), which instead uses samples from and to estimate . The estimator is essentially (or superficially) the kernel sum rule, so the proof of the following proposition can be found in Song2013 (); Fukumizu13:KBR (). Note that there is a conceptual difference between our estimator and the standard kernel sum rule: Our estimator is considered as the kernel sum rule equipped with a causal interpretation, provided that the assumptions in Lemma 1 hold.
Proposition 1.
Given samples from and from , let and be empirical covariance operators and let be the empirical kernel mean. Then an empirical estimator of is defined and expressed as
(4) 
where is a regularization constant, , with , with , and . Note that we can write where .
Last but not least, it is generally challenging to perform model selection (in our case, the choice of , and ) in counterfactual prediction. To the best of our knowledge, no systematic method has been used in this setting. To this end, we modify the classical cross validation procedure and we use it for model selection. Details are omitted to conserve space (see Appendix B).
3.2 Theoretical analysis: consistency and convergence rates
In the sequel we will use the following notation. Let be the Hilbert space of squareintegrable functions,^{2}^{2}2More precisely, each element in is a equivalent class of functions; see Appendix C.1. and be the product measure of and on the product space . For convergence analysis, we make the following assumption.
Assumption 2.
{enumerate*}[label=()]
and are measurable spaces, and are measurable kernels on and , and and are probability measures on .
and satisfy .
is absolutely continuous w.r.t. with the RadonNikodym derivative satisfying .
A function defined by , where is an independent copy of , satisfies .
Let , and samples and in Prop. 4 satisfy , , and as .
In Assumption 2, the condition 2 is a minimum assumption, and the condition 2 is satisfied for instance if is bounded. The condition 2 requires the support of be included in that of , and thus enforces the common support assumption (A3) in Assumption 1. In 2 the function encodes the information of the conditional distribution of given , and the assumption is satisfied for instance if the kernel is bounded. The condition 2 requires that samples are consistent, and is satisfied when are i.i.d. with , and are i.i.d. with , for instance. However, the condition 2 does not require that samples be independent, and can be satisfied even when and are given as timeseries data, for example, if they satisfy an appropriate stationarity condition. We assume for simplicity of presentation.
Theorem 1 below established the consistency of the CME estimator (4), where we also require that the RKHS be dense in , which is satisfied by commonly used kernels such as Gaussian and Matérn kernels^{3}^{3}3For Matérn kernels, this holds since the resulting RKHSs are normequivalent to Sobolev spaces, which contain all functions in the RKHSs of Gaussian kernels. on . The proof can be found in Appendix C.2.
Theorem 1 (Consistency).
Assume that is dense in , and that Assumption 2 is satisfied. Then if decays to zero sufficiently slowly as , we have in probability as .
Theorem 1 does not require any parametric assumption on and . Besides, it can be considered as a version of (Fukumizu13:KBR, , Theorem 8) that proves the consistency of the kernel sum rule. Unlike ours, however, Fukumizu13:KBR () assumes that the function belongs to the tensorproduct RKHS ; this is a strong condition that may not be satisfied in practice. On the other hand, for we only require that , which is satisfied if is bounded, as mentioned.
To derive convergence rates, we need to define the following concepts, whose details can be found in Appendix C.1. Define an integral operator by . Under the condition 2 in Assumption 2, can be written as for any with convergence in , where and is an orthonormal system in . Then for a constant , the th power of is defined as for . Define further an integral operator by for , and let be the th power of for . Denote by the range of an operator . Theorem 2 below establishes convergence rates of our estimator; the proof can be found in Appendix C.3.
Theorem 2 (Convergence rates).
In the condition , the constant can be considered as quantifying the smoothness of the function : As increases, gets smoother.^{4}^{4}4In fact, it is known that is normequivalent to a certain interpolation space between and for (SteSco12, , Thm. 4.6); As tends to (resp. 1/2), tends to (resp. ); the situation is that is smoother than least smooth functions in . A similar argument also applies to the interpretation of . Since is the RadonNikodym derivative of w.r.t. , being large (or being smooth) implies that the two distributions and are similar, and vice versa. Similarly, the constant quantifies the smoothness of the function .
Based on the above interpretation of the assumptions, let us interpret the rate of Theorem 2. Assume that is very close to , meaning that may be nonsmooth. Even in this case, if is large, that is if is smooth, we can still guarantee a certain rate of convergence. A similar argument holds for the case when is close to : If is large, then our estimator converges at a reasonable rate. In other words, Theorem 2 states that our estimator is guaranteed to converge reasonably fast, if either the RadonNikodym derivative or the function is smooth. This property resembles that of doubly robust estimators Cassel76:DR (); Dudik11:DoublyRobust (). Therefore, even in the situation where the change from to is large, we may still expect a good performance with our estimator, given the relationship between and is smooth. This will also be experimentally demonstrated.
3.3 Application: Offpolicy evaluation
We demonstrate the effectiveness of our estimator in offpolicy evaluation task. Let be a space of user (or context) features, be a space of treatments, and be a stochastic policy which selects a treatment given a user . Given a target policy , offpolicy evaluation generally aims to provide an unbiased estimate of its performance. It is assumed that we have access to logged data obtained from an initial policy where represent the user features, are the selected treatments (e.g., recommendations) given , and are the rewards, where is the conditional distribution of rewards given and . The policy specifies how the set of recommendations, known as slates, are constructed given the user or context information. Finally, the reward can simply be the number of clicks on the recommendation.
To adopt our framework, we assume that once the user feature and the treatment are specified, the reward distribution will be unchanged, i.e., , regardless of which policy produced them. (Here is the reward distribution when the policy is .) Intuitively, we expect only the recommendations, but not the user behavior/response, to change as a result of a policy change. The covariate distribution may differ depending on the situation. Based on this assumption, the reward distribution under the target policy can be obtained as . Since we have access to a sample from and a sample from , the embedding can be estimated directly using (4). That is, with the notation of Proposition 1, we let , , , , , , and so on. By virtue of Lemma 1, we can interpret as the embedding of the actual counterfactual distribution. The resulting algorithm, which is very simple, is summarized in Algorithm 1.
4 Experiments
We compare our estimator to the following benchmark estimators in the offpolicy evaluation task, using both simulated and realworld data. Below denotes logged data.
Direct Method (DM).
The Direct method fits a regression model for rewards based on . The estimate is given by where denotes the recommendation probabilities under the target policy. Note that is typically biased toward the distribution of . We used a 3layer feedforward neural network as , for which the input feature vector is given by concatenating the vector of user and all items in the recommendation .
Weighted Inverse Propensity Score (wIPS).
The wIPS estimator obtains an unbiased estimate of the target reward by reweighting each observation in the logged dataset by the ratio of propensity scores under the target and null policies Horvitz52:Sampling (). The wIPS estimator is defined by , where are the propensity weights.
Doubly Robust (DR).
The DR estimator combines the two aforementioned estimators by exploiting both the regression model and the propensity scores Cassel76:DR (); Dudik11:DoublyRobust (). The estimator is given by .
Slate Estimator.
The Slate estimator assumes that the reward value is linear w.r.t. a given recommendation Swaminathan17:Slate (). It is defined as , where ( and are the numbers of slots for recommendation and available items, respectively) is the indicator vector whose th element is if the recommendation contains the item in the slot , is the MoorePenrose pseudoinverse of , and . Because of the linearity assumption, the slate estimator has lower variance than IPS estimator; however, the estimator would suffer from bias if the assumption does not hold true.
For the CME, we used a kernel defined as where and . To be able to compare to other estimators, we used . The regularization parameter was selected by the cross validation procedure in Appendix B, while we determined and by the median heuristic, i.e., and .
4.1 Simulated data
As explained in §3.3, when a user visits a website, the system provides a recommendation as an ordered list of items out of available items to that user. Each item is represented by a feature vector for . Hence, a recommendation is an order list , where is the dimensionality. Likewise, each user has a preference (or feature) vector for where denotes the total number of users. The reward is 1 if the user clicks any of the recommended items and 0 otherwise. Specifically, for each pair, let be the probability of a click, where is the mean vector of feature vectors for , and is a Gaussian white noise. The reward of the recommendation is defined as .
For each user a policy generates the list of recommended items by sampling without replacement with a multinomial distribution: The probability of item being selected is , where is the user preference vector of user . For the target policy , we set for where with . For the null policy , we set where .
After generating the item feature vectors , the datasets and were generated from and , respectively. (Note that we only use for evaluation.) We set , , , , and . We performed 5fold CV over parameter grids, i.e., the number of hidden units for the Direct and DR estimators, and the regularization parameter for our CME. We repeated experiments 30 times to obtain the mean squared error (MSE) for each estimator.
Figure 1 depicts the experimental results (note that vertical axis is in log scale). In brief, we found that {enumerate*}[label=()]
the performance of all estimators degrade as the difference between and increases (i.e., as tends to ), but the CME is least susceptible to this difference,
the Slate estimator does not perform well in this setting because its assumptions do not hold,
all estimators deteriorate as the context dimension increases, but the effect appears to be more pronounced for the Direct, DR, and CME estimators than for the IPS and Slate estimators as they do not rely directly on the context variables,
the opposite effect is observed if we increase the number of available items , as illustrated in Figure 1(c), and
the CME estimator achieves better performance than other estimators in most experiments. Supplementary results of this experiment can also be found in Appendix A.
4.2 Real data
For the realworld dataset, we use the data from the Microsoft Learning to Rank Challenge dataset (MSLRWEB30K) Qin13:MSLR () and treat them as an offpolicy evaluation problem; the setup is similar to Swaminathan17:Slate (). The data contains set of queries and corresponding URLs. Each query and URL pair is represented by a vector along with the relevant judgment . For our reward function, we used the expected reciprocal rank (ERR) Chapelle09:ERR (), which is defined by , where with . For the null and target policies and , the vector is split into URL feature and body feature , which are then used to train two regression models to fit : For the Lasso is used with and denoted by , and for the regression tree is used with and denoted by .
The logged data is then generated as follows. We first sample query uniformly from the dataset, and obtain top candidate URLs based on the relevant scores predicted by . The null policy then recommends URLs out of these candidates, according to the PlackettLuce model parameterized by , where is the rank of the relevant score predicted by the model and is an exploration rate. For , the target policy employs the same setting as the null policy , except that the predicted relevant scores are obtained from the model. In this experiment, we set for the null policy, and consider both deterministic and stochastic target policies. For the stochastic policy, we set , while the deterministic policy selects the topK URLs directly from the predicted relevant scores.
In this experiment, we used for the Direct method the regression tree, instead of a neural network. In addition, we included the OnPolicy method as a baseline, which estimates rewards directly from the target policies (and thus, this baseline should always perform the best). To accelerate the computation of the CME, we used the Nyström approximation method Williams01:Nystrom ().
Figure 2 depicts the results. In short, our CME dominates other estimators in most of experiment conditions. (Note also here that vertical axis is in log scale, so the margins are significantly large.) The wIPS clearly suffers from high variance, especially in the deterministic target policy, and DR also suffers from the same issue, except in the top left condition. This would be because, in the deterministic policy setup, the propensity score adjustment requires an exact match between logged and target treatments, but this almost never happens when the treatment space is large. The Slate, Direct and CME are relatively robust across different conditions. The Direct method and CME perform particularly well when sample size is small, regardless of the treatment space, while the Slate estimator requires larger samples, especially in the large treatment space.
5 Discussion
Our estimator of counterfactual distribution exhibits appealing theoretical properties, and also serves as a practical tool for causal inference. Ultimately, we hope that our work can be useful not only for researchers in disciplines such as social science, epidemiology, and econometrics that rely on the potential outcome framework, but also for the driving machine learning community to solve this challenging problem, because several open questions still remain, e.g., the use of highorder moments of counterfactual distribution, and how to handle a hidden confounder and an instrumental variable.
Acknowledgments
We thank Ricardo Silva, Joris Mooij, Adith Swaminathan, Evan Robin, David LopezPaz, Wittawat Jitkrittum, Kenji Fukumizu, and Bernhard Schölkopf for fruitful discussions. Krikamol Muandet would like to acknowledge fundings from the Thailand Research Fund Grant No. MRG6080206 and additional funding from the Faculty of Science, Mahidol University. Motonobu Kanagawa has been partially supported by the European Research Council (StG project PANAMA).
References
 [1] N. Aronszajn. Theory of reproducing kernels. Transactions of the American Mathematical Society, 68(3):337–404, 1950.
 [2] C. R. Baker. Joint measures and crosscovariance operators. Transactions of the American Mathematical Society, 186:pp. 273–289, 1973.
 [3] S. BenDavid, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. W. Vaughan. A theory of learning from different domains. Machine Learning, 79(1–2):151–175, 2010.
 [4] A. Berlinet and C. ThomasAgnan. Reproducing Kernel Hilbert Spaces in Probability and Statistics. Kluwer Academic Publishers, 2004.
 [5] K. M. Borgwardt, A. Gretton, M. J. Rasch, H.P. Kriegel, B. Schölkopf, and A. J. Smola. Integrating structured biological data by kernel maximum mean discrepancy. Bioinformatics, 22(14):e49–e57, 2006.
 [6] L. Bottou, J. Peters, J. Q. nonero Candela, D. X. Charles, D. M. Chickering, E. Portugaly, D. Ray, P. Simard, and E. Snelson. Counterfactual reasoning and learning systems: The example of computational advertising. Journal of Machine Learning Research, 14:3207–3260, 2013.
 [7] C. M. Cassel, C. E. Särndal, and J. H. Wretman. Some results on generalized difference estimation and generalized regression estimation for finite populations. Biometrika, 63(3):615–620, 1976.
 [8] O. Chapelle, D. Metlzer, Y. Zhang, and P. Grinspan. Expected reciprocal rank for graded relevance. In Proceedings of the 18th ACM Conference on Information and Knowledge Management, CIKM ’09, pages 621–630, New York, NY, USA, 2009. ACM.
 [9] V. Chernozhukov, I. FernándezVal, and B. Melly. Inference on counterfactual distributions. Econometrica, 81(6):2205–2268, 2013.
 [10] M. Dudík, J. Langford, and L. Li. Doubly Robust Policy Evaluation and Learning. In Proceedings of the 28th International Conference on Machine Learning, pages 1097–1104. Omnipress, 2011.
 [11] K. Fukumizu, F. R. Bach, and M. I. Jordan. Dimensionality reduction for supervised learning with reproducing kernel Hilbert spaces. Journal of Machine Learning Research, 5:73–99, 2004.
 [12] K. Fukumizu, L. Song, and A. Gretton. Kernel Bayes’ rule: Bayesian inference with positive definite kernels. Journal of Machine Learning Research, 14:3753–3783, 2013.
 [13] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola. A kernel twosample test. Journal of Machine Learning Research, 13:723–773, 2012.
 [14] J. Hartford, G. Lewis, K. LeytonBrown, and M. Taddy. Deep IV: A flexible approach for counterfactual prediction. In Proceedings of the 34th International Conference on Machine Learning, volume 70, pages 1414–1423. PMLR, 2017.
 [15] J. J. Heckman and R. Robb. Alternative methods for evaluating the impact of interventions. Journal of Econometrics, 30(1):239–267, 1985.
 [16] J. L. Hill. Bayesian nonparametric modeling for causal inference. Journal of Computational and Graphical Statistics, 20(1):217–240, 2011.
 [17] D. G. Horvitz and D. J. Thompson. A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association, 47(260):663–685, 1952.
 [18] G. W. Imbens. Nonparametric Estimation of Average Treatment Effects under Exogeneity: A Review. The Review of Economics and Statistics, 86(1):4–29, 2004.
 [19] F. D. Johansson, U. Shalit, and D. Sontag. Learning representations for counterfactual inference. In Proceedings of the 33rd International Conference on Machine Learning, ICML’16, pages 3020–3029. JMLR.org, 2016.
 [20] J. Langford, A. Strehl, and J. Wortman. Exploration scavenging. In Proceedings of the 25th International Conference on Machine Learning, ICML ’08, pages 528–535, New York, NY, USA, 2008. ACM.
 [21] D. Liang, L. Charlin, and D. M. Blei. Causal inference for recommendation. 2016.
 [22] D. LopezPaz, K. Muandet, B. Schölkopf, and I. Tolstikhin. Towards a learning theory of causeeffect inference. In Proceedings of the 32nd International Conference on Machine Learning (ICML), volume 37, pages 1452–1461. JMLR, 2015.
 [23] J. A. F. Machado and J. Mata. Counterfactual decomposition of changes in wage distributions using quantile regression. Journal of Applied Econometrics, 20(4):445–465, 2005.
 [24] K. Muandet, K. Fukumizu, B. Sriperumbudur, and B. Schölkopf. Kernel mean embedding of distributions: A review and beyond. Foundations and Trends in Machine Learning, 10(12):1–141, 2017.
 [25] T. Qin and T. Liu. Introducing LETOR 4.0 datasets. CoRR, abs/1306.2597, 2013.
 [26] P. R. Rosenbaum. Observational studies. Springer Series in Statistics. SpringerVerlag, New York, 2nd edition, 2002.
 [27] P. R. Rosenbaum and D. B. Rubin. The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1):41–55, 1983.
 [28] D. Rubin. Causal inference using potential outcomes. Journal of the American Statistical Association, 100(469):322–331, 2005.
 [29] T. Schnabel, A. Swaminathan, A. Singh, N. Chandak, and T. Joachims. Recommendations as treatments: Debiasing learning and evaluation. In Proceedings of The 33rd International Conference on Machine Learning, volume 48, pages 1670–1679, New York, New York, USA, 2016. PMLR.
 [30] U. Shalit, F. Johansson, and D. Sontag. Bounding and minimizing counterfactual error. arXiv:1606.03976 Preprint, 2016.
 [31] A. J. Smola, A. Gretton, L. Song, and B. Schölkopf. A Hilbert space embedding for distributions. In Proceedings of the 18th International Conference on Algorithmic Learning Theory (ALT), pages 13–31. SpringerVerlag, 2007.
 [32] L. Song. Learning via Hilbert Space Embedding of Distributions. PhD thesis, The University of Sydney, 2008.
 [33] L. Song, K. Fukumizu, and A. Gretton. Kernel embeddings of conditional distributions: A unified kernel framework for nonparametric inference in graphical models. IEEE Signal Processing Magazine, 30(4):98–111, 2013.
 [34] L. Song, J. Huang, A. Smola, and K. Fukumizu. Hilbert space embeddings of conditional distributions with applications to dynamical systems. In Proceedings of the 26th International Conference on Machine Learning (ICML), June 2009.
 [35] B. K. Sriperumbudur, A. Gretton, K. Fukumizu, B. Schölkopf, and G. Lanckriet. Hilbert space embeddings and metrics on probability measures. Journal of Machine Learning Research, 99:1517–1561, 2010.
 [36] I. Steinwart and C. Scovel. Mercer’s theorem on general domains: on the interaction between measures, kernels, and RKHS. Constructive Approximation, 35(363417), 2012.
 [37] A. Swaminathan and T. Joachims. Counterfactual risk minimization: Learning from logged bandit feedback. In ICML, volume 37 of JMLR Workshop and Conference Proceedings, pages 814–823. JMLR.org, 2015.
 [38] A. Swaminathan, A. Krishnamurthy, A. Agarwal, M. Dudik, J. Langford, D. Jose, and I. Zitouni. Offpolicy evaluation for slate recommendation. In Advances in Neural Information Processing Systems 30, pages 3632–3642. Curran Associates, Inc., 2017.
 [39] I. Tolstikhin, B. Sriperumbudur, and K. Muandet. Minimax estimation of kernel mean embeddings. Journal of Machine Learning Research, 18:86:1–86:47, 2017.
 [40] C. Williams and M. Seeger. Using the Nyström method to speed up kernel machines. In Advances in Neural Information Processing Systems 13, pages 682–688. MIT Press, 2001.
 [41] K. Zhang, B. Schölkopf, K. Muandet, and Z. Wang. Domain adaptation under target and conditional shift. In Proceedings of the 30th International Conference on Machine Learning, volume 28, pages 819–827, Atlanta, Georgia, USA, 2013. PMLR.
 [42] L. Zwald, O. Bousquet, and G. Blanchard. Statistical properties of kernel principal component analysis. In Proceedings of the 17th Annual Conference on Learning Theory (COLT), volume 3120 of Lecture Notes in Computer Science, pages 594–608. Springer, 2004.
Appendix
Appendix A Experimental results
In this section, we provide additional results from extensive experimental studies presented in §4.
a.1 Simulated data
In this section, we investigate the behavior of different estimators as we vary the number of users , the number of recommended items , and the number of observations . Figure 3 depicts the results.
Appendix B Cross validation procedure for counterfactual prediction
One of the key challenges in counterfactual prediction is to perform model selection. Unlike standard cross validation, performing cross validation directly on the logged data results in the wrong choice of parameters. That is, the estimate of the performance measure will always be biased. This is obviously due to the fundamental problem of causal inference. To this end, given a dataset and a parameter grid , we propose the following general procedure for parameter selection.

Split into folds: for and .

For each parameter :

For each fold :

Calculate using propensity scores or covariate matching.

Reweight the validation reward (bias correction).

Use the remaining logged data and validation data to compute the estimated reward and corresponding error .


Calculate the mean CV error
