Counterfactual CrossValidation:
Effective Causal Model Selection from Observational Data
Abstract
What is the most effective way to select the best causal model among potential candidates? In this paper, we propose a method to effectively select the best individuallevel treatment effect (ITE) predictors from a set of candidates using only an observational validation set. In model selection or hyperparameter tuning, we are interested in choosing the best model or the value of hyperparameter from potential candidates. Thus, we focus on accurately preserving the rank order of the ITE prediction performance of candidate causal models. The proposed evaluation metric is theoretically proved to preserve the true ranking of the model performance in expectation and to minimize the upper bound of the finite sample uncertainty in model selection. Consistent with the theoretical result, empirical experiments demonstrate that our proposed method is more likely to select the best model and set of hyperparameter in both model selection and hyperparameter tuning.
1 Introduction
Predicting the Individuallevel Treatment Effects (ITE) for certain actions is essential for optimizing metrics of interest in various domains. In digital marketing, for instance, incrementality is becoming increasingly important for performance metrics [7]. For this purpose, users to be shown ads for a given product should be chosen based on the ITE to avoid showing ads to a user who will already buy that product. Healthcare is also an important application of ITE prediction [3]. For precision medicine, we need to know which treatments will be more beneficial or harmful for a given patient. The fundamental problem of causal inference is that we never observe both treated and untreated outcomes from the same unit at the same time, and it is the central problem in ITE prediction [14]. Because of this situation, we are unable to observe a causal effect itself and to use causal effects as labels in the prediction model. Most of the previous papers related to the topic of ITE focus on prediction methods using observational data [30, 29, 18, 26, 8, 2]. In model evaluation and selection, the fundamental problem of causal inference poses an additional critical challenge. Because labels are not observed directly, we are unable to calculate loss metrics such as Mean Squared Error (MSE). Therefore, datadriven validation procedures such as crossvalidation are not applicable in model selection and hyperparameter tuning. In other words, we are unsure which model and which hyperparameter values should be used when applying ITE prediction to real world problems.
There are only a few studies that tackle this problem. [12] proposed using the inverse probability weighting (IPW) outcome as the pseudo label for the true ITE in the calculation of a loss metric such as MSE. [24] used the loss function of Rlearner [20], which outperforms T and Slearner in both theoretical and empirical results [20], as the loss metrics. [4] used influence functions to obtain a more efficient estimator for the loss. These literatures are mainly interested in estimating the loss more accurately and efficiently.
In this paper, we focus on the problem of choosing the best ITE predictors among a set of candidates using only an observational validation set. Typically, we are interested in choosing the best model or hyperparameters from potential candidates in model selection and hyperparameter tuning. In such a situation, we only need to know the rank order of the value of the loss for those candidates. Thus, we construct a plugin oracle that accurately ranks the performance using a given validation dataset. Then we propose Counterfactual CrossValidation, which uses the above plugin oracle. Our proposed metric is theoretically proven to preserve the ranking of candidate predictors and minimize the upper bound of the finite sample uncertainty in model selection. To show the practical significance of our evaluation metric, we conducted two experiments. In those experiments, our proposed method accurately finds the best performance model among candidates compared with other benchmark methods.
The rest of our paper is organized as follows. In Section 2, we review related works. Section 3 describes the problem setting and notation. Section 4 presents our approach for model evaluation in ITE prediction. We explain the procedure of our experiments and show the results in Section 5. We conclude in Section 6 with a summary of our contributions and future research directions.
2 Related Work
ITE prediction has been extensively studied by combining causal inference and machine learning techniques aiming for the best possible personalization of interventions. Stateoftheart approaches are constructed by utilizing the adversarial generative model, Gaussian process, and latent variable models [30, 2, 18, 3]. Among the diverse methods that predict ITE from observational data, the approach that is most related to this work is the method based on representation learning [6]. All the methods based on representation learning attempt to map the original feature vectors into the desirable latent representation space such that it eliminates selection biases. Balancing Neural Network [17] is the most basic method and uses discrepancy distance [19], a domain discrepancy measure in unsupervised domain adaptation, for the regularization term. CounterFactual Regression [26] minimizes the upper bound of the true loss by utilizing an Integral Probability Metric (IPM)[27]. In addition to these, methods that obtain a latent representation by preserving a local similarity [29] or by applying adversarial learning [8] have been proposed.
The prediction methods stated above have provided promising results on standard benchmark datasets; however, the evaluation of such ITE predictors have been conducted by using synthetic datasets or simple heuristic metrics such as policy risk, in previous studies [30, 26, 29]. These evaluations do not guarantee which models would actually be best on a given realworld dataset [4, 25]. Therefore, to bridge the gap between causal inference and applications, developing a reliable evaluation metric is critical.
There are only a few studies directly tackling the evaluation problem of ITE prediction models. [24] conducted an extensive survey of several heuristic metrics and provided experimental comparisons. In particular, they introduced inverse probability weighting (IPW) validation, which utilizes an unbiased estimator for the true ITE as the oracle, and risk, which is based on a loss function of Rlearner as proposed in [20]. In addition, they showed that these metrics empirically outperformed another naive metric, risk, where predictive risk is estimated separately for treated and control outcomes using factual samples only. On the other hand, [4] improved those heuristic plugin metrics by introducing a metaestimation technique using the influence functions (IF) in a theoretically principal way. Our proposed metric can be further improved by the IFbased estimation method.
All the existing metrics so far aim to estimate the true metric of interest (e.g., MSE for the true ITE) accurately. However, to conduct accurate model selection and hyperparameter tuning, accurate ranking of model performance is critical, although these above metrics do not guarantee the preservation of such ranking. Therefore, in contrast to previous works, we investigate a way to construct a metric that accurately preserves the ranking of candidate ITE predictors.
3 Problem Setting and Notation
In this section, we introduce some notation and formulate the evaluation of ITE prediction models.
3.1 Notation
We denote as the dimensional feature vector and as a binary treatment assignment indicator. When an individual receives treatment, then , otherwise, . Here, we follow the potential outcome framework [21, 23, 15] and assume that there exist two potential outcomes denoted as for each individual. is a potential outcome associated with , and is associated with . Note that each individual receives only one treatment and reveals the outcome value for the received treatment. We use , or simply , to denote the joint probability distribution of these random variables. In addition, the treated and control feature distributions conditioned on treatment assignment are defined as and , respectively.
We formally define the Individuallevel Treatment Effect for an individual with a feature vector as:
(1) 
In addition, we use some notation to represent parameters of . First, the conditional expectations of potential outcomes are:
(2) 
Next, we define the conditional probability of treatment assignment as:
(3) 
This parameter is called propensity score in causal inference and is widely used to estimate treatment effects from observational data [21, 22, 15].
Throughout this paper, we make the following standard assumptions in causal inference:
Assumption 1.
(Unconfoundedness) Potential outcomes are independent of the treatment assignment indicator conditioned on feature vector , i.e.,
(4) 
Assumption 2.
(Overlap) For any point in feature space , the true propensity score is strictly between 0 and 1, i.e.,
(5) 
Assumption 3.
(Consistency) Observed outcome is represented using potential outcomes and the treatment assignment indicator as follows:
(6) 
Under these assumptions, the ITE is identifiable from observational data (i.e., ) [26].
Furthermore, we define some critical notation following [26].
Definition 1.
(Representation Function) is a representation function and is called the representation space. We assume that is a twice differentiable, onetoone function. Moreover, and are feature distributions for the treated and controlled induced over the representation space.
Definition 2.
(Factual and Counterfactual Loss Functions) Let be a hypothesis, be a weighting function, and be a loss function. In addition, The expected loss for the unit and treatment pair is denoted as:
Then, the expected factual and counterfactual losses of a combination of a hypothesis and a representation function are defined as:
Further, the expected factual and counterfactual losses on the treated () and on the controlled () are represented as:
By the definition of the conditional probability, the following equations hold for factual and counterfactual losses:
where .
We also define a class of metrics between probability distributions [27].
Definition 3.
(Integral Probability Metric) For two probability density functions defined over a space and for a family of functions . The IPM between the two density functions and is defined as:
Function families can be the family of bounded continuous functions, the family of 1Lipschitz functions, and the unitball of functions in a universal reproducing Hilbert kernel space.
Definition 4.
(Weighted Variance of Potential Outcomes) The weighted expected variance of a potential outcome with respect to a conditional feature distribution is defined as:
3.2 Evaluating ITE prediction models
In previous studies [12, 24, 4], the evaluation of an ITE predictor has been formulated as accurately estimating the following metric from observational validation dataset as:
(7) 
Here, is the true performance metric of an ITE predictor ^{1}^{1}1Several papers have called the expected Precision in Estimation of Heterogeneous Effect (PEHE)..
This approach is intuitive and ideal. However, the realizations of the true ITE are never observable, and thus, accurate performance estimation is difficult. Moreover, estimating the true metric values is not always necessary to conduct valid model selection or hyperparameter tuning of causal models. It may be possible to construct a better metric under an objective specific to selection and tuning. Thus, we take a different approach from previous works and aim to construct a performance estimator satisfying the following condition:
(8) 
where is a set of candidate ITE predictors.
An estimator satisfying Eq. (8) gives accurate ranking of candidate predictors by the true metric values, and we can select the best model among using the estimator. The main focus of this paper is to propose a theoretically sophisticated way to construct a performance estimator that achieves the condition described in Eq. (8) as much as possible.
4 Method
To achieve our goal of interest, we consider the following feasible estimator of the performance metric:
(9) 
where is called the oracle and is constructed from validation set . We consider the estimator for the true risk as represented in Eq. (9) because it can be applied to estimating the performance of a predictor, directly predicting an ITE such as Rlearner [20], Domain Adaptation Learner, or Doubly Robust Learner [11].
Under our formulation, we aim to answer the following question: What is the best plugin oracle to rank the performance of given candidate ITE predictors from an observational validation dataset ?
In the following subsections, we theoretically analyze the performance estimator as represented in the form of Eq. (9) and propose an oracle that gives accurate ranking of candidate ITE prediction models by the true performance metric.
4.1 Theoretical Analysis of the Performance Estimator
First, the following proposition states that an oracle that is unbiased against the ITE provides a desirable property of the performance estimator.
Proposition 1.
If an oracle is an unbiased estimator for the true ITE:
then, the expectation of performance estimator is decomposed into the true performance metric and the MSE of the given oracle:
(10) 
The first term of RHS of Eq. (10) is the true performance metric, and the second term is independent of the given predictor. Therefore, the expectations of the performance estimators preserve the difference between the true metric values as follows:
where are arbitrary candidate predictors. This property is desirable because the predictor that has the smallest expected value of among candidate predictors also has the smallest value of among them; one can expect to select the best predictor among a set of candidates.
However, the expectation of the performance estimator is incalculable because we can only use a finite sample validation dataset. This motivates us to consider the finite sample uncertainty of the performance estimator. Here, the empirical version of the performance estimator can be decomposed as:
(11) 
In Eq. (11), the second term of RHS () is critical to the uncertainty and is controllable by the oracle. Thus, we consider the oracle minimizing the variance of to construct the performance estimator. The following theorem states the upper bound of the variance of .
Theorem 1.
Assume that is upper bounded by a positive constant for a given predictor. Additionally, the oracle is unbiased for the ITE and the output of the oracle for an instance is independent of that of other instances. Then, we have the upper bound of the variance of as follows:
(12) 
In Eq. (12), the expectation of the conditional variance of oracle is the only controllable term by the construction of the oracle. Thus, an oracle satisfying the following condition is desirable to construct the performance estimator :
(13)  
(14) 
A performance estimator using an oracle that satisfies the conditions above is expected to preserve the difference of the true performance metric and minimizes the upper bound of the finite sample uncertainty term in Eq. (11). In the next subsection, we investigate and propose a method to derive an oracle achieving our goal.
4.2 Proposed Oracle
In this subsection, we present our class of counterfactual crossvalidation (CFCV) performance estimators. In particular, we propose a method to construct an oracle that leads to an effective estimator based on the theoretical analysis in the previous subsection. The main idea of CFCV is to gain an unbiased oracle minimizing its own variance to better satisfy the conditions in Eq. (13) and Eq. (14).
Here we define the proposed oracle inspired by the doubly robust (DR) estimator used to estimate average causal effects of treatments or performance of bandit policies [5, 9, 16, 10].
Definition 5.
Let be a hypothesis predicting potential outcomes and is defined as . Then, the oracle for a given data is defined as follows: {dmath} ~τ_DR(X, T, Y^obs) =Te(X)(Y^obsf(X, 1))1T1e(X)(Y^obsf(X, 0)) + (f(X, 1)f(X, 0) )
We rely on the class of DR estimators for constructing the oracle because it theoretically and empirically achieves better biasvariance tradeoff in a variety of fields than other unbiased estimators including IPW estimators. [9, 16, 10, 28].
First, the oracle in the form of Eq. (5) is proved to be unbiased against the true ITE, thus satisfying Eq. (14).
Proposition 2.
Given true propensity scores, the doubly robust oracle is unbiased against the true ITE:
(15) 
Next, to consider the condition in Eq. (13), we state the expectation of the conditional variance of the DR oracle.
Proposition 3.
Given true propensity scores, the expectation of the conditional variance of the doubly robust oracle is represented as:
(16) 
where
To find an oracle satisfying the variance condition in Eq. (13), we aim to train a hypothesis minimizing the variance derived in Eq. (16). In the variance, and are independent of , and are thus uncontrollable. On the other hand, the third term of RHS of Eq. (16) () is dependent on both and . However, either or is always counterfactual, and thus, the direct minimization of is infeasible.
Therefore, to find the appropriate hypothesis that minimizes from an observational validation set, we derive the upper bound of depending only on observable variables.
Theorem 2.
Let be a family of functions and assume that, for any given , there exists a positive constant such that the perunit expected loss functions obey where is the inverse image of . Then, the following inequality holds:
(17) 
where .
Eq. (17) in Theorem 2 consists of factual losses and an IPM on the representation space and thus can be estimated from finite samples. From the theoretical implications above, the loss function to derive a hypothesis and a representation function is:
(18) 
where is a traideoff hyperparameter.
The derived oracle for our CFCV is unbiased for the true metric and minimizes the upper bound of the variance in Eq. (17). A summary of the resulting CFCV is given in Algorithm 1.
5 Experiments
In this section, we compare our proposed performance estimator and the other baselines using a standard semisynthetic dataset.
5.1 Basic Experimental Setups
5.1.1 Dataset
We used the Infant Health Development Program (IHDP) dataset provided by [13]. IHDP is an interventional program aimed to improve the health of premature infants [13, 2]. This is a standard semisynthetic dataset containing 747 children with 25 features and has been widely used to evaluate ITE prediction models [26, 30, 2]. The detailed description of this dataset can be found in Section 5.1 of [26]. We used the simulated outcome implemented in EconML package^{2}^{2}2https://github.com/microsoft/EconML/blob/master/econml/data/dgps.py.
5.1.2 Baselines
We compared our CFCV with the following baseline metrics.

Plugin validation: This uses predicted values of potential outcomes by an arbitrary machine learning algorithm as the oracle of the performance estimator in Eq. (9).
where and are predictions for potential outcomes. We used Counterfactual Regression [26] to construct the oracle and to ensure a fair comparison.
5.2 Comparison on Model Selection Performance
We first compared the model selection performance of our CFCV with the other baselines.
5.2.1 Experimental Procedure
We follow the experimental procedure in [24]; We first trained candidate predictors on the training set, and then made predictions on both validation and test sets by pretrained predictors. Then, we ranked those predictors by using each metric on the observational validation set. Finally, we compare these estimated performances on the validation set and the true performance on the testing set. We conducted the experimental procedure over 30 realizations with 35/35/30 train/validation/test splits.
5.2.2 Candidate Models
We constructed a set of candidate predictors by combining five machine learning algorithms (Decision Tree, Random Forest, Gradient Boosting Tree, Ridge Regressor, and Support Vector Regressor with RBF kernel) and five metalearners (SLearner, XLearner, TLearner, Domain Adaptation Learner, and Doubly Robust Learner) as implemented in EconML package^{3}^{3}3https://econml.azurewebsites.net/spec/estimation/metalearners.html. Thus, we had a set of 25 ITE predictors to select among (i.e., ).
5.2.3 Results
Table 1 reports the averaged and the worstcase performance over 30 realizations. We evaluated the worstcase model selection performance of each metric because, in realworld causal inference problems, we never know the ground truth performance of any predictor, and stable model selection performance is essential. Rank Correlation is the Spearman rank correlation between the ranking by the true performance and the estimated metric values. Relative RMSE is the true performance of the selected model in each metric relative to the best one in . We used Relative RMSE defined as below because potential outcomes of the IHDP dataset have different scales among realizations.
Table 1 shows the effective model selection performance of the proposed CFCV on average. Moreover, ours significantly outperformed with respect to worstcase performance, and this empirically suggests the stability of the proposed metric.
Rank Correlation  Relative RMSE  

Avg ( SE)  WorstCase  Avg ( SE)  WorstCase  
IPW  7.779  
risk  8.884  
Plugin  1.841  
CFCV 
5.3 Comparison on Hyperparameter Tuning Performance
Next, we compared the hyperparameter tuning performance of our CFCV with the other baseline metrics.
5.3.1 Tuned Model
We tuned the hyperparameters of the combinations of Gradient Boosting Regressor and Domain Adaptation Learner as implemented in scikitlearn and EconML, respectively. Domain Adaptation Learner consists of three base learners including treated_model, controls_model, and overall_model. Thus, we aimed to find the best three sets of hyperparameters of Gradient Boosting Regressor to construct Domain Adaptation Learner.
5.3.2 Experimental Procedure
We used Optuna software [1] with a TPE sampler to tune the ITE predictor and set each metric as the objective function of Optuna. For each metric, we sought 100 points in the hyperparameter searching space. The hyperparameter tuning performance of each metric was evaluated by the true performance of the tuned model on the testing set. We conducted the experimental procedure over the same 30 realizations, using 35/35/30 train/validation/test splits as the model selection experiment.
5.3.3 Results
Figure 1 provides the results of the hyperparameter tuning experiment. We report the averaged performance of each metric relative to the performance of the IPW metric because potential outcomes of the IHDP dataset have different scales among realizations.
The results suggest that our metric improved by over 17.9% in comparison to IPW and by 6.2% comparison to Plugin. Thus, our metric allows one to conduct valid hyperparameter tuning of the causal inference model.
6 Conclusion
In this paper, we explored the evaluation problem of ITE prediction models. In contrast to previous studies, we proposed a new approach called counterfactual crossvalidation that preserves the ranking of the true performance with high confidence using only an observational validation set. The proposed evaluation metric was theoretically proved to preserve the true ranking of the model performance in expectation and to minimize the upper bound of the finite sample uncertainty in model evaluation. Empirical evaluation using the IHDP dataset demonstrated the effective and stable model selection and hyperparameter tuning performance of the proposed metric.
Important future research directions are theoretical analysis on choosing the hyper parameters of potential outcome models used in the proposed metric, and how to consider situations with hidden confounders.
References
 Akiba et al. [2019] Akiba, T.; Sano, S.; Yanase, T.; Ohta, T.; and Koyama, M. 2019. Optuna: A nextgeneration hyperparameter optimization framework. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’19, 2623–2631. New York, NY, USA: ACM.
 Alaa and Schaar [2018] Alaa, A., and Schaar, M. 2018. Limits of estimating heterogeneous treatment effects: Guidelines for practical algorithm design. In International Conference on Machine Learning, 129–138.
 Alaa and van der Schaar [2017] Alaa, A. M., and van der Schaar, M. 2017. Bayesian inference of individualized treatment effects using multitask gaussian processes. In Advances in Neural Information Processing Systems, 3424–3432.
 Alaa and Van Der Schaar [2019] Alaa, A., and Van Der Schaar, M. 2019. Validating causal inference models via influence functions. In International Conference on Machine Learning, 191–201.
 Bang and Robins [2005] Bang, H., and Robins, J. M. 2005. Doubly robust estimation in missing data and causal inference models. Biometrics 61(4):962–973.
 Bengio, Courville, and Vincent [2013] Bengio, Y.; Courville, A.; and Vincent, P. 2013. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence 35(8):1798–1828.
 Diemert et al. [2018] Diemert, E.; Betlei, A.; Renaudin, C.; and Amini, M.R. 2018. A large scale benchmark for uplift modeling. In Proceedings of the AdKDD and TargetAd Workshop, KDD, London, United Kingdom.
 Du et al. [2019] Du, X.; Sun, L.; Duivesteijn, W.; Nikolaev, A.; and Pechenizkiy, M. 2019. Adversarial balancingbased representation learning for causal effect inference with observational data. arXiv preprint arXiv:1904.13335.
 Dudík, Langford, and Li [2011] Dudík, M.; Langford, J.; and Li, L. 2011. Doubly robust policy evaluation and learning. In Proceedings of the 28th International Conference on International Conference on Machine Learning, 1097–1104. Omnipress.
 Farajtabar, Chow, and Ghavamzadeh [2018] Farajtabar, M.; Chow, Y.; and Ghavamzadeh, M. 2018. More robust doubly robust offpolicy evaluation. In International Conference on Machine Learning, 1446–1455.
 Foster and Syrgkanis [2019] Foster, D. J., and Syrgkanis, V. 2019. Orthogonal statistical learning. arXiv preprint arXiv:1901.09036.
 Gutierrez and Gérardy [2017] Gutierrez, P., and Gérardy, J.Y. 2017. Causal inference and uplift modelling: A review of the literature. In International Conference on Predictive Applications and APIs, 1–13.
 Hill [2011] Hill, J. L. 2011. Bayesian nonparametric modeling for causal inference. Journal of Computational and Graphical Statistics 20(1):217–240.
 Holland [1986] Holland, P. W. 1986. Statistics and causal inference. Journal of the American statistical Association 81(396):945–960.
 Imbens and Rubin [2015] Imbens, G. W., and Rubin, D. B. 2015. Causal inference in statistics, social, and biomedical sciences. Cambridge University Press.
 Jiang and Li [2016] Jiang, N., and Li, L. 2016. Doubly robust offpolicy value evaluation for reinforcement learning. In International Conference on Machine Learning, 652–661.
 Johansson, Shalit, and Sontag [2016] Johansson, F.; Shalit, U.; and Sontag, D. 2016. Learning representations for counterfactual inference. In International conference on machine learning, 3020–3029.
 Louizos et al. [2017] Louizos, C.; Shalit, U.; Mooij, J. M.; Sontag, D.; Zemel, R.; and Welling, M. 2017. Causal effect inference with deep latentvariable models. In Advances in Neural Information Processing Systems, 6446–6456.
 Mansour, Mohri, and Rostamizadeh [2009] Mansour, Y.; Mohri, M.; and Rostamizadeh, A. 2009. Domain adaptation: Learning bounds and algorithms. In 22nd Conference on Learning Theory, COLT 2009.
 Nie and Wager [2017] Nie, X., and Wager, S. 2017. Quasioracle estimation of heterogeneous treatment effects. arXiv preprint arXiv:1712.04912.
 Rosenbaum and Rubin [1983] Rosenbaum, P. R., and Rubin, D. B. 1983. The central role of the propensity score in observational studies for causal effects. Biometrika 70(1):41–55.
 Rubin [1974] Rubin, D. B. 1974. Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of educational Psychology 66(5):688.
 Rubin [2005] Rubin, D. B. 2005. Causal inference using potential outcomes: Design, modeling, decisions. Journal of the American Statistical Association 100(469):322–331.
 Schuler et al. [2018] Schuler, A.; Baiocchi, M.; Tibshirani, R.; and Shah, N. 2018. A comparison of methods for model selection when estimating individual treatment effects. arXiv preprint arXiv:1804.05146.
 Setoguchi et al. [2008] Setoguchi, S.; Schneeweiss, S.; Brookhart, M. A.; Glynn, R. J.; and Cook, E. F. 2008. Evaluating uses of data mining techniques in propensity score estimation: a simulation study. Pharmacoepidemiology and drug safety 17(6):546–555.
 Shalit, Johansson, and Sontag [2017] Shalit, U.; Johansson, F. D.; and Sontag, D. 2017. Estimating individual treatment effect: generalization bounds and algorithms. In Proceedings of the 34th International Conference on Machine LearningVolume 70, 3076–3085. JMLR. org.
 Sriperumbudur et al. [2012] Sriperumbudur, B. K.; Fukumizu, K.; Gretton, A.; Schölkopf, B.; Lanckriet, G. R.; et al. 2012. On the empirical estimation of integral probability metrics. Electronic Journal of Statistics 6:1550–1599.
 Vlassis et al. [2019] Vlassis, N.; Bibaut, A.; Dimakopoulou, M.; and Jebara, T. 2019. On the design of estimators for bandit offpolicy evaluation. In International Conference on Machine Learning, 6468–6476.
 Yao et al. [2018] Yao, L.; Li, S.; Li, Y.; Huai, M.; Gao, J.; and Zhang, A. 2018. Representation learning for treatment effect estimation from observational data. In Advances in Neural Information Processing Systems, 2633–2643.
 Yoon, Jordon, and van der Schaar [2018] Yoon, J.; Jordon, J.; and van der Schaar, M. 2018. GANITE: Estimation of individualized treatment effects using generative adversarial nets. In International Conference on Learning Representations.