1 INTRODUCTION

## Abstract

Causal inference, or counterfactual prediction, is central to decision making in healthcare, policy and social sciences. To de-bias causal estimators with high-dimensional data in observational studies, recent advances suggest the importance of combining machine learning models for both the propensity score and the outcome function. We propose a novel scalable method to learn double-robust representations for counterfactual predictions, leading to consistent causal estimation if the model for either the propensity score or the outcome, but not necessarily both, is correctly specified. Specifically, we use the entropy balancing method to learn the weights that minimize the Jensen-Shannon divergence of the representation between the treated and control groups, based on which we make robust and efficient counterfactual predictions for both individual and average treatment effects. We provide theoretical justifications for the proposed method. The algorithm shows competitive performance with the state-of-the-art on real world and synthetic data.

## 1 Introduction

Causal inference is central to decision-making in healthcare, policy, online advertising and social sciences. The main hurdle to causal inference is confounding, , factors that affect both the outcome and the treatment assignment (VanderWeele and Shpitser, 2013). For example, a beneficial medical treatment may be more likely assigned to patients with worse health conditions; then directly comparing the clinical outcomes of the treated and control groups, without adjusting for the difference in the baseline characteristics, would severely bias the causal comparisons and mistakenly conclude the treatment is harmful. Therefore, a key in de-biasing causal estimators is to balance the confounding covariates or features.

This paper focuses on using observational data to estimate treatment effects, defined as the contrasts between the counterfactual outcomes of the same study units under different treatment conditions (Neyman et al., 1990; Rubin, 1974). In observational studies, researchers do not have direct knowledge on how the treatment is assigned, and substantial imbalance in covariates between different treatment groups is prevalent. A classic approach for balancing covariates is to assign an importance weight to each unit so that the covariates are balanced after reweighting (Hirano et al., 2003; Hainmueller, 2012; Imai and Ratkovic, 2014; Li et al., 2018; Kallus, 2018). The weights usually involve the propensity score (Rosenbaum and Rubin, 1983) – a summary of the treatment assignment mechanism. Another stream of conventional causal methods directly model the outcome surface as a function of the covariates under treated and control condition to impute the missing counterfactual outcomes (Rubin, 1979; Imbens et al., 2005; Hill, 2011).

Advances in machine learning bring new tools to causal reasoning. A popular direction employs the framework of representation learning and impose balance in the representation space (Johansson et al., 2016; Shalit et al., 2017; Zhang et al., 2020). These methods usually separate the tasks of propensity score estimation and outcome modeling. However, recent theoretical evidence reveals that good performance in predicting either the propensity score or the observed outcome alone does not necessarily translate into good performance in estimating the causal effects (Belloni et al., 2014). In particular, (Chernozhukov et al., 2018) pointed out it is necessary to combine machine learning models for the propensity score and the outcome function to achieve consistency in estimating the average treatment effect (ATE). A closely related concept is double-robustness (Scharfstein et al., 1999; Lunceford and Davidian, 2004; Kang and Schafer, 2007), in which an estimator is consistent if either the propensity score model or the outcome model, but not necessarily both, is correctly specified. A similar concept also appears in the field of reinforcement learning for policy evaluation (Dudík et al., 2011; Jiang and Li, 2016; Kallus and Uehara, 2019). Double-robust estimators are desirable because they give analysts two chances to “get it right” and guard against model misspecification.

In this work, we bridge the classical and modern views of covariate balancing in causal inference in a unified framework. We propose a novel and scalable method to learn double-robust representations for counterfactual predictions with observational data, allowing for robust learning of the representations and balancing weights simultaneously. Though the proposed method is motivated by ATE estimation, it also achieves comparable performance with state-of-the-art on individual treatment effects (ITE) estimation. Specifically, we made the following contributions: (i) We propose to regularize the representations with the entropy of an optimal weight for each unit, obtained via an entropy balancing procedure. (ii) We show that minimizing the entropy of balancing weights corresponds to a regularization on Jensen-Shannon divergence of the low-dimensional representation distributions between the treated and control groups, and more importantly, leads to a double-robust estimator of the ATE. (iii) We show that the entropy of balancing weights can bound the generalization error and therefore reduce ITE prediction error.

## 2 Background

### 2.1 Setup and Assumptions

Assume we have a sample of units, with in treatment group and in control group. Each unit () has a binary treatment indicator ( for control and for treated), features or covariates . Each unit has a pair of potential outcomes corresponding to treatment and control, respectively, and causal effects are contrasts of the potential outcomes. We define individual treatment effect (ITE), also known as conditional average treatment effect (CATE) for context as: , and the average treatment effect (ATE) as: . The ITE quantifies the effect of the treatment for the unit(s) with a specific feature value, whereas ATE quantities the average effect over a target population. When the treatment effects are heterogeneous, the discrepancy between ITE for some context and ATE can be large. Despite the increasing attention on ITE in recent years, average estimands such as ATE remain the most important and commonly reported causal parameters in a wide range of disciplines. Our method is targeted at estimating ATE, but we will also examine its performance in estimating ITE.

For each unit, only the potential outcome corresponding to the observed treatment condition is observed, , and the other is counterfactual. Therefore, additional assumptions are necessary for estimating the causal effects. Throughout the discussion, we maintain two standard assumptions:

###### Assumption 2 (Overlap).

.

Under Assumption 1 and 2, treatment effects can be identified from the observed data. In observational studies, there is often significant imbalance in the covariates distributions between the treated and control groups, and thus directly comparing the average outcome between the groups may be lead to biased causal estimates. Therefore, an important step to de-bias the causal estimators is to balance the covariates distributions between the groups, which usually involves the propensity score , a summary of the treatment assignment mechanism. Once good balance is obtained, one can also build an outcome regression model for to impute the counterfactual outcomes and estimate ATE and ITE via the vanilla estimator and .

### 2.2 Related Work

Double robustness  A double-robust (DR) estimator combines the propensity score and outcome model; a common example for ATE (Robins et al., 1994; Lunceford and Davidian, 2004) is:

 ^τ\textupDR\tiny{ATE}=N∑i=1^w\textupIPWi(2Ti−1){Yi−^fTi(Xi)}+ (1) 1NN∑i=1{^f1(Xi)−^f0(Xi)},

where is the inverse probability weights (IPW). DR estimator has two appealing benefits: (i) it is DR in the sense that it remains consistent if either propensity score model or outcome model is correctly specified, not necessarily both; (ii) it reaches the semiparametric efficiency bound of if both models are correctly specified (Hahn, 1998; Chernozhukov et al., 2018). However, the finite-sample variance for can be quite large when the IPW have extreme values, which is likely to happen with severe confoundings. Several variants of the DR estimator have been proposed to avoid extreme importance weights, such as clipping or truncation (Bottou et al., 2013; Wang et al., 2017; Su et al., 2019). We propose a new weighting scheme, combined with the representation learning, to calculate the weights with less extreme values and maintain the double robustness.

Representation learning with balance regularization For causal inference with high-dimensional or complex observational data, an important consideration is dimension reduction. Specifically, we may wish to find a representations of the original space, and build the model based on the representations instead of directly on the features , . To this end, Johansson et al. (2016) and Shalit et al. (2017) proposed to combine predictive power and covariate balance to learn the representations, via minimizing the following type of loss function in the Counterfactual Regression (CFR) framework:

 argminf,Φ {∑i=1L(fTi(Φ(Xi)),Yi)+ (2) κ⋅D({Φ(Xi)}Ti=0,{Φ(Xi)}Ti=1)},

where the first term measures the predictive power the representation , the second term measures the distance between the representation distribution in treated and control groups, and is a hyperparameter controlling the importance of distance. This type of loss function targets learning representations that are predictive of the outcome and well balanced between the groups. Choice of the distance measure in (2) is crucial for the operating characteristics of the method; popular choices include the Integral Probability Measure (IPM) such as the Wasserstein (WASS) distance (Villani, 2008; Cuturi and Doucet, 2014) or Maximum Mean Discrepancy (MMD)(Gretton et al., 2009).

Concerning related modifications of (2), in Zhang et al. (2020), the authors argue that balancing representations in (2) may over-penalize the model when domain overlap is satisfied and propose to use the counterfactual variance as a measure for imbalance, which can also address measure the “local” similarity in distribution. In Hassanpour and Greiner (2019) the authors reweight regression terms with inverse probability weights (IPW) estimated from the representations. In Johansson et al. (2018), the authors tackle the distributional shift problem, for which they alternately optimize a weighting function and outcome models for prediction jointly to reduce the generalization error.

The optimization problem (2) only involves the outcome model , misspecification of which would likely introduce biased causal estimates. In contrast, the class of causal estimators of DR estimators like (1) combine the propensity score model with the outcome model to add robustness against model misspecifications. A number of DR causal estimators for high-dimensional data have been proposed (Belloni et al., 2014; Farrell, 2015; Antonelli et al., 2018), but none has incorporated representation learning. Below we propose the first DR representation learning method for counterfactual prediction. The key is the entropy balancing procedure, which we briefly review below.

Entropy balancing   To mitigate the extreme weights problem of IPW in (1), one stream of weighting methods learn the weights by minimizing the variation of weights subject to a set of balancing constraints, bypassing estimating the propensity score. Among these, entropy balancing (EB) (Hainmueller, 2012) has received much interest in social science (Ferwerda, 2014; Marcus, 2013). EB was originally designed for estimating the average treatment effect on the treated (ATT), but is straightforward to adapt to other estimands. Specifically, the EB weights for ATE, are obtained via the following programming problem:

 w\tiny{EB}=arg~{}maxw{−N∑i=1wilogwi,}, (3) s.t. {\textup(i)∑Ti=0wiXji=∑Ti=1wiXji,∀j∈[1:p],\textup(ii)∑Ti=0wi=∑Ti=1wi=1,wi>0.

Covariate balancing is enforced by the the first constraint (i), also known as the moment constraint, that the weighted average for each covariate of respective treatment groups are equal. Generalizations to higher moments are straight forward although less considered in practice. The second constraint simply ensures the weights are normalized. This objective is an instantiation of the maximal-entropy learning principle (Jaynes, 1957b, a), a concept derived from statistical physics that stipulates the most plausible state of a constrained physical system is the one maximizes its entropy. Intuitively, EB weights penalizes the extreme weights while keeps balancing condition satisfied.

Though the construction of EB does not explicitly impose models for either or , Zhao and Percival (2017) showed that EB implicitly fits a linear logistic regression model for the propensity score and a linear regression model for the outcome simultaneously, where the predictors are the covariates pr representations being balanced. Entropy balancing is DR in the sense that if only of the two models are correctly specified, the EB weighting estimator is consistent for the ATE. Note that the original EB procedure does not provide ITE estimation, which is explored in this work.

## 3 Double Robust Representation Learning

### 3.1 Proposal: Unifying Covariate Balance and Representation Learning

Based on the discussion above, we propose a novel method to learn DR representations for counterfactual predictions. Our development is motivated by an insightful heuristic: the entropy of balancing weight is a proxy measure of the covariate imbalance between the treatment groups. To understand the logic behind this intuition, recall the more dis-similar two distributions are, the more likely extreme weights are required to satisfy the matching criteria, and consequently resulting a bigger entropy for the balancing weight. See Figure 1 also for a graphical illustration of this. In Section 3.3, we will formalize this intuition based on information-theoretic arguments.

We adjust the constrained EB programming problem from (3) to (4), achieving the balance among the representations/transformed features. As we shall see later, this distance metric, entropy of balancing weights, leads to desirable theoretical properties in both ATE and ITE estimation.

 w\tiny{EB}=arg~{}maxw{−N∑i=1wilogwi,}, (4) s.t. {\textup(i)∑Ti=0wiΦ(Xji)=∑Ti=1wiΦ(Xji),\textup(ii)∑Ti=0wi=∑Ti=1wi=1,wi>0.

Specifically, we propose to learn a low-dimensional representation of the feature space, , through minimizing the following loss function:

 argminf,Φ\textuppredictionlossonobservedoutcomes{∑i(Yi−ft=Ti(Φ(Xi)))2+κ∑i=1w\tiny{EB}i(Φ)logw\tiny{EB}i(Φ)}\textupdistancemetric,balanceregularization, (5)

where we replace the distance metrics in (2) with the entropy of , function of the representation as implied in the notation, which is the solution to (4). At first sight, solving the system defined by (4) and (5) is challenging, because the gradient can not be back-propagated through the nested optimization (4). Another appealing property of EB is computational efficiency. We can solve the dual problem of (4):

 minλ {log⎛⎝∑Ti=0exp(⟨λ0,Φi⟩)⎞⎠+ (6) log⎛⎝∑Ti=1exp(⟨λ1,Φi⟩)⎞⎠−⟨λ0+λ1,¯Φ⟩},

where are the Lagrangian multipliers, is the unnormalized mean and denote the inner product. Note that (6) is a convex problem wrt , and therefore can be efficiently solved using standard convex optimization packages when the sample size is small. Via appealing to the Karush–Kuhn–Tucker (KKT) conditions, the optimal EB weights can be given in the following Softmax form

 w\tiny{EB}i(Φ)=exp(ηi)∑Tk=Tiexp(ηk), (7) ηi≜−(2Ti−1)⟨λ\tiny{EB}Ti,Φi⟩,

where is the solution to the dual problem (6). Equation (63) shows how to explicitly express the entropy weights as a function of the representation , thereby enabling efficient end-to-end training of the representation. Compared to the CFR framework, we have replaced the IPM matching term with the entropy term . When applied to the ATE estimation, the commensurate learned entropy balancing weights guarantees the to be DR. For ITE estimation, , as a regularization term in (5), can bound the ITE prediction error.

A few remarks are in order. For reasons that will be clear in Section 3.3, we will restrict to the family of linear functions, to ensure the nice theoretical properties of DRRL. Note that is not a restrictive assumption, as many schemes seek representations that can linearize the operations. For instance, outputs of a deep neural nets are typically given by a linear mapping of the penultimate layers. Many modern learning theories, such as reproducing kernel Hilbert space (RKHS), are formulated under inner product spaces (i.e., generalized linear operations).

After obtaining the representation , the outcome function , and the EB weights , we have the following estimators of and ,

 ^τ\tiny{EB}\tiny{ATE} = N∑i=1^w\tiny{EB}i(2Ti−1){Yi−^fTi(^Φ(Xi))} (8) +1NN∑i=1{^f1(^Φ(Xi))−^f0(^Φ(Xi))}, ^τ\tiny{EB}(x) = ^f1(^Φ(x))−^f0(^Φ(x)). (9)

In practice, we can parameterize the representations by as and the outcome function by as to learn the instead.

### 3.2 Practical Implementation

We now propose an algorithm – referred as Double Robust Representations Learning (DRRL) – to implement the proposed method when we parameterize the representations by neural networks. DRRL simultaneously learn the representations , the EB weights and the outcome function . The network consists of a representation layer performing non-linear transformation of the original feature space, an entropy-balancing layer solving the dual programming problem in (6) and a final layer learning the outcome function. We visualize the DRRL architecture in Figure 2.

We train the model by iteratively solving the programming problem in (4) given the representations and minimizing the loss function in (5) given the optimized weights . As we have successfully expressed EB weights, and consequently the entropy term, directly through the learned representation in (63), it enables efficient gradient-based learned schemes, such as stochastic gradient descent, for the training of DRRL using modern differential programming platforms (e.g., tensorflow, pytorch). As an additional remark, we note although the Lagrangian multiplier is computed from the representation , its gradient with respect to is zero based on the Envelop theorem (Carter, 2001). This impliess we can safely treat as if it is a constant in our training objective.

Adaptation to ATT estimand  So far we have focused on DR representations for ATE; the proposed method can be easily modified to other estimands. For example, for the average treatment effect on the treated (ATT), we can modify the EB constraint to and change the objective function to in (4). For ATT, we only need to reweight the control group to match the distribution of the treated group, which remains the same. Thus we only impose balancing constraints on the weighted average of representations of the control units; the objective function only applies to the weights of the control units. In the SM, we also provide theoretical proofs for the double-robustness property of the ATT estimator.

Scalable generalization  A bottleneck in scaling up our algorithm to large data is solving optimization problem (6) in the entropy balancing stage. Below we develop a scalable updating scheme with the idea of Fenchel mini-max learning in Tao et al. (2019). Specifically, let be a proper convex, lower-semicontinuous function; then its convex conjugate function is defined as , where denotes the domain of the function (Hiriart-Urruty and Lemaréchal, 2012); is also known as the Fenchel conjugate of , which is again convex and lower-semicontinuous. The Fenchel conjugate pair are dual to each other, in the sense that , , . As a concrete example, gives such a pair, which we exploit for our problem. Based on the Fenchel conjugacy, we can derive the mini-max training rule for the entropy-balancing objective in (6), for :

 minλt{maxut{ut−exp(ut)∑Ti=texp(⟨λt,Φi⟩)}−⟨λt,Φi⟩}. (10)

### 3.3 Theoretical Properties

In this section we establish the nice theoretical properties of the proposed DRRL framework. Limited by space, detailed technical derivations on Theorem 1, 5 and 6 are deferred to the SM.

Our first theorem shows that, the entropy of the EB weights as defined in (5) asymptotically converges to a scaled -Jensen-Shannon divergence (JSD) of the representation distribution between the treatment groups.

###### Theorem 1 (EB entropy as JSD).

The Shannon entropy of the EB weights defined in (4) converges in probability to the following -Jensen-Shannon divergence between the marginal representation distributions of the respective treatment groups:

 limn→∞ H\tiny{EB}n(Φ)≜∑iw\tiny{EB}i(Φ)log(w\tiny{EB}i(Φ)) (11) \lx@stackrelp⟶ c′{KL(p1Φ(x)||pΦ(x))+KL% (p0Φ(x)||pΦ(x))}+c′′ = c′\textupJSDα(p1Φ,p0Φ)+c′′

where are non-zero constants, is representation distribution in group (), is the marginal density of the representations, is the proportion of treated units and is the Kullback–Leibler (KL) divergence.

An important insight from Theorem 1 is that entropy of EB weights is an endogenous measure of representation imbalance, validating the insight in Sec 3.1 theoretically. This theorem bridges the classical weighting strategies with the modern representation learning perspectives for causal inference, that representation learning and propensity score modeling are inherently connected and does not need to be modeled separately.

###### Theorem 2 (Double Robustness).

Under the Assumption 1 and 2, the entropy balancing estimator is consistent for if either the true outcome models or the true propensity score model is linear in representation .

Theorem 5 establishes the DR property of the EB estimator . Note that the double robustness property will not be compromised if we add regularization term in (5). Double robust setups require modeling both the outcome function and propensity score; in our formulation, the former is explicitly specified in the first component in (5), while the latter is implicitly specified via the EB constraints in (4). By M-estimation theory (Stefanski and Boos, 2002), we can show that in (6) converges to the maximum likelihood estimate of a logistic propensity score model, which is equivalent to the solution of the following optimization problem,

 minλ N∑i=1log(1+exp(−(2Ti−1)m∑j=1λjΦj(Xi))). (12)

Jointly these two components constructs the double robustness property of estimator . The linearity restriction on is essential for double robustness, and may appear to be tight, but because the representations can be complex functions such as multi-layer neural networks (as in our implementation), both the outcome and propensity score models are flexible.

The third theorem shows that the objective function in (5) is an upper bound of the loss for the ITE. Before proceeding to the third theorem, we define a few estimation loss functions: Let be the loss function on predicting the outcome, denote the expected loss for a specific covariates-treatment pair given outcome function and representation,

 lf,Φ(x,t)=∫yL(Y(t),ft(Φx))P(Y(t)|x)dY(t). (13)

Suppose the covariates follow and we denote the distributions in treated and control group with . For a given and , the expected factual loss over the distributions in the treated and control groups are,

 εt\textupF(f,Φ) = ∫Xlf,ϕ(x,t)pt(x)dx,t=0,1, (14)

For the ITE estimation, we define the expected Precision in Estimation of Heterogeneous Effect (PEHE) (Hill, 2011),

 ε\textupPEHE(f,Φ)=∫X(f1(Φ(x))−f0(Φ(x))−τ(x))2p(x)dx. (15)

Assessing from the observational data is infeasible, as the countefactual labels are absent, but we can calculate the factual loss . The next theorem illustrates we can bound with and the -JS divergence of between the treatment and control groups.

###### Theorem 3.

Suppose is a compact space and is a continuous and invertible function. For a given , the expected loss for estimating the ITE, , is bounded by the sum of the prediction loss on the factual distribution and the -JS divergence of the distribution of between the treatment and control groups, up to some constants:

 ε\tiny{PEHE}(f,Φ)≤2⋅(ε0\textupF (f,Φ)+ε1\textupF(f,Φ)+ (16) CΦ,α⋅\textupJSDα(p1Φ,p0Φ)−2σ2Y),

where is a constant depending on the representation and , and is the expected conditional variance of .

The third theorem shows that the objective function in (5) is an upper bound to the loss for the ITE estimation, which cannot be estimated based on the observed data. This theorem justifies the use of entropy as the distance metric in bounding the ITE prediction error.

## 4 Experiments

We evaluate the proposed DRRL on the fully synthetic or semi-synthetic benchmark datasets. The experiment validates the use of DRRL and reveals several crucial properties of the representation learning for counterfactual prediction, such as the trade-off between balance and prediction power. The experimental details can be found in SM and the code is available from https://github.com/zengshx777/Double-Rouble-Representation-Learning/.

### 4.1 Experimental Setups

Hyperparameter tuning, architecture  As we only know one of the potential outcomes for each unit, we cannot perform hyperparameter selection on the validation data to minimize the loss. We tackle this problem in the same manner as Shalit et al. (2017). Specifically, we use the one-nearest-neighbor matching method (Abadie and Imbens, 2006) to estimate the ITE for each unit, which serves as the ground truth to approximate the prediction loss. We use fully-connected multi-layer perceptrons (MLP) with ReLU activations as the flexible learner. The hyperparameters to be selected in the algorithm include the architecture of the network (number of representation layer, number of nodes in layer), the importance of imbalance measure , batch size in each learning step. We provide detailed hyperparameter selection steps in SM.

Datasets  To explore the performance of the proposed method extensively, we select the following three datasets: (i) IHDP (Hill, 2011; Shalit et al., 2017): a semi-synthetic benchmark dataset with known ground-truth. The train/validation/test splits is 63/27/10 for 1000 realizations;(ii) JOBS (LaLonde, 1986): a real-world benchmark dataset with a randomized study and an observational study. The outcome for the Jobs dataset is binary, so we add a sigmoid function after the final layer to produce a probability prediction and use the cross-entropy loss in (5); (iii) high-dimensional dataset, HDD: a fully-synthetic dataset with high-dimensional covariates and varying levels of confoundings. We defer its generating mechanism to Sec 4.4.

Evaluation metrics  To measure the performance of different counterfactual predictions algorithms, we consider the following evaluation metrics for both average causal estimands (including ATE and ATT) and ITE: (i) the absolute bias for ATE or ATT predictions ; (ii) the prediction loss for ITE, ; (iii) policy risk, quantifies the effectiveness of a policy depending on the outcome function , . It measures the risk of the policy , which assigns treatment if and remains as control otherwise.

Baselines  We compare DRRL with the following state-of-the-art methods: ordinary least squares (OLS) with interactions, k-nearest neighbor (k-NN), Bayesian Additive Regression Trees (BART) (Hill, 2011), Causal Random Forests (Causal RF) (Wager and Athey, 2018), Counterfactual Regression with Wasserstein distance (CFR-WASS) or Maximum Mean Discrepancy (CFR-MMD) and their variant without balance regularization, the Treatment-Agnostic Representation Network(TARNet) (Shalit et al., 2017). We also evaluate the models that separate the weighting and representation learning procedure. Specifically, we replace the distance metrics in (5) with other metrics like MMD or WASS, and perform entropy balancing on the learned representations (EB-MMD or EB-WASS).

### 4.2 Learned Balanced Representations

We first examine how DRRL extracts balanced representations to support counterfactual predictions. In Figure 3, we select one imbalanced case from IHDP dataset and perform t-SNE (t-Distributed Stochastic Neighbor Embedding) (Maaten and Hinton, 2008) to visualize the distribution of the original feature space and the representations learned from DRRL algorithm when . While the original covariates are imbalanced, the learned representations or the transformed features have more similarity in distributions across two arms. Especially, a larger value leads the algorithm to emphasize on the balance of representations and gives rise to a nearly identical representations across two groups. However, an overly large may deteriorate the performance, because the balance is improved at the cost of predictive power.

To see how the importance of balance constraint affects the prediction performance, we plot the and in IHDP dataset against the hyperparameter (on log scale) in Figure 4, for CFR-WASS, CFR-MMD and DRRL, which involve tuning in the algorithms. We obtain the lowest or at the moderate level of balance for the representations. This pattern makes sense as the perfect balance might compromise the prediction power of representations, while the poor balance cannot adjust for the confoundings sufficiently. Also, the DRRL is less sensitive to the choice compared with CFR-WASS and CFR-MMD, with as the prediction loss has a smaller variation for different .

### 4.3 Performance on Semi-synthetic or Real-world Dataset

ATE estimation  We can see a significant gain in ATE estimation of DRRL over most state-of-the-art algorithms in the IHDP data, as in Table 1; this is expected, as DRRL is designed to improve the inference of average estimands. The advantage remains even if we shift to binary outcome and the ATT estimand in the JOBS data, as in Table 1. Moreover, compared with EB-MMD or EB-WASS which separates out the weights learning and representation learning, the proposed DRRL also achieve a lower bias in estimating ATE. This demonstrates the benefits of learning the weights and representation jointly instead of separating them out.

ITE estimation  The DRRL has a better performance compared with the state-of-the-art methods like CFR-MMD on the IHDP dataset for ITE prediction. For the binary outcome in the JOBS data, the DRRL gives a better over most methods except for the Causal RF when setting threshold . In Figure 5, we plot the policy risk as a function of the inclusion rate , through varying the threshold value . The straight dashed line is the random policy assigning treatment with probability , serving as a baseline for the performance. The vertical line shows the when . The DRRL are persistently gives a lower as we vary the inclusion rate of the policy

### 4.4 High-dimensional Performance and Double Robustness

We generate HDD datasets from the following model:

 Xi∼N(0,σ2[(1−ρ)Ip+ρ1p1Tp]) ||β0||0=||βτ||0=||γ||0=p∗,\textupsupp(β0)=\textupsupp(βτ) P(Ti=1)=\textupsigmoid(Xiγ) Yi(t)=Xiβ0+TXiβτ+εi,εi∼N(0,σ2e),t=0,1,

where are the parameters for outcome and treatment assignment model. We consider sparse cases where the number of nonzero elements in is much smaller than the total feature size . The support for is the same, for simplicity.

Three scenarios are considered, by varying the overlapping support of and : (i) scenario A (high confounding), the set of the variables determining the outcome and treatment assignment are identical, ; (ii) scenario B (moderate confounding), these two sets have 50% overlapping, ; scenario C (low confounding), these two sets do not overlap, . We set and generate the data of size each time, with 54/21/25 train/validation/test splits. We report the and in Table 11. The DRRL obtains the lowest error in estimating ATE, except for the Causal RF and BART, and achieve comparable performance in predicting ITE in all three scenarios.

This experiment also demonstrates the superiority of double robustness. The advantage of DRRL increases as the overlap between the predictors in the outcome function and those in the propensity score diminishes (from Scenario A to C), especially for ATE estimation. This illustrates the benefit of double robustness: when the representation learning fails to capture the predictive features of the outcomes, entropy balancing offers a second chance of correction via sample reweighting.

## 5 Conclusions

We propose a novel framework to learn double-robust representations for counterfactual prediction with the high-dimensional data. By incorporating an entropy balancing stage in the representation learning process and quantifying the balance of the representations between groups with the entropy of the resulting weights, we provide robust and efficient causal estimates. Important directions for future research include exploring other balancing weights methods (Deville and Särndal, 1992; Kallus, 2019; Zubizarreta, 2015) and generalizing into the learning problem with panel data (Abadie et al., 2010), sequential treatments (Robins et al., 2000), and survival outcomes (Cox, 2018).

\aistatstitle

Supplementary Materials for “Double Robust Representation Learning for Counterfactual Prediction”

## 6 Theorem Proofs

In this section, we present the detailed proofs for the theoretical properties in Sec 3.3 in the main text,

###### Lemma 1.

The optimal value of dual variable converges to maximum likelihood estimator in (12) in probability.

Proof: The following proposition is based on a given representation, therefore we treat as a fixed function. With Karush-Kuhn-Tucker (KKT) conditions, we derive the first order optimiality condition of (11):

 n∑i=1(1−Ti)e∑mj=1λmΦj(Xi)(Φj(Xj)−¯Φj)=0 (17) n∑i=1Tie−∑mj=1λmΦj(Xi)(Φj(Xj)−¯Φj)=0,

for . We rewrite the above conditions as estimating equations, let . Then (17) is the same as:

 n∑iaj(Xi,Ti,r,λ)=0,j=1,2,⋯,m (18) n∑ibj(Xi,Ti,r,λ)=0,j=1,2,⋯,m.

We can verify that and is the solution to the population version of (18). First, set and taking the conditional expectation of given is:

 E(aj(X,T,λ,r)|X)= (19) (1−e(X)) e∑mj=1λjΦj(X)(Φj(X)−E(Φj(X)), E(bj(X,T,λ,r)|X)= e(X) e∑mj=1−λjΦj(X)(Φj(X)−E(Φj(X)).

Suppose we are fitting the propensity score model with the log likelihood in (13), let be the MLE solution in (13) and plug into the , we have:

 E(aj(X,T,λ,r)|X)=e∑mj=1λjΦj(x)e∑mj=1λ∗jΦj(X)(Φj(X)−E(Φj(X)), (20) E(bj(X,T,λ,r)|X)=e∑mj=1−λjΦj(x)e∑mj=1−λ∗jΦj(x)(Φj(X)−E(Φj(X)).

The only way to make the follow quantify to be zero is to set . So far we have verified is the solution to the population version of , whose sample version is the KKT condition. Therefore, according to the M-estimation theory, we show that to , which is the MLE solution for (13).

###### Theorem 4.

The Shannon entropy of the EB weights defined in (4) converges in probability to the following -Jensen-Shannon divergence between the marginal representation distributions of the respective treatment groups:

 limn→∞H\tiny{EB}n(Φ)≜∑iw\tiny{EB}i(Φ)log(w\tiny{EB}i(Φ)) \lx@stackrelp⟶ (21) c′{KL(p1Φ(x)||pΦ(x))+ KL(p0Φ(x)||pΦ(x))}+c′′=c′\textupJSDα(p1Φ,p0Φ)+c′′

where are non-zero constants, is representation distribution in group (), is the marginal density of the representations, is the proportion of treated units and is the Kullback–Leibler (KL) divergence.

Proof: According to Lemma 1, we have . Therefore, with , for and (up to a normalized constant) for . Also, we have,

 1e(xi)=1p(Ti=1|Φ(Xi))=p(Φ(Xi))p(Φ(Xi)|Ti=1)p(Ti=1), (22) 11−e(xi)=1p(Ti=0|Φ(Xi))=p(Φ(Xi))p(Φ(Xi)|Ti=0)p(Ti=0).

Also, we can derive,

 N1w\tiny{EB}i→1/e(xi)∑Ti=11e(xi)/N1=1e(xi)/E(1/e(xi)|Ti=1). (23)

Notice that,

 E(1/e(xi)|Ti=1) =∫X1e(x)p(x|T=1)dx (24) =∫Xp(Φ(Xi))p(T=1)dx=p(Ti=1).

Therefore, we have,

 N1w\tiny{EB}i→p(Φ(Xi))p(Φ(Xi)|T