Optimal Estimation of Generalized Average Treatment Effects using Kernel Optimal Matching

# Optimal Estimation of Generalized Average Treatment Effects using Kernel Optimal Matching

## Abstract

In causal inference, a variety of causal effect estimands have been studied, including the sample, uncensored, target, conditional, optimal subpopulation, and optimal weighted average treatment effects. Ad-hoc methods have been developed for each estimand based on inverse probability weighting (IPW) and on outcome regression modeling, but these may be sensitive to model misspecification, practical violations of positivity, or both. The contribution of this paper is twofold. First, we formulate the generalized average treatment effect (GATE) to unify these causal estimands as well as their IPW estimates. Second, we develop a method based on Kernel Optimal Matching (KOM) to optimally estimate GATE and to find the GATE most easily estimable by KOM, which we term the Kernel Optimal Weighted Average Treatment Effect. KOM provides uniform control on the conditional mean squared error of a weighted estimator over a class of models while simultaneously controlling for precision. We study its theoretical properties and evaluate its comparative performance in a simulation study. We illustrate the use of KOM for GATE estimation in two case studies: comparing spine surgical interventions and studying the effect of peer support on people living with HIV.

Keywords: causal inference, optimization, covariate balance, average treatment effect, misspecification, positivity

## 1 Introduction

One of the primary goals of causal inference is to estimate the average causal effect of a treatment or intervention on an outcome under study. A common causal estimand of interest is the Sample Average Treatment Effect (SATE), which is the average effect of a treatment on an outcome among all individuals in the sample. Often, however, we may be interested in other averages. For example, Stuart (2010); Buchanan et al. (2018) consider the Target Average Treatment Effect (TATE) on a population or sample distinct from the study sample and propose the use of inverse probability of sampling weights. Similarly, if outcome data are only available for some units, Cain and Cole (2009); Robins and Finkelstein (2000) propose the use of inverse probability of censoring weights to generalize the results to the whole sample. Other estimands of interest focus on particular subgroups of the sample such as the Sample Average Treatment Effect on the Treated (SATT), the Conditional Average Treatment Effect (CATE) (Crump et al., 2008; Cai et al., 2010), and the Complete-Case SATE (CCSATE) (Seaman and White, 2013). In particular, Crump et al. (2009) propose the Optimal SATE (OSATE) and, as in Li et al. (2018), the Optimal Weighted Average Treatment Effect (OWATE) as the average treatment effect restricted by or weighted by overlap in covariate distributions in order to make the estimation easier.

Ad-hoc methods, such as those bases on Inverse Probability Weighting (IPW) (Horvitz and Thompson, 1952; Robins et al., 1994; Robins, 2000; Lunceford and Davidian, 2004) and outcome regression modeling, have been widely used to estimate these causal estimands. However, due to their sensitivity to model misspecification these methods may lead to biased estimates. In addition, IPW-based methods depend heavily on the positivity assumption, which practical violations of lead to extreme weights and high variance (Robins et al., 1995; Scharfstein et al., 1999; Robins et al., 2007; Kang and Schafer, 2007). In Section 7.1 in the Supplementary Material, we thoroughly discuss these issues, some of the related work to overcome them and alternative methodologies to estimate the aforementioned causal estimands.

In this paper, we start by presenting a general causal estimand, the Generalized Average Treatment Effect (GATE), which unifies all the causal estimands previously presented and motivates the formulation of new ones. We then present and apply Kernel Optimal Matching (KOM) (Kallus, 2016; Kallus et al., 2018) to optimally estimate GATE. KOM provides weights that simultaneously mitigates the possible effect of model misspecification and control for possible practical positivity violations (Kallus et al., 2018). We do that by minimizing the worst-case Conditional Mean Squared Error (CMSE) of the weighted estimator in estimating GATE over the space of weights. The proposed methodology has several attractive characteristics. First, KOM can be used to optimally estimate a variety of well-known causal estimands, as well as to find new ones such as the Kernel Optimal Weighted Average Treatment Effect (KOWATE). In Section 3.3 we show that various causal estimands can be easily estimated by simply modifying the optimization problem formulation we give for KOM, which is fed to an off-the-shelf solver. Second, by minimizing the worst-case CMSE of the weighted estimator, it leads to better accuracy, precision, and total error. We show this in our simulation study in Section 4. Third, by optimally balancing covariates, KOM mitigates the effect of possible model misspecification. In Section 4, we show that both absolute bias and root MSE (RMSE) of the weighted estimator that uses weights obtained by using KOM are consistently lower across levels of misspecification. Fourth, by penalizing the weights, KOM controls precision. We show this in Section 4. Fifth, the weights are obtained by using off-the-shelf solvers for convex-quadratic optimization. Finally, KOM is implemented in an open source R package available at https://github.com/michelesantacatterina/KOM.

In the next Section we introduce notation, specify assumptions and define GATE, the estimand of interest and its weighted estimator. We then introduce KOM for GATE, describe its theoretical properties and present some practical guidelines on its use (Section 3). In Section 4, we present the results of a simulation study aimed at comparing the performance of KOM with IPW, overlap weights, truncated weights and outcome regression modeling with respect to absolute bias and RMSE across levels of practical positivity violations and levels of misspecification. In Section 5, we apply KOM on the evaluation of the effect of spine surgical interventions on the Oswestry Disability Index (ODI) among patients with lumbar stenosis or lumbar spondylolisthesis, and on the evaluation of peer-support on CD4 cell count in two target populations of healthier patients, using real-world data. We conclude with some remarks in Section 6.

## 2 Generalized Average Treatment Effect

Suppose we have a simple random sample with replacement of size from a population. Under the potential outcome framework (Imbens and Rubin, 2015), for each unit , we let be the potential outcome of treatment . We let be the observed confounders. We consider three exclusive and exhaustive subsets of the units: (i) units treated with , for whom we observe ; (ii) units treated with , for whom we observe and (iii) untreated units, for whom we do not observe anything but confounders. We let , be the units in the study sample. We set , the indicator of being treated with , , the indicator of being in the study sample, and , the indicator of being outside the study. Let , for . We define the Generalized Average Treatment Effect (GATE) as the weighted average difference between the conditional expectation of the potential outcome of those treated and those untreated conditioned on . Formally, we define GATE as,

 τV=1nn∑i=1Vi(g1(Xi)−g0(Xi)), (2.1)

where is chosen to target the estimand of interest and may depend on (see Assumption 2.5 below). For instance, when , we target the study SATE, and when , we target the TATE. Moreover, by setting equal to the overlap weights (Li et al., 2018) and the truncated weights (Crump et al., 2009), we target the OWATE and OSATE, respectively. We provide examples of causal estimands in the first two columns of Table 1. To estimate GATE in eq. (2.1) we propose to use the following weighted estimator,

 ^τW=1n∑i∈TWiYi−∑i∈CWiYi=1n∑i∈SWi(−1)(Ti+1)Yi. (2.2)

For instance, the usual IPW estimator for SATE is given by plugging in , where , is the propensity score. In the next Lemma, we provide a general formulation of the IPW weights that make unbiased for GATE for any . To do so, we impose the assumption of consistency, non-interference, ignorable treatment assignment and ignorable sample assignment (Imbens and Rubin, 2015). Consistency, states that the observed outcome corresponds to the potential outcome of the treatment applied to that unit, and non-interference reflects the fact that units potential outcomes are not effected by how the treatment or intervention has been allocated. Consistency together with non-interference are also known as SUTVA (Imbens and Rubin, 2015). Ignorable treatment assignment, (also called unconfoundeness, no unmeasured confounding, or exchangeability), states that the potential outcome, , is independent to the treatment assignment mechanism given covariates. Similarly, ignorable sample assignment states that the potential outcome, , is independent to the sampling assignment mechanism, i.e.,  being part of the study sample, given covariates. We formalize these assumptions as follow,

###### Assumption 2.3 (Boundedness of ϕ(Xi)).

The propensity score is bounded away from 0,1.

###### Assumption 2.4 (Boundedness of ψ(Xi)).

The sampling probability is bounded away from 0.

Letting , we additionally assume,

###### Assumption 2.5 (Honest weights).

and are independent of all else given .

In the next Lemma we define the genalized IPW weights, , and show that , the weighted estimator in eq. (2.2) weighted by , is unbiased for GATE.

###### Lemma 2.1.

Define

 W\opIPWi=1ψ(Xi)\prnsTiϕ(Xi)+1−Ti(1−ϕ(Xi)) (ϕ(Xi)Vi(X1:n,(T−i,1),S1:n) +(1−ϕ(Xi))Vi(X1:n,(T−i,0),S1:n)),

where is equal to in components except the -th where it is equal to . Then under consistency, non-interference and assumptions 2.1-2.5,

 \mathbbmE[^τW\opIPW−τV|H1:n]=0.

This is a well-known results for SATE (Lunceford and Davidian, 2004) and TATE (Stuart et al., 2011; Buchanan et al., 2018). If we assume appropriate bounds on the norms of and the variances of , it is easy to additionally see that has diminishing variance and is therefore also consistent. We show examples of inverse probability weights, , for the IPW estimator of GATE in the last column of Table 1.

In the next Section, we introduce Kernel Optimal Matching for estimating GATE, which, instead of plugging estimated propensities into the weighted estimator, provides weights that minimizes the CMSE of for GATE. By doing so, the proposed methodology optimally minimizes the bias with respect to GATE while simultaneously controlling precision. We further consider simultaneously choosing to minimize the worst-case CMSE to obtain the Kernel Optimal Weighted Average Treatment Effect (KOWATE).

## 3 Kernel Optimal Matching for estimating GATE

In this Section, we present Kernel Optimal Matching for estimating GATE. We start by decomposing the CMSE of the weighted estimator, , in eq. (2.2). We show that this CMSE can be decomposed in terms of (a) the discrepancies between the conditional expectation of the potential outcome among the treated and the control, and (b) a variance term (Section 3.1). Since the CMSE depends on some unknown functions (conditional expectations), in Section 3.2, we guard against all possible realizations of the unknown functions by considering the worst-case CMSE of . In Section 3.3, we embed these in reproducing kernel Hilbert spaces (RKHS) and use quadratic programming to minimize the corresponding worst-case CMSE and find optimal weights.

### 3.1 Decomposing the CMSE of ^τW

We now decompose the CMSE of , the weighted estimator for GATE. Recall that, in Section 2, we defined . Further define and , for , and . We then define, for each function , the -moment discrepancy between the weighted -treated study sample and the -weighted total sample,

 Bt(W1:n,V1:n,f)=1nn∑i=1(Si\indicTi=tWi−Vi)f(Xi),

where is equal to 1 if and 0 otherwise. In the following theorem, we show that the CMSE of can be decomposed into the squared such discrepancies in the conditional expectations of the potential outcomes.

###### Theorem 3.1.

Under consistency, non-interference and assumptions 2.1-2.5,

 \mathbbmE[^τW−τV|H1:n] =B1(W1:n,V1:n,g1)−B0(W1:n,V1:n,g0) \mathbbmE[\prns^τW−τV2|H1:n] =(B1(W1:n,V1:n,g1)−B0(W1:n,V1:n,g0))2+1n2n∑i=1SiW2iσ2i. (3.1)

In the next Section, we show how to find weights that minimize eq. (3.1). The main challenge in this task is that the functions , on which this quantity depends, are unknown.

### 3.2 Worst-case CMSE

To overcome the issue that we do not know the -functions which the CMSE of depends, we will guard against any possible realizations of the unknown functions. Specifically, since the CMSE of scales linearly with and , we consider its magnitude with respect to that of and . We therefore need to define a magnitude. We choose the following,

 \fmagdg=√\fmagdg020+\fmagdg121,

where are some extended seminorm on functions from the space of confounders to the space of outcomes. We discuss a specific choice of such extended seminorms in Section 3.3. Given this magnitude, we can define the worst-case squared bias as follows:

 B(W1:n,V1:n) =supgB1(W1:n,V1:n,g1)−B0(W1:n,V1:n,g0)∥g∥ (3.2) =√Δ21(W1:n,V1:n)+Δ20(W1:n,V1:n), (3.3)

where

 Δt(W1:n,V1:n)=supgtBt(W1:n,V1:n,gt)∥gt∥t=sup∥gt∥t≤1Bt(W1:n,V1:n,gt),

is the worst-case discrepancy in the -moment between the weighted -treated group and the -weighted sample over all functions in the unit ball of . In particular, given a positive semidefinite (PSD) kernel , if we choose the corresponding RKHS (a Hilbert space of functions with continuous evaluations, which is associated with the reproducing kernel ) to specify the norm, we can show that the worst-case discrepancy can be expressed as a convex-quadratic function in .

###### Theorem 3.2.

Define the matrix as and note that it is positive semidefinite by definition. Then,

 Δt(W1:n,V1:n)2 =1n2\prnsW1:nTISItKtISItW1:n−2V1:nTKtISItW1:n+V1:nTKtV1:n,

where is the diagonal matrix with in its diagonal entry, and is the diagonal matrix with in its diagonal entry.

Based on Theorem 3.2, letting the RKHS given by the kernel specify the norm, both the worst-case bias and the worst-case CMSE of are convex-quadratic functions in . Specifically, we define the worst-case CMSE as

 C(W1:n,V1:n,λ) =sup∥g∥2≤1\mathbbmE[\prns^τW% −τV2∣H1:n] =Δ21(W1:n,V1:n)+Δ20(W1:n,V1:n)+λn2∥ISW1:n∥22, (3.4)

where, for simplicity, we use equal variance weights, . In the next Section we show how to minimize the worst-case CMSE of in estimating GATE, , by using off-the-shelf solvers for quadratic optimization.

### 3.3 Minimizing the worst-case CMSE

In the previous two Sections, we showed that the CMSE of in estimating GATE can be decomposed in squared bias plus its variance. We also showed that, since the bias depends on unknown conditional expectations, by guarding against any possible realizations of these unknown functions, embedded in an RKHS given by the kernel , the worst-case CMSE of can be expressed as a convex-quadratic function in . Here, we use quadratic programming to obtain the weights that minimizes the worst-case CMSE of . When interested in estimating, for example, SATE, and TATE, the set of weights is fixed, i.e., all are given, known scalars. We show the corresponding convex-quadratic optimization problem when the set of weights is fixed in the next Section. In addition, given the flexibility of the proposed methodology, we can also let be variable and let it be chosen by the solver in such a way that the worst-case CMSE of is minimized. We show this in Section 3.3.2.

#### Fixed V1:n

Let . When is fixed, we propose to use weights obtained by solving the following optimization problem

 minW1:n∈W\prnsΔ21(W1:n,V1:n)+Δ20(W1:n,V1:n)+λn2∥ISW1:n∥22 , (3.5)

where is interpreted as a penalization parameter that controls the trade-off between bias and variance. When equals zero, we obtain weights that yield minimal bias. When , we obtain uniform weights. (If we have estimates of heteroskedastic conditional variance, we can also easily use unit-specific weights.) We discuss how to tune this hyperparameter in Section 3.6. As shown in Theorem 3.2, using an RKHS norm, we can show that the optimization problem (3.5) reduces to the following linearly-constrained convex-quadratic optimization problem:

 minW1:n≥0,WT1:nISI1en=n,WT1:nISI0en=n1n2\prnsW1:nT\prns∑t∈0,1ISItKtItIS+λISW1:n−2V1:nT(K1ISI1+K0ISI0)W1:n. (3.6)

#### Variable V1:n

We can also let be variable. Instead of being given set values, we are given a feasible set . We assume that and that is a polytope (expressed by linear constraints). To simultaneously find the GATE, subject to , that is most easily estimable and the weights to estimate this GATE, we propose to solve the following optimization problem

 minV1:n∈VW1:n∈W\prnsΔ21(W1:n,V1:n)+Δ20(W1:n,V1:n)+λn2∥ISW1:n∥22 . (3.7)

When is a singleton, this optimization problem is the same as that in eq. (3.5). Again, we can show that the optimization problem (3.7) reduces to a linearly-constrained convex-quadratic optimization problem:

 minV1:n∈V,W1:n≥0,W1:nTISI1en=n,W1:nTI0en=n(W1:n,V1:n)T⎛⎝1n2∑t∈{0,1}[ISItKtItIS+λIS−ISItKt−KtItISKt]⎞⎠(W1:n,V1:n). (3.8)

The solution to the optimization problem (3.8) provides both weights that define a GATE of interest and the weights to estimate it. The weights are chosen in order to allow for minimal CMSE. That is, it focuses on the subpopulation where the average effect on which is easiest to estimate by KOM. We discuss this further in Section 3.4.

When we use , we term the resulting GATE estimand the Kernel Optimal Weighted Average Treatment Effect (KOWATE). We can also construct other causal estimands by choosing different . For instance, we may restrict to an unweighted subsample as in the OSATE of Crump et al. (2009) by choosing , where is a chosen subsample size. We refer to this as Kernel Optimal SATE (KOSATE). Table 2 summarizes these causal estimands. It is worth noticing that, other causal estimands can be easily constructed by plugging the set of overlap or truncated weights of Crump et al. (2009); Li et al. (2018) as fixed in the optimization problem. In the next Section we provide more insight on the set of weights chosen.

### 3.4 What population is KOWATE choosing?

In the previous Sections, we have seen that by changing the set of weights , we change the target causal estimand considered and consequently the target population under study. In addition, we have seen that this set of weights can be optimally obtained by letting be variable and be chosen by the optimization problem (3.7). The idea is to pick the subpopulation that is easiest to estimate by KOM. This subpopulation will emphasize areas with better overlap, where overlap is characterized in terms of worst-case moment discrepancies as defined by the kernels, rather than in terms of (unknown) propensity scores.

In this Section, we illustrate this in a simple simulated example described in Figure 1. Specifically, Figure 1 shows scatterplots between two confounders, one on the vertical axis and one on the horizontal axis, weighted by the weights , obtained when targeting SATE (first column of Figure 1, KOSATE (second top panel), KOWATE (third top panel), OSATE (second bottom panel) and OWATE (third bottom panel). The histograms on the top and right axes represent the distributions of the confounders across treated (dark-grey) and control (light-grey). The data was generated to exhibit practical positivity violations and we provide more details on the data generation in the simulation section (Section 4).

When targeting SATE, we consider a fixed that is equal to 1 for all units in the sample. On the other hand, all of KOSATE, OSATE, KOWATE, OWATE focus on the area of confounders with high overlap. In practice we find this translate to better performance, as seen in Table 3 in our case study. KOSATE and OSATE do this while restricting to either including or excluding samples, as can be seen by the two point sizes in Figure 1. KOWATE and OWATE consider a range of weights, as can be seen by the variable point sizes. Visually, the weights that define the GATE for KOSATE and OSATE are similar as they both focus on the area of overlap; the same for KOWATE and OWATE. The differences are that KOWATE and KOSATE guard against possible misspecification of propensity models and that they target the CMSE of the estimator itself, rather than the asymptotic variance, and therefore they account for the desired precision of the KOM estimate that will be applied. For example, should we desire more precision, both KOWATE and the KOM weights to estimate it would be more dispersed. We provide a deeper study of this in Section 8, where we consider the effects of misspecification as well as how the weights differ as well between the methods.

### 3.5 Consistency

In this Section we study the consistency of the proposed weighted estimator with respect to the true causal estimand GATE (for both fixed and variables).

###### Theorem 3.3.

For given, let be the set of optimal weights obtained by solving optimization problem (3.5). Suppose Assumptions 2.12.4 hold. Assume also, , , , and . Then, .

The above theorem shows that for any GATE estimand, under appropriate assumptions, the KOM estimate is root- consistent.

The assumption about the kernel can be automatically satisifed by using a bounded kernel, such as the Gaussian or Matern kernels. The assumption of requires no model misspecification. We can relax this assumption if we use a universal kernel, such as Gaussian, but the rate may deteriorate from to as we need to include an approximation term. For brevity, we omit the details.

To apply Theorem 3.3 to the case of variable , note that the solution of problem (3.7) is a function of and is therefore honest (satisfies Assumption 2.5) and that, given these , the solution to problem (3.5) is the exactly same as that in problem (3.7), as it can just be viewed as a nested minimization problem (once in and once in ). So to apply Theorem 3.3 to the case of variable weights, we need only guarantee that . We can either take that as an assumption, or we can enforce it in the construction of by including a bound. In practice, we find that this is not necessary.

### 3.6 Kernel choice, automatic selection of the its hyperparameters, and other practical guidelines

Solutions to the optimization problems (3.5) and (3.7) depend on the choice of the kernel and its hyperparameters. In this Section we provide some practical suggestions on how to chose them. Generally, we suggest the use of a polynomial Mahalanobis kernel:

 Kt(x,x′)=γt(1+θt(x−^μn)T^Σ−1n(x′−^μn))d, (3.9)

where is the sample mean, is the sample covariance, is the parameter that controls the degree of the polynomial, is a parameter that controls the importance of higher orders degrees, and controls the overall scale of the kernel. To avoid unit dependence and confounders with high variance dominating those with smaller ones, by proposing the aforementioned kernel, we are suggesting to normalize confounders to have mean 0 and variance 1. Based on the results of the next Section and previous simulation studies (Kallus et al., 2018), we suggest using . Alternatively, we may additionally replace with a matrix parameter to be tuned. Alternative choices for the kernel include Gaussian or Matern. These have the benefit of being universal approximators, but practically we find a polynomial kernel is sufficient.

To tune the kernel’s hyperparameters, and , we propose to use marginal likelihood, a well-known model selection criteria for Gaussian processes (Rasmussen and Nickisch, 2010). To do so, we specify two Gaussian process priors, and with covariances specified by the kernels and . We suppose that the the potential outcome was observed from with Gaussian noise of variance . We then maximize the marginal likelihood (marginalizing over the Gaussian process) of seeing the data with respect to the hyperparameters, , , and .

In addition, optimization problems (3.5) and (3.7) depend on the choice of the hyperparameter , which controls the trade-off between bias and variance. In order to target the total error (i.e., the CMSE) we set for all , yields to the worst-case CMSE. Through this paper, we refer to as optimal , where , and . Consequently, when using optimal , we let the CMSE of the weighted estimator be optimally minimized. When equals zero, we instead minimize the bias. As a general practical advice, we suggest using optimal obtained by using GPML as aforementioned.

Several software packages implementing marginal likliehood can be used to tune hyperparameters and a variety of solvers can be used to solve linearly-constrained convex-quadratic optimization problems. We suggest using the GaussianProcessRegressor package from scikit-learn (Pedregosa et al., 2011) for tuning the hyperparameters and Gurobi (Gurobi Optimization, 2014) for solving quadratic optimization problems. In practice, Gurobi sometimes fails due to the quadratic objective being numerically non-PSD (despite being PSD in theory). We have found this occurs sometimes when using a high-degree kernel and variable . We found that this is easily fixed without materially changing the results by adding to the quadratic objective matrix to inflate its spectrum slightly.

We suggest to use Wald confidence intervals together with the robust “sandwich” standard error estimator as previously suggested by other authors (Hernán et al., 2001; Robins, 2000; Freedman, 2006; Imai and Ratkovic, 2014). We provide more details on standard error estimation and coverage in Section 9.2 of the Supplementary Material.

## 4 Simulations

In this Section, we present the results of a simulation study aimed at comparing KOM with IPW, overlap weights, truncated IPW, and outcome regression modeling in estimating GATE with respect of absolute bias and root MSE, across levels of practical positivity violations and across levels of misspecification. In summary, KOM showed a consistently low absolute bias and root mean squared error (RMSE) across all of the considered scenarios.

### 4.1 Setup

We considered a sample size of . We computed the potential outcomes from the following models: , , where . We computed the observed outcome as , where , , , . The true GATE was computed as . We considered for all units in the sample. We consider several , namely, (a) KOWATE; (b) KOSATE where we set equal to the number of units chosen by OSATE, (c) SATE, (d) OWATE, (e) OSATE with . For KOWATE and KOSATE we consider the KOM weights given by problem (3.7). For SATE we consider several estimates: (a) KOM as in the optimization problem (3.5); (b) IPW; and (c) using outcome regression modeling. For OWATE and OSTAE, we use the estimated propensity to define and consider estimating it

we computed the set of overlap weights, and to estimate OSATE we computed the set of truncated weights setting . We used a product of polynomial degree 2 kernels for KOM. We modeled the propensity score, , by using a polynomial-degree-4 logistic regression and we used a polynomial-degree-4 regression model for outcome regression modeling. We evaluated the performance of the proposed method across levels of practical positivity violation and misspecification as described in Section 4.1.1 and Section 4.1.2 respectively. When the solver failed to solve the optimization problem (3.5) or (3.7) when using a product of polynomial degree 2 kernels, we rerun the same optimization problem by considering a KOM polynomial degree 3. If also KOM polynomial degree 3 failed then we considered KOM polynomial degree 2. We estimated the estimand of interest by plugging the set of obtained weights into the weighted estimator . We used scikit-learn (through the R package reticulate) to tune the hyperparametes and the R interface of Gurobi to obtain the set of KOM weights.

#### Estimating GATE across levels of practical positivity violation

To evaluate the performance of the proposed methodology across levels of practical positivity violation, we let vary between and . We considered 10 levels. When we refer to as weak practical positivity violation, when to moderate, and when to strong violation. In our simulation scenario the propensity score, , ranged between 0.37 and 0.63 under weak violation, between 0.07 and 0.92 under moderate violation and between 0.007 and 0.993 under strong violation (average of min/max over simulations under no misspecification).

#### Estimating GATE across levels of misspecification

We also evaluated the performance of the proposed methodology across levels of misspecification. To do so, we generated and and considered a convex combination between the correct variables and the misspecified variables , and . We considered 3 levels, which we refer to as correct specification (Overparametrized - because we use the polynomial models previously described in all scenarios), which we refer to as moderate misspecification, and which we refer to as strong misspecification.

### 4.2 Results

In this Section we discuss the results of our simulation study. In summary, KOM outperformed IPW, overlap, truncated weights and outcome regression modeling with respect of absolute bias and RMSE in estimating GATE across levels of practical positivity violation under both moderate and strong misspecification.

#### Results across levels of practical positivity violations and model misspecification

Kallus et al. (2018) presented KOM for SATE. The Authors showed that KOM outperformed IPW, truncated IPW, propensity score matching, regression adjustment, CBPS and SBW with respect to bias and MSE across most of the considered levels of practical positivity violation and considered scenarios. In addition, the authors showed that KOM for SATE outperformed the other methods especially under strong practical positivity violation. Figure 2 shows the absolute bias (left panels) and RMSE (right panels) of SATE estimated by using KOM (solid-black)(which we refer to as KOM-SATE), KOSATE by using KOM (solid-dark-grey), KOWATE estimated by using KOM (solid-light-grey), SATE estimated by using IPW (long-dashed-black), OSATE estimated by using truncated weights (long-dashed-dark-grey), OWATE estimated by using overlap weights (long-dashed-light-grey), and SATE estimated by using outcome regression modeling (dotted-black)(which we refer to as OM), with optimal (i.e., ). The top panels of Figure 2 show absolute bias and RMSE across levels of practical positivity violations under strong misspecification, while the middle and bottom panels under moderate misspecification and correct specification, respectively. In summary, KOM showed a consistently low absolute bias and RMSE across all the considered scenarios, outperforming the other methods especially under strong practical positivity violation and strong misspecification (top-right panel). KOM matched the performance of the other methods with respect of absolute bias and RMSE under weak levels of practical positivity violations and correct model misspecification. We obtained similar results when (Figure 7 in the Supplementary Material). IPW showed erroneous behaviours, leading to high absolute bias and RMSE under moderate to strong practical positivity violations and moderate to strong model misspecification. Outcome regression modeling resulted in even higher absolute bias and RMSE under moderate and strong misspecification across all levels of practical positivity violations. Results are outside the plot region of the top and middle panels of Figures 2 and 7. We provide additional simulations results in Section 9 in the Supplementary Material.

## 5 Illustrations

In this Section we present two empirical applications of proposed methodology. In the first, we apply KOM in the evaluation of two spine surgical interventions, laminectomy alone versus fusion-plus-laminectomy, on the Oswestry Disability Index (ODI), among patients with lumbar stenosis or lumbar spondylolisthesis. In the second, we apply KOM in the evaluation of peer support on CD4 cell count at 12 months after trial recruitment among patients affected by HIV, in two target populations where patients were healthier compared to those of the trial population.

### 5.1 The effect of fusion-plus-laminectomy on ODI

In this Section we apply KOM in the evaluation of two spine surgical interventions, laminectomy alone versus fusion-plus-laminectomy, on the Oswestry Disability Index (ODI), among patients with lumbar stenosis or lumbar spondylolisthesis. Briefly, lumbar stenosis is caused by the narrowing of the space around the spinal cord in the lumbar spine (Resnick et al., 2014). Lumbar spondylolisthesis is caused by the slippage of one vertebra on another. These pathologies lead to low back and leg pain, ultimately limiting the quality of life of those patients affected by them (Waterman et al., 2012). In case these pathologies are not anymore controlled by medications or physical therapy, surgical interventions may be needed. Typically, patients with lumbar stenosis are usually treated with laminectomy alone while those with lumbar spondylolistheses with fusion-plus-laminectomy (Resnick et al., 2014; Eck et al., 2014; Raad et al., 2018). In addition, laminectomy alone is done to patients with leg pain, while fusion-plus-laminectomy to patients with mechanical back pain (Resnick et al., 2014). This surgical practice leads to a practical positivity violation.

Differently from other medical areas where randomized controlled trials are the gold standard to evaluate interventions, the use of randomized controlled trials to evaluate surgical interventions is rare. This is due to practical and methodological issues (Carey, 1999). Lately, a number of large real-world observational datasets have been collecting information about surgical interventions and outcomes. However, these datasets are purely observational and confounding must be carefully taken into account. Furthermore, the assumption of correct model specification is hardly ever met. To overcome these challenges, in this Section we evaluate the effect of fusion-plus-laminectomy on ODI by estimating SATE, KOWATE, and KOSATE using KOM.

#### Study population

We restrict our study to patients who had the their first spine surgery intervention, i.e., primary surgery. Demographic and clinical information was collected at the time of the patient interview which happened before surgical intervention. The outcome under study, ODI, was collected at 3-month follow-up. The study subset was composed of 313 patients. Two-hundred forty-nine (79%) received laminectomy alone and 64 (21%) fusion-plus-laminectomy. We identified as potential confounders the following variables: biological sex (female vs. male), lumbar stenosis (yes vs. no), lumbar spondylolistheses (yes vs. no), back pain (score from 0 to 10), leg pain (score from 0 to 10), and activity at home (yes vs. no), activity outside home (yes vs. no). As previously described, spine surgical practice may lead to a practical violation of the positivity assumption. For example, in our subset, less then 1 of patients with low-to-moderate leg pain were treated with fusion-plus-laminectomy.

#### Models setup

We estimate SATE by solving optimization problem (3.5) with , and KOWATE and KOSATE by solving optimization problem (3.7) where we set and , respectively. We obtained by summing truncated weights obtained by using a logistic regression model and setting . Once the set of weights were obtained, we plugged them into a weighted ordinary least squares estimator. We used scikit-learn (through the R package reticulate) to tune the hyperparametes and the R interface of Gurobi to obtain the set of KOM weights. We computed robust (sandwich) standard errors in each case.We used the R packages lm for estimating SATE, KOWATE and KOSATE and sandwich to estimate robust standard errors.

#### Results

In this Section we present the results of our analysis. Previous randomized trials showed no statistically significant difference between laminectomy alone versus fusion-plus-laminectomy on ODI (Försth et al., 2016; Ghogawala et al., 2016). The proposed methodology consistently showed similar results to those of Försth et al. (2016); Ghogawala et al. (2016). Specifically, Table 3 shows point estimates and standard errors with respect to SATE, KOWATE, and KOSATE. While the unadjusted method, i.e., naive method regressing only the treatment on the outcome, shows a significant effect of fusion-plus-laminectomy on ODI, adjusted estimates from SATE, KOWATE and KOSATE show a non statistically significant effect of it. Standard errors are lower for KOWATE and KOSATE compared to SATE. Figure 3 shows the covariate balance with respect to SATE (top panel), KOSATE (middle panel) and KOWATE (lower panel). The black dots show the level of balance after weighting, while the light-grey dots show the unadjusted balance. KOWATE provides the lowest covariate balance compared with SATE and KOSATE. Finally, based on the results obtained by applying KOM, we conclude that fusion-plus-laminectomy has no statistically significant effect on ODI.

### 5.2 Evaluating the impact of peer support on CD4 cell count in target populations of healthier people

In the past four decades, the HIV epidemic has become from a small problem to a leading global burden with major social and economical consequences. People living with HIV (PLWH) have benefited from the services of peer support, such as peer-to-peer counseling, support groups and home-based adherence counseling. Although several studies have been shown the positive effect of peer support on quality of life, stigma and discrimination, few studies have evaluated its effect on treatment outcomes such as CD4 cell count at 12 month after peer support initiation. The CD4 cell count provides an indication of the health of the immune system of a PLWH. Normal ranges are between 500 and 1,500, and it decreases when a person is infected by HIV, leading to AIDS when below 200. In a recent randomized trial, Sönnerborg et al. (2016) showed that peer support, defined as home-based adherence counseling, did not have an effect on CD4 cell count at 12 months and other treatment outcomes, such as virological failure. The study was conducted in a resource-limited settings, in which the majority of the PLWH in the sample were severely immune-suppressed with low CD4 cell count at baseline. We are interested in answering the question: what is the impact of peer support in a target population in which PLWH have a higher CD4 cell count at baseline? To do so, in this Section, we apply KOM to evaluate the impact of peer support on CD4 cell count in the general population provided by the Multicenter AIDS Cohort Study (MACS) (MACS, 2019), a prospective observational study collecting demographic and economic information of the HIV-infection. Specifically, we considered two target populations, (1) those PLWH in the MACS dataset that received treatment after 2001 (which we refer to MACS-1), and (2) those that received treatment after 2010 (which we refer to MACS-2). These two populations reflect the fact that in the past years, PLWH get detected earlier, leading to more healthy PLWH at baseline.

#### Study and target populations

We restrict our analysis to a subset of 492 PLWH with complete information about CD4 cell count at baseline, age and CD4 cell count at 12 months. We considered CD4 cell count and age at baseline as possible confounders. PLWH in the trial, had an average CD4 of 120 with more than 80% diagnosed with AIDS at the time of trial initiation. PLWH in the target population MACS-1 had an average CD4 cell count at baseline of 450, while those in MACS-2 of 640, again suggesting that PLWH in the two target populations considered are healthier. The mean age in the trial was 32, while that in MACS-1 and MACS-2, was 38 and 36 respectively. The sample sizes of MACS-1 and MACS-2 were 860 and 243.

#### Model setup

To estimate TATE, we combined the data in the trial with those in the target population. We set the indicator of sample assignment, equal to 1 when unit -th belonged to the trial and 0 otherwise, and set . We solved optimization problem (3.6) to estimate TATE in the two target populations. Once we obtained the set of KOM weights, we plugged them into a weighted ordinary least squares estimator. We computed robust (sandwich) standard errors in each case. We used scikit-learn (through the R package reticulate) to tune the hyperparametes and the R interface of Gurobi to obtain the set of KOM weights, lm for estimating TATE and sandwich to estimate robust standard errors.

#### Results

Table 4 shows the result with respect to TATE across the two target populations MACS-1 and MACS-2. In summary, similarly to as the results in the trial, our results showed a non statistically significant effect of peer support in healtier populations. We therefore conclude that the results in Sönnerborg et al. (2016) may be generalized to populations where PLWH had a higher CD4 cell count at baseline.

## 6 Conclusion

In this paper we presented a general causal estimand, GATE, that unified previously proposed causal estimand, such as SATE, OWATE, OSATE and TATE among others and motivated the formulation of new ones. We also presented and applied KOM to optimally estimate GATE. KOM directly and optimally control both bias and variance which leads to a successful mitigation of possible model misspecifications while controlling precision. In addition, by easily modifying the optimization problem that is fed to an off-the-shelf solver, the proposed method effectively target different causal estimands of interest. Furthermore, by automatically learning the structure of the data, KOM allows to balance linear, nonlinear, additive, and non-additive covariate relationships. One future direction may be to extend KOM for GATE in the longitudinal setting with time-dependent confounders, thus extending the work of Kallus and Santacatterina (2018).

SUPPLEMENTARY MATERIAL

R-code:

R-code containing code to perform the simulations and the analyses of the case-studies described in the article.

Introduction: Related work; KOM for GATE: Figures 4 5, 6; Simulations: Figures 7,8, 9, 10, 11 and Table 4.

Proofs:

Proofs of Lemma 2.1, Theorem 3.1, 3.2 and 3.3.

## 7 Introduction - Additional results

We provide additional information about the literature on average treatment effects estimation.

### 7.1 Related work

Randomized controlled trials provide unbiased estimates of the SATE. However, their results may not be extended to different populations because trial participants are not representative of the target populations or subgroup of interest. For example, the Women’s Health Initiative (WHI) trial (Writing Group for the Women’s Health Initiative Investigators and others, 2002), contrarily to previous observational studies, found an harmful effect of hormone-therapy on stroke. These discrepancies have been in part attributed to possible differences in the distributions of age and weight between the WHI trial and the real-world practice. Specifically, women in the WHI trial were older than the typical age at which the hormone therapy is taken and were also more obese, leading to an increased risk of stroke Keiding and Louis (2016). Observational datasets, such as electronic medical registries, are more representative of the real-world clinical practice, and may provide more generalizable results compared to those of the trials. However, despite their potential, these datasets are observational and non-experimental, where the true causal effect is hidden by confounding factors. Thus, while trials provide unbiased estimates of the SATE in populations that are not generalizable as those in an observational dataset, the estimation of causal effects with observational data is hampered by confounding.

Various statistical methods have been proposed in an attempt to control for confounding in observational studies and to generalized trial results to target populations. Among others, methods based on IPW have been widely used. To control for confounding, IPW weights each subject under study by the inverse of the probability of being treated given covariates, i.e.  the propensity score (Rosenbaum and Rubin, 1983), thus mimicking a random treatment assignment as in a trial. In other words, IPW creates an hypothetical population in which covariates are balanced and confounding is consequently removed. IPW has also been used to generalized trial results to target populations of interest. For instance, based upon the work of Cole and Stuart (2010) and Stuart et al. (2011), Buchanan et al. (2018) proposed an IPW estimator that weights each trial participants by the inverse of the probability of trial participation conditional on covariates. Outcome regression modeling, where the outcome given covariates is modeled by using a linear or nonlinear regression model have also been used. Specifically, to control for confounding, for each treatment arm, an outcome model is postulated, and used to predict the outcome. If the chosen model is the true model that generated the outcome, then, it could be used to both control for confounding and evaluate causal effects in populations different than the one in which the model was trained.

Despite their wide applicability, these two methods are both highly sensitive to model misspecification. Specifically, IPW-based methods are sensitive to misspecification of the treatment assignment model, used to construct the weights, while outcome regression modeling to that of the outcome model. In addition, the use of IPW-based methods is jeopardized by their dependence on the positivity assumption (Rosenbaum and Rubin, 1983), which requires that the propensity scores are neither 0 nor 1. Although positivity may hold theoretically, it can be practically violated (Petersen et al., 2012), i.e., lack of overlap in the covariate distributions between treatment groups, yielding to propensity scores close to 0 or 1. Practical violations of the positivity assumption leads to extreme weights and erroneous inferences.

Methods have been proposed to overcome the issue of misspecification. Robins et al. (1994) proposed augmented IPW estimators, which combine IPW and outcome models in one doubly robust estimator. Here, doubly robust refers to the fact that an unbiased estimate of SATE can be obtained whenever either of the outcome model or the treatment assignment model is correctly specified, thus being robust to misspecification. However, as noted by Kang and Schafer (2007), these methods also suffer from practical violations of the positivity assumption and they are highly biased in case of misspecification of both treatment and outcome models. Imai and Ratkovic (2015) proposed to use the Covariate Balance Propensity Score (CBPS), which finds the logistic regression model that balances covariates.

Several methods have been proposed in the past decades to overcome the issue of extreme weights in IPW. Robins (2000) suggested the use of stabilized inverse probability weights, which are obtained by normalizing the weights by the marginal probability of treatment. Santacatterina and Bottai (2018) proposed to use shrinkage to better control the bias-variance trade-off. Zubizarreta (2015) proposed Stable Balancing Weights (SBW), which are the set of weights of minimal sample variance that satisfy a list of approximate moment matching conditions to a level of balance specified by the research. Cole and Hernán (2008); Xiao et al. (2013) suggested truncation, which consists of replacing outlying weights with less extreme ones. A number of approaches have been proposed with the goal of defining a study population in which the average treatment effect can be well estimated while being as inclusive as possible, i.e., defining a study population that has enough overlap between treatment groups. Crump et al. (2009) proposed to choose the study population that optimize the variance of the estimated average treatment effect on the study population. The Authors showed that this study population is composed by those units whose propensity scores lie in an interval . They suggest to set . As previously mentioned, the authors refer to this causal estimand as OSATE. Visconti and Zubizarreta (2018); Zubizarreta et al. (2014) proposed to use cardinality matching, in which integer programming is used to find the largest sample with bounded balance. Other methods have been also proposed (Traskin and Small, 2011; King et al., 2017, among others). Finally, Crump et al. (2009); Li et al. (2018) proposed to use overlap weights, where each unitâs weight is proportional to the probability of that unit being assigned to the opposite treatment group. As previously mentioned, the authors refer to this causal estimand as OWATE.

The literature of causal estimation is extensive and many other methods have been developed in the recent years (Hirshberg et al., 2019; Hirshberg and Wager, 2017; Hainmueller, 2012; Zhao and Percival, 2017; Zhao, 2019; Wong and Chan, 2017, among others)

Our main contributions to this literature are 1) propose a general causal estimand that unifies several causal estimands and motivates the formulation of new ones, 2) present and apply KOM, which by optimally controlling bias and variance, it mitigates possible model misspecification while controlling for precision.

## 8 Kernel Optimal Matching for estimating GATE - Additional results

In this Section, we provide additional results related to Section 3.4. Specifically, in Figure 4 we show the scatterplots between the two confounders weighted by the weights under correct specification as described in Section 4. Figures 5 and 6 show scatterplots weighted by and respectively, under strong misspecification. Weights are standardized to mean one for comparison.

## 9 Simulations - Additional results

In this Section, we provide additional simulations results. Specifically, we evaluate the performance of KOM with respect to root MSE 1) across levels of practical positivity violations and across levels of misspecification, when , and 2) when increasing sample size and across levels of the penalization parameter under correct specification.

### 9.1 Estimating GATE when increasing the sample size and the level of λ

In this section we describe the setup for the evaluation of the performance of KOM in estimating GATE when increasing the sample size and the level of the penalization parameter under no misspecification. We considered three practical positivity violations: weak, moderate and strong, defined as described in Section 4. We considered five different samples sizes, to , and thirty different values for , from to . When evaluating KOM across sample sizes, we set , i.e.  the optimal , while when evaluating KOM across levels of we set the sample size equal to . For each scenario, we computed the potential outcomes as described in Section 4.1 and we estimated SATE, KOSATE and KOWATE by solving optimization problems (3.5) and (3.7). We used a polynomial kernel degree 1 (a linear kernel), and plugged into the kernel the correctly specified covariates, and .

#### Results when increasing sample size and λ under no misspecification

Figure 8 shows the performance of KOM when estimating SATE (solid-black), KOSATE (solid-dark-grey) and KOWATE (solid-light-grey), with respect of RMSE when increasing the sample size with (left panels) and when increasing the penalization parameter with sample size set to be equal to 100 (right panels), across strong, moderate and weak practical positivity violation scenarios. Results are presented in the log-transformed scale.

In summary, RMSEs decreased when increasing the sample size for SATE, KOSATE and KOWATE with a similar rate across weak, moderate and strong practical positivity violation scenarios.

Lower values of the penalization parameter did not seem to affect the performance of KOM with respect of RMSE for KOWATE, KOSATE and SATE across all three practical positivity violation scenarios.

### 9.2 Some considerations about standard error estimation and coverage

When dealing with weighted estimators, several authors suggested using a robust “sandwich” variance estimator (Freedman, 2006). Furthermore, to compute confidence intervals of a weighted estimator, Wald confidence intervals can be used (Hernán et al., 2001; Robins, 2000; Freedman, 2006). In Section 2, Theorem 3.1, we showed that the conditional variance of the weighted estimator, is equal to the sum of squared weights multiply by the variance of the error, i.e.. In this Section we provide some practical considerations about standard error estimation and coverage of the 95% confidence interval.

Figures 9 and 10 show the results of a simulation study aimed at comparing the empirical standard error of the sampling distribution of estimated SATE, KOSATE and KOWATE when and respectively, with 1) the standard error obtained in Theorem 3.1 (conditional standard error), 2) the classic standard error obtained by using ordinary least squares under errors homoskedasticity, and 3) the robust “sandwich” standard error, across levels of practical positivity violations under strong misspecification (top panels), moderate misspecification (middle panels) and correct specification (bottom panels). We used the R package sandwich to estimate robust standard errors. Similarly, Figures 11 and 12 the coverage of the 95% Wald confidence interval across levels of practical positivity violations when estimating SATE, KOSATE and KOWATE.

In summary, under strong and moderate misspecification, the conditional standard error was smaller than the empirical standard error for SATE, KOSATE and KOWATE. The sandwich estimator was larger, across almost all scenarios of practical positivity violations for SATE, KOSATE and KOWATE. The naive standard error was slightly larger than the empirical standard error for KOSATE and KOWATE, while smaller for SATE. Under correct specification, the conditional standard error was very close to the empirical standard error for all the methods. We obtained similar results for .

Under strong and moderate misspecification the coverage of the 95% confidence interval close to nominal values when using the “sandwich” estimator under weak practical positivity violations. For moderate and strong practical positivity violations the coverage was low for all methods and for both optimal and . As shown in Kallus (2016), exact coverage can be obtained by using the conditional standard error when computed keeping fixed under no model misspecification. Finally, in practical settings, we suggest to use Wald confidence intervals together with the robust “sandwich” standard error estimator as previously suggested by other authors (Hernán et al., 2001; Robins, 2000; Freedman, 2006).

### 9.3 Computational time of KOM

To find a solution to the optimization problems (3.6), and (3.8), three steps are required: (1) tune the kernel hyperparameters; (2) compute the matrices and (3) solve the quadratic optimization problem. In this Section we provide a summary about the computational cost of finding a solution. We computed the computational time by using the R package rbenchmark on a AWS EC2 C5 instance, Intel Xeon Platinum 8000 series, 3.5 GHz, 32GB RAM and a Linux Ubuntu 16.04 operating system. Table 5 shows the mean computational time in seconds needed to tune the hyperparameters across treated (GPML 1) and control (GPML 0), constructing the matrices (Matrices) and solving the optimization problem with Gurobi (Gurobi) when estimating KOWATE, SATE, and KOSATE across the positivity and misspecified scenarios previously described. In summary, for KOSATE and KOWATE most of the computational time was needed to solve the optimization problem, especially for KOSATE.

On average, SATE and KOWATE with a product of polynomial degree 2 always found a solution across levels of practical positivity violations and across levels of misspecification. KOWATE with a product of polynomial degree 2 found a solution around 71% of the times, while the remaining by using a cubin polynomial degree 3. We argue that this is due to the different formulation of the optimization problem, which includes binary variable in the constraint.

## 10 Omitted proofs

### 10.1 Proof of Lemma 2.1

Note that, under assumption 2.5, to obtain a set of weights that makes unbiased for what we need is to be equal to for each , where is equal to except for the -th unit (we refer to as equal to except for the -th unit which is set to be equal to ). Note that

 \EbSi\indicTi=1Wi|X1:n,T−i,S1:n =ϕ(Xi)ψ(Xi)Wi(X1:n,(T−i,1),S1:n) \EbSi\indicTi=0Wi|X1:n,T−i,S1:n =(1−ϕ(Xi))ψ(Xi)Wi(