ConfoundingRobust Policy Improvement
Abstract
We study the problem of learning personalized decision policies from observational data while accounting for possible unobserved confounding in the datagenerating process. Unlike previous approaches which assume unconfoundedness, i.e., no unobserved confounders affected treatment assignment as well as outcome, we calibrate policy learning for realistic violations of this unverifiable assumption with uncertainty sets motivated by sensitivity analysis in causal inference. Our framework for confoundingrobust policy improvement optimizes the minimax regret of a candidate policy against a baseline or reference “status quo” policy, over a uncertainty set around nominal propensity weights. We prove that if the uncertainty set is wellspecified, robust policy learning can do no worse than the baseline, and only improve if the data supports it. We characterize the adversarial subproblem and use efficient algorithmic solutions to optimize over parametrized spaces of decision policies such as logistic treatment assignment. We assess our methods on synthetic data and a large clinical trial, demonstrating that confounded selection can hinder policy learning and lead to unwarranted harm, while our robust approach guarantees safety and focuses on wellevidenced improvement.
1 Introduction
The problem of learning personalized decision policies to study “what works and for whom” in areas such as medicine and ecommerce often endeavors to draw insights from observational data, since data from randomized experiments may be scarce and costly or unethical to acquire [14, 3, 29, 8, 15]. These and other approaches for drawing conclusions from observational data in the NeymanRubin potential outcomes framework generally appeal to methodologies such as inversepropensity weighting, matching, and balancing, which compare outcomes across groups constructed such that assignment is almost as if at random [22]. These methods rely on the controversial assumption of unconfoundedness, which requires that the data are sufficiently informative of treatment assignment such that no unobserved confounders jointly affect treatment assignment and individual response [23]. This key assumption may be made to hold ex ante by directly controlling the treatment assignment policy as sometimes done in online advertising [4], but in other domains of key interest such as personalized medicine where electronic medical records (EMRs) are increasingly being analyzed ex post, unconfoundedness may never truly hold in fact.
Assuming unconfoundedness, also called ignorability, conditional exogeneity, or selection on observables, is controversial because it is fundamentally unverifiable since the counterfactual distribution is not identified from the data, thus rendering any insights from observational studies vulnerable to this fundamental critique [13]. The growing availability of richer observational data such as found in EMRs renders unconfoundedness more plausible, yet it still may never be fully satisfied in practice. Because unconfoundedness may fail to hold, existing policy learning methods that assume it can lead to personalized decision policies that seek to exploit individuallevel effects that are not really there, may intervene where not necessary, and may in fact lead to net harm rather than net good. Such dangers constitute obvious impediments to the use of policy learning to enhance decision making in such sensitive applications as medicine, public policy, and civics.
To address this deficiency, in this paper we develop a framework for robust policy learning and improvement that can ensure that a personalized decision policy derived from observational data, which inevitably may have some unobserved confounding, does no worse than a current policy such as the standard of care and in fact does better if the data can indeed support it. We do so by recognizing and accounting for the potential confounding in the data and require that the learned policy improve upon a baseline no matter the direction of confounding. Thus, we calibrate personalized decision policies to address sensitivity to realistic violations of the unconfoundedness assumption. For the purposes of informing reliable and personalized decisionmaking that leverages modern machine learning, point identification of individuallevel causal effects, which previous approaches rely on, may not be at all necessary for success, but accounting for the lack of identification is.
Functionally, our approach is to optimize a policy to achieve the best worstcase improvement relative to a baseline treatment assignment policy such as treat all or treat none, where the improvement is measured using a weighted average of outcomes and weights take values in an uncertainty set around the nominal inverse propensity weights (IPW). This generalizes the popular class of IPWbased approaches to policy learning, which optimize an unbiased estimator for policy value under unconfoundedness [17, 27, 26]. Unlike standard approaches, in our approach the choice of baseline is material and changes the resulting policy chosen by our method. This framing supports reliable decisionmaking in practice, as often a practitioner is seeking evidence of substantial improvement upon the standard of care or a default option, and/or the intervention under consideration introduces risk of toxicity or adverse effects and should not be applied without strong evidence.
Our contributions are as follows: we provide a framework for performing policy improvement which is robust in the face of unobserved confounding. Our framework allows for the specification of datadriven uncertainty sets, based on the sensitivity parameter describing a pointwise multiplicative bound, as well as allowing for a global uncertainty budget by a parameter , which can be calibrated on the maximal discrepancy between the true propensities and nominal propensities. Leveraging the optimization structure of the robust subproblem, we provide a set of algorithms for performing policy optimization. We assess performance on a synthetic example as well as a large clinical trial.
2 Problem Statement and Preliminaries
We assume the observational data consists of tuples of random variables , comprising of covariates , assigned treatment , and realvalued outcomes . Using the NeymanRubin potential outcomes framework, we let and denote the potential outcomes of applying treatment and , respectively. We assume that the observed outcome is potential outcome for the observed treatment, , encapsulating noninterference and consistency, also known as SUTVA [24]. We also use the convention that the outcomes corresponds to losses so that lower outcomes are better.
We consider evaluating and learning a (randomized) treatment assignment policy mapping covariates to the probability of assigining treatment, . We focus on a policy class of restricted complexity. Examples include linear policies , logistic policies where , or decision trees of a bounded certain depth. We allow the candidate policy to be either deterministic or stochastic, and denote the random variable indicating the realization of treatment assignment for some to be a Bernoulli random variable such that .
The goal of policy evaluation is to assess the policy value,
the population average outcome induced by the policy . The problem of policy optimization seeks to find the best such policy over the parametrized function class . Both of these tasks are hindered by residual confounding since then cannot actually be identified from the data.
Motivated by the Rosenbaum sensitivity model [21] and without loss of generality, we assume that there is an additional but unobserved covariate such that unconfoundedness would hold if we were to control for both and , that is, such that for . Equivalently, we can treat the data as collected under an unknown logging policy that based its assignment on both and and that assigned with probability . Here, is precisely the true propensity score of unit . Since we do not have access to in our data, we instead presume that we have access only to nominal propensities , which do not account for the potential unobserved confounding. These are either part of the data or can be estimated directly from the data using a probabilistic classification model such as logistic regression. For compactness, we denote and .
2.1 Related Work
Our work builds upon the literatures on policy learning from observational data and on sensitivity analysis in causal inference.
Sensitivity analysis. Sensitivity analysis in causal inference tests the robustness of qualitative conclusions made from observational data to model specification or assumptions such as unconfoundedness. Some approaches for assessing unconfoundedness require auxiliary data or additional structural assumptions, which we do not assume here [13]. We focus on the implications of confounding on personalized treatment decisions. The Rosenbaum model for sensitivity analysis assesses the robustness of randomization inference to the presence of unobserved confounding by considering a uniform bound on the impact of confounding on the odds ratio of treatment assignment [21], motivated by a logistic specification. More generally, such a restriction corresponds to a model where the oddsratio for two units with the same covariates , which differs due to the units’ different values for the unobserved confounder, is , and may be arbitrary. The value of can be calibrated against the discrepancies induced by omitting observed variables; then determining can be phrased in terms of whether one thinks one has omitted a variable that could have increased or decreased the probability of treatment by as much as, say, gender or age can in the observed data [12].
In the sampling literature, the weightnormalized estimator for population mean is known as the Hajek estimator, and Aronow and Lee [1] derive sharp bounds on the estimator arising from a uniform bound on the sampling weights, showing a closedform solution for the solution to the fractional linear program for a uniform bound on the sampling probabilities. [31] considers bounds on the Hajek estimator, but imposes a parametric model on the treatment assignment probability. [18] considers tightening the bounds from the Hajek estimator by adding shape constraints, such as logconcavity, on the cumulative distribution of outcomes .
Policy learning from observational data under unconfoundedness. A variety of approaches for learning personalized intervention policies that maximize causal effect have been proposed, but all under the assumption of unconfoundedness. These fall under regressionbased strategies [20] or reweightingbased strategies [3, 14, 15, 27], or doubly robust combinations thereof [8, 29]. Regressionbased strategies estimate the conditional average treatment effect (CATE), , either directly or by differencing two regressions, and use it to score the policy. Without unconfoundedness, however, CATE is not identifiable from the data and these methods have no guarantees.
Reweightingbased strategies use inverseprobability weighting (IPW) to change measure from the outcome distribution induced by a logging policy to that induced by the policy . Specifically, these methods use the fact that, under unconfoundedness, is unbiased for [17], where
(1) 
Optimizing can be phrased as a weighted classification problem [3]. Since dividing by propensities can lead to extreme weights and high variance estimates, additional strategies such as clipping the probabilities away from 0 and normalizing by the sum of weights as a control variate are typically necessary for good performance [26, 30]. With or without these fixes, if there are unobserved confounders, none of these are consistent for and learned policies may introduce more harm than good.
A separate literature in reinforcement learning considers the idea of safe policy improvement by minimizing the regret against a baseline policy, forming an uncertainty set around the presumed unknown transition probabilities between states as in [28], or forming a trust region for safe policy exploration via concentration inequalities on the importancereweighted estimates of policy risk [19].
3 Robust policy evaluation and improvement
Our framework for confoundingrobust policy improvement minimizes a bound on policy regret against a specified baseline policy , . Our bound is achieved by maximizing a reweightingbased regret estimate over an uncertainty set around the nominal propensities. This ensures that we cannot do any worse than and may do better, even if the data is confounded.
The baseline policy can be any fixed policy that we want to make sure not to do worse than, or deviate from unnecessarily. This is usually the current standard of care, established from prior evidence, and can be a policy that actually depends on . Generally, we think of this as the policy that always assigns control. Alternatively, if a reliable estimate of the average treatment effect, , is available then can be the constant . In an agnostic extreme, can be the complete randomization policy .
3.1 Confoundingrobust policy learning by optimizing minimax regret
If we had oracle access to the true inverse propensities we could form the correct IPW estimate by replacing nominal with true propensities in eq. (1). We may go a step further and, recognizing that , use the empirical sum of true propensities as a control variate by normalizing our IPW estimate by them. This gives rise to the following Hajek estimators of and correspondingly
It follows by Slutsky’s theorem that these estimates remain consistent (if we know ). Note that had we known , both the normalization and choice of would have amounted to constant shifts and scales to that would not have changed the choice of to minimize the regret estimate. This will not be true of our bound, where both the normalization and the choice of will be material.
Since the oracle weights are unknown, we instead minimize the worstcase possible value of our regret estimate, by ranging over the space of possible values for that are consistent with the observed data and our assumptions about the confounded datagenerating process. Specifically, our model restricts the extent to which unobserved confounding may affect assignment probabilities. We first consider an uncertainty set motivated by the oddsratio characterization in [21], which restricts how far the weights can vary pointwise from the nominal propensities. Given a bound , the oddsratio restriction on is that it satisfy the following inequalities
(2) 
This restriction is motivated by (but more general than) considering a logistic model where , is any function, is bounded without loss of generality, and . Such a model would necessarily give rise to eq. (2). This restriction also immediately leads to an uncertainty set for for the true inverse propensities of observed treatments of each unit, , which we denote as follows
The corresponding bound on empirical regret is , where for any we define
We then choose the policy in our class that minimizes this regret bound, i.e., , where
(3) 
In particular, for our estimate , weight normalization is crucial for only enforcing robustness against consequential realizations of confounding which affect the relative weighting of patient outcomes; otherwise robustness against confounding would simply assign weights to their highest possible bounds for positive . If the baseline policy is in the policy class , it already achieves 0 regret; thus, minimizing regret necessitates learning regions of policy treatment assignment where evidence from observed outcomes suggests benefits in terms of decreased loss. Different baseline policies structurally change the solution to the adversarial subproblem by shifting the contribution of the loss term to emphasize improvement upon the baseline.
Budgeted uncertainty sets to address “local” confounding. Our approach can be pessimistic in ensuring robustness against worstcase realizations of unobserved confounding “globally” for each unit, whereas concerns about unobserved confounding may be restricted to subgroup risk factors or outliers. For the Rosenbaum model in hypothesis testing, this has been recognized by [9, 11] who address it by limiting the average of the unobserved propensities by an additional sensitivity parameter. Motivated by this, we next consider an alternative uncertainty set, where we fix a budget for how much the weights can diverge from the nominal inverse propensity weights in total. Specifically, letting , we construct the uncertainty set
When plugged into eq. (3), this provides an alternative policy choice criterion that is less conservative. We suggest to calibrate as a fraction of the total deviation allowed by . Specifically, . This is the approach we take in our empirical investigation.
3.2 The Improvement Guarantee
We next prove that if we appropriately bounded the potential hidden confounding then our worstcase empirical regret objective is asymptotically an upper bound on the true population regret. On the one hand, since our objective is necessarily nonpositive if , this says we do no worse. On the other hand, if our objective is negative, which we can check by just evaluating it, then we are assured some strict improvement. Our result is generic for both and .
Our upper bound depends on the complexity of our policy class. Define its Rademacher complexity:
All the policy classes we consider have vanishing complexities, i.e., .
Theorem 1.
Suppose that and that for some and for some . Then for any such that , we have that with probability at least ,
(4) 
In particular, if we let be as in eq. (3) then eq. (4) holds for , which minimizes the right hand side. So, if the objective is negative, we are (almost) assured of getting some improvement on . At the same time, so long as , the objective is necessarily nonpositive, so we are also (almost) assured of doing no worse than . All this, without really being able to identify any effects, due to hidden confounding. Thus, Theorem 1 exactly captures the allure of our approach.
4 Optimizing Robust Policies
We next discuss how to optimize the policy optimization problem in eq. (3). In the main text, we focus on parametrized policy classes, , such as linear policies. In the appendix Sec. D we also consider treebased policies, including forest ensembles. In particular, we provide a greedy recursive partitioning algorithm as well as a mixedinteger programming algorithm to find a globally optimal confoundingrobust decision tree.
For the remainder we focus on parameterized policies. We first discuss how to solve the worstcase regret subproblem for a fixed policy, which we will then use to develop our algorithm.
4.1 Dual Formulation of WorstCase Regret
The minimization in eq. (3) for involves an inner supremum, namely . Moreover, this supremum over weights does not on the face of it appear to be convex. We next proceed to characterize this supremum, formulate it as a linear program, and, by dualizing it, provide an efficient procedure for finding the pessimal weights.
For compactness and generality, we address the optimization problem parameterized by an arbitrary reward vector , where
(5) 
To recover , we would simply set . Since involves only linear constraints on , eq. (5) for is a linear fractional program. We can reformulate it as a linear program by applying the CharnesCooper transformation [6], requiring weights to sum to 1, and rescaling the pointwise bounds by a nonnegative scale factor . We obtain the following equivalent linear program, where we let denote the normalized weights:
(6) 
The dual problem to eq. (6) has dual variables for the weight normalization constraint and for the lower bound and upper bound constraints on weights, respectively, and is given by
(7) 
We use this to show that solving the adversarial subproblem requires only sorting the data and ternary search to optimize a unimodal function, generalizing the result of Aronow and Lee [1] for arbitrary pointwise bounds on the weights. Crucially, the algorithmically efficient solution will allow for faster subproblem solutions when optimizing our regret bound over policies in a given policy classes.
Theorem 2 (Normalized optimization solution).
Let denote the ordering such that . Then, , where and
(8) 
Moreover, is a discrete concave unimodal function.
Next we consider . Write an extended formulation for using only linear constraints:
This immediately shows that remains a fractional linear program. Indeed, letting, a similar CharnesCooper transformation as used above with the additional normalization yields a nonfractional linear programming formulation:
The corresponding dual problem is:
As remains a linear program, we can easily solve it using offtheshelf solvers.
4.2 Optimizing Parametric Policies
We next consider a subgradient algorithm (Algorithm 1), that at each stage solves the worstcase regret subproblem efficiently for a fixed policy in a differentiable (or subdifferentiable) policy classes parametrized by a parameter , such as logistic assignment policies . We let denote the policy assignment for each individual unit and denote the policy assignment vector. We optimize the linearized value function:
From the viewpoint from parametric programming, the linearized program of eq. (6) is parametrized by insofar as affects the objective weights on the inner decision variable . We let denote the affine transformation. We presume is a provided subgradient of the policy with respect to , and let denote the collection of subgradients for . By a theorem of [25] and the subgradient chain rule for affine transformations, a gradient of the value function at is . For each iteration, we compute the weights , which we obtain by an efficient subroutine Weights, and perform a subgradient step on . Using this method, we can optimize policies over the unbudgeted uncertainty set and the budgeted uncertainty set . In Appendix B, we discuss the concrete form this takes for parametric logistic policies.
5 Experiments
Simulated data. We first consider a simple linear model specification demonstrating the possible effects of significant confounding on inversepropensity weighted estimators.
We introduce the outcomedependent confounder . We let the nominal propensities be logistic in , with . We let the confounded propensities take on the larger bound, where , if , and the lower bound otherwise. The constant treatment effect is with the linear interaction . The data mean is . The noise term affects outcomes with coefficients , in addition to a uniform noise term .
In Fig. 0(a), we compare results from averaging over 50 datasets drawn from this model, training our robust policies RIC.Log (unbudgeted) and budgeted (RIC.L1) with a budget multiplier , IPW with a probabilistic policy, selfnormalized POEM [26], and causal forest estimates of CATE on examples and testing on examples. We consider regret against the control assignment policy for our methods assessing robust improvement (RIC). We vary for each replication. The optimal unrestricted policy achieves average regret of , while SNPOEM remains confounded and achieves . Causal forests achieve . For optimization, we take the best 15 restarts and a stepschedule of . Our conservative robust policies achieve substantial improvement, though the uncertaintybudgeted approaches incur greater variance due to flatter regions of the nonconvex value function. As increases, the learned robust policies for the parametric approaches converge to the allcontrol policy. Even in this extreme example of confounding where the true propensities achieve the oddsratio bounds, the budgeted version is able to attain similar robust improvements, sometimes outperforming the unbudgeted version for . The best improvements are achieved around , consistent with the specification; the policies tend toward control as the possible confounding increases.
Assessment with Clinical Data: International Stroke Trial.
We build an evaluation framework for our methods from realworld data, where the counterfactuals are not known, by simulating confounded selection into a training dataset, and estimating outofsample policy regret on a heldout “test set” from the completely randomized controlled trial. We study the International Stroke Trial (IST), comparing aspirin and heparin (high dose) vs. only aspirin treatment arms from the original factorial design, numbering 7233 cases with [10]. We defer some details about the dataset to Appendix E. Findings from the study suggest clear reduction in adverse events (recurrent stroke or death) from aspirin, whereas heparin efficacy is inconclusive since small (nonsignificant) benefit on rates of death at 6 months was offset by greater incidence of other adverse events such as hemorrhage or cranial bleeding. We construct an evaluation framework from the dataset as follows by fixing a split into a training set and a heldout test set, and subsampling a final set of initial patients, whose data is then used to train treatment assignment policies. We generate nominal selection probabilities into the trial, letting denote inclusion, as , where is rescaled. We consider nominal propensities as . We introduce confounding by censoring the treated patients with the worst 10% of outcomes, and the 10% best patients in the control group.
The original trial measured a set of clinical outcomes including death, stroke recurrence, adverse effect, and recovery, which we scalarize as a composite loss function. A differenceinmeans estimate of the ATE for the composite score in full data is . Without access to the true counterfactual outcomes for patients, our oracle estimates are IPWbased estimates from the heldout RCT data. We estimate regret against the control policy with the empirical selfnormalized estimate of . In Fig. 0(b), we evaluate on 10 draws from the dataset, comparing our policies against the vanilla IPW estimator with a probabilistic policy, selfnormalized POEM [26], and assigning based on the sign of the CATE prediction from causal forests [29]. The selected datasets average a size of . We evaluate logistic parametric policies (RIC.Lg) and budgeted (RIC.L1) with budget multiplier . For the parametric policies, we optimize with the same parameters as earlier. We evaluate , every 0.025 between and , every 0.2 between and . For small values of , our methods perform similarly as IPW. As increases, our methods achieve policy improvement, though the L1budgeted method (RIC.L1) achieves worse performance. For , the robust policy essentially learns the allcontrol policy; our finitesample regret estimator simply indicates good regret for a neglible number of patients (56).



In Figs. 0(c)0(d), we study the behavior of the robust policies. The IST trial recorded a prognosis score of probability of death at 6 months for patients, using an externally validated model, which we do not include in training data but use to assess the validity of our robust policy. In Fig. 0(d), we consider the average prognosis score of death for among patients treated with . In Fig. 0(c), for , the policy considers treating of patients and the subsequent average prognosis score of the population under consideration increases, indicating that the policy is learning and treating on appropriate indicators of severity from the available covariates. For , the noise in the prognosis score is due to the small treated subgroups. Our learned policies suggest that improvements from heparin may be seen in the highestrisk patients, consistent with the findings of [2], a systematic review comparing anticoagulants such as heparin against aspirin. They conclude from a study of trials including IST that heparin provides little therapeutic benefit, with the caveat that the trial evidence base is lacking for the highestrisk patients where heparin may be of benefit. Thus, our robust method appropriately treats those, and only those, who stand to benefit from the more aggressive treatment regime.
6 Conclusion
We developed a framework for estimating and optimizing for robust policy improvement, which optimizes the minimax regret of a candidate personalized decision policy against a baseline policy. We optimize over uncertainty sets centered at the nominal propensities, and leverage the optimization structure of normalized estimators to perform policy optimization efficiently by subgradient descent on the robust risk. Assessments on synthetic and clinical data demonstrate the benefits of robust policy improvement.
References
 Aronow and Lee [2012] P. Aronow and D. Lee. Interval estimation of population means under unknown but bounded probabilities of sample selection. Biometrika, 2012.
 Berge and Sandercock [2002] E. Berge and P. A. Sandercock. Anticoagulants versus antiplatelet agents for acute ischaemic stroke. The Cochrane Library of Systematic Reviews, 2002.
 Beygelzimer and Langford [2009] A. Beygelzimer and J. Langford. The offset tree for learning with partial labels. Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, 2009.
 Bottou et al. [2013] L. Bottou, J. Peters, J. QuinoneroCandela, D. X. Charles, D. M. Chickering, E. Portugaly, D. Ray, P. Simard, and E. Snelson. Counterfactual reasoning and learning systems. Journal of Machine Learning Research, 2013.
 Breiman et al. [1984] L. Breiman, J. F. adn charles Stone, and R. Olshen. Classification and Regression Trees. Chapman and Hall, 1984.
 Charnes and Cooper [1962] A. Charnes and W. Cooper. Programming with linear fractional functionals. Naval Research Logistics Quarterly, 1962.
 Duchi and Namkoong [2017] J. Duchi and H. Namkoong. Variancebased regularization with convex objectives. Advances in Neural Information Processing Systems 30 (NIPS 2017), 2017.
 Dudik et al. [2014] M. Dudik, D. Erhan, J. Langford, and L. Li. Doubly robust policy evaluation and optimization. Statistical Science, 2014.
 Fogarty and Hasegawa [2017] C. Fogarty and R. Hasegawa. An extended sensitivity analysis for heterogeneous unmeasured confounding. 2017.
 Group [1997] I. S. T. C. Group. The international stroke trial (ist): a randomised trial of aspirin, subcutaneous heparin, both, or neither among 19435 patients with acute ischaemic stroke. international stroke trial collaborative group. Lancet, 1997.
 Hasegawa and Small [2017] R. Hasegawa and D. Small. Sensitivity analysis for matched pair analysis of binary data: From worst case to average case analysis. Biometrics, 2017.
 Hsu and Small [2013] J. Y. Hsu and D. S. Small. Calibrating sensitivity analyses to observed covariates in observational studies. Biometrics, 69(4):803–811, 2013.
 Imbens and Rubin [2015] G. Imbens and D. Rubin. Causal Inference for Statistics, Social, and Biomedical Sciences. Cambridge University Press, 2015.
 Kallus [2017] N. Kallus. Recursive partitioning for personalization using observation data. Proceedings of the Thirtyfourth International Conference on Machine Learning, 2017.
 Kitagawa and Tetenov [2015] T. Kitagawa and A. Tetenov. Empirical welfare maximization. 2015.
 Ledoux and Talagrand [1991] M. Ledoux and M. Talagrand. Probability in Banach Spaces: isoperimetry and processes. Springer, 1991.
 Li et al. [2011] L. Li, W. Chu, J. Langford, and X. Wang. Unbiased offline evaluation of contextualbanditbased news article recommendation algorithms. Proceedings of the fourth ACM international conference on web search and data mining, 2011.
 Miratrix et al. [2018] L. W. Miratrix, S. Wager, and J. R. Zubizarreta. Shapeconstrained partial identification of a population mean under unknown probabilities of sample selection. Biometrika, 2018.
 Petrik et al. [2016] M. Petrik, M. Ghavamzadeh, and Y. Chow. Safe policy improvement by minimizing robust baseline regret. 29th Conference on Neural Information Processing Systems, 2016.
 Qian and Murphy [2011] M. Qian and S. A. Murphy. Performance guarantees for individualized treatment rules. Annals of statistics, 39(2):1180, 2011.
 Rosenbaum [2002] P. Rosenbaum. Observational Studies. Springer Series in Statistics, 2002.
 Rosenbaum and Rubin [1983] P. R. Rosenbaum and D. B. Rubin. The central role of the propensity score in observational studies for causal effects. Biometrika, 1983.
 Rubin [1974] D. Rubin. Estimating causal effect of treatments in randomized and nonrandomized studies. Journal of Educational Psychology, 1974.
 Rubin [1980] D. B. Rubin. Comments on “randomization analysis of experimental data: The fisher randomization test comment”. Journal of the American Statistical Association, 75(371):591–593, 1980.
 Still [2018] G. Still. Lectures on parametric optimization: An introduction. Optimization Online, 2018.
 Swaminathan and Joachims [2015a] A. Swaminathan and T. Joachims. The selfnormalized estimator for counterfactual learning. Proceedings of NIPS, 2015a.
 Swaminathan and Joachims [2015b] A. Swaminathan and T. Joachims. Counterfactual risk minimization. Journal of Machine Learning Research, 2015b.
 Thomas et al. [2015] P. Thomas, G. Theocharous, and M. Ghavamzadeh. High confidence policy improvement. Proceedings of the 32nd International Conference on Machine Learning, 2015.
 Wager and Athey [2017] S. Wager and S. Athey. Efficient policy learning. 2017.
 Wang et al. [2017] Y.X. Wang, A. Agarwal, and M. Dudik. Optimal and adaptive offpolicy evaluation in contextual bandits. Proceedings of Neural Information Processing Systems 2017, 2017.
 Zhao et al. [2017] Q. Zhao, D. S. Small, and B. B. Bhattacharya. Sensitivity analysis for inverse probability weighting estimators via the percentile bootstrap. ArXiv, 2017.
Appendix A Proofs
Proof of Theorem 1.
Let . Then
Since and , Hoeffding’s inequality gives
Let
and note that since we have that satisfies bounded differences with constant . Hence, McDiarmid’s inequality gives
Next, because , a standard symmetrization argument gives that
Further, by the Rademacher comparison lemma [16, Thm. 4.12], we get that
Next, satisfies bounded differences with constant so McDiarmid’s inequality gives
Finally, Hoeffdding’s inequality gives that
Combining, we get that with probability at least , assuming that , we have that is bounded above by
Letting and , the above is bounded by so long as . The proof is completed by noting that by assumption of true weights being inside we get that . ∎
Equivalence of Fractional Program 5 and 6.
We can easily verify that a feasible solution for one problem is feasible for the other: for a feasible solution to (FP), we can generate a feasible solution to (LP) as with the same objective value. In the other direction, we can generate a feasible solution to (6) from a feasible fractional program (5) solution if we take . This solution has the same objective value since . ∎
Proof of Thm. 2.
We analyze the program using complementary slackness, which will yield an algorithm for finding a solution which generalizes the solution in [1].
Note that without loss of generality we can consider for datapoints where without changing the optimization problem. At optimality only one of or will be tight. For the nonbinding primal constraints, by complementary slackness the corresponding dual variable will be 0. We expect constraints to be tight in the dual (since and the constraint is not binding). So the optimal solution to the dual will satisfy:
Note that if and if . Then at optimality, there exists some index (where refers to the th index of the increasing order statistics, an ordering where .
We can subsitute in the solution from the equality constraints and obtain the following equality which holds at optimality:
We discuss how to derive the primal solution from the dual solution: for , take and . Thus by complementary slackness, is the optimal value with the corresponding optimal solution exhibited above.
The optimal such occurs with the order statistic threshold at for . Consider the parametric restriction of the primal program, parametrized by the sum of weights ; the value function is concave in and furthermore concave in the discrete restriction of to the values it takes at the vertex solutions.
∎
Appendix B Algorithm details
Subgradients for adversarial value function and logistic assignment policy. .
By Theorem 4.3 (essentially implicit value theorem) of [25], where , the weights achieving the optimal solution. For the linear problem, the parametric coefficients of the objective function are , and is an affine transformation of . By the subgradient chain rule, since is an affine function of , for , we have that .
We use the fact that the derivative of the sigmoid function is . Then by the chain rule, . We consider the vector and express as .
Then, by the chain rule for subgradients with affine functions, . ∎
Appendix C Additional Algorithms for Policy Optimization
c.1 Optimal Decision Trees
Instead of considering probabilistic treatment policies , we can optimize the deterministic assignment policies globally over the space of decision trees via integer programming. We introduce the integer assignment variables and compose the Optimal Classification Tree from [7] with the dual formulation of 5. We combine the primal constraints defining the tree structure with the objective function from the dual formulation. The space of policies is parametrized by the hyperplane and intercept vectors defining axisaligned splits, where we define a tree structure and a split at a decision node assigns units to the left leaf if , and to the right leaf otherwise. [7] introduces the constraints in the program to enforce the hierarchical split structure, which we reproduce for completeness below. The program tracks a set of branching nodes and a set of leaf nodes , using the binary assignment variables to track assignment of data points to leaf subject to the requirement that every instance is assigned to a leaf node. The binary variables track whether a split occurs at node and maintains split hierarchy consistency.
The additional constraints that allow us to encode the dual objective are as follows: we define the policy assignment indicator where is the policy assignment label of leaf node , and describes whether or not instance is assigned to leaf node . We enforce this with the set of auxiliary bigM constraints for the product of binary variables (for the case of two treatments)
In this formulation, we introduce the binary variable (i.e. a 01 indicator version of ) and express in terms of the 01 binary variable to clarify the connection to an optimal classification tree formulation with the 01 loss.
s.t. 