Confounding-Robust Policy Improvement

Confounding-Robust Policy Improvement

Nathan Kallus Angela Zhou School of Operations Research Operations Research and and Information Engineering Information Engineering and Cornell Tech, Cornell University Cornell University

We study the problem of learning personalized decision policies from observational data while accounting for possible unobserved confounding in the data-generating process. Unlike previous approaches which assume unconfoundedness, i.e., no unobserved confounders affected treatment assignment as well as outcome, we calibrate policy learning for realistic violations of this unverifiable assumption with uncertainty sets motivated by sensitivity analysis in causal inference. Our framework for confounding-robust policy improvement optimizes the minimax regret of a candidate policy against a baseline or reference “status quo” policy, over a uncertainty set around nominal propensity weights. We prove that if the uncertainty set is well-specified, robust policy learning can do no worse than the baseline, and only improve if the data supports it. We characterize the adversarial subproblem and use efficient algorithmic solutions to optimize over parametrized spaces of decision policies such as logistic treatment assignment. We assess our methods on synthetic data and a large clinical trial, demonstrating that confounded selection can hinder policy learning and lead to unwarranted harm, while our robust approach guarantees safety and focuses on well-evidenced improvement.

1 Introduction

The problem of learning personalized decision policies to study “what works and for whom” in areas such as medicine and e-commerce often endeavors to draw insights from observational data, since data from randomized experiments may be scarce and costly or unethical to acquire [14, 3, 29, 8, 15]. These and other approaches for drawing conclusions from observational data in the Neyman-Rubin potential outcomes framework generally appeal to methodologies such as inverse-propensity weighting, matching, and balancing, which compare outcomes across groups constructed such that assignment is almost as if at random [22]. These methods rely on the controversial assumption of unconfoundedness, which requires that the data are sufficiently informative of treatment assignment such that no unobserved confounders jointly affect treatment assignment and individual response [23]. This key assumption may be made to hold ex ante by directly controlling the treatment assignment policy as sometimes done in online advertising [4], but in other domains of key interest such as personalized medicine where electronic medical records (EMRs) are increasingly being analyzed ex post, unconfoundedness may never truly hold in fact.

Assuming unconfoundedness, also called ignorability, conditional exogeneity, or selection on observables, is controversial because it is fundamentally unverifiable since the counterfactual distribution is not identified from the data, thus rendering any insights from observational studies vulnerable to this fundamental critique [13]. The growing availability of richer observational data such as found in EMRs renders unconfoundedness more plausible, yet it still may never be fully satisfied in practice. Because unconfoundedness may fail to hold, existing policy learning methods that assume it can lead to personalized decision policies that seek to exploit individual-level effects that are not really there, may intervene where not necessary, and may in fact lead to net harm rather than net good. Such dangers constitute obvious impediments to the use of policy learning to enhance decision making in such sensitive applications as medicine, public policy, and civics.

To address this deficiency, in this paper we develop a framework for robust policy learning and improvement that can ensure that a personalized decision policy derived from observational data, which inevitably may have some unobserved confounding, does no worse than a current policy such as the standard of care and in fact does better if the data can indeed support it. We do so by recognizing and accounting for the potential confounding in the data and require that the learned policy improve upon a baseline no matter the direction of confounding. Thus, we calibrate personalized decision policies to address sensitivity to realistic violations of the unconfoundedness assumption. For the purposes of informing reliable and personalized decision-making that leverages modern machine learning, point identification of individual-level causal effects, which previous approaches rely on, may not be at all necessary for success, but accounting for the lack of identification is.

Functionally, our approach is to optimize a policy to achieve the best worst-case improvement relative to a baseline treatment assignment policy such as treat all or treat none, where the improvement is measured using a weighted average of outcomes and weights take values in an uncertainty set around the nominal inverse propensity weights (IPW). This generalizes the popular class of IPW-based approaches to policy learning, which optimize an unbiased estimator for policy value under unconfoundedness [17, 27, 26]. Unlike standard approaches, in our approach the choice of baseline is material and changes the resulting policy chosen by our method. This framing supports reliable decision-making in practice, as often a practitioner is seeking evidence of substantial improvement upon the standard of care or a default option, and/or the intervention under consideration introduces risk of toxicity or adverse effects and should not be applied without strong evidence.

Our contributions are as follows: we provide a framework for performing policy improvement which is robust in the face of unobserved confounding. Our framework allows for the specification of data-driven uncertainty sets, based on the sensitivity parameter describing a pointwise multiplicative bound, as well as allowing for a global uncertainty budget by a parameter , which can be calibrated on the maximal discrepancy between the true propensities and nominal propensities. Leveraging the optimization structure of the robust subproblem, we provide a set of algorithms for performing policy optimization. We assess performance on a synthetic example as well as a large clinical trial.

2 Problem Statement and Preliminaries

We assume the observational data consists of tuples of random variables , comprising of covariates , assigned treatment , and real-valued outcomes . Using the Neyman-Rubin potential outcomes framework, we let and denote the potential outcomes of applying treatment and , respectively. We assume that the observed outcome is potential outcome for the observed treatment, , encapsulating non-interference and consistency, also known as SUTVA [24]. We also use the convention that the outcomes corresponds to losses so that lower outcomes are better.

We consider evaluating and learning a (randomized) treatment assignment policy mapping covariates to the probability of assigining treatment, . We focus on a policy class of restricted complexity. Examples include linear policies , logistic policies where , or decision trees of a bounded certain depth. We allow the candidate policy to be either deterministic or stochastic, and denote the random variable indicating the realization of treatment assignment for some to be a Bernoulli random variable such that .

The goal of policy evaluation is to assess the policy value,

the population average outcome induced by the policy . The problem of policy optimization seeks to find the best such policy over the parametrized function class . Both of these tasks are hindered by residual confounding since then cannot actually be identified from the data.

Motivated by the Rosenbaum sensitivity model [21] and without loss of generality, we assume that there is an additional but unobserved covariate such that unconfoundedness would hold if we were to control for both and , that is, such that for . Equivalently, we can treat the data as collected under an unknown logging policy that based its assignment on both and and that assigned with probability . Here, is precisely the true propensity score of unit . Since we do not have access to in our data, we instead presume that we have access only to nominal propensities , which do not account for the potential unobserved confounding. These are either part of the data or can be estimated directly from the data using a probabilistic classification model such as logistic regression. For compactness, we denote and .

2.1 Related Work

Our work builds upon the literatures on policy learning from observational data and on sensitivity analysis in causal inference.

Sensitivity analysis. Sensitivity analysis in causal inference tests the robustness of qualitative conclusions made from observational data to model specification or assumptions such as unconfoundedness. Some approaches for assessing unconfoundedness require auxiliary data or additional structural assumptions, which we do not assume here [13]. We focus on the implications of confounding on personalized treatment decisions. The Rosenbaum model for sensitivity analysis assesses the robustness of randomization inference to the presence of unobserved confounding by considering a uniform bound on the impact of confounding on the odds ratio of treatment assignment [21], motivated by a logistic specification. More generally, such a restriction corresponds to a model where the odds-ratio for two units with the same covariates , which differs due to the units’ different values for the unobserved confounder, is , and may be arbitrary. The value of can be calibrated against the discrepancies induced by omitting observed variables; then determining can be phrased in terms of whether one thinks one has omitted a variable that could have increased or decreased the probability of treatment by as much as, say, gender or age can in the observed data [12].

In the sampling literature, the weight-normalized estimator for population mean is known as the Hajek estimator, and Aronow and Lee [1] derive sharp bounds on the estimator arising from a uniform bound on the sampling weights, showing a closed-form solution for the solution to the fractional linear program for a uniform bound on the sampling probabilities. [31] considers bounds on the Hajek estimator, but imposes a parametric model on the treatment assignment probability. [18] considers tightening the bounds from the Hajek estimator by adding shape constraints, such as log-concavity, on the cumulative distribution of outcomes .

Policy learning from observational data under unconfoundedness. A variety of approaches for learning personalized intervention policies that maximize causal effect have been proposed, but all under the assumption of unconfoundedness. These fall under regression-based strategies [20] or reweighting-based strategies [3, 14, 15, 27], or doubly robust combinations thereof [8, 29]. Regression-based strategies estimate the conditional average treatment effect (CATE), , either directly or by differencing two regressions, and use it to score the policy. Without unconfoundedness, however, CATE is not identifiable from the data and these methods have no guarantees.

Reweighting-based strategies use inverse-probability weighting (IPW) to change measure from the outcome distribution induced by a logging policy to that induced by the policy . Specifically, these methods use the fact that, under unconfoundedness, is unbiased for [17], where


Optimizing can be phrased as a weighted classification problem [3]. Since dividing by propensities can lead to extreme weights and high variance estimates, additional strategies such as clipping the probabilities away from 0 and normalizing by the sum of weights as a control variate are typically necessary for good performance [26, 30]. With or without these fixes, if there are unobserved confounders, none of these are consistent for and learned policies may introduce more harm than good.

A separate literature in reinforcement learning considers the idea of safe policy improvement by minimizing the regret against a baseline policy, forming an uncertainty set around the presumed unknown transition probabilities between states as in [28], or forming a trust region for safe policy exploration via concentration inequalities on the importance-reweighted estimates of policy risk [19].

3 Robust policy evaluation and improvement

Our framework for confounding-robust policy improvement minimizes a bound on policy regret against a specified baseline policy , . Our bound is achieved by maximizing a reweighting-based regret estimate over an uncertainty set around the nominal propensities. This ensures that we cannot do any worse than and may do better, even if the data is confounded.

The baseline policy can be any fixed policy that we want to make sure not to do worse than, or deviate from unnecessarily. This is usually the current standard of care, established from prior evidence, and can be a policy that actually depends on . Generally, we think of this as the policy that always assigns control. Alternatively, if a reliable estimate of the average treatment effect, , is available then can be the constant . In an agnostic extreme, can be the complete randomization policy .

3.1 Confounding-robust policy learning by optimizing minimax regret

If we had oracle access to the true inverse propensities we could form the correct IPW estimate by replacing nominal with true propensities in eq. (1). We may go a step further and, recognizing that , use the empirical sum of true propensities as a control variate by normalizing our IPW estimate by them. This gives rise to the following Hajek estimators of and correspondingly

It follows by Slutsky’s theorem that these estimates remain consistent (if we know ). Note that had we known , both the normalization and choice of would have amounted to constant shifts and scales to that would not have changed the choice of to minimize the regret estimate. This will not be true of our bound, where both the normalization and the choice of will be material.

Since the oracle weights are unknown, we instead minimize the worst-case possible value of our regret estimate, by ranging over the space of possible values for that are consistent with the observed data and our assumptions about the confounded data-generating process. Specifically, our model restricts the extent to which unobserved confounding may affect assignment probabilities. We first consider an uncertainty set motivated by the odds-ratio characterization in [21], which restricts how far the weights can vary pointwise from the nominal propensities. Given a bound , the odds-ratio restriction on is that it satisfy the following inequalities


This restriction is motivated by (but more general than) considering a logistic model where , is any function, is bounded without loss of generality, and . Such a model would necessarily give rise to eq. (2). This restriction also immediately leads to an uncertainty set for for the true inverse propensities of observed treatments of each unit, , which we denote as follows

The corresponding bound on empirical regret is , where for any we define

We then choose the policy in our class that minimizes this regret bound, i.e., , where


In particular, for our estimate , weight normalization is crucial for only enforcing robustness against consequential realizations of confounding which affect the relative weighting of patient outcomes; otherwise robustness against confounding would simply assign weights to their highest possible bounds for positive . If the baseline policy is in the policy class , it already achieves 0 regret; thus, minimizing regret necessitates learning regions of policy treatment assignment where evidence from observed outcomes suggests benefits in terms of decreased loss. Different baseline policies structurally change the solution to the adversarial subproblem by shifting the contribution of the loss term to emphasize improvement upon the baseline.

Budgeted uncertainty sets to address “local” confounding. Our approach can be pessimistic in ensuring robustness against worst-case realizations of unobserved confounding “globally” for each unit, whereas concerns about unobserved confounding may be restricted to subgroup risk factors or outliers. For the Rosenbaum model in hypothesis testing, this has been recognized by [9, 11] who address it by limiting the average of the unobserved propensities by an additional sensitivity parameter. Motivated by this, we next consider an alternative uncertainty set, where we fix a budget for how much the weights can diverge from the nominal inverse propensity weights in total. Specifically, letting , we construct the uncertainty set

When plugged into eq. (3), this provides an alternative policy choice criterion that is less conservative. We suggest to calibrate as a fraction of the total deviation allowed by . Specifically, . This is the approach we take in our empirical investigation.

3.2 The Improvement Guarantee

We next prove that if we appropriately bounded the potential hidden confounding then our worst-case empirical regret objective is asymptotically an upper bound on the true population regret. On the one hand, since our objective is necessarily non-positive if , this says we do no worse. On the other hand, if our objective is negative, which we can check by just evaluating it, then we are assured some strict improvement. Our result is generic for both and .

Our upper bound depends on the complexity of our policy class. Define its Rademacher complexity:

All the policy classes we consider have -vanishing complexities, i.e., .

Theorem 1.

Suppose that and that for some and for some . Then for any such that , we have that with probability at least ,


In particular, if we let be as in eq. (3) then eq. (4) holds for , which minimizes the right hand side. So, if the objective is negative, we are (almost) assured of getting some improvement on . At the same time, so long as , the objective is necessarily non-positive, so we are also (almost) assured of doing no worse than . All this, without really being able to identify any effects, due to hidden confounding. Thus, Theorem 1 exactly captures the allure of our approach.

4 Optimizing Robust Policies

We next discuss how to optimize the policy optimization problem in eq. (3). In the main text, we focus on parametrized policy classes, , such as linear policies. In the appendix Sec. D we also consider tree-based policies, including forest ensembles. In particular, we provide a greedy recursive partitioning algorithm as well as a mixed-integer programming algorithm to find a globally optimal confounding-robust decision tree.

For the remainder we focus on parameterized policies. We first discuss how to solve the worst-case regret subproblem for a fixed policy, which we will then use to develop our algorithm.

4.1 Dual Formulation of Worst-Case Regret

The minimization in eq. (3) for involves an inner supremum, namely . Moreover, this supremum over weights does not on the face of it appear to be convex. We next proceed to characterize this supremum, formulate it as a linear program, and, by dualizing it, provide an efficient procedure for finding the pessimal weights.

For compactness and generality, we address the optimization problem parameterized by an arbitrary reward vector , where


To recover , we would simply set . Since involves only linear constraints on , eq. (5) for is a linear fractional program. We can reformulate it as a linear program by applying the Charnes-Cooper transformation [6], requiring weights to sum to 1, and rescaling the pointwise bounds by a nonnegative scale factor . We obtain the following equivalent linear program, where we let denote the normalized weights:


The dual problem to eq. (6) has dual variables for the weight normalization constraint and for the lower bound and upper bound constraints on weights, respectively, and is given by


We use this to show that solving the adversarial subproblem requires only sorting the data and ternary search to optimize a unimodal function, generalizing the result of Aronow and Lee [1] for arbitrary pointwise bounds on the weights. Crucially, the algorithmically efficient solution will allow for faster subproblem solutions when optimizing our regret bound over policies in a given policy classes.

Theorem 2 (Normalized optimization solution).

Let denote the ordering such that . Then, , where and


Moreover, is a discrete concave unimodal function.

Next we consider . Write an extended formulation for using only linear constraints:

This immediately shows that remains a fractional linear program. Indeed, letting, a similar Charnes-Cooper transformation as used above with the additional normalization yields a non-fractional linear programming formulation:

The corresponding dual problem is:

As remains a linear program, we can easily solve it using off-the-shelf solvers.

4.2 Optimizing Parametric Policies

We next consider a subgradient algorithm (Algorithm 1), that at each stage solves the worst-case regret subproblem efficiently for a fixed policy in a differentiable (or subdifferentiable) policy classes parametrized by a parameter , such as logistic assignment policies . We let denote the policy assignment for each individual unit and denote the policy assignment vector. We optimize the linearized value function:

From the viewpoint from parametric programming, the linearized program of eq. (6) is parametrized by insofar as affects the objective weights on the inner decision variable . We let denote the affine transformation. We presume is a provided subgradient of the policy with respect to , and let denote the collection of subgradients for . By a theorem of [25] and the subgradient chain rule for affine transformations, a gradient of the value function at is . For each iteration, we compute the weights , which we obtain by an efficient subroutine Weights, and perform a subgradient step on . Using this method, we can optimize policies over the unbudgeted uncertainty set and the budgeted uncertainty set . In Appendix B, we discuss the concrete form this takes for parametric logistic policies.

5 Experiments

Simulated data. We first consider a simple linear model specification demonstrating the possible effects of significant confounding on inverse-propensity weighted estimators.

We introduce the outcome-dependent confounder . We let the nominal propensities be logistic in , with . We let the confounded propensities take on the larger bound, where , if , and the lower bound otherwise. The constant treatment effect is with the linear interaction . The data mean is . The noise term affects outcomes with coefficients , in addition to a uniform noise term .

(a) Figure 2: Policy performance on synthetic data
1:given , step-schedule exponent , initial iterate , subproblem routine Weights(), policy subgradient
2:for  in  do:
4:      Weights
Algorithm 1 Algorithm 1: Parametric Subgradient Descent

In Fig. 0(a), we compare results from averaging over 50 datasets drawn from this model, training our robust policies RIC.Log (unbudgeted) and budgeted (RIC.L1) with a budget multiplier , IPW with a probabilistic policy, self-normalized POEM [26], and causal forest estimates of CATE on examples and testing on examples. We consider regret against the control assignment policy for our methods assessing robust improvement (RIC). We vary for each replication. The optimal unrestricted policy achieves average regret of , while SNPOEM remains confounded and achieves . Causal forests achieve . For optimization, we take the best 15 restarts and a step-schedule of . Our conservative robust policies achieve substantial improvement, though the uncertainty-budgeted approaches incur greater variance due to flatter regions of the nonconvex value function. As increases, the learned robust policies for the parametric approaches converge to the all-control policy. Even in this extreme example of confounding where the true propensities achieve the odds-ratio bounds, the budgeted version is able to attain similar robust improvements, sometimes outperforming the unbudgeted version for . The best improvements are achieved around , consistent with the specification; the policies tend toward control as the possible confounding increases.

Assessment with Clinical Data: International Stroke Trial.

We build an evaluation framework for our methods from real-world data, where the counterfactuals are not known, by simulating confounded selection into a training dataset, and estimating out-of-sample policy regret on a held-out “test set” from the completely randomized controlled trial. We study the International Stroke Trial (IST), comparing aspirin and heparin (high dose) vs. only aspirin treatment arms from the original factorial design, numbering 7233 cases with [10]. We defer some details about the dataset to Appendix E. Findings from the study suggest clear reduction in adverse events (recurrent stroke or death) from aspirin, whereas heparin efficacy is inconclusive since small (non-significant) benefit on rates of death at 6 months was offset by greater incidence of other adverse events such as hemorrhage or cranial bleeding. We construct an evaluation framework from the dataset as follows by fixing a split into a training set and a held-out test set, and subsampling a final set of initial patients, whose data is then used to train treatment assignment policies. We generate nominal selection probabilities into the trial, letting denote inclusion, as , where is rescaled. We consider nominal propensities as . We introduce confounding by censoring the treated patients with the worst 10% of outcomes, and the 10% best patients in the control group.

The original trial measured a set of clinical outcomes including death, stroke recurrence, adverse effect, and recovery, which we scalarize as a composite loss function. A difference-in-means estimate of the ATE for the composite score in full data is . Without access to the true counterfactual outcomes for patients, our oracle estimates are IPW-based estimates from the held-out RCT data. We estimate regret against the control policy with the empirical self-normalized estimate of . In Fig. 0(b), we evaluate on 10 draws from the dataset, comparing our policies against the vanilla IPW estimator with a probabilistic policy, self-normalized POEM [26], and assigning based on the sign of the CATE prediction from causal forests [29]. The selected datasets average a size of . We evaluate logistic parametric policies (RIC.Lg) and budgeted (RIC.L1) with budget multiplier . For the parametric policies, we optimize with the same parameters as earlier. We evaluate , every 0.025 between and , every 0.2 between and . For small values of , our methods perform similarly as IPW. As increases, our methods achieve policy improvement, though the L1-budgeted method (RIC.L1) achieves worse performance. For , the robust policy essentially learns the all-control policy; our finite-sample regret estimator simply indicates good regret for a neglible number of patients (5-6).

(b) Out-of-sample policy regret
(c) of patients with
(d) Avg death prognosis among treated
Figure 1: Comparison of policy performance on clinical trial (IST) data as increases

In Figs. 0(c)-0(d), we study the behavior of the robust policies. The IST trial recorded a prognosis score of probability of death at 6 months for patients, using an externally validated model, which we do not include in training data but use to assess the validity of our robust policy. In Fig. 0(d), we consider the average prognosis score of death for among patients treated with . In Fig. 0(c), for , the policy considers treating of patients and the subsequent average prognosis score of the population under consideration increases, indicating that the policy is learning and treating on appropriate indicators of severity from the available covariates. For , the noise in the prognosis score is due to the small treated subgroups. Our learned policies suggest that improvements from heparin may be seen in the highest-risk patients, consistent with the findings of [2], a systematic review comparing anticoagulants such as heparin against aspirin. They conclude from a study of trials including IST that heparin provides little therapeutic benefit, with the caveat that the trial evidence base is lacking for the highest-risk patients where heparin may be of benefit. Thus, our robust method appropriately treats those, and only those, who stand to benefit from the more aggressive treatment regime.

6 Conclusion

We developed a framework for estimating and optimizing for robust policy improvement, which optimizes the minimax regret of a candidate personalized decision policy against a baseline policy. We optimize over uncertainty sets centered at the nominal propensities, and leverage the optimization structure of normalized estimators to perform policy optimization efficiently by subgradient descent on the robust risk. Assessments on synthetic and clinical data demonstrate the benefits of robust policy improvement.


  • Aronow and Lee [2012] P. Aronow and D. Lee. Interval estimation of population means under unknown but bounded probabilities of sample selection. Biometrika, 2012.
  • Berge and Sandercock [2002] E. Berge and P. A. Sandercock. Anticoagulants versus antiplatelet agents for acute ischaemic stroke. The Cochrane Library of Systematic Reviews, 2002.
  • Beygelzimer and Langford [2009] A. Beygelzimer and J. Langford. The offset tree for learning with partial labels. Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, 2009.
  • Bottou et al. [2013] L. Bottou, J. Peters, J. Quinonero-Candela, D. X. Charles, D. M. Chickering, E. Portugaly, D. Ray, P. Simard, and E. Snelson. Counterfactual reasoning and learning systems. Journal of Machine Learning Research, 2013.
  • Breiman et al. [1984] L. Breiman, J. F. adn charles Stone, and R. Olshen. Classification and Regression Trees. Chapman and Hall, 1984.
  • Charnes and Cooper [1962] A. Charnes and W. Cooper. Programming with linear fractional functionals. Naval Research Logistics Quarterly, 1962.
  • Duchi and Namkoong [2017] J. Duchi and H. Namkoong. Variance-based regularization with convex objectives. Advances in Neural Information Processing Systems 30 (NIPS 2017), 2017.
  • Dudik et al. [2014] M. Dudik, D. Erhan, J. Langford, and L. Li. Doubly robust policy evaluation and optimization. Statistical Science, 2014.
  • Fogarty and Hasegawa [2017] C. Fogarty and R. Hasegawa. An extended sensitivity analysis for heterogeneous unmeasured confounding. 2017.
  • Group [1997] I. S. T. C. Group. The international stroke trial (ist): a randomised trial of aspirin, subcutaneous heparin, both, or neither among 19435 patients with acute ischaemic stroke. international stroke trial collaborative group. Lancet, 1997.
  • Hasegawa and Small [2017] R. Hasegawa and D. Small. Sensitivity analysis for matched pair analysis of binary data: From worst case to average case analysis. Biometrics, 2017.
  • Hsu and Small [2013] J. Y. Hsu and D. S. Small. Calibrating sensitivity analyses to observed covariates in observational studies. Biometrics, 69(4):803–811, 2013.
  • Imbens and Rubin [2015] G. Imbens and D. Rubin. Causal Inference for Statistics, Social, and Biomedical Sciences. Cambridge University Press, 2015.
  • Kallus [2017] N. Kallus. Recursive partitioning for personalization using observation data. Proceedings of the Thirty-fourth International Conference on Machine Learning, 2017.
  • Kitagawa and Tetenov [2015] T. Kitagawa and A. Tetenov. Empirical welfare maximization. 2015.
  • Ledoux and Talagrand [1991] M. Ledoux and M. Talagrand. Probability in Banach Spaces: isoperimetry and processes. Springer, 1991.
  • Li et al. [2011] L. Li, W. Chu, J. Langford, and X. Wang. Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms. Proceedings of the fourth ACM international conference on web search and data mining, 2011.
  • Miratrix et al. [2018] L. W. Miratrix, S. Wager, and J. R. Zubizarreta. Shape-constrained partial identification of a population mean under unknown probabilities of sample selection. Biometrika, 2018.
  • Petrik et al. [2016] M. Petrik, M. Ghavamzadeh, and Y. Chow. Safe policy improvement by minimizing robust baseline regret. 29th Conference on Neural Information Processing Systems, 2016.
  • Qian and Murphy [2011] M. Qian and S. A. Murphy. Performance guarantees for individualized treatment rules. Annals of statistics, 39(2):1180, 2011.
  • Rosenbaum [2002] P. Rosenbaum. Observational Studies. Springer Series in Statistics, 2002.
  • Rosenbaum and Rubin [1983] P. R. Rosenbaum and D. B. Rubin. The central role of the propensity score in observational studies for causal effects. Biometrika, 1983.
  • Rubin [1974] D. Rubin. Estimating causal effect of treatments in randomized and nonrandomized studies. Journal of Educational Psychology, 1974.
  • Rubin [1980] D. B. Rubin. Comments on “randomization analysis of experimental data: The fisher randomization test comment”. Journal of the American Statistical Association, 75(371):591–593, 1980.
  • Still [2018] G. Still. Lectures on parametric optimization: An introduction. Optimization Online, 2018.
  • Swaminathan and Joachims [2015a] A. Swaminathan and T. Joachims. The self-normalized estimator for counterfactual learning. Proceedings of NIPS, 2015a.
  • Swaminathan and Joachims [2015b] A. Swaminathan and T. Joachims. Counterfactual risk minimization. Journal of Machine Learning Research, 2015b.
  • Thomas et al. [2015] P. Thomas, G. Theocharous, and M. Ghavamzadeh. High confidence policy improvement. Proceedings of the 32nd International Conference on Machine Learning, 2015.
  • Wager and Athey [2017] S. Wager and S. Athey. Efficient policy learning. 2017.
  • Wang et al. [2017] Y.-X. Wang, A. Agarwal, and M. Dudik. Optimal and adaptive off-policy evaluation in contextual bandits. Proceedings of Neural Information Processing Systems 2017, 2017.
  • Zhao et al. [2017] Q. Zhao, D. S. Small, and B. B. Bhattacharya. Sensitivity analysis for inverse probability weighting estimators via the percentile bootstrap. ArXiv, 2017.

Appendix A Proofs

Proof of Theorem 1.

Let . Then

Since and , Hoeffding’s inequality gives


and note that since we have that satisfies bounded differences with constant . Hence, McDiarmid’s inequality gives

Next, because , a standard symmetrization argument gives that

Further, by the Rademacher comparison lemma [16, Thm. 4.12], we get that

Next, satisfies bounded differences with constant so McDiarmid’s inequality gives

Finally, Hoeffdding’s inequality gives that

Combining, we get that with probability at least , assuming that , we have that is bounded above by

Letting and , the above is bounded by so long as . The proof is completed by noting that by assumption of true weights being inside we get that . ∎

Equivalence of Fractional Program 5 and 6.

We can easily verify that a feasible solution for one problem is feasible for the other: for a feasible solution to (FP), we can generate a feasible solution to (LP) as with the same objective value. In the other direction, we can generate a feasible solution to (6) from a feasible fractional program (5) solution if we take . This solution has the same objective value since . ∎

Proof of Thm. 2.

We analyze the program using complementary slackness, which will yield an algorithm for finding a solution which generalizes the solution in [1].

Note that without loss of generality we can consider for datapoints where without changing the optimization problem. At optimality only one of or will be tight. For the nonbinding primal constraints, by complementary slackness the corresponding dual variable will be 0. We expect constraints to be tight in the dual (since and the constraint is not binding). So the optimal solution to the dual will satisfy:

Note that if and if . Then at optimality, there exists some index (where refers to the th index of the increasing order statistics, an ordering where .

We can subsitute in the solution from the equality constraints and obtain the following equality which holds at optimality:

We discuss how to derive the primal solution from the dual solution: for , take and . Thus by complementary slackness, is the optimal value with the corresponding optimal solution exhibited above.

The optimal such occurs with the order statistic threshold at for . Consider the parametric restriction of the primal program, parametrized by the sum of weights ; the value function is concave in and furthermore concave in the discrete restriction of to the values it takes at the vertex solutions.

Appendix B Algorithm details

Subgradients for adversarial value function and logistic assignment policy. .

By Theorem 4.3 (essentially implicit value theorem) of [25], where , the weights achieving the optimal solution. For the linear problem, the parametric coefficients of the objective function are , and is an affine transformation of . By the subgradient chain rule, since is an affine function of , for , we have that .

We use the fact that the derivative of the sigmoid function is . Then by the chain rule, . We consider the vector and express as .

Then, by the chain rule for subgradients with affine functions, . ∎

Appendix C Additional Algorithms for Policy Optimization

c.1 Optimal Decision Trees

Instead of considering probabilistic treatment policies , we can optimize the deterministic assignment policies globally over the space of decision trees via integer programming. We introduce the integer assignment variables and compose the Optimal Classification Tree from [7] with the dual formulation of 5. We combine the primal constraints defining the tree structure with the objective function from the dual formulation. The space of policies is parametrized by the hyperplane and intercept vectors defining axis-aligned splits, where we define a tree structure and a split at a decision node assigns units to the left leaf if , and to the right leaf otherwise. [7] introduces the constraints in the program to enforce the hierarchical split structure, which we reproduce for completeness below. The program tracks a set of branching nodes and a set of leaf nodes , using the binary assignment variables to track assignment of data points to leaf subject to the requirement that every instance is assigned to a leaf node. The binary variables track whether a split occurs at node and maintains split hierarchy consistency.

The additional constraints that allow us to encode the dual objective are as follows: we define the policy assignment indicator where is the policy assignment label of leaf node , and describes whether or not instance is assigned to leaf node . We enforce this with the set of auxiliary big-M constraints for the product of binary variables (for the case of two treatments)

In this formulation, we introduce the binary variable (i.e. a 0-1 indicator version of ) and express in terms of the 0-1 binary variable to clarify the connection to an optimal classification tree formulation with the 0-1 loss.