Characterization of Overlap in Observational Studies
Abstract
Overlap between treatment groups is required for nonparametric estimation of causal effects. If a subgroup of subjects always receives (or never receives) a given intervention, we cannot estimate the effect of intervention changes on that subgroup without further assumptions. When overlap does not hold globally, characterizing local regions of overlap can inform the relevance of any causal conclusions for new subjects, and can help guide additional data collection. To have impact, these descriptions must be interpretable for downstream users who are not machine learning experts, such as clinicians. We formalize overlap estimation as a problem of finding minimum volume sets and give a method to solve it by reduction to binary classification with Boolean rules. We also generalize our method to estimate overlap in offpolicy policy evaluation. Using data from realworld applications, we demonstrate that these rules have comparable accuracy to blackbox estimators while maintaining a simple description. In one case study, we perform a user study with clinicians to evaluate rules learned to describe treatment group overlap in postsurgical opioid prescriptions. In another, we estimate overlap in policy evaluation of antibiotic prescription for urinary tract infections.
1 Introduction
To estimate the causal effect of a binary intervention, we must use the outcomes of treated individuals to infer the outcomes of untreated individuals under treatment, and vice versa [21]. However, the set of subjects that were considered for either treatment group—the overlap set—is the only set for which we can learn provably optimal policies without assumptions on function class [27]. This motivates the construction of an explicit, interpretable characterization of the overlap set, to provide domain experts with insight into the relevance of learned policies for new subjects. For example, physicians who practice evidencebased medicine (EBM) need to know whether the results of a clinical trial or observational study apply to their specific patient [29]. Providing inclusion/exclusion criteria is insufficient in either setting, as these often apply to a larger cohort than the overlap set. For instance, if certain subgroups are included in the retrospective cohort, but never received the intervention, then the conclusions of the study should not be considered applicable for that subgroup.
Characterizing overlap between distributions is also relevant beyond causal inference. A closely related problem is unsupervised domain adaptation [3], in which predictive models trained on a source domain are applied to a target domain from which no labels are observed. By describing the overlap between domains, we gain insight into the inputs for which transfer is likely to succeed. Another use case is understanding areas of difficulty in classification tasks. Describing the overlap between inputs from different classes illustrates which inputs are “hard” to classify, and the complement of the overlap (limited to the observed distribution) describes which inputs are “easy”.
With this context in mind, our main contributions are as follows: {enumerate*}[label=()]
We propose desiderata in overlap estimation, and note how existing methods fail to satisfy them.
We give a method for interpretable characterization of distributional overlap which satisfies these desiderata, by reducing the problem to two binary classification problems, and using a linear programming relaxation of learning optimal Boolean rules.
We demonstrate that small rule sets often perform comparably to blackbox estimators on a suite of realworld tasks.
We show how a generalized definition and method applies to policy evaluation and apply it to describing overlap in policies for antibiotic prescription.
We evaluate the interpretability of rules for describing treatment group overlap in postsurgical opioid prescription in a user study with medical professionals.
2 Related work
Distributional (treatment group) overlap is a central assumption in the estimation of causal effects from observational data. A simple yet common method to estimate overlap is to compare groupspecific covariate bounds and loworder moments [26, 42, 9]. A more flexible approach is to estimate the group propensity—the probability that a subject was prescribed treatment. Propensities bounded away from and at a point indicates that groups overlap [20, 27]. This idea was used by [9] to learn “interpretable study populations”, by identifying the largest axisaligned box that contains only subjects with bounded propensity. Matching methods [14, 25, 16], which compute an optimal crossgroup matching of subjects with a limit on matching distance may also be used to estimate overlap but do not provide an explicit and generalizable description. Rulebased models have been widely considered also for classification tasks [24, 1, 40, 18, 38, 7, 10, 37] and subgroup discovery [13], but have to the best of our knowledge not been applied to support or overlap estimation. Authors have proposed rulebased algorithms for density estimation, using decision trees [23] and Bayesian histograms [11], but small models of two densities are often not straightforward to combine into one small model of their overlap.
3 Problem Statement and Overlap Definition
We address rulebased overlap estimation—characterization of the intersection of two or more populations or densities. Our primary motivation is to aid policy making based on estimates of causal effects, the validity of which relies on knowing and communicating the set of subjects to which the policy applies. This often places restrictions on the overlap description [2]. We identify the following desiderata for estimates of overlap: {enumerate*}[label=(D.0)]
They include regions where all groups are wellrepresented;
They exclude all other regions, including those outside the support (see Figure 0(a));
They can be expressed using a small set of simple rules. First, we discuss definitions of overlap satisfying 3 and 3 and establish notation. We address 3 in Sections 4 and 5.
Let subjects be observed through samples of covariates and a group indicator . We assume that samples are independently and identically distributed according to a density , that is bounded. Let denote the covariate density of group . In causal effect estimation, it is most common to assess overlap between two groups with conditional densities and on , and . In this case, overlap is often described as either a) the intersection of supports, , where , or b) the set of covariates values for which the conditional probability of group membership (referred to as propensity) is bounded [6, 20]. We generalize this notion to an arbitrary set of groups ,
(1) 
Both of these definitions have shortcomings: the former is somewhat vacuous for variables with infinite support (e.g., a normal random variable), and even with finite support, we may wish to restrict it to “essential support” to avoid including distant outliers; The latter does not satisfy 3 since a point may have bounded propensity but lie outside the support of the population (see Figure 0(a)).
Our preferred definition combines the propensitybased definition with a generalized notion of support in minimumvolume sets [30], based on the multidimensional quantile function [8]. Let be a set of measurable subsets of , let denote the volume (Lebesgue measure) of a set , and define . An minimumvolume set of is then
(2) 
When , . may not always be unique, but the difference between any two MV sets (for the same ) is small for large .^{1}^{1}1For two MV sets where for some , their intersection is large in that . In this work, we consider only in order to handle distributions with infinite support and unwanted outliers, and refer to as the support of . Defining overlap as the intersection of groupspecific MV sets is feasible but has the downside that empirical estimates scale poorly with . Moreover, it does not facilitate the generalization to policy evaluation described at the end of this section, and the intersection of several descriptions is often less interpretable than a single description. Instead, we define the overlap set, for , to be
(3) 
We define the problem of overlap estimation under definition (3) as one of characterizing the set given thresholds and . In line with 3, these characterizations should be useful in policymaking, and interpretable by domain experts, at small or no cost in accuracy. In the sequel, for notational convenience, we sometimes leave out superscripts from and , assuming them fixed.
Generalization to policy evaluation.
The definition of in (1) is motivated by causal effect estimation—comparison of outcomes under two or more alternative interventions. We may instead be interested in policy evaluation, which involves estimating the expected outcome under a conditional intervention , which assigns (possibly stochastically) a treatment to each following a conditional distribution we write as [22]. To perform this type of evaluation, we only require that the propensity be bounded away from zero for treatments which have nonzero probability under the policy . To describe the inputs for which this is satisfied, we may generalize to be a function of the target policy by defining . See the supplementary material for more details, and Section 6.4 for experimental results in this setting. The full application of this modified procedure is also given in the supplement for clarity.
4 OverRule: RuleBased Overlap Estimation
We propose OverRule, an algorithm for identifying the overlap set by separately estimating the MV support set (2) and the boundedpropensity set (1), thereby satisfying desiderata 3–3. Our approach to estimating and , described in Sections 4.1 and 4.2, aims to fulfill desideratum 3 by using Boolean rules—logical formulae in either disjunctive (DNF) or conjunctive (CNF) normal form approximators that have received renewed attention because of their interpretability [7, 35] (see Figure 3 for an example). It was observed in preliminary experiments that learning rules for and separately improved interpretability, as it makes clear which rules apply to which task and prevents the capacity of the function from being consumed by one task. The conjunction of the two rules yields a description of . OverRule proceeds in the following steps: {enumerate*}[label=()]
Fit an estimate of the marginal support using Boolean rules,
Fit a group propensitybased estimator indicating membership in ,
Approximate on using Boolean rules.
Our main contribution in this section is to demonstrate how rule learning of and , steps 4 & 4, can be reduced to binary classification. This enables us to exploit the rich set of existing methods for rulebased classification [10] in our effort to improve the interpretability of the overlap estimate.
4.1 Estimation of as Binary Classification
In the first step of OverRule, we learn a rule set to approximate the MV set of the marginal distribution , , by reducing the MV set problem to one of binary classification between the observed samples and uniform background samples. With an abusive reuse of notation, let denote a class of subsets of where each subset corresponds to a candidate MV set. In Section 5, we parameterize using Boolean rules, where is the set of inputs for which a rule holds (see Figure 0(b)). Let denote the observed set of covariates. Then, the empirical version of (2) specialized to rule sets is as follows:
(4) 
with the addition of a regularization term to control the complexity of the rule set.
In practice, the volume may be difficult to compute during optimization for general classes , and the size of is often too large to allow precomputation of for all . In particular, in the case of DNF Boolean rules, each is a union of several potentially overlapping rules (see Figure 0(b)). Even if the volume spanned by each rule is known or quick to compute on the fly, may not be.
To estimate , we use the fact that volume is a uniform measure on : The volume of can be estimated as a fraction of the volume of by means of uniform samples over . Let be the index set of these uniform samples. Then is distributed as a scaled binomial random variable with mean and variance .
If places conditions on only of the dimensions, then would be expected to scale exponentially with , not . Thus does not need to be overly large to accurately estimate the volume of the rule set.
Given the above empirical estimator of volume, we reduce support estimation to a classification problem between the marginal density and a uniform distribution over . This is inspired by a similar reduction of support estimation (see Conclusion, p.695 in [31]).
(5) 
where . Problem (5) is a NeymanPearsonlike classification problem with a false negative rate constraint of (instead of the usual false positive constraint).
4.2 Estimation of as Binary Classification
Towards estimating , we follow in the tradition of using nonparametric (blackbox) estimators of the group propensity to identify balanced cohorts in the study of causal effects [20, 9]. In particular, given an estimator of propensity , e.g. a random forest model, we assign labels to each data point indicating that propensity is bounded away from 0 and 1 in the following way:
(6) 
Let . Similar to the case of , we may now reduce rule set estimation of to binary classification. Given , the minimizer of (5), we restrict attention to samples in and again set up a NeymanPearsonlike classification problem regarding the intersection as the positive class:
(7) 
The sets and are defined by the solution to (5) and the base estimator (6). To accommodate the policy evaluation setting described in Section 3, we can modify the pseudolabels labels defined in (6) to be , where , and solve (7) using in place of . The resulting full procedure is given in the supplement.
5 NeymanPearson Classification with Boolean Rules
In this section, we derive an optimization procedure for (5) in the case where the hypothesis class consists of Boolean DNF rules. The same procedure also solves (7). As the resulting DNF rule learning problem is an integer program (IP), we derive several heuristics for reducing computation.
DNF rules are also commonly known as rule sets, where the conjunctive clauses in the DNF correspond to individual rules in the set. As pointed out by [35], CNF rules can be learned by swapping class labels and fitting a DNF. Figure 0(b) exemplifies a DNF rule in . We assume that base features have been binarized to form literals such as or indexed by a set , as is standard in e.g. decision tree learning. We let index the set of all possible (exponentially many) conjunctions of literals in , e.g. . Then, for , let denote the value taken by the th conjunction at sample . Let the rule set be defined by such that indicates that the th conjunction is used in the rule set.
Recall that indexes samples from a uniform reference measure and define an error variable for in representing the penalty for covering or failing to cover point , depending on its set membership.
Problem (5) may be reformulated accordingly as follows (similar to [7]),
s.t.  (8) 
Problem (8) is an IP with an exponential number of variables and is intractable as written. We follow the column generation approach of [7] to effectively manage the large number of variables and solve (8) approximately. As in that previous work, we bound from above the operators in the constraints of (8) with sums (Hamming loss instead of zeroone loss) as it gives better numerical results. We also let with so that higherdegree conjunctions are more costly. These modifications yield an objective that is linear in , , with the same constraints as (8) except that , has been absorbed into the objective. We then follow the overall procedure in [7] of solving the linear programming (LP) relaxation, using column generation to add variables only as needed.
We make the following departures from [7]. As noted, (8) has a constraint on false negative rate instead of a corresponding objective term and a penalty on rule set complexity while [7] use a constraint. As a result, the LP reduced costs, used in column generation, are different. With , the dual variable for the top right constraint in (8), the reduced cost of conjunction is now , which remains a linear function of , allowing the same column generation method to be used. We also simplify the procedure of [7] to avoid the need for an IP solver by a) solving the column generation problem using a heuristic algorithm from [39], and b) once column generation terminates, we obtain an integer solution by taking the restriction of (8) to the final columns, converting to a weighted set cover problem, and applying a greedy set cover algorithm.
6 Experiments
We compare OverRule to blackbox and rulebased estimators of overlap. The first baseline approximates the overlap region with the intersection of covariate (marginal) bounding boxes (CBB) to evoke classical balance checks in causal inference. The bounding boxes are selected to cover the percentiles of the data. Second, we use propensity score estimators as described in (6) with standard logistic regression (PSLR) or nearest neighbors (PSNN) estimates of the propensity. Finally, we use OneClass Support Vector Machines (OSVM) to first estimate conditional supports and then overlap as their intersection.
The PS estimators can be viewed as a binary version of overlap weights [20] and CBB as the standard practice of comparing basic covariate metrics. These baselines are also used as blackbox estimators by OverRule. In addition, we compare our results to the MaxBox (MB) framework used by [9] to learn interpretable study populations.
When estimating support in OverRule, we use uniform reference samples where is the number of data samples and their dimension. Parameters were selected from the range for estimation of both and . The choice of had small impact and was set to unless otherwise specified. For propensitybased base estimators, we used a threshold . For NN we selected based on heldout accuracy in predicting group membership and used as threshold. For OSVM, we use a Gaussian RBFkernel with bandwidth selected based on the heldout likelihood of kernel density estimator with the same bandwidth. To select hyperparameters for the rulebased models, and to assess quality when overlap is known, we use the balanced accuracy [5].
We evaluate OverRule in a series of realworld experiments. To give an understandable illustration of our definition and method, we estimate regions of classifier uncertainty in the famous Iris dataset. Estimation of treatment group overlap in the Jobs dataset [19, 32] provides an example with "ground truth". A study of overlap in opioid prescriptions gives us a largescale realworld example with a user study, and policy evaluation for antibiotic prescriptions points to the versatility of the core methodology. A synthetic experiment showing the utility of characterizing overlap in observational studies of causal effects both before and after estimation can be found in the supplement.
6.1 Classifier Uncertainty: Iris
We use OverRule to identify the overlap between members of two species of Iris, as represented by their sepal and petal length and width, based on the famous Iris dataset. We fit OverRule using a NN base estimator () and DNF Boolean rules with high regularization (). In Figure 2, we present the rules learned to characterize and compare them with the rules learned for a binary classifier of group membership. In contrast, the coefficients of a logistic regression propensity score model, reveal very little about which points lie in the overlap set.

6.2 Observational Study: Jobs


In a famous trial performed to study the effects of job training [19, 34], eligible US citizens were randomly selected into (), or left out of () job training programs. The RCT (), which satisfies overlap by definition, has since been combined with nonexperimental control samples (), forming a larger observational set (Jobs), to serve as a benchmark for causal effect estimation [19]. Here, we aim to characterize the overlap between treated and control subjects.
As a consequence of the trial’s eligibility criteria, the experimental and nonexperimental cohorts barely overlap; a standard logistic regression estimator separate the experimental and nonexperimental groups with heldout balanced accuracy of 0.96. Since all treated subjects were part of the experiment, the experimental cohort perfectly represents the overlap region. For this reason, we use the experiment indicator as ground truth for .^{2}^{2}2This may introduce a small number false negatives in the label used as ground truth. In studies of causal effects in this data, the following features were included to adjust for confounding: Age, #Years in education (Educ), Race (black/hispanic/other), Married, No degree (NoDegr), Real earnings in 1974 (RE74) and in 1975 (RE75). These are the features for which we estimate overlap between treated and controls.
The results for Jobs can be seen in Table 1. We present the results for the smallest rules that achieve balanced accuracy within 1% of that of the best performing model within each class. We see that for most base estimators, the OverRule approximations perform slightly worse than the base estimator, but with a simpler description. OverRule compares favorably to MaxBox in all cases. In the supplement, we give plots which show that the heldout balanced accuracy quickly converges with the number of literals in the rules and correlates strongly with the quality by which the rule set approximates the base estimator. The learned rules in Table 1(b) conform to our expectations as the eligibility criteria for the RCT allow only subjects who were currently unemployed and had been so for most of the time leading up to the trial—factors that correlate with education and marital status [34].
6.3 Observational Study: Opioid Misuse
Opioid misuse affects millions of Americans and understanding the factors that influence the risk of misuse is of great importance. To this end, [4] and [41] study the effect of choices in opioid prescriptions on the risk of future misuse. In this experiment, we study a group of postsurgical patients who were given opioid prescriptions within 7 days of surgery. We compare patients who were given doses, morphine milligram equivalent (MME), above and below the 85th percentile in the selected cohort, MME=450. We replicate the cohort eligibility criteria of [4], using a subset of the MarketScan insurance claims database. Subjects were represented by basic demographics (age, sex), diagnosis history and procedures billed as surgical on the index date. Note that surgical procedures are not mutually exclusive. We list firstorder statistics of these features in the supplement.
We fit an OverRule model (OR) to a random forest base estimator with for and for picked a priori. The hyperparameter was set to for , chosen based on balanced accuracy w.r.t. the base estimator, and for based on accuracy in classifying the reference measure. For comparison, we fit a MaxBox model (MB) [9] to the same base, and another OverRule model describing the complement of (ORC). The balanced accuracy of these models w.r.t. the base were 0.90 (OR), 0.77 (MB) and 0.92 (ORC). In Figure 3, we summarize the rules learned by OR which cover 27% of the overall population. MB learned the rule: Musculoskeletal surg. Mediastinum surg. Male genital surg. Maternity surg. Lumbosacral spondylosis without myelopathy which covers 17% of patients. The rules learned by ORC are presented in the supplement. To evaluate the interpretability of learned overlap sets, we conducted a qualitative user study through a moderated discussion with three participants: two attending surgeons (P1 & P2) and a 4th year medical student (P3) at a large US teaching hospital. Before seeing the outputs of any method, the participants were asked to give their expectations for what to find in the overlap set. The full discussion was transcribed, anonymized, and included in the supplement.
The participants expected that the overlap set would mostly correspond to patients in the higher dose range, as these patients are often considered also for smaller doses, and that overlap would be driven largely by surgery type. All participants expected Musculoskeletal and Cardiovascular surgery patients to be predominantly in the higher dose group, and sometimes in the lower, and one suggested that Maternity surgeries (e.g., Csections) would be only in the lower range. These comments are all consistent with the findings of OverRule, which identified all of these surgery types as important. MaxBox identified only Musculoskeletal surgery patients as overlapping. One participant expected history of psychiatric disease and Tobacco use disorder to be predictive of higher prescription doses for some patients, and thus overlap. Neither method identified psychiatric disease, but Tobacco use disorder was identified by ORC as anticorrelated with exclusion from overlap (see the supplement).
The participants found the support rules () output by OR (Figure 3 left) intuitive and P1 stated that Endocrine surgeries are not typically followed by opioid prescriptions. They found the MaxBox and OR rule descriptions easy to interpret, and discussion focused on their clinical meaning. The first three propensity overlap rules B.1B.3 were all consistent with expectation as described above, with the caveat that Cardiovascular patients are not typically stratified by Urinary and Genital surgeries. This was later partially explained by catheters being billed as Urinary and P3 interpreted it as a proxy for more severe Cardiovascular surgeries. P1 pointed out the value in discovering such surprising patterns that may be hidden in blackbox analyses. The ORC rules were found hard to interpret due to many double negatives (“excluded from exclusion”), but were ultimately deemed clinically sound.
6.4 Observational Study: Policy Evaluation of Antibiotic Prescription Guidelines
Using the policy evaluation formulation of (see Section 3), we apply OverRule to assess the overlap set for a policy that follows clinical guidelines published by the Infectious Disease Society of America (IDSA) for treatment of uncomplicated urinary tract infections (UTIs) in female patients [12]. We use data derived from the electronic medical record of two academic medical centers.
We apply the OverRule algorithm to a broad cohort (e.g., including men) of 65,000 UTI patients to test whether or not it can recover a clinically meaningful overlap set. From a qualitative perspective, we discussed the results with an infection disease specialist, who verified that the resulting rules have a clear clinical interpretation which aligns with how the guidelines are applied in practice, identifying primarily female outpatient cases and uncomplicated female inpatient cases. From a quantitative perspective, we compared the learned region (covering 42k patients, 64% of total) with a subset of patients selected apriori to be eligible for the guidelines (14k patients, 21% of total). We found that the former covers 96% of the latter while also including a much broader, but clinically intuitive, cohort. The experimental setup and results are described in more detail in the supplement.
7 Conclusion
We have presented OverRule—an algorithm for learning rulebased characterizations of overlap between distributions, or the inputs for which policy evaluation is feasible. The algorithm learns to exclude points marginally outofdistribution, as well as points where either distribution/policy has low density. We evaluated the algorithm in characterizing overlap in causal effect estimation, and demonstrated that our rule descriptions often have similar accuracy to blackbox estimators and outperform a competitive baseline. In an application to study treatment group overlap in postsurgical opioid prescription, a qualitative user study found the results interpretable and clinically meaningful. Similar observations were made in an application to evaluation of antibiotic prescription policies.
Acknowledgments
We thank Chloe O’Connell and Charles S. Parsons for providing clinical feedback on the opioid misuse experiment, and Sanjat Kanjilal for providing clinical feedback on the antibiotic prescription experiment. We also thank Bhanukiran Vinzamuri for assistance with the opioids data, David Amirault for insightful suggestions and feedback, and members of the Clinical Machine Learning group for feedback on earlier drafts.
References
 [1] Elaine Angelino, Nicholas LarusStone, Daniel Alabi, Margo Seltzer, and Cynthia Rudin. Learning certifiably optimal rule lists. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 35–44, 2017.
 [2] Susan Athey and Stefan Wager. Efficient policy learning. arXiv preprint arXiv:1702.02896, 2017.
 [3] Shai BenDavid, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan. A theory of learning from different domains. Machine learning, 79(12):151–175, 2010.
 [4] Gabriel A Brat, Denis Agniel, Andrew Beam, Brian Yorkgitis, Mark Bicket, Mark Homer, Kathe P Fox, Daniel B Knecht, Cheryl N McMahillWalraven, Nathan Palmer, et al. Postsurgical prescriptions for opioid naive patients and association with overdose and misuse: retrospective cohort study. Bmj, 360:j5790, 2018.
 [5] Kay Henning Brodersen, Cheng Soon Ong, Klaas Enno Stephan, and Joachim M Buhmann. The balanced accuracy and its posterior distribution. In 2010 20th International Conference on Pattern Recognition, pages 3121–3124. IEEE, 2010.
 [6] Alexander D’Amour, Peng Ding, Avi Feller, Lihua Lei, and Jasjeet Sekhon. Overlap in observational studies with highdimensional covariates. arXiv preprint arXiv:1711.02582, 2017.
 [7] Sanjeeb Dash, Oktay Gunluk, and Dennis Wei. Boolean decision rules via column generation. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 4660–4670. Curran Associates, Inc., 2018.
 [8] John HJ Einmahl and David M Mason. Generalized quantile processes. The Annals of Statistics, pages 1062–1078, 1992.
 [9] Colin B Fogarty, Mark E Mikkelsen, David F Gaieski, and Dylan S Small. Discrete optimization for interpretable study populations and randomization inference in an observational study of severe sepsis mortality. Journal of the American Statistical Association, 111(514):447–458, 2016.
 [10] Alex A Freitas. Comprehensible classification models: a position paper. ACM SIGKDD explorations newsletter, 15(1):1–10, 2014.
 [11] Siong Thye Goh and Cynthia Rudin. Cascaded high dimensional histograms: A generative approach to density estimation. arXiv preprint arXiv:1510.06779, 2015.
 [12] Kalpana Gupta, Thomas M Hooton, Kurt G Naber, Bjorn Wullt, Richard Colgan, Loren G Miller, Gregory J Moran, Lindsay E Nicolle, Raul Raz, Anthony J Schaeffer, and David E Soper. International clinical practice guidelines for the treatment of acute uncomplicated cystitis and pyelonephritis in women: A 2010 update by the Infectious Diseases Society of America and the European Society for Microbiology and Infectious Diseases. Clinical Infectious Diseases, 52(5):e103–20, mar 2011.
 [13] Franciso Herrera, Cristóbal José Carmona, Pedro González, and María José Del Jesus. An overview on subgroup discovery: foundations and applications. Knowledge and information systems, 29(3):495–525, 2011.
 [14] Stefano M Iacus, Gary King, and Giuseppe Porro. Causal inference without balance checking: Coarsened exact matching. Political analysis, 20(1):1–24, 2012.
 [15] Herman Kahn. Use of Different Monte Carlo Sampling Techniques. Technical report, RAND Corporation, Santa Monica, California, 1955.
 [16] Nathan Kallus. Generalized optimal matching methods for causal inference. arXiv preprint arXiv:1612.08321, 2016.
 [17] Nathan Kallus and Angela Zhou. Policy evaluation and optimization with continuous treatments. Proceedings of the TwentyFirst International Conference on Artificial Intelligence and Statistics, 84:1243–1251, 09–11 Apr 2018.
 [18] Himabindu Lakkaraju, Stephen H Bach, and Jure Leskovec. Interpretable decision sets: A joint framework for description and prediction. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pages 1675–1684. ACM, 2016.
 [19] Robert J LaLonde. Evaluating the econometric evaluations of training programs with experimental data. The American economic review, pages 604–620, 1986.
 [20] Fan Li, Kari Lock Morgan, and Alan M Zaslavsky. Balancing covariates via propensity score weighting. Journal of the American Statistical Association, 113(521):390–400, 2018.
 [21] Judea Pearl. Causality. Cambridge university press, 2009.
 [22] Doina Precup, Richard S Sutton, and Satinder P Singh. Eligibility Traces for OffPolicy Policy Evaluation. In Proceedings of the Seventeenth International Conference on Machine Learning (ICML), pages 759–766, 2000.
 [23] Parikshit Ram and Alexander G Gray. Density estimation trees. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 627–635. ACM, 2011.
 [24] Ronald L Rivest. Learning decision lists. Machine learning, 2(3):229–246, 1987.
 [25] Paul R Rosenbaum. Optimal matching for observational studies. Journal of the American Statistical Association, 84(408):1024–1032, 1989.
 [26] Paul R Rosenbaum. Design of observational studies, volume 10. Springer, 2010.
 [27] Paul R Rosenbaum and Donald B Rubin. The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1):41–55, 1983.
 [28] Guillermo V Sanchez, Ahmed Babiker, Ronald N Master, Tony Luu, Anisha Mathur, and Jose Bordon. Antibiotic Resistance among Urinary Isolates from Female Outpatients in the United States in 2003 and 2012. Antimicrobial Agents and Chemotherapy, 60(5):2680–2683, 2016.
 [29] Connie Schardt, Martha B Adams, Thomas Owens, Sheri Keitz, and Paul Fontelo. Utilization of the PICO framework to improve searching PubMed for clinical questions. BMC medical informatics and decision making, 7:16, jun 2007.
 [30] Bernhard Schölkopf, John C Platt, John ShaweTaylor, Alex J Smola, and Robert C Williamson. Estimating the support of a highdimensional distribution. Neural computation, 13(7):1443–1471, 2001.
 [31] Clayton D Scott and Robert D Nowak. Learning minimum volume sets. Journal of Machine Learning Research, 7(Apr):665–704, 2006.
 [32] Uri Shalit, Fredrik D Johansson, and David Sontag. Estimating individual treatment effect: generalization bounds and algorithms. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 3076–3085. JMLR. org, 2017.
 [33] Daniel J Shapiro, Lauri A Hicks, Andrew T Pavia, and Adam L Hersh. Antibiotic prescribing for adults in ambulatory care in the USA, 2007–09. Journal of Antimicrobial Chemotherapy, 69(1):234–240, 2013.
 [34] Jeffrey A Smith and Petra E Todd. Does matching overcome lalonde’s critique of nonexperimental estimators? Journal of econometrics, 125(12):305–353, 2005.
 [35] Guolong Su, Dennis Wei, Kush R. Varshney, and Dmitry M. Malioutov. Learning sparse twolevel Boolean rules. In Proc. IEEE Int. Workshop Mach. Learn. Signal Process. (MLSP), pages 1–6, September 2016.
 [36] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT Press, 2nd edition, 2017.
 [37] Fulton Wang and Cynthia Rudin. Falling rule lists. In Artificial Intelligence and Statistics, pages 1013–1022, 2015.
 [38] Tong Wang, Cynthia Rudin, Finale DoshiVelez, Yimin Liu, Erica Klampfl, and Perry MacNeille. A Bayesian framework for learning rule sets for interpretable classification. Journal of Machine Learning Research, 18(70):1–37, 2017.
 [39] Dennis Wei, Sanjeeb Dash, Tian Gao, and Oktay Gunluk. Generalized linear rule models. In Proceedings of the 36th International Conference on Machine Learning (ICML), 2019.
 [40] Hongyu Yang, Cynthia Rudin, and Margo Seltzer. Scalable Bayesian rule lists. In Proc. Int. Conf. Mach. Learn. (ICML), pages 1013–1022, 2017.
 [41] Jinghe Zhang, Vijay Iyengar, Dennis Wei, Bhanukiran Vinzamuri, Hamsa S. Bastani, Alexander R. Macalalad, Anne E. Fischer, Gigi YuenReed, Aleksandra Mojsilovic, and Kush R. Varshney. Exploring the causal relationships between initial opioid prescriptions and outcomes. In AMIA Workshop on Data Mining for Medical Informatics, Washington, DC, November 2017.
 [42] José R Zubizarreta. Using mixed integer programming for matching in an observational study of kidney failure after surgery. Journal of the American Statistical Association, 107(500):1360–1371, 2012.
Appendix
Appendix A Generalization to Policy Evaluation
In this section we give the detailed algorithm for applying OverRule to policy evaluation, as described in the main paper. In this context, we wish to evaluate not a specific treatment decision (e.g., the average treatment effect of giving a drug vs. withholding it), but rather a conditional policy representing a personalized treatment regime, which we will refer to as the target policy. This problem falls under the setting of offpolicy policy evaluation when this target policy differs from the policy which generated the data, which we observe in the observational data as .
Rationale for
In the main paper, we drew a connection between the set and the following set, which is a function of the target policy , . In this section, we will provide theoretical justification for why we are restricted to this set, if we wish to evaluate the policy given samples generated according to .
Following similar notation to [17], we will let correspond to covariates, to an outcome of interest, to a treatment decision. We write as the probability of each treatment under the policy, which may be stochastic. We write to represent the potential outcome under treatment . In this setting, we wish to evaluate the expected value of under the target policy, which we denote as . For our purposes, we note the following as motivation for our definition of
Proposition 1 (Informal).
The expectation is only defined w.r.t. the observed distribution for the subset such that
Proof.
Under the assumption that ignorability [21] holds, we can write out our desired quantity as follows in terms of observed distribution
(9)  
(10)  
(11)  
(12) 
Where in Equation (10) we multiply by one, in Equation (11) we use the assumption of ignorability to write and rearrange terms, and in Equation (12) we collect the terms which represent the observed distribution. For our purposes, it is sufficient to look at the integral in Equation (12) to see that it requires the condition that for all , the relationship must hold. ∎
The condition given in Proposition 1 is sometimes referred to as the condition of coverage [see 36, Section 5.5] in offpolicy evaluation. Rewriting Equation (12) as an expectation over the observed distribution, we can see that this leads naturally to the importance sampling [15] estimator
(13) 
which approximates our desired quantity. If for some small value of , then the variance of the importance sampling estimator increases dramatically. This motivates our notion of “strict” coverage, that for each value of , we require that for all actions such that , the condition must hold.
Note that this differs conceptually from the binary treatment case in an important respect: Since we are not seeking to contrast all treatments, we do not require that , but rather just for those treatments which have positive probability of being taken under the target policy.
Algorithmic Details
As described in the main paper, applying OverRule to the policy evaluation setting only requires a single change to the procedure, which is that the set is used in place of the set in Equation 7 in Section 4.2. Nonetheless, we provide an explicit selfcontained sketch of the procedure here to avoid any confusion:

Given a dataset, find an MV set using the approach given in the main paper.

Using this set, learn the conditional probabilities of each possible treatment , resulting in estimated propensities

For each data point in the support set , assign the label
where . The set is the collection of data points such that . Note that we know the target policy that we are evaluating, so we can evaluate for each data point.

Solve the following NeymanPearsonlike classification problem, using the techniques discussed in the main paper. Note that this is identical to solving Equation 7 in Section 4.2, with the substitution of for :
(14)
Appendix B Additional experimental results
b.1 Synthetic task: Estimating causal effects
To illustrate the utility of characterizing overlap before estimating causal effects, we perform a synthetic observational study of the average treatment effect (ATE) of a treatment on an outcome under confounding by a variable . Under the distribution , the true ATE on the overlap set is . We compare estimates of based on a) all available data and b) an estimate of the overlap region . For global estimators that are sensitive to misspecification, using samples outside of often worsens estimates of ATE.
DiM ()  OLS ()  DiM ()  OLS ()  DiM ()  OLS ()  
MAE 
We generate according to a 2D Gaussian mixture model with the mixture component and . We let , with the logistic function, linear weights, a constant offset, and noise. We compare two estimators of ATE: difference in means (DiM) () and ordinary leastsquares regression ( where and are OLS models fit to for treatment groups and , respectively. We also fit two OverRule models ( and ) with and , respectively, corresponding to weak and no regularization.
We report the mean absolute over repeated experiments to measure performance. As seen in Table 2, models based on or , using only samples in the estimated overlap regions, give better mean absolute error (MAE) than using the whole data, DiM() and OLS(). performs better than for all three cases as it adjusts for confounding and regularized outperforms unregularized . The results show that, under model misspecification, restricting the study to comply with the overlap assumption can be beneficial also prior to estimation.
b.2 Jobs
In figure 5 we see the correlation between heldout AUC for the rule set w.r.t. the experimental label, and the AUC for the rule set in approximating the base estimator. AUC is equal to balanced accuracy for binary predictions.
Rule S.1  
Age 20  
and  educ 8 
and  Not Black 
and  Not Hispanic 
and  Married 
and  RE74 $6270 
and  RE75 $1162 
or Rule S.2  
Not Hispanic  
and  RE74 $33220 
and  RE75 $32200 
or Rule S.3  
Not black  
and  Hispanic 
and  Education 12 years 
and  RE74 $21900 
and  RE75 $9850 
b.3 Opioids
For a full table of covariate statistics for the Opioids dataset, see Table LABEL:tbl:covariates_supp. For a illustration of the rules learned by OverRule to describe the complement of the overlap set, see Figure 4.
=\TX@col@width=\TX@target\TX@cols=´\TX@typeout@ Table Width Column Width X Columns