###### Abstract

Decision makers, such as doctors and judges, make crucial decisions such as recommending treatments to patients, and granting bails to defendants on a daily basis. Such decisions typically involve weighting the potential benefits of taking an action against the costs involved. In this work, we aim to automate this task of learning cost-effective, interpretable and actionable treatment regimes. We formulate this as a problem of learning a decision list – a sequence of if-then-else rules – which maps characteristics of subjects (eg., diagnostic test results of patients) to treatments. We propose a novel objective to construct a decision list which maximizes outcomes for the population, and minimizes overall costs. We model the problem of learning such a list as a Markov Decision Process (MDP) and employ a variant of the Upper Confidence Bound for Trees (UCT) strategy which leverages customized checks for pruning the search space effectively. Experimental results on real world observational data capturing judicial bail decisions and treatment recommendations for asthma patients demonstrate the effectiveness of our approach.

Learning Cost-Effective Treatment Regimes

Using Markov Decision Processes

Himabindu Lakkaraju Cynthia Rudin

Stanford University Duke University

## 1 Introduction

If Spiro-TestPos and Prev-AsthmaYes and CoughHigh then C |

Else if Spiro-TestPos and Prev-Asthma No then Q |

Else if Short-Breath Yes and GenderF and Age 40 and Prev-AsthmaYes then C |

Else if Peak-FlowYes and Prev-RespIssueNo and Wheezing Yes, then Q |

Else if Chest-PainYes and Prev-RespIssue Yes and Methacholine Pos then C |

Else Q |

Medical and judicial decisions can be complex: they involve careful assessment of the subject’s condition, analyzing the costs associated with the possible actions, and the nature of the consequent outcomes. Further, there might be costs associated with the assessment of the subject’s condition itself (e.g., physical pain endured during medical tests, monetary costs etc.). For instance, a doctor first diagnoses the patient’s condition by studying the patient’s medical history and ordering a set of relevant tests that are crucial to the diagnosis. In doing so, she also factors in the physical, mental and monetary costs incurred due to each of these tests. Based on the test results, she carefully deliberates various treatment options, analyzes the potential side-effects as well as the effectiveness of each of these options. Analogously, a judge deciding if a defendant should be granted bail studies the criminal records of the defendant, and enquires for additional information (e.g., defendant’s personal life or economic status) if needed. She then recommends a course of action that trades off the risk with granting bail to the defendant (the defendant may commit a new crime when out on bail) with the cost of denying bail (adverse effects on defendant, or defendant’s family, cost of jail to the county).

In practical situations, human decision makers often leverage personal experience to make decisions, without considering data, even if massive amounts of it exist for the problem at hand. There exist domains where machine learning models could potentially help – but they would need to consider all three aspects discussed above: predictions of counterfactuals, costs of gathering information, and costs of treatments. Further, these models must be interpretable in order to create any reasonable chance of a human decision maker actually using them. In this work, we address the problem of learning such cost-effective, interpretable treatment regimes from observational data.

Prior research addresses various aspects of the problem at hand in isolation. For instance, there exists a large body of literature on estimating treatment effects [8, 24, 7], recommending optimal treatments [1, 34, 9], and learning intelligible models for prediction [19, 16, 21, 4]. However, an effective solution for the problem at hand should ideally incorporate all of the aforementioned aspects. Furthermore, existing solutions for learning treatment regimes neither account for the costs associated with gathering the required information, nor the treatment costs. The goal of this work is to propose a framework which jointly addresses all of the aforementioned aspects.

We address the problem at hand by formulating it as a task of learning a decision list that maps subject characteristics to treatments (such as the one shown in Figure 1) such that it: 1) maximizes the expectation of a pre-specified outcome when used to assign treatments to a population of interest 2) minimizes costs associated with assessing subjects’ conditions and 3) minimizes costs associated with the treatments themselves. We choose decision lists to express the treatment regimes because they are highly intelligible, and therefore, readily employable by decision makers. We propose a novel objective function to learn a decision list optimized with respect to the criterion highlighted above. We prove that the proposed objective is NP-hard by reducing it to the weighted exact cover problem. We then optimize this objective by modeling it as a Markov Decision Process (MDP) and employing a variant of the Upper Confidence Bound for Trees (UCT) strategy which leverages customized checks for pruning the search space effectively.

We empirically evaluate the proposed framework on two real world datasets: 1) judicial bail decisions 2) treatment recommendations for asthma patients. Our results demonstrate that the regimes output by our framework result in improved outcomes compared to state-of-the-art baselines at much lower costs. Further, the treatment regimes output by our approach are less complex and require fewer diagnostic checks to determine the optimal treatment.

## 2 Related Work

Below, we provide an overview of related research on learning treatment regimes, dynamic optimal treatment regimes, subgroup analysis, and interpretable models.

Treatment Regimes. The problem of learning treatment regimes has been extensively studied in the context of medicine and health care. Along the lines of [36], literature on treatment regimes can be categorized as: regression-based methods and policy-search-based methods. Regression-based methods [28, 32, 29, 33, 40, 28, 26] model the conditional distribution of the outcomes given the treatment and characteristics of patients and choose the treatment resulting in the best possible outcome for each individual. Policy-search-based methods search for a policy (a function which assigns treatments to individuals) within a pre-specified class of policies. The policy is chosen to optimize the expected outcome across the population of interest. Examples of such estimators include marginal structural mean models [29], outcome weighted learning [39, 38], and robust marginal mean models [35, 36]. Very few of the aforementioned solutions [36, 25] produce regimes which are intelligible. None of the aforementioned approaches explicitly account for treatment costs and costs associated with gathering information pertaining to patient characteristics.

While most work on learning treatment regimes has been done in the context of medicine, the same ideas apply to policies in other fields. To the best of our knowledge, this work is the first attempt in extending work on treatment regimes to judicial bail decisions.

Dynamic Treatment Regimes. Recent research in personalized medicine has focused on developing dynamic treatment regimes [15, 37, 34, 9]. The goal is to learn treatment regimes that maximize outcomes for patients in a given population by recommending a sequence of appropriate treatments over time, based on the state of the patient. There has been little attention paid to interpretability in this literature (with the exception of [37]). None of the prior solutions for this problem consider treatment costs or costs associated with diagnosing a patient’s condition.

Subgroup Analysis. The goal of this line of research is to find out whether there exist subgroups of individuals in which a given treatment exhibits heterogeneous effects, and if so, how the treatment effect varies across them. This problem has been well studied [31, 10, 20, 3, 11]. However, identifying subgroups with heterogeneous treatment effects does not readily provide us with regimes.

Interpretable Models. A large body of machine learning literature focused on developing interpretable models for classification [19, 16, 21, 4] and clustering [12, 18, 17]. To this end, various classes of models such as decision lists [19], decision sets [16], prototype (case) based models [4], and generalized additive models [21] were proposed. These classes of models were not conceived to model treatment effects. There has been recent work on leveraging decision lists to describe estimated treatment regimes [25, 14, 36]. These solutions do not account for the treatment costs or costs involved in gathering patient characteristics. They are also constructed using greedy methods, which causes issues with the quality of the models.

## 3 Our Framework

First, we formalize the notion of treatment regimes and discuss how to represent them as decision lists. We then propose an objective function for constructing cost-effective treatment regimes.

### 3.1 Input Data and Cost Functions

Consider a dataset comprised of independent and identically distributed observations, each of which corresponds to a subject (individual), potentially from an observational study. Let denote the characteristics of subject . denotes the set of all possible values that can be assumed by a characteristic . Each characteristic can either be a binary, categorical or real valued variable. In the medical setting, example characteristics include patient’s age, BMI, gender, red blood cell count, glucose level etc., Let and denote the treatment assigned to subject and the corresponding outcome respectively. We assume that is defined such that higher values indicate better outcomes. For example, the outcome of a patient can be regarded as a wellness improvement score that indicates the effectiveness of the assigned treatment.

It can be much more expensive to determine certain subject characteristics compared to others. For instance, a patient’s age can be easily retrieved either from previous records or by asking the patient. On the other hand, determining her glucose level requires more comprehensive testing, and is therefore more expensive in terms of monetary costs, time and effort required both from the patient as well as the clinicians. We assume access to a function which returns the cost of determining any characteristic in . The cost associated with a given characteristic is assumed to be the same for all the subjects in the population, though the framework can be extended to have patient-specific costs. Analogously, each treatment incurs a cost and we assume access to a function which returns the cost associated with treatment .

We now discuss the notion of a treatment regime formally, and then introduce the class of models that we employ to express such regimes.

### 3.2 Treatment Regimes

A treatment regime is a function which takes as input the characteristics of any given subject x and maps them to an appropriate treatment .
As discussed, prior studies [30, 23] suggest that decision makers such as doctors and judges who make high stake decisions are more likely to trust, and, therefore employ models which are interpretable and transparent. We thus employ decision lists to express treatment regimes (see example in Figure 1). A decision list is an ordered list of rules embedded within an if-then-else structure. A treatment regime^{1}^{1}1We use the terms decision list and treatment regimes interchangeably from here on. expressed as a decision list
is a sequence of rules .
The last one, , is a default rule which applies to all those subjects who do not satisfy any of the previous rules. Each rule (except the default rule) is a tuple of the form where , and represents a pattern which is a conjunction of one or more predicates. Each predicate takes the form where , , and denotes some value that can be assumed by the characteristic . For instance, “Age 40 GenderFemale” is an example of such a pattern. A subject is said to satisfy rule if his/her characteristics satisfy all the predicates in . Let us formally denote this using an indicator function, which returns a if satisfies and otherwise.

The rules in partition the dataset into groups: . A group , where , is comprised of those subjects that satisfy but do not satisfy any of . This can be formally written as:

(1) |

The treatment assigned to each subject by is determined by the group that he/she belongs to. For instance, if subject with characteristics belongs to group induced by i.e., , then subject will be assigned the corresponding treatment under the regime . More formally,

(2) |

where denotes an indicator function that returns if the condition within the brackets evaluates to true and otherwise. Thus, returns the treatment assigned to .

Similarly, the cost incurred when we assign a treatment to the subject (treatment cost) according to the regime is given by:

(3) |

where the function , defined in Section 3.1., takes as input a treatment and returns its cost.

We can also define the cost incurred in assessing the condition of a subject (assessment cost) as per the regime . Note that a subject belongs to the group if and only if the subject does not satisfy the conditions , but satisfies the condition (Refer to Eqn. 1). To reach this conclusion, all the characteristics present in the corresponding antecedents must have been measured for subject and evaluated against the appropriate predicate conditions. This implies that the assessment cost incurred for this subject is the sum of the costs of all the characteristics that appear in . If denotes the set of all the characteristics that appear in , the assessment cost of the subject as per the regime can be written as:

(4) |

### 3.3 Objective Function

We now formulate the objective function for learning a cost-effective treatment regime. We first formalize the notions of expected outcome, assessment, and treatment costs of a treatment regime with respect to the dataset .

##### Expected Outcome

Recall that the treatment regime assigns a subject with characteristics to a treatment (Equation 2). The quality of the regime is partly determined by the expected outcome when all the subjects in are assigned treatments according to . The higher the value of such an expected outcome, the better the quality of the regime . There is, however, one caveat to computing the value of this expected outcome – we only observe the outcome resulting from assigning to in the data , and not any of the counterfactuals. If the regime , on the other hand, assigns a different treatment to , we cannot evaluate the policy on .

The solutions proposed to compute expected outcomes in settings such as ours can be categorized as: adjustment by regression modeling, adjustment by inverse propensity score weighting, and doubly robust estimation. A detailed treatment of each of these approaches is presented in Lunceford et al. [22]. The success of regression based modeling and inverse weighting depends heavily on the postulated regression model and the postulated propensity score model respectively. In either case, if the postulated models are not identical to the true models, we have biased estimates of the expected outcome. On the other hand, doubly robust estimation combines the above approaches in such a way that the estimated value of the expected outcome is unbiased as long as one of the postulated models is identical to the true model. The doubly robust estimator for the expected outcome of the regime , denoted by , can be written as:

(5) |

denotes the probability that the subject with characteristics is assigned to treatment in the data . represents the propensity score model. In practice, we fit a multinomial logistic regression model on to learn this function. Our framework does not impose any constraints on the functional form of . Similarly, denotes the predicted outcome obtained as a result of assigning a subject characterized by to a treatment . corresponds to the outcome regression model and is learned in our experiments by fitting a linear regression model on prior to optimizing for the treatment regimes. and could be modeled using any other method; this is an entirely separate step from the algorithm discussed here.

##### Expected Assessment Cost

Recall that there are assessment costs associated with each subject. These costs are governed by the characteristics that will be used in assessing the subject’s condition and recommending a treatment. The assessment cost of a subject treated using regime is given in Eqn. 4. The expected assessment cost across the entire population can be computed as:

(6) |

It is important to ensure that our learning process favors regimes with smaller values of expected assessment cost. Keeping this cost low also ensures that the full decision list is sparse, which assists with interpretability.

##### Expected Treatment Cost

There is a cost associated with assigning treatment to any given subject. The treatment cost for a subject who is assigned treatment using regime is given in Eqn. 3. The expected treatment cost across the entire population can be computed as:

(7) |

The smaller the expected treatment cost of a regime, the more desirable it is in practice. We present the complete objective function below.

##### Complete Objective

We assume access to the following inputs: 1) the observational data ; 2) a set of frequently occurring patterns in . Recall that each pattern corresponds to a conjunction of one or more predicates. An example pattern is “Age 40 GenderFemale”. In practice, such patterns can be obtained by running a frequent pattern mining algorithm such as Apriori [2] on the set ; 3) a set of all possible treatments .

We define the set of all possible (pattern, treatment) tuples as and as the set of all possible combinations of . An element in can be thought of as a rule in a decision list and an element in can be thought of a list of rules in a decision list (without the default rule). We then search over all elements in the set to find a regime which maximizes the expected outcome (Eqn. 5) while minimizing the expected assessment (Eqn. 6), and treatment costs (Eqn. 7) all of which are computed over . Our objective function can be formally written as:

(8) |

where are defined in Eqns. 5, 6, 7 respectively, and and are non-negative weights that scale the relative influence of the terms in the objective.

###### Theorem 1

The objective function in Eqn. 9 is NP-hard. (Please see appendix for details.)

Note that NP-hardness is a worst case categorization only; with an efficient search procedure, it is practical to obtain a good approximation on most reasonably-sized datasets.

### 3.4 Optimizing the Objective

We optimize our objective by modeling it as as a Markov Decision Process (MDP) and then employing Upper Confidence Bound on Trees (UCT) algorithm to find a treatment regime which maximizes Eqn. 9. We also propose and leverage customized checks for guiding the exploration of the UCT algorithm and pruning the search space effectively.

##### Markov Decision Process Formulation

Our goal is to find a sequence of rules which maximize the objective function in Eqn. 9. To this end, we formulate a fully observable MDP such that the optimal policy of the posited formulation provides a solution to our objective function.

A fully observable MDP is characterized by a tuple where S denotes the set of all possible states, A denotes the set of all possible actions, T and R represent the transition and reward functions respectively. Below we define each of these in the context of our problem. Figure 2 shows a snapshot of the state space and transitions for a small dataset.

State Space. Conceptually, each state in our state space captures the effect of some partial or fully constructed decision list. To illustrate, let us consider a partial decision list with just one rule “if Age 40 Gender Female, then T1”. This partial list induces that: (i) all those subjects that satisfy the condition of the rule are assigned treatment T1, and (ii) Age and gender characteristics will be required in determining treatments for all the subjects in the population.

To capture such information, we represent a state by a list of tuples where each tuple corresponds to a subject in . is a binary vector of length defined such that if the characteristic will be required for determining subject ’s treatment, and 0 otherwise. Further, captures the treatment assigned to subject . If no treatment has been assigned to , then .

Note that we have a single start state which corresponds to an empty decision list. is a vector of s, and for all in indicating that no treatments have been assigned to any subject, and no characteristics were deemed as requirements for assigning treatments. Furthermore, a state is regarded as a terminal state if for all , is non-zero indicating that treatments have been assigned to all the subjects.

Actions. Each action can take one of the following forms: 1) a rule , which is a tuple of the form (pattern, treatment). Eg., (Age40 GenderFemale, T1). This specifies that subjects who obey conditions in the pattern are prescribed the treatment. Such action leads to a non-terminal state. 2) a treatment , which corresponds to the default rule, thus this action leads to a terminal state.

Transition and Reward Functions. We have a deterministic transition function which ensures that taking an action from state will always lead to the same state . Let denote the set of all those subjects for which treatments have already been assigned to be in state i.e., and let denote the set of all those subjects who have not been assigned treatment in the state . Let denote the set of all those subjects which do not belong to the set and which satisfy the condition of action . Let denote the set of all those characteristics in which are present in the condition of action . If action corresponds to a default rule, then and . With this notation in place, the new state can be characterized as follows: 1) and for all , ; 2) for all , ; 3) for all .

Similarly, the immediate reward obtained when we reach by taking from the state can be written as:

where is defined in Eqn. 5, and are cost functions for characteristics and treatments respectively (see Section 3.1).

Bail Dataset | Asthma Dataset | |

# of Data Points | 86152 | 60048 |

Characteristics & Costs | age, gender, previous offenses, prior arrests, | age, gender, BMI, BP, short breath, temperature, |

current charge, SSN (cost = 1) | cough, chest pain, wheezing, past allergies, asthma history, | |

family history, has insurance (cost 1) | ||

marital status, kids, owns house, pays rent | peak flow test (cost = 2) | |

addresses in past years (cost = 2) | ||

spirometry test (cost = 4) | ||

mental illness, drug tests (cost = 6) | methacholine test (cost = 6) | |

Treatments & Costs | release on personal recognizance (cost = 20) | quick relief (cost = 10) |

release on conditions/bond (cost = 40) | controller drugs (cost = 15) | |

Outcomes & Scores | no risk (score = 100), failure to appear (score = 66) | no asthma attack for 4 months (score = 100) |

non-violent crime (score = 33) | no asthma attack for 2 months (score = 66) | |

violent crime (score = 0) | no asthma attack for 1 month (score = 33) | |

asthma attack in less than 2 weeks (score = 0) |

##### UCT with Customized Pruning

The basic idea behind the Upper Confidence Bound on Trees (UCT) [13] algorithm is to iteratively construct a search tree for some pre-determined number of iterations. At the end of this procedure, the best performing policy or sequence of actions is returned as the output. Each node in the search tree corresponds to a state in the MDP state space and the links in the tree correspond to the actions. UCT employs the UCB-1 metric [6] for navigating through the search space.

We employ a UCT-based algorithm for finding the optimal policy of our MDP formulation, though we leverage customized checks to further guide the exploration process and prune the search space. Recall that each non-terminal state in our state space corresponds to a partial decision list. We exploit the fact that we can upper-bound the value of the objective for any given partial decision list. The upper bound on the objective for any given non-terminal state can be computed by approximating the reward as follows: 1) all the subjects who have not been assigned treatments will get the best possible treatments without incurring any treatment cost 2) no additional assessments are required by any subject (and hence no additional assessment costs levied) in the population. The upper bound on the incremental reward is thus:

During the execution of UCT procedure, whenever there is a choice to be made about which action needs to be taken, we employ checks based on the upper bound of the objective value of the resulting state. Consider a scenario in which the UCT procedure is currently in state and needs to choose an action. For each possible action (that does not correspond to a default rule^{2}^{2}2We can compute exact values of objective function if the action is a default rule because the corresponding decision list is fully constructed.) from state , we determine the upper bound on the objective value of the resulting state . If this value is less than either the highest value encountered previously for a complete rule list, or the objective value corresponding to the best default action from the state , then we block the action from the state . This state is provably suboptimal.

## 4 Experimental Evaluation

Here, we discuss the detailed experimental evaluation of our framework. First we analyze the outcomes obtained and costs incurred when recommending treatments using our approach. Then, we present an ablation study which explores the contributions of each of the terms in our objective, followed by an analysis on real data.

##### Dataset Descriptions

Our first dataset consists of information pertaining to the bail decisions of about 86K defendants (see Table 1). It captures information about various defendant characteristics such as demographic attributes, past criminal history, personal and health related information for each of the 86K defendants. Further, the decisions made by judges in each of these cases (release without/with conditions) and the corresponding outcomes (e.g., if a defendant committed another crime when out on bail) are also available.

We assigned costs to characteristics, and treatments based on discussions with subject matter experts. The characteristics that were harder to obtain were assigned higher costs compared to the ones that were readily available. Similarly, the treatment that placed a higher burden on both the defendant (release on condition) was assigned a higher cost. When assigning scores to outcomes, undesirable scenarios (e.g., violent crime when released on bail) received lower scores.

Our second dataset (Refer Table 1) captures details of about 60K asthma patients [16]. For each of these 60K patients, various attributes such as demographics, symptoms, past health history, test results have been recorded. Each patient in the dataset was prescribed either quick relief medications or long term controller drugs. Further, the outcomes in the form of time to the next asthma attack (after the treatment began) were recorded. The longer this interval, the better the outcome, and the higher the outcome score.

We assigned costs to characteristics, and treatments based on the inconvenience (physical/mental/monetary) they caused to patients.

##### Baselines

We compared our framework to the following state-of-the-art treatment recommendation approaches: 1) Outcome Weighted Learning (OWL) [39] 2) Modified Covariate Approach (MCA) [32] 3) Interpretable and Parsimonious Treatment Regime Learning (IPTL) [36]. While none of these approaches explicitly account for treatment costs or costs required for gathering the subject characteristics, MCA and IPTL minimize the number of characteristics/covariates required for deciding the treatment of any given subject. OWL, on the other hand, utilizes all the characteristics available in the data when assigning treatments.

##### Experimental Setting

The objective function that we proposed in Eqn. 9 has three parameters . These parameters could either be specified by an end-user or learned using a validation set. We set aside 5% of each of our datasets as a validation set to estimate these parameters. We automatically searched the parameter space to find a set of parameters that produced a decision list with the maximum average outcome on the validation set (discussed in detail later) and satisfied some simple constraints such as: 1) average assessment cost 4 on both the datasets 2) average treatment cost 30 for the bail data; average treatment cost 12 for the asthma data. We then used a coordinate ascent strategy to search the parameter space and update each parameter while holding the other two parameters constant. The values of each of these parameters were chosen via a binary search on the interval . We ran the UCT procedure for our approach for 50K iterations. We used both gaussian and linear kernels for OWL and employed the tuning strategy discussed in Zhao et. al. [39]. In case of IPTL, we set the parameter that limits the number of the rules in the treatment regime to . We evaluated the performance of our model and other baselines using 10 fold cross validation.

### 4.1 Quantitative Evaluation

We analyzed the performance of our approach CITR (Cost-effective, Interpretable Treatment Regimes) on various aspects such as outcomes obtained, costs incurred, and intelligibility. We computed the following metrics:

Avg. Outcome Recall that a treatment regime assigns a treatment to every subject in the population. We used the prediction model (defined in Section 3.3) to obtain an outcome score given the characteristics of the subject and the treatment assigned (we used ground truth outcome scores whenever they were available in the data). We computed the average outcome score of all the subjects in the population.

Avg. Assess Cost We determined assessment costs incurred by each subject based on what characteristics were used to determine their treatment. We then averaged all such per-subject assessment costs to obtain the average assessment cost.

Avg. # of Characs We determined the number of characteristics that are used when assigning a treatment to each subject in the population and then computed the average of these numbers.

Avg. Treat Cost We computed the average of the treatment costs incurred by all the subjects in the population.

List Len Our approach CITR and the baseline IPTL express treatment regimes as decision lists. In order to compare the complexity of the resulting decision lists, we computed the number of rules in each of these lists.

While higher values of average outcome are preferred, lower values on all of the other metrics are desirable.

##### Results

Table 2 (top panel) presents the values of the metrics computed for our approach as well as the baselines. It can be seen that the treatment regimes produced by our approach results in better average outcomes with lower costs across both datasets. While IPTL and MCA do not explicitly reduce costs, they do minimize the number of characteristics required for determining treatment of any given subject. Our approach produces regimes with the least cost for a given average number of characteristics required to determine treatment (Avg. # of Characs). It is also interesting that our approach produces more concise lists with fewer rules compared to the baselines. While the treatment costs of all the baselines are similar, there is some variation in the average assessment costs and the outcomes. IPTL turns out to be the best performing baseline in terms of the average outcome, average assessment costs, and average no. of characteristics. The last line of Table 2 shows the average outcomes and the average treatment costs computed empirically on the observational data. Both of our datasets are comprised of decisions made by human experts. It is interesting that the regimes learned by algorithmic approaches perform better than human experts on both of the datasets.

Bail Dataset | Asthma Dataset | |||||||||

Avg. | Avg. | Avg. | Avg. # of | List | Avg. | Avg. | Avg. | Avg. # of | List | |

Outcome | Assess Cost | Treat Cost | Characs. | Len | Outcome | Assess Cost | Treat Cost | Characs. | Len | |

CITR | 79.2 | 8.88 | 31.09 | 6.38 | 7 | 74.38 | 13.87 | 11.81 | 7.23 | 6 |

IPTL | 77.6 | 14.53 | 35.23 | 8.57 | 9 | 71.88 | 18.58 | 11.83 | 7.87 | 8 |

MCA | 73.4 | 19.03 | 35.48 | 12.03 | - | 70.32 | 19.53 | 12.01 | 10.23 | - |

OWL (Gaussian) | 72.9 | 28 | 35.18 | 13 | - | 71.02 | 25 | 12.38 | 16 | - |

OWL (Linear) | 71.3 | 28 | 34.23 | 13 | - | 71.02 | 25 | 12.38 | 16 | - |

CITR - No Treat | 80.5 | 8.93 | 34.48 | 7.57 | 7 | 77.39 | 14.02 | 12.87 | 7.38 | 7 |

CITR - No Assess | 81.3 | 13.83 | 32.02 | 9.86 | 10 | 78.32 | 18.28 | 12.02 | 8.97 | 9 |

CITR - Outcome | 81.7 | 13.98 | 34.49 | 10.38 | 10 | 79.37 | 18.28 | 12.88 | 9.21 | 9 |

Human | 69.37 | - | 33.39 | - | - | 68.32 | - | 12.28 | - | - |

#### 4.1.1 Ablation Study

We also analyzed the effect of various terms of our objective function on the outcomes, and the costs incurred. To this end, we experimented with three different ablations of our approach: 1) CITR - No Treat, which is obtained by excluding the term corresponding to the expected treatment cost in our objective ( in Eqn. 9). 2) CITR - No Assess, which is obtained by excluding the expected assessment cost term in our objective ( in Eqn. 9) 3) CITR - Outcome, which is obtained by excluding both assessment and treatment cost terms from our objective.

Table 2 (second panel) shows the values of the metrics discussed earlier in this section for all the ablations of our model. Naturally, removing the treatment cost term increases the average treatment cost on both datasets. Naturally, removing the assessment cost part of the objective results in regimes with much higher assessment costs (8.88 vs 13.83 on bail data; 13.87 vs 18.28 on asthma data). The length of the list also increases for both the datasets when we exclude the assessment cost term. These results demonstrate that each term in our objective function is crucial to producing a cost-effective interpretable regime.

### 4.2 Qualitative Analysis

The treatment regimes produced by our approach on asthma and bail datasets are shown in Figures 1 and 3 respectively.

It can be seen in Figure 3 that methacholine test which is more expensive appears at the end of the regime. This ensures that only a small fraction of the population (8.23%) is burdened by its cost. Furthermore, it turns out that though the spirometry test is slightly expensive compared to patient demographics and symptoms, it would be harder to determine the treatment for a patient without this test. This aligns with research on asthma treatment recommendations [27, 5]. Furthermore, it is interesting to note that the regime not only accounts for test results on spirometry and peak flow but also assesses if the patient has a previous history of asthma or respiratory issues. If the test results are positive and the patient has no previous history of asthma or respiratory disorders, then the patient is recommended quick relief drugs. On the other hand, if the test results are positive and the patient suffered previous asthma or respiratory issues, then controller drugs are recommended.

If GenderF and Current-Charge Minor Prev-OffenseNone then RP |

Else if Prev-OffenseYes and Prior-Arrest Yes then RC |

Else if Current-Charge Misdemeanor and Age 30 then RC |

Else if Age 50 and Prior-ArrestNo, then RP |

Else if Marital-StatusSingle and Pays-Rent No and Current-Charge Misd. then RC |

Else if Addresses-Past-Yr 5 then RC |

Else RP |

In case of the bail dataset, the constructed regime is able to achieve good outcomes without even using the most expensive characteristics such as mental illness tests and drug tests. Personal information characteristics, which are slightly more expensive than defendant demographics and prior criminal history, appear only towards the end of the list and these checks apply only to 21.23% of the population. It is interesting that the regime uses the defendant’s criminal history as well as personal and demographic information to make recommendations. For instance, females with minor current charges (such as driving offenses) and no prior criminal records are typically released on bail without conditions such as bonds or checking in with the police. On the other hand, defendants who have committed crimes earlier are only granted conditional bail.

## 5 Conclusions

In this work, we proposed a framework for learning cost-effective, interpretable treatment regimes from observational data. To the best of our knowledge, this is the first solution to the problem at hand that addresses all of the following aspects: 1) maximizing the outcomes 2) minimizing the treatment costs, and costs associated with gathering information required to determine the treatment 3) expressing regimes using an interpretable model. We modeled the problem of learning a treatment regime as a MDP and employed a variant of UCT which prunes the search space using customized checks. We demonstrated the effectiveness of our framework on real world data from judiciary and health care domains.

## References

- [1] Eva-Maria Abulesz and Gerasimos Lyberatos. Novel approach for determining optimal treatment regimen for cancer cemotherapy. International journal of systems science, 19(8):1483–1497, 1988.
- [2] Rakesh Agrawal, Ramakrishnan Srikant, et al. Fast algorithms for mining association rules.
- [3] James O Berger, Xiaojing Wang, and Lei Shen. A bayesian approach to subgroup identification. Journal of biopharmaceutical statistics, 24(1):110–129, 2014.
- [4] Jacob Bien and Robert Tibshirani. Classification by set cover: The prototype vector machine. arXiv preprint arXiv:0908.2284, 2009.
- [5] Louis-Philippe Boulet, Marie-Ève Boulay, Guylaine Gauthier, Livia Battisti, Valérie Chabot, Marie-France Beauchesne, Denis Villeneuve, and Patricia Côté. Benefits of an asthma education program provided at primary care sites on asthma outcomes. Respiratory medicine, 109(8):991–1000, 2015.
- [6] Cameron B Browne, Edward Powley, Daniel Whitehouse, Simon M Lucas, Peter I Cowling, Philipp Rohlfshagen, Stephen Tavener, Diego Perez, Spyridon Samothrakis, and Simon Colton. A survey of monte carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in Games, 4(1):1–43, 2012.
- [7] Johannes AN Dorresteijn, Frank LJ Visseren, Paul M Ridker, Annemarie MJ Wassink, Nina P Paynter, Ewout W Steyerberg, Yolanda van der Graaf, and Nancy R Cook. Estimating treatment effects for individual patients based on the results of randomised clinical trials. Bmj, 343:d5888, 2011.
- [8] Ralph B DâAgostino. Estimating treatment effects using observational data. Jama, 297(3):314–316, 2007.
- [9] Ailin Fan, Wenbin Lu, Rui Song, et al. Sequential advantage selection for optimal treatment regime. The Annals of Applied Statistics, 10(1):32–53, 2016.
- [10] Jared C Foster, Jeremy MG Taylor, and Stephen J Ruberg. Subgroup identification from randomized clinical trial data. Statistics in medicine, 30(24):2867–2880, 2011.
- [11] Kosuke Imai, Marc Ratkovic, et al. Estimating treatment effect heterogeneity in randomized program evaluation. The Annals of Applied Statistics, 7(1):443–470, 2013.
- [12] Been Kim, Cynthia Rudin, and Julie A Shah. The bayesian case model: A generative approach for case-based reasoning and prototype classification. In Advances in Neural Information Processing Systems, pages 1952–1960, 2014.
- [13] Levente Kocsis and Csaba Szepesvári. Bandit based monte-carlo planning. In European conference on machine learning, pages 282–293. Springer, 2006.
- [14] EB Laber and YQ Zhao. Tree-based methods for individualized treatment regimes. Biometrika, 102(3):501–514, 2015.
- [15] Eric B Laber, Daniel J Lizotte, Min Qian, William E Pelham, and Susan A Murphy. Dynamic treatment regimes: Technical challenges and applications. Electronic journal of statistics, 8(1):1225, 2014.
- [16] Himabindu Lakkaraju, Stephen H Bach, and Jure Leskovec. Interpretable decision sets: A joint framework for description and prediction. 2016.
- [17] Himabindu Lakkaraju and Jure Leskovec. Confusions over time: An interpretable bayesian model to characterize trends in decision making. In Advances in Neural Information Processing Systems (NIPS), 2016.
- [18] Himabindu Lakkaraju, Jure Leskovec, Jon Kleinberg, and Sendhil Mullainathan. A bayesian framework for modeling human evaluations. In SIAM SDM, 2015.
- [19] Benjamin Letham, Cynthia Rudin, Tyler H McCormick, David Madigan, et al. Interpretable classifiers using rules and bayesian analysis: Building a better stroke prediction model. The Annals of Applied Statistics, 9(3):1350–1371, 2015.
- [20] Wei-Yin Loh, Xu He, and Michael Man. A regression tree approach to identifying subgroups with differential treatment effects. Statistics in medicine, 34(11):1818–1833, 2015.
- [21] Yin Lou, Rich Caruana, and Johannes Gehrke. Intelligible models for classification and regression. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 150–158. ACM, 2012.
- [22] Jared K Lunceford and Marie Davidian. Stratification and weighting via the propensity score in estimation of causal treatment effects: a comparative study. Statistics in medicine, 23(19):2937–2960, 2004.
- [23] Douglas B Marlowe, David S Festinger, Karen L Dugosh, Kathleen M Benasutti, Gloria Fox, and Jason R Croft. Adaptive programming improves outcomes in drug court an experimental trial. Criminal justice and behavior, 39(4):514–532, 2012.
- [24] James J McGough and Stephen V Faraone. Estimating the size of treatment effects: moving beyond p values. Psychiatry (1550-5952), 6(10), 2009.
- [25] Erica EM Moodie, Bibhas Chakraborty, and Michael S Kramer. Q-learning for estimating optimal dynamic treatment rules from observational data. Canadian Journal of Statistics, 40(4):629–645, 2012.
- [26] Erica EM Moodie, Nema Dean, and Yue Ru Sun. Q-learning: Flexible learning about useful utilities. Statistics in Biosciences, 6(2):223–243, 2014.
- [27] Jorge Pereira, Priscilla Porto-Figueira, Carina Cavaco, Khushman Taunk, Srikanth Rapole, Rahul Dhakne, Hampapathalu Nagarajaram, and José S Câmara. Breath analysis as a potential and non-invasive frontier in disease diagnosis: an overview. Metabolites, 5(1):3–55, 2015.
- [28] Min Qian and Susan A Murphy. Performance guarantees for individualized treatment rules. Annals of statistics, 39(2):1180, 2011.
- [29] James M Robins. Correcting for non-compliance in randomized trials using structural nested mean models. Communications in Statistics-Theory and methods, 23(8):2379–2412, 1994.
- [30] Richard N Shiffman. Representation of clinical practice guidelines in conventional and augmented decision tables. Journal of the American Medical Informatics Association, 4(5):382–393, 1997.
- [31] Xiaogang Su, Chih-Ling Tsai, Hansheng Wang, David M Nickerson, and Bogong Li. Subgroup analysis via recursive partitioning. Journal of Machine Learning Research, 10(Feb):141–158, 2009.
- [32] Lu Tian, Ash A Alizadeh, Andrew J Gentles, and Robert Tibshirani. A simple method for estimating interactions between a treatment and a large number of covariates. Journal of the American Statistical Association, 109(508):1517–1532, 2014.
- [33] Stijn Vansteelandt, Marshall Joffe, et al. Structural nested models and g-estimation: The partially realized promise. Statistical Science, 29(4):707–731, 2014.
- [34] Michael P Wallace and Erica EM Moodie. Personalizing medicine: a review of adaptive treatment strategies. Pharmacoepidemiology and drug safety, 23(6):580–585, 2014.
- [35] Baqun Zhang, Anastasios A Tsiatis, Eric B Laber, and Marie Davidian. A robust method for estimating optimal treatment regimes. Biometrics, 68(4):1010–1018, 2012.
- [36] Yichi Zhang, Eric B. Laber, Anastasios Tsiatis, and Marie Davidian. Using decision lists to construct interpretable and parsimonious treatment regimes. Biometrics, 71(4):895–904, 2015.
- [37] Yichi Zhang, Eric B Laber, Anastasios Tsiatis, and Marie Davidian. Interpretable dynamic treatment regimes. arXiv preprint arXiv:1606.01472, 2016.
- [38] Ying-Qi Zhao, Donglin Zeng, Eric B Laber, Rui Song, Ming Yuan, and Michael Rene Kosorok. Doubly robust learning for estimating individualized treatment with censored data. Biometrika, 102(1):151–168, 2015.
- [39] Yingqi Zhao, Donglin Zeng, A John Rush, and Michael R Kosorok. Estimating individualized treatment rules using outcome weighted learning. Journal of the American Statistical Association, 107(499):1106–1118, 2012.
- [40] Yufan Zhao, Michael R Kosorok, and Donglin Zeng. Reinforcement learning design for cancer clinical trials. Statistics in medicine, 28(26):3294–3315, 2009.

## Appendix A Appendix

### a.1 Proof for Theorem 1

Statement: The objective defined in Eqn. 8 is NP-hard.

Proof:
The rough idea behind this proof is to establish the connection between the objective in Eqn. (8) and weighted exact-cover problem.

Our objective function is given by:

(9) |

The goal is to find a sequence of pairs where and which not only covers all the data points in the dataset but also maximizes the objective given above. Note that denotes a default rule.

represents a set of frequently occurring patterns each of which is a conjunction of one or more predicates.
Examples:

(1) Age 40 Gender Female;

(2) BMI High;

(3) Gender M BP High Age 25

Such patterns are provided as input to us. We have defined the set as: . This implies that an element in the set will be of the form:

(Age 40 Gender Female, T1) i.e., each element in is a rule. Our goal is now to find an ordered list of rules from (let us ignore the default rule for a little while) which maximize the objective in Eqn. 9.

Let us assume the set comprises of the following candidate rules:

(1) (Age 40 Gender Female, T1)

(2) (Age 40 Gender Female, T2)

(3) (BMI High, T1)

(4) (BMI High, T2)

Let us create a new set from as follows: for each rule in , append the negations of conditions of all possible combinations of all the other rules in . Also include in the new set , the set of all possible combinations of negations of conditions in all the rules in the set . Following our example above, the new set will look like this:

(1) (Age 40 Gender Female, T1)

(2) (Age 40 Gender Female, T2)

(3) ((Age 40 Gender Female), T1)

(4) ((Age 40 Gender Female), T2)

(5) ( (BMI High) Age 40 Gender Female, T1)

(6) ( (BMI High) Age 40 Gender Female, T2)

(7) (BMI High, T1)

(8) (BMI High, T2)

(9) ((BMI High), T1)

(10) ((BMI High), T2)

(11) ( (Age 40 Gender Female) BMI High, T1)

(12) ( (Age 40 Gender Female) BMI High, T2)

(13) ((Age 40 Gender Female) (BMI High), T1)

(14) ((Age 40 Gender Female) (BMI High), T2)

Now, the problem of finding an ordered sequence of rules on (plus a default rule ) can now be posed as the problem of finding an unordered set of rules on . To illustrate, let us consider a decision list constructed using in the above example:

(1) (Age 40 Gender Female, T1)

(2) T2

This list can now be expressed as an unordered set using the elements in as follows:

(Age 40 Gender Female, T1)

((Age 40 Gender Female), T2)

We have thus reduced the problem of finding an ordered list of rules to that of unordered set of rules on . More specifically, the problem is now reduced to that of choosing a set of rules from the set such that 1) each data point/element in the data is covered exactly once 2) the objective function in Eqn. 9 is maximized. This problem can be formally written as:

(10) |

where is an indicator function which is if the rule is chosen to be in the set cover. is the cost associated with choosing the rule which is defined as:

Note that we basically split our complete objective function across the rules that will be chosen to be part of the final set cover. Further, we are dealing with a minimization problem here, so we flip the signs of the terms in the objective (which is a maximization function).

Eqn. A.1 is the weighted exact cover problem. Since this problem is NP-Hard, our objective function is also NP-Hard.