Simple Rules for Complex Decisions
From doctors diagnosing patients to judges setting bail, experts often base their decisions on experience and intuition rather than on statistical models. While understandable, relying on intuition over models has often been found to result in inferior outcomes. Here we present a new method—select-regress-and-round—for constructing simple rules that perform well for complex decisions. These rules take the form of a weighted checklist, can be applied mentally, and nonetheless rival the performance of modern machine learning algorithms. Our method for creating these rules is itself simple, and can be carried out by practitioners with basic statistics knowledge. We demonstrate this technique with a detailed case study of judicial decisions to release or detain defendants while they await trial. In this application, as in many policy settings, the effects of proposed decision rules cannot be directly observed from historical data: if a rule recommends releasing a defendant that the judge in reality detained, we do not observe what would have happened under the proposed action. We address this key counterfactual estimation problem by drawing on tools from causal inference. We find that simple rules significantly outperform judges and are on par with decisions derived from random forests trained on all available features. Generalizing to 22 varied decision-making domains, we find this basic result replicates. We conclude with an analytical framework that helps explain why these simple decision rules perform as well as they do.
In decision-making scenarios, experts often choose a course of action based on experience and intuition rather than on statistical analysis (Gigerenzer et al., 2011). This includes doctors classifying patients based on their symptoms (McDonald, 1996), judges setting bail amounts (Dhami, 2003) and making parole decisions (Danziger et al., 2011), and managers determining which customers to target (Wübben and Wangenheim, 2008). A large body of work shows that intuitive judgments are generally inferior to those based on statistical models (Dawes, 1979; Dawes et al., 1989; Tetlock, 2005; Kleinberg et al., 2015, 2017). However, decision makers have consistently eschewed formal decision models in part because it has been difficult to create, understand, and apply them.
Here we present a simple method for constructing simple decision rules that often perform on par with traditional machine learning algorithms. Our select-regress-and-round strategy results in rules that are fast, frugal, and clear: fast in that decisions can be made quickly in one’s mind, without the aid of a computing device; frugal in that they require only limited information to reach a decision; and clear in that they expose the grounds on which classifications are made. Decision rules satisfying these criteria have many benefits. For instance, rules that can be applied quickly and mentally are likely to be adopted and used persistently. In medicine, frugal rules require fewer tests, which saves time, money, and, in the case of triage situations, lives (Marewski and Gigerenzer, 2012). The clarity of simple rules engenders trust from users, providing insight into how systems work and exposing where models may be improved (Gleicher, 2016; Sull and Eisenhardt, 2015). Clarity can even become a legal requirement when society demands to know how algorithmic decisions are being made (Goodman and Flaxman, 2016; Corbett-Davies et al., 2017).
Our results add to a growing literature on interpretable machine learning (Kim et al., 2014, 2015; Ustun and Rudin, 2016; Letham et al., 2015; Lakkaraju et al., 2016). Several methods recently have been introduced to construct the kind of simple decision rules we discuss here, including supersparse linear integer models (SLIM) (Ustun and Rudin, 2016, 2017), Bayesian rule lists (Letham et al., 2015), and interpretable decision sets (Lakkaraju et al., 2016). These methods all produce rules that are easy to interpret and to apply. One important difference between our approach and past techniques is that our rules are also easy to create.
To illustrate our method, we begin with a case study of judicial decisions for pretrial release. We show that simple rules substantially improve upon the efficiency and equity of unaided decisions. In particular, we estimate that judges can detain half as many defendants without appreciably increasing the number that fail to appear at their court dates. Our simple rules perform as well as a black-box, random forest model trained on all available data. (We note that Kleinberg et al. (2017) recently and independently proposed using random forests to assist judicial decisions, but they do not consider simple rules.) We further evaluate the efficacy of our method on 22 datasets from the UCI ML repository and show that in many cases simple rules are competitive with state-of-the-art machine learning algorithms. We conclude with an analytical framework that helps explain why simple decision rules often perform well.
2. Illustration: bail decisions
As an initial example of how to create simple rules that make accurate and transparent decisions, we turn to the domain of pretrial release determinations. In the United States, a defendant is typically arraigned shortly after arrest in a court appearance where he is provided with written notice of the charges alleged by the prosecutor. At this time, a judge must decide whether the defendant, while he awaits trial, should be released on his own recognizance (RoR), or alternatively, subject to monetary bail. In practice, if the judge rules that bail be set, defendants often await trial in jail since many of them do not have the financial resources to post bail. Moreover, when defendants are able to post bail, they often do so by contracting with a bail bondsman and in turn incur hefty fees. The judge, however, has a legal obligation to consider taking measures necessary to secure the defendant’s appearance at required court proceedings. Pretrial release decisions must thus balance flight risk against the high burden that bail requirements place on defendants. In many jurisdictions judges may also consider a defendant’s threat to public safety, but that is not a legally relevant factor for the specific jurisdiction we analyze below.
A key statistical challenge in this setting is that one cannot, with historical data alone, directly observe the effects of hypothetical decision rules. For example, if a proposed policy recommends releasing some defendants who in reality were detained by the judge, one does not observe what would have happened had the rule been followed. This counterfactual estimation problem—also known as offline policy evaluation (Dudík et al., 2011)—is common in many domains. We address it here by adapting tools from causal inference to the policy setting, including the method of Rosenbaum and Rubin (1983a) for assessing the sensitivity of estimated causal effects to an unobserved binary covariate.
Our analysis is based on 165,000 adult cases involving nonviolent offenses charged by a large urban prosecutor’s office and arraigned in criminal court between 2010 and 2015. This set was obtained by starting with a random sample of 200,000 cases provided to us by the prosecutor’s office, and then restricting to those cases involving nonviolent offenses and for which the records were complete and accurate. Our initial sample of 200,000 cases does not include instances where defendants accepted a plea deal at arraignment, obviating the need for a pretrial release decision. For each case, we have a rich set of attributes: 49 features describe characteristics of the current charges (e.g., theft, gun-related), and 15 describe characteristics of the defendant (e.g., gender, age, prior arrests). We also observe whether the defendant was RoR’d, and whether he failed to appear (FTA) at any of his subsequent court dates. We note that even if bail is set, a defendant may still fail to appear since he could post bail and then skip his court date. Overall, 69% of defendants are RoR’d, and 15% of RoR’d defendants fail to appear. Of the remaining 31% of defendants for whom bail is set, 45% are eventually released and 9% fail to appear. As a result, the overall FTA rate is 13%.
In our analysis below, we randomly divide the full set of 165,000 cases into three approximately equal subsets; we use the first fold to construct decision rules (both simple and complex), and the second and third to evaluate these rules, as described next.
2.1. Rule construction
We start by constructing traditional (but complex) decision rules for balancing flight risk with the burdens of bail. These rules serve as a benchmark for evaluating the simple rules we create below. On the first fold of the data, we restrict to cases in which the judge RoR’d the defendant, and then train a random forest model to estimate the likelihood an individual fails to appear at any of his subsequent court dates. Random forests are considered to be one of the best off-the-shelf classification algorithms (Fernández-Delgado et al., 2014; Kleinberg et al., 2017), and we fit the model on all available information about the case and the defendant, excluding race.111We use the randomForest package in R, fit with 1,000 trees. We exclude race from the presented results due to legal and policy concerns with basing decisions on protected attributes (Corbett-Davies et al., 2017). We note, however, that including race does not significantly affect performance. The fitted model lets us compute risk scores (i.e., estimated flight risk if RoR’d) for any defendant. These risk scores can in turn be converted to a binary decision rule by selecting a threshold for releasing individuals. One might, for example, RoR a defendant if and only if his flight risk is below 20%.
We now construct a family of simple rules for making release decisions. We begin by fitting a logistic regression model that estimates a defendant’s flight risk as a function of his age and prior history of failing to appear. These two factors are well understood to be highly predictive in this context, but we later show how such features can be selected in a principled fashion without domain expertise. Specifically, we fit the following model:
where indicates whether the -th defendant failed to appear; indicates the defendant’s number of past failures to appear (exactly one, two, three, or at least four); and indicates the binned age of the defendant (18–20, 21–25, 26–30, 31–35, 36–40, 41–45, or 46–50). For identifiability, indicator variables for zero past FTAs and age 51-and-older are omitted. As before, this model is fit on the subset of cases in the first fold of data for which the judge released the defendant. Next, we rescale the age and prior FTA coefficients so that they lie in the interval ; specifically we multiply each coefficient by the constant
Finally, we round the rescaled coefficients to the nearest integer.
|age||8||no prior FTAs||0|
|age||6||1 prior FTA||6|
|age||4||2 prior FTAs||8|
|age||2||3 prior FTAs||9|
|age||0||4 or more prior FTAs||10|
Table 1 shows the result of this procedure. For any defendant, a risk score can be computed by summing the relevant terms in the table. Unsurprisingly, past FTAs are indeed strong predictors of future failure to appear; an individual’s risk also declines with age, in line with conventional wisdom. These risk scores can be converted to a binary decision rule by selecting a threshold for releasing individuals. For example, one might RoR a defendant if and only if his risk score is below 10.5. A graphical representation of that rule is shown in Figure 1.
2.2. Policy evaluation
There are two key considerations in evaluating a decision rule for pretrial release: (1) the proportion of defendants who are released under the rule; and (2) the resulting proportion who fail to appear at their court proceedings. It is straightforward to estimate the former, since one need only apply the rule to historical data to see what actions would have been recommended.222In theory, implementing a decision rule could alter the equilibrium distribution of defendants. We do not consider such possible effects, and assume the distribution of defendants is not affected by the rule itself. For example, if defendants are released if and only if their risk score is below 10.5, 84% would be RoR’d; under this rule, bail would be required of only half as many defendants relative to the status quo. Forecasting the proportion who would fail to appear, however, is generally much more difficult. The key problem is that for any particular defendant, we only observe the outcome (i.e., whether or not he failed to appear) conditional on the action the judge ultimately decided to take (i.e., RoR or bail). Since the action taken by the judge may differ from that prescribed by the decision rule, we do not always observe what would have happened under the rule. This problem of offline policy evaluation (Dudík et al., 2011) is a specific instance of the fundamental problem of causal inference.
To rigorously describe the estimation problem and our approach, we first introduce some notation. We denote the observed set of cases by , where is a case, is the action taken by the judge, and indicates whether the defendant failed to appear at his scheduled court date. We write and to mean the potential outcomes, what would have happened under the two possible judicial actions. For any policy , our goal is to estimate the FTA rate under the policy:
where denotes the action prescribed under the rule. The key statistical challenge is that only one of the two potential outcomes, , is observed. We note that policy evaluation is a generalization of estimating average treatment effects. Namely, the average treatment effect can be expressed as , where is the policy under which everyone is released and is defined analogously.
Here we take a straightforward and popular statistical approach to estimating : response surface modeling (Hill, 2012). With response surface modeling, the idea is to use a standard prediction model (e.g., logistic regression or random forest) to estimate the effect on each defendant of each potential judicial action. The model estimates of these potential outcomes are denoted by , for . Our estimate of is then given by
where is an indicator function evaluating to 1 if its argument is true and to 0 otherwise. If the prescribed action is in fact taken by the judge, then is directly observed and can be used; otherwise we approximate the potential outcome with . Table 2 illustrates this method for a hypothetical example.
Response surface modeling implicitly assumes that a judge’s action is ignorable given the observed covariates (i.e., that conditional on the observed covariates, those who are RoR’d are similar to those who are not). Formally, ignorability means that
This ignorability assumption is unavoidable, and is similarly required for methods based on propensity scores (Rosenbaum and Rubin, 1983b, 1984; Cassel et al., 1976; Robins et al., 1994; Robins and Rotnitzky, 1995; Kang and Schafer, 2007; Dudík et al., 2011). We examine this assumption in detail in Section 2.3, and find that our conclusions are robust to unobserved heterogeneity.
|Proposed action||Observed action||Observed outcome|
To carry out this approach, we derive estimates via an -regularized logistic regression (lasso) model trained on the second fold of our data. For each individual, the model estimates his likelihood of FTA given all the observed features and the action taken by the judge. In contrast to the rule construction described above, this time we train the model on all cases (not just those for which the judge RoR’d the defendant) and include as a predictor the judge’s action (RoR or bail); we also include the defendant’s race.333Although it is legally problematic to use race when making decisions, its use is acceptable—and indeed often required—when evaluating decisions. The model was fit with the glmnet package in R. The cv.glmnet method was used to determine the best value for the regularization parameter with 10-fold cross-validation and 1,000 values of . The model includes all pairwise interactions between the judge’s decision and defendant’s features. We opt for lasso instead of random forest for this prediction task because the latter, while very good for classification, is known to suffer from poor calibration (Niculescu-Mizil and Caruana, 2005), which can in turn yield biased estimates of a policy’s effects. Then, on the third fold of the data, we use the observed and model-estimated outcomes to approximate the overall FTA rate for any decision rule.
Figure 2 shows estimated RoR and FTA rates for a variety of pretrial release rules. Points on the solid line correspond to rules constructed via the random forest model described above for various decision thresholds. The red points correspond to rules based on the simple scoring procedure in Table 1, again corresponding to various decision thresholds. For each rule, the horizontal axis shows the estimated proportion of defendants ROR’d under the rule, and the vertical axis shows the estimated proportion of defendants who would fail to appear at their court dates. The solid black dot shows the status quo: 69% of defendants RoR’d and a 13% FTA rate. Finally, the open circles show the observed RoR and FTA rates for each of the 23 judges in our data who have presided over at least 1,000 cases, sized in proportion to their case load.
The plot illustrates three key points. First, simple rules that consider only two features—age and prior FTAs—perform nearly identically to a random forest that incorporates 64 features. Second, the statistically informed policies in the lower right quadrant all achieve higher rates of RoR and, simultaneously, lower rates of FTA than the status quo. In particular, by releasing defendants if and only if their risk score is below 10.5, we expect to release 84% of defendants while achieving an FTA rate of 14%. Relative to the existing policy, following this rule would not appreciably increase the overall FTA rate—it would increase just 0.3 percentage points, from 13.3% to 13.6%—but only half as many defendants would be required to pay bail. Finally, for nearly every judge, there is a statistical decision rule that simultaneously yields both a higher rate of release and a lower rate of FTA than the judge currently achieves. The statistical decision rules consistently outperform the human decision-makers.
Why do these statistical decision rules outperform the experts? Figure 1 sheds light on this phenomenon. Each cell in the plot corresponds to defendants binned by their age and prior number of FTAs. Under a rule that releases defendants if and only if their risk score is below 10.5, one would release everyone to the left of the solid black line, and set bail for everyone to the right of the line. The number in each cell shows the proportion of defendants in each bin who are currently released, and the cell shading graphically indicates this proportion. Aside from the lowest risk defendants, who have no prior FTAs, the likelihood of being released does not correlate strongly with estimated flight risk. For example, the high risk group of young defendants with four or more prior FTAs is released at about the same 50% rate as the low risk group of older defendants with one prior FTA. This low correlation between flight risk and release decision is in part attributable to extreme differences in release rates across judges, with some releasing more than 90% of defendants and others releasing just 50%.444Defendants are not perfectly randomly assigned to judges for arraignment, but in practice judges see a similar distribution of defendants. Whereas defendants experience dramatically different outcomes based on the judge they happened to appear in front of, statistical decision rules improve efficiency in part by ensuring consistency.
2.3. Sensitivity to unobserved heterogeneity
As noted above, our estimation strategy assumes that the judicial action taken is ignorable given the observed covariates. Under this ignorability assumption, one can accurately estimate the potential outcomes. Judges, however, might base their decisions in part on information that is not recorded in the data, which could in turn bias our estimates. For example, a judge, upon meeting a defendant, might surmise that his flight risk is higher than one would expect based on the recorded covariates alone, and may accordingly require the defendant to post bail. In this case, since our estimates are based only on the recorded data, we may underestimate the defendant’s counterfactual likelihood of failing to appear if released.
We take two approaches to gauge the robustness of our results to such hidden heterogeneity. First, on each subset of cases handled by a single judge, we use response surface modeling to estimate . Each judge has idiosyncratic criteria for releasing defendants, as evidenced by the dramatically different release rates across judges; accordingly, the types and proportion of cases for which the policy coincides with the observed action differ from judge to judge. This variation allows us to assess the sensitivity of our estimates to the observed actions . In particular, if unobserved heterogeneity were significant, we would expect our estimates to systematically vary depending on the proportion of observed judicial actions that agree with the policy . Figure 3 shows the results of this analysis for the simple decision rule described in Figure 1, where each point corresponds to a judge. We find that the FTA rate of the decision rule is consistently estimated to be approximately 12–14%. Moreover, some judges act in concordance with the decision rule in nearly 80% of cases; for this subset of judges, where our estimates are largely based on directly observed outcomes, we again find FTA is estimated at around 12–14%.
As a second robustness check, we adapt the method of Rosenbaum and Rubin (1983a) for assessing the sensitivity of estimated causal effects to an unobserved binary covariate. We specifically tailor their approach to offline policy evaluation. At a high level, we assume there is an unobserved covariate that affects both a judge’s decision (RoR or bail) and also the outcome conditional on that action. For example, might indicate that a defendant is sympathetic, and sympathetic defendants may be more likely to be RoR’d and also more likely to appear at their court proceedings. Our key assumption is that a judge’s action is ignorable given the observed covariates and the unobserved covariate :
There are four key parameters in this framework: (1) the probability that ; (2) the effect of on the judge’s decision; (3) the effect of on the defendant’s likelihood of FTA if RoR’d; and (4) the effect of on the defendant’s likelihood of FTA if bail is set. Our goal is to quantify the extent to which our estimate of changes as a function of these parameters.
Without loss of generality, we can write
for appropriately chosen parameters and that depend on the observed covariates . We note that randomness in judicial decisions may arise from a multitude of factors, including idiosyncrasies in how judges are assigned to cases. Here is the change in log-odds of being RoR’d when versus when . For , we can similarly write
for parameters and . In this case, is the change in log-odds of failing to appear if RoR’d when versus when , and is the corresponding change if bail is set.
Now, for any posited values of , , and , we use the observed data to estimate , and . We do this in three steps. By (2),
The left-hand side of the equation can be estimated with a regression model fit to the data. For fixed values of and , the right-hand side is an increasing function of that takes on values from 0 to 1 as goes from to . There is thus a unique value such that the right-hand side equals . Rosenbaum and Rubin (1983a) derive a simple closed form solution for , facilitating fast computation on large datasets, which we omit for space.
Second, we use the fitted values of to estimate the distribution of given the observed covariates and judicial action. By Bayes’ rule,
With , the terms on the right-hand side can be estimated from (2), and we can thus approximate the left-hand side.
Third, we have
The second equality above follows from the ignorability assumption stated in (1), and the third equality follows from (3). The left-hand side can be approximated by the quantity that we obtain via response surface modeling. Importantly, is a reasonable estimate of even though it may not be a good estimate of . This distinction is indeed the rationale of our sensitivity analysis. Given our above estimate of and our assumed value of , the only unknown on the right-hand side is . As before, there is a unique value that satisfies the constraint.
With in hand, we can now approximate the potential outcome for the action not taken:
where if , and vice versa. Specifically, we have
Finally, the Rosenbaum and Rubin estimator adapted to policy evaluation is
where is computed via (4).
Figure 4 shows the results of computing on our data in two parameter regimes. In the first (left-hand plot), we assume and consider all combinations of , , and . All parameters are constant independent of . We thus assume that holding the observed covariates fixed, a defendant with has twice the odds of being RoR’d as one with , and that can double or half the odds a defendant fails to appear. For each complex policy (i.e., one based on a random forest), the grey band shows the minimum and maximum value of across all parameters in this set; the error bars on the red points show the analogous quantity for the simple rules. In the right-hand plot, we consider a more extreme situation, with , , and . We find that our estimates are relatively stable in these parameter regimes. In the first case () the estimated FTA rate for a given policy typically varies by only half a percentage point. Even in the more extreme setting (), policies are typically stable to about one percentage point. It thus seems our conclusions are robust to unobserved heterogeneity across defendants.
3. Select-regress-and-round: A simple method for creating simple rules
We now introduce and evaluate a simple method—select-regress-and-round—that formalizes and generalizes the rule construction procedure we applied for pretrial release decisions. In particular, we dispense with ad hoc feature selection and adopt a standard statistical routine.
3.1. Rule construction
The rules we construct are designed to aid classification or ranking decisions by assigning each item in consideration a score , computed as a linear combination of a subset of the item features:
where the weights are integers. In the cases we consider, the features themselves are typically 0-1 indicator variables (indicating, for example, whether a person is male, or whether an individual is 26–30 years old), and so the rule reduces to a weighted checklist, in which one simply sums up the (integer) weights of the applicable attributes. Often, one seeks to make binary decisions (e.g., whether to detain or to release an individual), which amounts to setting a threshold and then taking a particular course of action if and only if the score is above that threshold.
This class of rules has two natural dimensions of complexity: the number of features and the magnitude of the weights. Given integers and , we apply the following three-step procedure to construct rules with at most features and integer weights bounded by (i.e., and ).
Select. From the full set of features, select features via forward stepwise regression. For fixed , we note that standard selection metrics (e.g., AIC or BIC) are theoretically guaranteed to yield the same set of features.
Regress. Using only these selected features, train an -regularized (lasso) logistic regression model to the data, which yields (real-valued) fitted coefficients .
Round. Rescale the coefficients to be in the range , and then round the rescaled coefficients to the nearest integer. Specifically, set
We note that rules constructed in this way may have fewer than features, since the lasso regression in Step 2 may result in coefficients that are identically zero, and rescaling and rounding coefficients in Step 3 may zero-out additional terms.555We select features in Step 1 with the R package leaps. The models in Step 2 are fit with the R package glmnet. The cv.glmnet method is used to determine the best value of the regularization parameter with 10-fold cross-validation and 1,000 values of . This select-regress-and-round strategy for rule construction builds upon findings that “improper” weighting schemes for linear models (e.g, unit weighting) lead to accurate predictions (Guilford, 1942; Dawes, 1979; Gigerenzer and Goldstein, 1996; Goel et al., 2016); in particular, our strategy incorporates feature selection and more general integer weights to generate a richer family of simple rules. We next examine the accuracy of these rules.
3.2. Rule evaluation
We apply the select-regress-and-round procedure to 22 publicly available datasets to examine the tradeoff between complexity and performance. These datasets all come from the UCI ML repository, and were selected according to four criteria: (1) the dataset involves a binary classification (as opposed to a regression) problem;666For those datasets whose outcome variable takes more than two values, we set the majority class as the target variable, so that all the tasks we consider involve binary classification. (2) the dataset is provided in a standard and complete form; (3) the dataset involves more than 10 features; and (4) the classification problem is one that a human could plausibly learn to solve with the given features. For example, we included a dataset in which the task was to determine whether cells were malignant or benign based on various biological attributes of the cells, but we excluded image recognition tasks in which the features were represented as pixel values. This fourth requirement limits the scope of our analysis and conclusions to domains in which human decision makers typically act without the aid of a computer.777The 22 UCI datasets we consider are: adult, annealing, audiology-std, bank, bankruptcy, car, chess-krvk, chess-krvkp, congress-voting, contrac, credit-approval, ctg, cylinder-bands, dermatology, german_credit, heart-cleveland, ilpd, mammo, mushroom, aus_credit, wine, and wine_qual.
Unlike the judicial decisions discussed in Section 2, outcomes in the domains we consider here are unaffected by a decision maker’s actions. For example, assessing the likelihood a cell is malignant—and then acting on that knowledge—does not change the fact that the cell was either malignant or not at the time of the measurement. In contrast, a judge’s decision to release or detain an individual necessarily alters the defendant’s likelihood of appearing at trial. Further, in the UCI domains, we observe outcomes for every example, not only a subset in which a decision maker chose to act. Decision rules are constructed similarly in both the UCI and bail datasets. Evaluating the resulting rules, however, is significantly easier for the UCI datasets: since outcomes are independent of actions and are observed for all examples, one need not consider subtle issues of causal inference.
On each of the 22 datasets we analyze here, we construct simple rules for a range of the number of features and the magnitude of the weights . We benchmark the performance of these rules against three standard statistical models: logistic regression, -regularized logistic regression, and random forest. These models were fit in R with the glm, glmnet, and randomForest packages, respectively. For the -regularized logistic regression models, the cv.glmnet method was used to determine the best value of the regularization parameter with 10-fold cross-validation and 1,000 values of . We used 1,000 trees for the random forest models. This head-to-head comparison is a difficult test for the simple rules in part because they can only base their predictions on 1 to 10 features. The complex models, in contrast, can train and predict with all features, which number between 11 and 93 with a mean of 38.
Figure 5 shows model performance—measured in terms of mean AUC across the 22 datasets—as a function of model size and coefficient range. The AUC for each model on each dataset is computed via 10-fold cross-validation. We find that simple rules with only five features and integer coefficients between -3 and 3 perform on par with logistic regression and -regularized logistic regression trained on the full set of features. For 1 to 10 features, the [-3, 3] model (green line) differs from the unrounded lasso model (black line) by less than 1 percentage point. The performance of the random forest model is somewhat better: trained on all features, random forest achieves mean AUC of 92%; the mean AUC is 87% for simple rules with at most five features and integer coefficients between -3 and 3. Complex prediction methods certainly have their advantages, but the gap in performance between simple rules and fully optimized prediction methods is not as large as one might have thought.
3.3. Benchmarking to integer programming
The simple rules we construct take the form of a linear scoring rule with integer weights. To produce such rules, mixed-integer programming is a natural alternative to our select-regress-and-round strategy, and supersparse linear integer models (SLIM) (Ustun and Rudin, 2016) is the leading instantiation of that approach. Given constraints on the number of features and the magnitude of the integer weights, SLIM produces rules that optimize for binary classification accuracy (i.e., 0-1 loss).
We compare SLIM to select-regress-and-round on the judicial decision-making problem and on the 22 UCI datasets. Figure 6 (left panel) shows estimated FTA and release rates for the random forest model (black line), our simple rules derived in Section 2 (red points), and the simple rules produced by SLIM (blue points). As with our own simple rules, we constrain SLIM to produce rules based on age and number of past FTAs, with integer weights ranging from -10 to 10. As before, decision rules are constructed from the random forest and select-regress-and-round risk scores by varying the decision threshold; in contrast, multiple rules for SLIM are computed by varying a parameter that specifies the maximum acceptable false positive rate (Ustun and Rudin, 2016). Both methods for producing simple rules perform nearly the same as the random forest model trained on the full set of 64 features.
We next consider the 22 UCI datasets. SLIM is known to work best when the features are discrete (Zeng et al., 2016). We thus pre-process the datasets by discretizing all continuous features into three bins containing an approximately equal number of examples, representing low, medium, and high values of the feature. Integer programming is an NP-hard problem, and so following Ustun and Rudin (2016) we set a time limit for SLIM; they set a 10-minute limit, but we allow up to 6 hours of computation per model. For 7 of the 22 datasets, SLIM found an integer-optimal solution within the time limit, and returned approximate solutions in the remaining 15 cases. Figure 6 (right panel) compares binary classification accuracy of SLIM and select-regress-and-round on the 22 UCI datasets, where each point corresponds to a dataset. Both methods are constrained to produce rules with at most five features and integer coefficients between -3 and 3. We show 0-1 accuracy since SLIM optimizes for this metric, but similar results hold for AUC; accuracy is computed out-of-sample via 10-fold cross-validation. Both methods for producing simple rules yield comparable results. Averaged across all 22 datasets, SLIM and select-regress-and-round both achieve mean accuracy of 86%. Even in the 7 cases where SLIM found integer-optimal solutions, performance is nearly identical to our simple select-regress-and-round strategy.
In terms of classification accuracy, select-regress-and-round generates rules on par with those obtained by solving mixed-integer programs. We note, however, two advantages of our approach. First, whereas select-regress-and-round yields results almost instantaneously, integer programs can be computationally expensive to solve. Second, our approach is both conceptually and technically simple, requiring little statistical or computational expertise, and accordingly easing adoption for practitioners.
4. The robustness of classification
Why is it that simple rules often perform as well as the most sophisticated statistical methods? In part it is because binary classification is robust to error in the underlying predictive model, an observation that we formalize in Theorem 4.1 below.
To establish this result, we start by considering the prediction scores generated via a standard statistical method—such as logistic regression trained on the full set of available features—which we call the “true” scores. As in linear discriminant analysis, we assume that the true scores for positive and negative instances are normally distributed with equal variance: and , respectively. The homoscedasticity assumption guarantees the Bayes optimal classifier is a threshold rule on the scores. For scores estimated via logistic regression, the normality assumption is reasonable if we consider the scores on the logit scale rather than on the probability scale. Figure 7 (left panel) shows such scores for one of the UCI datasets. We further assume that the process of generating simple rules—both limiting the number of features and also restricting the possible values of the weights—can be viewed as adding normal, mean-zero noise to the true scores; Figure 7 (center panel) plots the distribution of this noise for one of the datasets.888 We estimate the noise distribution by taking the difference between the simple and true scores. Before taking the difference, we convert the simple scores to the scale of true scores by dividing the simple scores by , the scaling factor used when generating the rule. Thus, with simple rules, instead of making classification decisions based on the true scores, we assume decisions are made in terms of a noisy approximation. Under this analytic framework, Theorem 4.1 shows that the drop in classification performance (as measured by AUC) can be expressed in terms of the “true AUC” (i.e., the AUC under the true scores) and , the ratio of the noise to the within-class variance of the true scores. In particular, we find that when the magnitude of the noise is on par with (or smaller than) the score variance (i.e., ), then the AUC of the noisy approximation is comparable to the true AUC.
Theorem 4.1 ().
For a binary classification task, let be a continuous random variable that denotes the prediction score of a random instance, and let and denote the conditional distributions of for positive and negative instances, respectively. Suppose and . Then, for and ,
where , and is the CDF for the standard normal.
In general, AUC is equal to the probability that a randomly selected positive instance has a higher prediction score than a randomly selected negative instance, and so . Since is normally distributed with mean and variance ,
where the last equality follows from symmetry of the normal distribution.
Now define , so , with defined similarly. A short computation shows that
Theorem 4.1 establishes a direct theoretical link between performance and noise in model specification. To give a better sense of how the analytic expression for varies with and , Figure 7 (right panel) shows this expression for various parameter values. For example, the figure shows that for and , we have . That is, if the amount of noise is equal to half the within-class variance of the true scores, then the drop in performance is relatively small.
While connecting model performance to model noise, Theorem 4.1 leaves unanswered how much noise simple rules add to the underlying scores. This question seems difficult to answer theoretically. We can, however, empirically estimate how much noise simple rules add in the datasets we analyze.999To estimate for a specific simple rule on a given dataset, we first compute the average within-class variance of the true scores, where these scores are generated via an -regularized logistic regression model. We estimate by taking the variance of the noise, as described in Footnote 8. Across the 22 UCI datasets we consider, we find that rules with five features and a coefficient range of -3 to 3 have an average value of . This low empirically observed noise is in line with our finding that such simple rules perform well on these datasets.
In this paper we introduced select-regress-and-round, a simple method for constructing decision rules that are fast, frugal, and clear. In an analysis of pretrial release decisions, simple rules outperformed human judges and matched the performance of a sophisticated statistical model. Generalizing this result, in 22 domains of varying size and complexity, the simple mental checklists produced by the select-regress-and-round method rivaled the performance of regularized regression models while using only a fraction of the information.
These results complement a growing body of work in statistics and computer science in which sophisticated algorithms are used to create interpretable scoring systems and rule sets (Ustun and Rudin, 2016; Letham et al., 2015; Lakkaraju et al., 2016; Lakkaraju and Rudin, 2017). Many prior rule construction methods offer great flexibility and performance (Ustun and Rudin, 2017), but in turn require considerable computational expertise to carry out. In contrast, the simple rules in this article can be created by practitioners with only basic statistical knowledge and generic software. For practitioners to favor statistics over intuition, we believe decision rules must not only be simple to apply but also simple to create.
Acknowledgements.We thank Avi Feller, Andrew Gelman, Gerd Gigerenzer, Art Owen, and Berk Ustun for helpful conversations.
- Cassel et al. (1976) Claes M Cassel, Carl E Särndal, and Jan H Wretman. 1976. Some results on generalized difference estimation and generalized regression estimation for finite populations. Biometrika 63, 3 (1976), 615–620.
- Corbett-Davies et al. (2017) Sam Corbett-Davies, Emma Pierson, Avi Feller, Sharad Goel, and Aziz Huq. 2017. Algorithmic decision making and the cost of fairness. arXiv preprint arXiv:1701.08230 (2017).
- Danziger et al. (2011) Shai Danziger, Jonathan Levav, and Liora Avnaim-Pesso. 2011. Extraneous factors in judicial decisions. Proceedings of the National Academy of Sciences 108, 17 (2011), 6889–6892.
- Dawes (1979) Robyn M Dawes. 1979. The robust beauty of improper linear models in decision making. American Psychologist 34, 7 (1979), 571.
- Dawes et al. (1989) Robyn M Dawes, David Faust, and Paul E Meehl. 1989. Clinical versus actuarial judgment. Science 243, 4899 (1989), 1668–1674.
- Dhami (2003) Mandeep K Dhami. 2003. Psychological models of professional decision making. Psychological Science 14, 2 (2003), 175–180.
- Dudík et al. (2011) Miroslav Dudík, John Langford, and Lihong Li. 2011. Doubly Robust Policy Evaluation and Learning. ICML (2011). DOI:http://dx.doi.org/10.1214/14-STS500
- Fernández-Delgado et al. (2014) Manuel Fernández-Delgado, Eva Cernadas, Senén Barro, and Dinani Amorim. 2014. Do we need hundreds of classifiers to solve real world classification problems? J. Mach. Learn. Res 15, 1 (2014), 3133–3181.
- Gigerenzer and Goldstein (1996) Gerd Gigerenzer and Daniel G Goldstein. 1996. Reasoning the fast and frugal way: models of bounded rationality. Psychological review 103, 4 (1996), 650.
- Gigerenzer et al. (2011) Gerd Gigerenzer, Ralph Hertwig, and Thorsten Pachur. 2011. Heuristics: The foundations of adaptive behavior. Oxford University Press, Inc.
- Gleicher (2016) Michael Gleicher. 2016. A Framework for Considering Comprehensibility in Modeling. Big Data 4, 2 (2016), 75–88.
- Goel et al. (2016) Sharad Goel, Justin M Rao, and Ravi Shroff. 2016. Precinct or Prejudice? Understanding Racial Disparities in New York City’s Stop-and-Frisk Policy. Annals of Applied Statistics (2016).
- Goodman and Flaxman (2016) Bryce Goodman and Seth Flaxman. 2016. EU regulations on algorithmic decision-making and a right to explanation. arXiv preprint arXiv:1606.08813 (2016).
- Guilford (1942) Joy Paul Guilford. 1942. Fundamental statistics in psychology and education. McGraw-Hill.
- Hill (2012) Jennifer L Hill. 2012. Bayesian nonparametric modeling for causal inference. Journal of Computational and Graphical Statistics (2012).
- Kang and Schafer (2007) Joseph DY Kang and Joseph L Schafer. 2007. Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data. Statistical science (2007), 523–539.
- Kim et al. (2014) Been Kim, Cynthia Rudin, and Julie A Shah. 2014. The Bayesian Case Model: A Generative Approach for Case-Based Reasoning and Prototype Classification. In Advances in Neural Information Processing Systems 27. 1952–1960.
- Kim et al. (2015) Been Kim, Julie A Shah, and Finale Doshi-Velez. 2015. Mind the Gap: A Generative Approach to Interpretable Feature Selection and Extraction. In Advances in Neural Information Processing Systems 28. 2260–2268.
- Kleinberg et al. (2017) Jon Kleinberg, Himabindu Lakkaraju, Jure Leskovec, Jens Ludwig, and Sendhil Mullainathan. 2017. Human Decisions and Machine Predictions. (2017). http://www.nber.org/papers/w23180 Working paper.
- Kleinberg et al. (2015) Jon Kleinberg, Jens Ludwig, Sendhil Mullainathan, and Ziad Obermeyer. 2015. Prediction policy problems. The American Economic Review 105, 5 (2015).
- Lakkaraju et al. (2016) Himabindu Lakkaraju, Stephen H Bach, and Jure Leskovec. 2016. Interpretable decision sets: A joint framework for description and prediction. In Proceedings of the 22nd International Conference on Knowledge Discovery and Data Mining.
- Lakkaraju and Rudin (2017) Himabindu Lakkaraju and Cynthia Rudin. 2017. Learning Cost-Effective Treatment Regimes using Markov Decision Processes. International Conference on Artificial Intelligence and Statistics (AISTATS) (2017).
- Letham et al. (2015) Benjamin Letham, Cynthia Rudin, Tyler H McCormick, and David Madigan. 2015. Interpretable classifiers using rules and Bayesian analysis: Building a better stroke prediction model. The Annals of Applied Statistics 9, 3 (2015), 1350–1371.
- Marewski and Gigerenzer (2012) Julian N Marewski and Gerd Gigerenzer. 2012. Heuristic decision making in medicine. Dialogues Clin Neurosci 14, 1 (2012), 77–89.
- McDonald (1996) Clement J. McDonald. 1996. Medical Heuristics: The Silent Adjudicators of Clinical Practice. Annals of Internal Medicine 124, 1 Part 1 (1996), 56–62.
- Niculescu-Mizil and Caruana (2005) Alexandru Niculescu-Mizil and Rich Caruana. 2005. Predicting good probabilities with supervised learning. In Proceedings of the 22nd international conference on Machine learning. ACM, 625–632.
- Robins and Rotnitzky (1995) James M Robins and Andrea Rotnitzky. 1995. Semiparametric efficiency in multivariate regression models with missing data. J. Amer. Statist. Assoc. 90, 429 (1995), 122–129.
- Robins et al. (1994) James M Robins, Andrea Rotnitzky, and Lue Ping Zhao. 1994. Estimation of regression coefficients when some regressors are not always observed. Journal of the American statistical Association 89, 427 (1994), 846–866.
- Rosenbaum and Rubin (1983a) Paul R Rosenbaum and Donald B Rubin. 1983a. Assessing sensitivity to an unobserved binary covariate in an observational study with binary outcome. Journal of the Royal Statistical Society. Series B (Methodological) (1983), 212–218.
- Rosenbaum and Rubin (1983b) Paul R Rosenbaum and Donald B Rubin. 1983b. The central role of the propensity score in observational studies for causal effects. Biometrika 70, 1 (1983), 41–55.
- Rosenbaum and Rubin (1984) Paul R Rosenbaum and Donald B Rubin. 1984. Reducing bias in observational studies using subclassification on the propensity score. Journal of the American statistical Association 79, 387 (1984), 516–524.
- Sull and Eisenhardt (2015) Donald Sull and Kathleen M Eisenhardt. 2015. Simple rules: How to thrive in a complex world. Houghton Mifflin Harcourt.
- Tetlock (2005) Philip Tetlock. 2005. Expert political judgment: How good is it? How can we know? Princeton University Press.
- Ustun and Rudin (2016) Berk Ustun and Cynthia Rudin. 2016. Supersparse linear integer models for optimized medical scoring systems. Machine Learning 102, 3 (2016), 349–391.
- Ustun and Rudin (2017) Berk Ustun and Cynthia Rudin. 2017. Learning Optimized Risk Scores on Large-Scale Datasets. arXiv preprint arXiv:1610.00168 (2017).
- Wübben and Wangenheim (2008) Markus Wübben and Florian V Wangenheim. 2008. Instant customer base analysis: Managerial heuristics often get it right. Journal of Marketing 72, 3 (2008), 82–93.
- Zeng et al. (2016) Jiaming Zeng, Berk Ustun, and Cynthia Rudin. 2016. Interpretable classification models for recidivism prediction. Journal of the Royal Statistical Society: Series A (Statistics in Society) (2016).