A Algorithm FRL

# An Optimization Approach to Learning Falling Rule Lists

## Abstract

A falling rule list is a probabilistic decision list for binary classification, consisting of a series of if-then rules with antecedents in the if clauses and probabilities of the desired outcome (“1”) in the then clauses. Just as in a regular decision list, the order of rules in a falling rule list is important – each example is classified by the first rule whose antecedent it satisfies. Unlike a regular decision list, a falling rule list requires the probabilities of the desired outcome (“1”) to be monotonically decreasing down the list. We propose an optimization approach to learning falling rule lists and “softly” falling rule lists, along with Monte-Carlo search algorithms that use bounds on the optimal solution to prune the search space.

## 1 Introduction

In many real-life scenarios, we want to learn a predictive model that allows us to easily identify the most significant conditions that are predictive of a certain outcome. For example, in health care, doctors often want to know the conditions that signify a high risk of stroke, so that patients with such conditions can be prioritized in receiving treatment. A falling rule list, whose form was first proposed by Wang and Rudin (2015), is a type of model that serves this purpose.

Table 1 shows a falling rule list we learned from the bank-full dataset, which was used by Moro et al. (2011) in their study of applying data mining techniques to direct marketing. As we can see, a falling rule list is a probabilistic decision list for binary classification, consisting of a series of if-then rules with antecedents in the if clauses and probabilities of the desired outcome (“1”) in the then clauses, where the probabilities of the desired outcome (“1”) are monotonically decreasing down the list (hence the name “falling” rule list). The falling rule list in Table 1 has identified clients for whom the previous marketing campaign was successful (“poutcome=success”), and who have no credit in default (“default=no”), as individuals who are most likely to subscribe to a term deposit in the current marketing campaign. Their probability of subscribing is . Of the remaining clients, those who are next most likely to sign up for a term deposit are older people (aged between 60 and 100) with no credit in default. Their probability of subscribing is . The two rightmost columns in Table 1, labeled and , show the number of positive training examples (i.e. clients who subscribe to a term deposit in the current campaign) and of negative training examples, respectively, that satisfy the antecedent in each rule of the falling rule list.

Falling rule lists can provide valuable insight into data – if we know how to construct them well. In this paper, we propose an optimization approach to learning falling rule lists and “softly” falling rule lists, along with Monte-Carlo search algorithms that use bounds on the optimal solution to prune the search space. The falling rule list shown in Table 1 was produced using Algorithm FRL, which we shall introduce later.

Our work lives within several well-established fields, but is the first work we know of to use an optimization approach to handling monotonicity constraints in rule-based models. It relates closely to associative classification (e.g. the RIPPER algorithm (Cohen, 1995) and the CBA algorithm (Liu et al., 1998); see Thabtah (2007) for a comprehensive review) and inductive logic programming (Muggleton and De Raedt, 1994). The proposed algorithms are competitors for decision tree methods like CART (Breiman et al., 1984), ID3 (Quinlan, 1986), C4.5 (Quinlan, 1993), and C5.0 (Quinlan, 2004), and decision list learning (Rivest, 1987). Almost all methods from this class build decision trees from the top down using greedy splitting criteria. Greedy splitting criteria do not lend naturally to constrained models like falling rule lists. There are some works on decision trees with monotonicity constraints (e.g. Altendorf et al., 2005; Ben-David, 1995; Feelders and Pardoel, 2003), but they focus mostly on enforcing the monotonic relationship between certain attributes and ordinal class labels. In addition, our work also relates to those that underline the importance of the interpretability of models (Freitas, 2014; Huysmans et al., 2011; Kodratoff, 1994; Martens and Baesens, 2010).

Wang and Rudin (2015) proposed the form of a falling rule list, and a Bayesian approach to learning falling rule lists (extending the ideas of Letham et al. (2015) and Yang et al. (2017)). The Bayesian approach offers some advantages: e.g. a full posterior over rule lists allows model averaging. However, the optimization perspective has an important computational advantage: the search space is made substantially smaller by the tight bounds presented here. The concept of softly falling rule lists is novel to this paper and has not been done in the Bayesian setting.

## 2 Problem Formulation

We first formalize the notion of an antecedent, of a rule list, of a falling rule list, and of a prefix.

###### Definition 2.1.

An antecedent on an input domain is a Boolean function that outputs true or false. Given an input , we say that satisfies the antecedent if evaluates to true. For example, (poutcome=success AND default=no) in Table 1 is an antecedent.

###### Definition 2.2.

A rule list on an input domain is a probabilistic decision list of the following form: “if satisfies , then ; else if satisfies , then ; ; else if satisfies , then ; else ” where is the -th antecedent in , , and denotes the size of the rule list, which is defined as the number of rules, excluding the final else clause, in the rule list. We can denote the rule list as follows:

 d={(a(d)0,^α(d)0),(a(d)1,^α(d)1),...,(a(d)|d|−1,^α(d)|d|−1),^α(d)|d|}. (1)

The rule list of Equation (1) is a falling rule list if the following inequalities hold:

 ^α(d)0≥^α(d)1≥...≥^α(d)|d|−1≥^α(d)|d|. (2)

For convenience, we sometimes refer to the final else clause in as the -th antecedent in , which is satisfied by all . We denote the space of all possible rule lists on by .

###### Definition 2.3.

A prefix on an input domain is a rule list without the final else clause. We can denote the prefix as follows:

 e={(a(e)0,^α(e)0),(a(e)1,^α(e)1),...,(a(e)|e|−1,^α(e)|e|−1)}. (3)

where is the -th antecedent in , , and denotes the size of the prefix, which is defined as the number of rules in the prefix.

###### Definition 2.4.

Given the rule list of Equation (1) (or the prefix of Equation (3)), we say that an input is captured by the -th antecedent in (or ) if satisfies (or , respectively), and for all (or , respectively) such that satisfies (or , respectively), holds – in other words, (or , respectively) is the first antecedent that satisfies. We define the function capt by (or ) if is captured by the -th antecedent in (or ). Moreover, given the prefix of Equation (3), we say that an input is captured by the prefix if is captured by some antecedent in , and we define if is not captured by the prefix .

Let be the training data, with and for each . We now define the empirical positive proportion of an antecedent, and introduce the notion of a rule list (or a prefix) that is compatible with .

###### Definition 2.5.

Given the training data and the rule list of Equation (1) (or the prefix of Equation (3)), we denote by , , (or , , ), the number of positive, negative, and all training inputs captured by the -th antecedent in (or ), respectively, and define the empirical positive proportion of the -th antecedent in (or ), denoted by (or ), as:

 α(d,D)j=n+j,d,D/nj,d,D (or α(e,D)j=n+j,e,D/nj,e,D).

Moreover, given the training data and the prefix of Equation (3), we denote by , , , the number of positive, negative, and all training inputs that are not captured by the prefix , and define the empirical positive proportion after the prefix , denoted by , as .

###### Definition 2.6.

Given the training data and the rule list of Equation (1) (or the prefix of Equation (3)), we say that the rule list (or the prefix ) is compatible with if for all (or , respectively), the equation (, respectively) holds. We denote the space of all possible rule lists on that are compatible with the training data by .

To formulate the problem of learning falling rule lists from data as an optimization program, we first observe that, given a threshold , the rule list of Equation (1) can be viewed as a classifier that predicts for an input only if the inequality holds. Hence, we can define the empirical risk of misclassification by the rule list on the training data as that by the classifier . More formally, we have the following definition.

###### Definition 2.7.

Given the training data , the rule list of Equation (1), a threshold , and the weight for the positive class, the empirical risk of misclassification by the rule list on the training data with threshold and with weight for the positive class, denoted by , is:

 R(d,D,τ,w)=1n⎛⎝w∑i:yi=1\mathds1[^α(d)capt(xi,d)≤τ]∑i:yi=1\mathds1[^α(d)% capt(xi,d)≤τ]+∑i:yi=−1\mathds1[^α(d)capt(xi,d)>τ]⎞⎠. (4)

If is compatible with , we can replace in Equation (4) with . We define the empirical risk of misclassification by the prefix on the training data with threshold and with weight for the positive class, denoted by , analogously:

 R(e,D,τ,w)=1n⎛⎜ ⎜⎝w∑i:yi=1∧capt(xi,e)≠|e|\mathds1[^α(e)capt(xi,e)≤τ]∑i:yi=1∧capt(xi,e)≠|e|\mathds1[^α(e)capt(xi,e)≤τ]+∑i:yi=−1∧capt(xi,e)≠|e|\mathds1[^α(e)capt(xi,e)>τ]⎞⎟ ⎟⎠. (5)

If is compatible with , we can replace in Equation (5) with . Note that for any rule list that begins with a given prefix , is the contribution by the prefix to .

We can formulate the problem of learning falling rule lists as a minimization program of the empirical risk of misclassification, given by Equation (4), with a regularization term that penalizes each rule in with a cost of to limit the number of rules, subject to the monotonicity constraint (2). For now, we focus on the problem of learning falling rule lists that are compatible with the training data .

Let and be the regularized empirical risk of misclassification by the rule list and by the prefix , respectively, on the training data . The former defines the objective of the minimization program, and the latter gives the contribution by the prefix to for any rule list that begins with . The following theorem provides a motivation for setting the threshold to in the minimization program – the empirical risk of misclassification by a given rule list is minimized when is set in this way.

###### Theorem 2.8.

Given the training data , a rule list that is compatible with , and the weight for the positive class, we have for all .

For reasons of computational tractability and model interpretability, we further restrict our attention to learning compatible falling rule lists whose antecedents must come from a pre-determined set of antecedents . We now present the optimization program for learning falling rule lists, which forms the basis of the rest of this paper.

###### Program 2.9 (Learning compatible falling rule lists).
 mind∈D(X,D)L(d,D,1/(1+w),w,C) subject to
 α(d,D)0≥α(d,D)1≥...≥α(d,D)|d|−1≥α(d,D)|d|, (6)
 a(d)j∈A, for all j∈{0,1,...,|d|−1}. (7)

The constraint (6) is exactly the monotonicity constraint (2) for the falling rule lists that are compatible with . The constraint (7) limits the choice of antecedents. An instance of Program 2.9 is defined by the tuple .

## 3 Algorithm

In this section, we outline a Monte-Carlo search algorithm, Algorithm FRL, based on Program 2.9, for learning compatible falling rule lists from data. Given an instance of Program 2.9, the algorithm constructs a compatible falling rule list in each iteration, while keeping track of the falling rule list that has the smallest objective value among all the falling rule lists that the algorithm has constructed so far. At the end of iterations, the algorithm outputs the falling rule list that has the smallest objective value out of the lists it has constructed.

In the process of constructing a falling rule list , the algorithm chooses the antecedents successively, and uses various properties of Program 2.9, presented in Section 4, to prune the search space. In particular, when the algorithm is choosing the -th antecedent in , it considers only those antecedents satisfying the following conditions: (1) the inclusion of as the -th antecedent in gives rise to a rule that respects the monotonicity constraint and the necessary condition for optimality (Corollary 4.5), and (2) the inclusion of as the -th antecedent in gives rise to a prefix such that is feasible for Program 2.9 under the training data (Proposition 4.2), and the best possible objective value achievable by any falling rule list that begins with and is compatible with (Theorem 4.6) is less than the current best objective value . The algorithm terminates the construction of if Inequality (9) in Theorem 4.6 holds. The details of the algorithm can be found in the supplementary material.

## 4 Prefix Bound

The goal of this section is to find a lower bound on the objective value of any compatible falling rule list that begins with a given compatible prefix, which we call a prefix bound, and to prove the various results used in the algorithm. To derive this prefix bound, we first introduce the concept of a feasible prefix, with which it is possible to construct a compatible falling rule list from data.

###### Definition 4.1.

Given the training data and the set of antecedents , a prefix is feasible for Program 2.9 under the training data and the set of antecedents if is compatible with , and there exists a falling rule list such that is compatible with , the antecedents of come from , and begins with .

The following proposition gives necessary and sufficient conditions for a prefix to be feasible.

###### Proposition 4.2.

Given the training data , the set of antecedents , and a prefix that is compatible with and satisfies for all and for all , the following statements are equivalent: (1) is feasible for Program 2.9 under and ; (2) holds; (3) holds.

We now introduce the concept of a hypothetical rule list, whose antecedents do not need to come from the pre-determined set of antecedents .

###### Definition 4.3.

Given a pre-determined set of antecedents , a hypothetical rule list with respect to is a rule list that contains an antecedent that is not in .

We need the following lemma to prove the necessary condition for optimality (Corollary 4.5), and to derive a prefix bound (Theorem 4.6).

###### Lemma 4.4.

Suppose that we are given an instance of Program 2.9, a prefix that is feasible for Program 2.9 under and , and a (possibly hypothetical) falling rule list that begins with and is compatible with . Then there exists a falling rule list , possibly hypothetical with respect to , such that begins with , has at most one more rule (excluding the final else clause) following , is compatible with , and satisfies

 L(d′,D,1/(1+w),w,C)≤L(d,D,1/(1+w),w,C).

As a special case, if either holds for all , or holds for all , then the falling rule list (i.e. the falling rule list in which the final else clause follows immediately the prefix , and the probability estimate of the final else clause is ) is compatible with and satisfies .

A consequence of the above lemma is that an optimal solution for a given instance of Program 2.9 should not have any antecedent whose empirical positive proportion falls below .

###### Corollary 4.5.

If is an optimal solution for a given instance of Program 2.9, then we must have for all .

Another implication of Lemma 4.4 is that the objective value of any compatible falling rule list that begins with a given prefix cannot be less than a lower bound on the objective value of any compatible falling rule list that begins with the same prefix , and has at most one more rule (excluding the final else clause) following . This leads to the following theorem.

###### Theorem 4.6.

Suppose that we are given an instance of Program 2.9 and a prefix that is feasible for Program 2.9 under and . Then any falling rule list that begins with and is compatible with satisfies

 L(d,D,1/(1+w),w,C)≥L∗(e,D,w,C),

where

 L∗(e,D,w,C)=L(e,D,1/(1+w),w,C)+min⎛⎜⎝1n⎛⎜⎝1α(e,D)|e|−1−1⎞⎟⎠~n+e,D+C,wn~n+e,D,1n~n−e,D⎞⎟⎠ (8)

is a lower bound on the objective value of any compatible falling rule list that begins with , under the instance of Program 2.9. We call the prefix bound for . Further, if

 C≥min(wn~n+e,D,1n~n−e,D)−1n⎛⎜⎝1α(e,D)|e|−1−1⎞⎟⎠~n+e,D (9)

holds, then the compatible falling rule list , where the prefix is followed directly by the final else clause, satisfies .

The results presented in this section are used in Algorithm FRL to prune the search space. The proofs can be found in the supplementary material.

## 5 Softly Falling Rule Lists

Program 2.9 and Algorithm FRL have some limitations. Let us consider a toy example, where we have a training set of instances, with positive and negative instances. Suppose that we have an antecedent that is satisfied by positive and negative training instances. If were to be the first rule of a falling rule list that is compatible with , we would obtain a prefix . However, the empirical positive proportion after the prefix is . This violates (2) in Proposition 4.2, so is not a feasible prefix for Program 2.9 under the training data . In fact, if every antecedent in is satisfied by positive and negative instances in the training set , then the only possible compatible falling rule list we can learn using Algorithm FRL is the trivial falling rule list, which has only the final else clause. At the same time, if we consider the rule list , which is compatible with the given toy dataset but is not a falling rule list, we may notice that the two probability estimates in are quite close to each other – it is very likely that the difference between them is due to sampling variability in the dataset itself.

The two limitations of Program 2.9 and Algorithm FRL – the potential non-existence of a feasible non-trivial solution and the rigidness of using empirical positive proportions as probability estimates – motivate us to formulate a new optimization program for learning “softly” falling rule lists, where we remove the monotonicity constraint and instead introduce a penalty term in the objective function that penalizes violations of the monotonicity constraint (6) in Program 2.9. More formally, define a softly falling rule list as a rule list of Equation (1) with . Note that any rule list that is compatible with the given training data can be turned into a softly falling rule list by setting . Hence, we can learn a softly falling rule list by first learning a compatible rule list with the “softly falling objective” (denoted by below), and then transforming the rule list into a softly falling rule list. Let

 ~L(d,D,τ,w,C,C1)=L(d,D,τ,w,C)+C1|d|∑j=0⌊α(d,D)j−mink
 and ~L(e,D,τ,w,C,C1)=L(e,D,τ,w,C)+C1|e|−1∑j=0⌊α(e,D)j−mink

be the regularized empirical risk of misclassification by a rule list and by a prefix , respectively, with a penalty term that penalizes violations of monotonicity in the empirical positive proportions of the antecedents in and in , respectively. We call the softly falling objective function, set the threshold as before, and obtain the following optimization program:

###### Program 5.1 (Learning compatible rule lists with the softly falling objective).
 mind∈D(X,D)~L(d,D,1/(1+w),w,C,C1) subject to a(d)j∈A, % for all j∈{0,1,...,|d|−1}.

An instance of Program 5.1 is defined by the tuple . Similarly, we have a Monte-Carlo search algorithm, Algorithm softFRL, based on Program 5.1, for learning softly falling rule lists from data. Given an instance of Program 5.1, this algorithm searches through the space of rule lists that are compatible with and finds a compatible rule list whose antecedents come from , and whose objective value is the smallest among all the rule lists that the algorithm explores. It then turns this compatible rule list into a softly falling rule list. In the search phase, the algorithm uses the following prefix bound (Theorem 5.2) to prune the search space of compatible rule lists. The details of Algorithm softFRL and the proof of Theorem 5.2 can be found in the supplementary material.

###### Theorem 5.2.

Suppose that we are given an instance of Program 5.1 and a prefix that is compatible with . Then any rule list that begins with and is compatible with satisfies

 ~L(d,D,1/(1+w),w,C,C1)≥~L∗(e,D,w,C,C1),

where

 ~L∗(e,D,w,C,C1)=~L(e,D,1/(1+w),w,C,C1)+min⎛⎝1n⎛⎝1α(e,D)min−1⎞⎠~n+e,D+C+C1⌊~αe,D−α(e,D)min⌋++wn~n+e,D\mathds1[~αe,D≥α(e,D)min],infβ:ζ<β≤1g(β),wn~n+e,D+C1⌊~αe,D−α(e,D)min⌋+,1α(e,D)|e|−1−11n~n−e,D+C1⌊~αe,D−α(e,D)min⌋+⎞⎟⎠ (10)

is a lower bound on the objective value of any compatible rule list that begins with , under the instance of Program 5.1. In Equation (10), , , and are defined by

 α(e,D)min=mink<|e|α(e,D)k,
 ζ=max(α(e,D)min,~αe,D,1/(1+w)),
 g(β)=1n(1β−1)~n+e,D+C+C1(β−α(e,D)min).

Note that can be computed analytically: if satisfies , and otherwise.

## 6 Experiments

In this section, we demonstrate our algorithms for learning falling rule lists using a real-world application – learning the conditions that are predictive of the success of a bank marketing effort, from previous bank marketing campaign data. We used the public bank-full dataset (Moro et al., 2011), which contains observations, with predictor variables that were discretized. We used the frequent pattern growth (FP-growth) algorithm (Han and Pei, 2000) to generate the set of antecedents from the dataset. For reasons of model interpretability and generalizability, we included in the antecedents that have at most predicates, and have at least support within the data that are labeled positive or within the data that are labeled negative. Besides the FP-growth algorithm, there is a vast literature on rule mining algorithms (e.g. Agrawal and Srikant, 1994; Han et al., 2000; Landwehr et al., 2005), and any of these can be used to produce antecedents for our algorithms.

The bank-full dataset is imbalanced – there are only positive instances out of observations. A trivial model that always predicts the negative outcome for a bank marketing campaign will achieve close to accuracy on this dataset, but it will not be useful for the bank to understand what makes a marketing campaign successful. Moreover, when predicting if a future campaign will be successful in finding a client, the bank cares more about “getting the positive right” than about “getting the negative right” – a false negative means a substantial loss in revenue, while a false positive incurs little more than some phone calls.

We compared our algorithms with other classification algorithms in a cost-sensitive setting, where a false negative and a false positive have different costs of misclassification. We generated five random splits into a training and a test set, where of the observations in the original bank-full dataset were placed into the training set. For each training-test split, and for each positive class weight , we learned from the training set: (1) a falling rule list , which is treated as a classifier , using Algorithm FRL with (which is small enough so that no training accuracy will be sacrificed for sparsity), (2) a softly falling rule list , which is treated as a classifier , using Algorithm softFRL with and , (3) three decision trees using cost-sensitive CART (Breiman et al., 1984), cost-sensitive C4.5 (Quinlan, 1993), and cost-sensitive C5.0 (Quinlan, 2004), respectively, (6) a random forest (Breiman, 2001) of decision trees trained with cost-sensitive CART, (7) a boosted tree classifier using AdaBoost (Freund and Schapire, 1996) on trees trained with cost-sensitive CART, and (8) a decision list using RIPPER (Cohen, 1995), and we computed the true positive rate and the false positive rate on the test set for each classifier. For each split and for each algorithm, we plotted a receiver operating characteristic (ROC) curve on the test set using different values of . Figure 4 shows the ROC curves for one of the training-test splits. The ROC curves for the other training-test splits can be found in the supplementary material. Note that since RIPPER is not a cost-sensitive algorithm, its ROC curve based on different values has only a single point. As we can see, the curves in Figure 4 lie close to each other. This demonstrates the effectiveness of our algorithms in producing falling rule lists that, when used as classifiers, are comparable with classifiers produced by other widely used classification algorithms, in a cost-sensitive setting. This is possibly surprising since our models are much more constrained than other classification methods.

We also plotted the number of antecedents considered by Algorithm FRL and Algorithm softFRL in the process of constructing a rule list at each iteration (Figures 4 and 4), when we applied the two algorithms to the entire dataset. Each curve in either plot corresponds to a rule list constructed in an iteration of the appropriate algorithm. The intensity of the curve is inversely proportional to the iteration number – the larger the iteration number, the lighter the curve is. The number of antecedents considered by Algorithm FRL stays below in all but a few early iterations (despite a choice of antecedents available), and the number considered by either algorithm generally decreases drastically in each iteration after three or four antecedents have been chosen. The curves generally become lighter as we move vertically down the plots, indicating that as we find better rule lists, there are less antecedents to consider at each level. Algorithm softFRL needs to consider more antecedents in general since the search space is less constrained. All of these demonstrate that the prefix bounds we have derived for our algorithms are effective in excluding a large portion of the search space of rule lists. The supplementary material contains more rule lists created using our algorithms with different parameter values.

Since this paper was directly inspired by Wang and Rudin (2015), who proposed a Bayesian approach to learning falling rule lists, we conducted a set of experiments comparing their work to ours. We trained falling rule lists on the entire bank-full dataset using both the Bayesian approach and our optimization approach, and plotted the weighted training loss over real runtime for each positive class weight with the threshold set to (By Theorem 2.8, this is the threshold with the least weighted training loss for any given rule list). Since we want to focus our experiments on the efficiency of searching the model space, the runtimes recorded do not include the time for mining the antecedents. Note that the Bayesian approach is not cost-sensitive, and does not optimize the weighted training loss directly. However, in many real-life applications such as predicting the success of a future marketing campaign, it is desirable to minimize the expected weighted loss. Therefore, it is reasonable to compare the two approaches using the weighted training loss to demonstrate the advantages of our optimization approach. We compared the Bayesian approach only with Algorithm FRL, because both methods strictly enforce the monotonicity constraint on the positive proportions of the training data that are classified into each rule. Softly falling rule lists do not strictly enforce the monotonicity constraint, and are therefore not used for comparison. Figure 9 shows the plots of the weighted training loss over real runtime. Due to the random nature of both approaches, the experiments were repeated several times – more plots of the weighted training loss over real runtime for different trials of the same experiment, along with falling rule lists created using both approaches, can be found in the supplementary material. As shown in Figure 9, our optimization approach tends to find a falling rule list with a smaller weighted training loss faster than the Bayesian approach. This is not too surprising because in our approach, the search space is made substantially smaller by the tight bounds presented here, whereas in the original Bayesian approach, there are no tight bounds on optimal solutions to restrict the search space – even if we constructed bounds for the original Bayesian approach, they would involve loose approximations to gamma functions.

## 7 Conclusion

We have proposed an optimization approach to learning falling rule lists and softly falling rule lists, along with Monte-Carlo search algorithms that use bounds on the optimal solution to prune the search space. A recent work by Angelino et al. (2017) on (non-falling) rule lists showed that it is possible to exhaustively optimize an objective over rule lists, indicating that the space of lists is not as large as one might think. Our search space is a dramatically constrained version of their search space, allowing us to reasonably believe that it can be searched exhaustively. Unfortunately, almost none of the logic of Angelino et al. (2017) can be used here. Indeed, introducing the falling constraint or the monotonicity penalty changes the nature of the problem, and the bounds in our work are entirely different. The algorithm of Angelino et al. (2017) is not cost-sensitive, which led in this work to another level of complexity for the bounds.

Falling rule lists are optimized for ease-of-use – users only need to check a small number of conditions to determine whether an observation is in a high risk or high probability subgroup. As pointed out by Wang and Rudin (2015), the monotonicity in probabilities in falling rule lists allows doctors to identify the most at-risk patients easily. Typical decision tree methods (CART, C4.5, C5.0) do not have the added interpretability that comes from the falling constraint in falling rule lists: one may have to check many conditions in a decision tree to determine whether an observation is in a high risk or high probability subgroup – even if the decision tree has a small depth, it is possible that high risk subgroups are in different parts of the tree, so that one still has to check many conditions in order to find high risk subgroups. In this sense, falling rule lists and softly falling rule lists are as sparse as we need them to be, and they can provide valuable insight into data.

Supplementary Material and Code: The supplementary material and code are available at https://github.com/cfchen-duke/FRLOptimization.

#### Acknowledgments

This work was sponsored in part by MIT Lincoln Laboratory.

## Appendix A Algorithm FRL

In this section, we present Algorithm FRL in detail. Given an instance of Program 2.9, the algorithm searches through the space of falling rule lists that are compatible with and outputs a compatible falling rule list that respects the constraints of Program 2.9, and whose objective value is the smallest among all the falling rule lists that the algorithm explores. It does so by iterating over steps, in each of which the algorithm constructs a compatible falling rule list , while keeping track of the falling rule list that has the smallest objective value among all the falling rule lists that the algorithm has constructed so far. At the end of iterations, the algorithm outputs the falling rule list that has the smallest objective value out of the lists it has constructed.

In the process of constructing a falling rule list , the algorithm chooses the antecedents successively: first for the antecedent in the top rule, then for the antecedent in the next rule, and so forth. For each antecedent chosen, the algorithm also computes its empirical positive proportion . After rules have been constructed so that currently holds the prefix , the algorithm either: (1) terminates the construction of by computing the empirical positive proportion after , , and then adding to the final else clause with probability estimate , or (2) randomly picks an antecedent from a candidate set of possible next antecedents, computes its empirical positive proportion, and uses these as the next rule for .

The algorithm uses various properties of Program 2.9, which are presented in Section 4, to prune the search space. More specifically, the algorithm terminates the construction of if Inequality (9) in Theorem 4.6 holds. Otherwise it either terminates the construction of with some probability, or proceeds to construct a candidate set of possible next antecedents, as follows. For every antecedent that has not been chosen before, it constructs a candidate next rule by setting and computing using Definition 2.5. The algorithm then checks if the monotonicity constraint and the necessary condition for optimality (Corollary 4.5) are satisfied, if the prefix is feasible under Program 2.9 (i.e. whether there exists a compatible falling rule list that begins with the prefix ) using Proposition 4.2, and if the best possible objective value achievable by any falling rule list that begins with and is compatible with (Theorem 4.6) is less than the current best objective value . If all of the above conditions are satisfied, the algorithm adds to . Once the construction of is complete, the algorithm randomly chooses an antecedent with probability and uses this antecedent, together with its empirical positive proportion, as the next rule for . If is empty, the algorithm terminates the construction of .

In practice, we define the probability for by first defining a curiosity function and then normalizing it:

 P(Al|S,e,D)=fS,e,D(Al)∑Al′fS,e,D(Al′).

A possible choice of the curiosity function for use in Algorithm FRL is given by

 fS,e,D(Al)=λα(Al,e,D)+(1−λ)n+(Al,e,D)~n+e,D, (11)

where is the empirical positive proportion of , and is the number of positive training inputs captured by , should be chosen as the next antecedent after the prefix . The curiosity function given by (11) is a weighted sum of and for each : the former encourages the algorithm to choose antecedents that have large empirical positive proportions, and the latter encourages the algorithm to choose antecedents that have large positive supports in the training data not captured by . We used this curiosity function for Algorithm FRL in our experiments.

The pseudocode of Algorithm FRL is shown in Algorithm 10.

## Appendix B Algorithm softFRL

In this section, we present Algorithm softFRL in detail. Given an instance of Program 5.1, the algorithm searches through the space of rule lists that are compatible with and finds a compatible rule list whose antecedents come from , and whose objective value is the smallest among all the rule lists that the algorithm explores. It does so by iterating over steps, in each of which the algorithm constructs a compatible rule list , while keeping track of the rule list that has the smallest objective value among all the rule lists that the algorithm has constructed so far. At the end of iterations, the algorithm transforms the rule list that has the smallest objective value out of the lists it has constructed, into a falling rule list by setting .

In the process of constructing a rule list , the algorithm chooses the antecedents successively: first for the antecedent in the top rule, then for the antecedent in the next rule, and so forth. For each antecedent chosen, the algorithm also computes its empirical positive proportion . After rules have been constructed so that currently holds the prefix , the algorithm either: (1) terminates the construction of by computing the empirical positive proportion after , , and then adding to the final else clause with probability estimate , or (2) randomly picks an antecedent from a candidate set of possible next antecedents, computes its empirical positive proportion, and use these as the next rule for .

The algorithm uses Theorem 5.2 to prune the search space. More specifically, the algorithm terminates the construction of if defined by Equation (10) in Theorem 5.2 is equal to , where is the compatible rule list in which the prefix is followed directly by the final else clause. The condition implies that is an optimal compatible rule list that begins with . If we have instead, the algorithm either terminates the construction of with some probability, or it proceeds to construct a candidate set of possible next antecedents, as follows. For every antecedent that has not been chosen before, it constructs a candidate next rule by setting and computing using Definition 2.5. The algorithm then checks if the best possible objective value achievable by any rule list that begins with and is compatible with (Theorem 5.2) is less than the current best objective value . If so, the algorithm adds to . Once the construction of is complete, the algorithm randomly chooses an antecedent with probability and uses this antecedent, together with its empirical positive proportion, as the next rule for . If is empty, the algorithm terminates the construction of .

In practice, we define the probability