Abstract
We propose novel methods for maxcost Discrete Function Evaluation Problem (DFEP) under budget constraints. We are motivated by applications such as clinical diagnosis where a patient is subjected to a sequence of (possibly expensive) tests before a decision is made. Our goal is to develop strategies for minimizing maxcosts. The problem is known to be NP hard and greedy methods based on specialized impurity functions have been proposed. We develop a broad class of admissible impurity functions that admit monomials, classes of polynomials, and hingeloss functions that allow for flexible impurity design with provably optimal approximation bounds. This flexibility is important for datasets when maxcost can be overly sensitive to “outliers.” Outliers bias maxcost to a few examples that require a large number of tests for classification. We design admissible functions that allow for accuracycost tradeoff and result in guarantees of the optimal cost among trees with corresponding classification accuracy levels.
MaxCost Discrete Function Evaluation Problem under a Budget
Feng Nan &Joseph Wang &Venkatesh Saligrama
\aistatsaddress Boston University
fnan@bu.edu &Boston University
joewang@bu.edu &Boston University
srv@bu.edu
1 Introduction
In many applications such as clinical diagnosis, monitoring, and web search, a patient, entity or query is subjected to a sequence of tests before a decision or prediction is made. Tests can be expensive and often complementary, namely, the outcome of one test may render another redundant. The goal in these scenarios is to minimize total test costs with negligible loss in diagnostic performance.
We propose to formulate this problem as an instance of the Discrete Function Evaluation Problem (DFEP). Under this framework, we seek to learn a decision tree which correctly classifies data while minimizing the cost of testing. We then propose methods to tradeoff accuracy and costs.
An instance of the problem is defined as ; Here is the set of objects; is a partition of into classes; is a set of tests; is a cost function that assigns a cost for each test . Applying test on object will output a discrete value in a finite set of possible outcomes . is assumed to be complete in the sense that for any distinct there exists a such that so they can be distinguished by . Given an instance of the DFEP, the goal is to build a testing procedure that uses tests in to determine the class of an unknown object. Formally, any testing procedure can be represented by a decision tree, where every internal node is associated with a test and objects are directed from the root to the corresponding leaves based on the test outcomes at each node. Given instance and decision tree , the testing cost of , denoted as , is the sum of all costs incurred along the roottoleaf path in traced by . We define the total cost as
This is known as the maxcost testing problem in the DFEP literature and has independently received significant attention [Cicalese et al., 2014, Saettler et al., 2014, Moshkov, 2010, Bellala et al., 2012] due to the fact that in real world problems, the prior probability used to compute the expected testing cost is either unavailable or inaccurate. Another motivation stems from timecritical applications, such as emergency response [Bellala et al., 2012], where violation of a timeconstraint may lead to unacceptable consequences.
In this paper we propose novel approaches and themes for the maxcost DFEP problem. It is now wellknown [Cicalese et al., 2014] that is the best approximation factor for DFEP unless . Greedy methods that achieve approximation guarantee have been proposed [Cicalese et al., 2014, Saettler et al., 2014, Moshkov, 2010]. These methods often rely on judiciously engineering so called impurity functions that are surprisingly effective in realizing “optimal” guarantees. Authors in [Cicalese et al., 2014, Moshkov, 2010, Saettler et al., 2014] describe impurity functions based on the notion of Pairs, while the authors in [Bellala et al., 2012] describe more complex impurity functions but require distributional assumptions.
In contrast, we propose a broad class of admissible functions such that any function from this class can be chosen as an impurity function with an approximation guarantee. Our admissible functions are in essence positive, monotone supermodular functions and admit not only pairs, monomials, classes of polynomials, but also hingeloss functions.
We propose new directions for the maxcost DFEP problem. In contrast to the current emphasis on correct classification, we propose to deliberately tradeoff cost with accuracy. This perspective can be justified under various scenarios. First, maxcost is overly sensitive to “outliers,” namely, a few instances require prohibitively many tests for correct classification. In these situations maxcost is not representative of most of the data and is biased towards a small subset of objects. Consequently, censoring those few “outliers” is meaningful from the perspective that maxcost applies to all but few examples. Second many applications have hard cost constraints that supersede correct classification of the entire data set and the goal is a tree that guarantees these cost constraints while minimizing errors.
Our proposed admissible functions are sufficiently general and allows for trading accuracy for cost. In particular we develop methods with guarantees of the optimal cost among trees with a corresponding classification accuracy level. Moreover, we show empirically on a number of examples that selection of impurity functions plays an important role in this tradeoff. In particular some admissible functions, such as hingeloss are particularly wellsuited for lowbudgets while others are preferable in highbudget scenarios.
Apart from the related approaches already described above, our work is also related to those that generally deal with expected costs [Golovin and Krause, 2011, Golovin et al., 2010, Bellala et al., 2012] or related problems such as submodular set coverage problem [Guillory and Bilmes, 2010]. At a conceptual level the main difference in [Guillory and Bilmes, 2010, Golovin and Krause, 2011, Golovin et al., 2010, Bellala et al., 2012] is in the way tests are chosen. Unlike our approach these methods employ utility functions in the policy space that acts on a sequence of observations. [Golovin and Krause, 2011] develops the notion of adaptive submodularity and has applied it for automated diagnosis. The proposed adaptive greedy algorithm can handle multiple classes/ test outcomes and arbitrary test costs but the approximation factor for the maxcost depends on the prior probability and can be very large in adversarial situations. A popular class of related approximation algorithms is generalized binary search (GBS) [Dasgupta, 2004, Kosaraju et al., 1999, Nowak, 2008]. A special case of this problem is where each object belongs to a distinct class and is known as object identification problem [Chakaravarthy et al., 2011] or poolbased active learning [Dasgupta, 2004]. When tests are restricted to binary outcomes and uniform test costs, approximation, where is the minimum probability of any single object [Dasgupta, 2004] can be obtained. Alternatively [Gupta et al., 2010] provides an algorithm which leads to an approximation factor for the optimal expected cost with arbitrary test costs and binary test outcomes. With respect to the maxcost, [Hanneke, 2006] gave a approximation for multiway tests and arbitrary test costs.
Organization:
We present a greedy algorithm in Section 2 which we show under general assumptions on the impurity function leads to an approximation of the optimal tree. We examine the assumptions on impurity functions and use them to define a class of admissible impurity functions in Section 3. Following this, we generalize from the errorfree case to tradeoff between maxcost and error in Section 4. Finally, we demonstrate performance of the greedy algorithm on real world data sets in Section 5 and show the advantage of different impurity functions along with the tradeoff between error and maxcost.
2 Greedy Algorithm and Analysis
In this section, we present an analysis of the greedy algorithm GreedyTree. We first show that GreedyTree yields a tree whose maxcost is within of the optimal maxcost for any DFEP. This bound on maxcost holds for any impurity function that satisfies a very general criteria as opposed to a fixed impurity function. In Section 3 we examine the assumptions on the impurity functions and present multiple examples of impurity functions for which this approximation bound holds.
Before beginning the analysis, we first define the following terms: for a given impurity function , is the impurity function on the set of objects ; is the family of decision trees with for any of its leaf ; is the minimum maxcost among all trees in for the given input set of objects ; is the maxcost of the tree constructed by GreedyTree based on impurity function .
For simplicity, we assume the impurity function takes on integer values and outcomeindependent test costs. Note that integer valued impurity functions is not a limitation because of the discrete (finite) nature of the problem  one can always scale any rationalvalued impurity function to make it integervalued. Similarly, it can be easily shown that our result extends to the outcomedependent cost setting considered in [Saettler et al., 2014] as well.
Given a DFEP, GreedyTree greedily chooses the test with the largest worstcase impurity reduction until all leaves are pure, i.e. impurity equals zero. Let be the first test selected by GreedyTree. By definition of the maxcost,
where is the set of objects in that has outcome for test . Let be such that . We first provide a lemma to lower bound the optimal cost, which will later be used to prove a bound on the cost of the tree.
Lemma 2.1.
Let be monotone and supermodular, and is the first test chosen by GreedyTree on the set of objects , then
Proof.
Let be a tree with optimal maxcost. Let be an arbitrarily chosen internal node in , let be the test associated with and let be the set of objects associated with the leaves of the subtree rooted at . Let be such that is maximized and be such that is maximized. We then have:
(1) 
The first inequality follows from the definition of . The second inequality follows from the greedy choice at the root. To show the last inequality, we have to show . This follows from the fact that and and therefore , where the first inequality follows from monotonicity and the second follows from the definition of supermodularity.
For a node , let be the set of objects associated with the leaves of the subtree rooted at . Let be a roottoleaf path on as follows: is the root of the tree, and for each the node is a child of associated with the branch of that maximizes , where is the test associated with . If follows from (1) that
(2) 
Since the cost of the path from to is no larger than the maxcost of the , we have that
∎
Using Lemma 2.1, we can now state the main theorem of this section which bounds the cost of the greedily constructed tree.
Theorem 2.2.
GreedyTree constructs a decision tree achieving factor approximation of the optimal maxcost in on the set of objects if is nonnegative, monotone, supermodular with .
Proof.
(3)  
(4)  
(5)  
(6)  
(7) 
The inequality in (4) follows from the fact that . (5) follows from Lemma 2.1. The first term in (6) follows from the inequality for and the second term follows from the induction hypothesis that for each , . If for some set of objects , we define .
We can verify the base case of the induction as follows. if , which is the smallest nonzero impurity of on subsets of objects , we claim that the optimal decision tree chooses the test with the smallest cost among those that can reduce the impurity function :
Suppose otherwise, the optimal tree chooses first a test with a child node such that and later chooses another test such that all the child nodes of by has zero impurity, then could have been chosen in the first place to reduce all child nodes of to zero impurity by supermodularity of and therefore this cannot be the optimal ordering of tests. On the other hand, in GreedyTree for those test that cannot reduce impurity and for those tests that can. So the algorithm would pick the test among those that can reduce impurity and have the smallest cost. Thus, we have shown that for the base case. ∎
Given that , the optimal order approximation for the DFEP problem is , which is achieved by GreedyTree. This approximation is not dependent on a particular impurity function, but instead holds for any function which satisfies the assumptions. In Section 3, we define a family of impurity functions that satisfy these assumptions.
3 Admissible Functions
A fundamental element of constructing decision trees is the impurity function, which measures the disagreement of labels between a set of objects. Many impurity functions have been proposed for constructing decision trees, and the choice of impurity function can have a significant impact on the performance of the tree. In this section we examine the assumptions placed on the impurity function by Lemma 2.1 and Theorem 2.2 which we use to define a class of functions we call admissible impurity functions and provide examples of admissible impurity functions.

A function of a set of objects is admissible if it satisfies the following five properties: (1) Nonnegativity: for any set of objects ; (2) Purity: if consists of objects of the same class; (3) Monotonicity: ; (4) Supermodularty: for any and object ; (5) .
A wide range of functions falls into the class of admissible impurity functions. We propose a general family of polynomial functions which we show is admissible. Given a set of objects , denotes the number of objects in that belong to class .
Lemma 3.1.
Suppose there are classes in . Any polynomial function of with nonnegative terms such that do not appear as singleton terms is admissible. Formally, if
(8) where ’s are nonnegative, ’s are nonnegative integers and for each there exists at least 2 nonzero ’s, then is admissible.
Proof.
Properties (1),(2),(3) and (5) are obviously true. To show is supermodular, suppose and object and belongs to class , we have
where the first summation index set is the set of terms that involve . The inequality follows because can be expanded so the negative term can be canceled, leaving a sumofproducts form for , which is termbyterm dominated by that of . ∎
A special case of polynomial impurity function is the previously proposed Pairs function [Saettler et al., 2014, Cicalese et al., 2014, Moshkov, 2010]. Two objects are defined as a pair if they are of different classes, with the Pairs function equal to the total number of pairs in the set :
where is the number of distinct classes in set .
Corollary 3.2.
The Pairs impurity function is admissible.
As a corollary of Theorem 2.2 and Corollary 3.2, we see that approximation for Pairs and outcomedependent cost holds for multiple test outcomes as well, extending the binary outcome setting shown in [Saettler et al., 2014].
Another family of admissible impurity functions is the Powers function.
Corollary 3.3.
Powers function
(9) is admissible for .
Note Pairs can be viewed as a special case of Powers function when . An important property of the Powers impurity functions is the fact that for any power , the function is zero only if the set of objects all belong to the same class. As a result, using any of these Powers impurity function in GreedyTree results in an errorfree tree with near optimal cost.
Another interesting admissible impurity used in Section 4 is the hingedPairs function defined:
(10) where . This function differs from the Powers impurity function due to the fact that for a , the function need not imply that all objects in belong to the same class. In the next section, we will discuss how this allows for trees to be constructed incorporating classification error. We include the proof of the following lemma in the Appendix.
Lemma 3.4.
In the multiclass setting, is admissible.
Impurity Function Selection: While all admissible impurity functions enjoy the approximation of the optimal maxcost, they lead to different trees depending on the problems. To illustrate this point, consider the toy example in Figure 1. A set has 30 objects in class 1 (circles) and 30 objects in Class 2 (triangles). Two tests and are available to the algorithm. Test separates 20 objects of Class 2 from the rest of the objects while evenly divides the objects into halves with equal number of objects from Class 1 and Class 2 in either half. Intuitively, is not a useful test from a classification point of view because it does not separate objects based on class at all. This is reflected in the right plot of Figure 1: choosing increases cost but does not reduce classification error while choosing reduces the error to . If the impurity function chosen is the Pairs function, test will be chosen due to the fact that Pairs biases towards tests with balanced test outcomes. In contrast, the hingedPairs function leads to test , and therefore may be preferable in this case (for more details on this example see the Appendix). Although both impurity functions are admissible and return trees with near optimal guarantees, empirical performance can differ greatly and is strongly dependent on the structure of the data. In practice, we find that choosing the tree with the lowest classification error across a variety of impurity functions yields improved performance compared to a single impurity function strategy.
4 Tradeoff Bounds
Up to this point, we have focused on constructing errorfree trees. Unfortunately, the maxcost criteria is highly sensitive to outliers, and therefore often yields trees with unnecessarily large maximum depth to accommodate a small subset of outliers in the data set. Refer to the synthetic experiment in Section 5 for such an example. To overcome the sensitivity to outliers, we present an approach to constructing near optimal trees with nonzero error rates.
Earlystopping:
Instead of requiring all leaves to have zero impurity () in Algorithm 1, we can stop the recursion as soon as all leaves have impurity below a threshold (). This will allow error and cost tradeoff. Let denote the set of trees with for all leaves and let denote the optimal maxcost among all trees in .
Similar to the errorfree setting, the approximation of the optimal maxcost still holds for early stopping as shown next. The proofs of Lemma 4.1 and Theorem 4.2 are similar to that of Lemma 2.1 and Theorem 2.2 and we include them in the Appendix.
Lemma 4.1.
Let be an admissible function and is the first test chosen by GreedyTree on the set of objects , then
Theorem 4.2.
GreedyTree constructs a decision tree achieving factor approximation of the optimal maxcost in on the set of objects if is admissible.
HingedPairs:
Similar to earlystopping, we can also use the hingedPairs (10) with in GreedyTree to allow errorcost tradeoff. We first establish an error upper bound for trees in .
Lemma 4.3.
For a multiclass input set with classes, the classification error of any tree in with leaves is bounded by , where we set .
Proof.
Suppose is the largest class in leaf . For , if , we have , which implies . So
If , we have . So for any leaf we have . The overall error bound thus follows. ∎
Often in practice a tree may contain a relatively large number of leaves but only a small fraction of them contain most of the objects. A more refined upper bound on the error is given by the following lemma, which we prove in the Appendix.
Lemma 4.4.
Consider a multiclass input set with classes and . For any tree with leaves, given any , let be the smallest integer such that the largest leaves of have more than of the total number of objects . Then the classification error is bounded by .
Denote as the class of trees with classification error less than or equal to on the set of input . We can further derive a useful relation between and .
Lemma 4.5.
For any multiclass input set with classes, .
Proof.
To show , for any tree , we have , where is the number of leaves and is the number of objects in leaf that are not from the majority class: . This implies for all leaves of . Suppose is the class with most number of objects in leaf : . It is not hard to see for any class
which implies . Thus we have . Thus . follows from Lemma 4.3. ∎
The main theorem of this section is the following.
Theorem 4.6.
In multiclass classification with classes, if is the decision tree returned by GreedyTree using hingedPairs (setting ) applied on the set of objects, then we have the following:
Proof.
The above theorem states that for a given error parameter , a greedy tree can be constructed using hingedPairs by setting , with the maxcost guaranteed to be within an factor of the best possible maxcost among all decision trees that have classification error less than or equal to . To our knowledge this is the first bound relating classification error to cost, which provides a theoretical basis for accuracycost tradeoff.
5 Experimental Results
Figure 4: Comparison of classification error vs. maxcost for the Powers impurity function in (9) for and the hingedPairs impurity function in (10). Note that for both House Votes and WBCD, the depth tree is not included as the error decreases dramatically using a single test. In many cases, the hinged pairs impurity function outperforms the Powers impurity functions for trees with smaller maxcosts, whereas the Powers impurity function outperforms the hingedPairs function for larger maxcosts. We first demonstrate the effect of outliers using a simple synthetic example, where a small set of outliers dramatically increases the maxcost of the tree. We show that allowing a small number of errors in the tree drastically reduces the cost of the tree, allowing for efficient trees to be constructed in the presence of outliers. Next, we demonstrate the ability to construct decision trees on real world data sets. We observe a similar behavior to the synthetic data set on many of these data sets, where allowing a small amount of error results in trees with significantly lower cost. Additionally, we see the effect of impurity function choice on performance of the trees. For all real datasets, we present performance of the Powers impurity function presented in Eq. (9) with and error introduced by early stopping as well as the hingedPairs impurity function presented in Eq. (10) with error introduced by varying the parameter .
Synthetic Example: Here we consider a multiclass classification example to demonstrate the effect a small set of objects can have on the maxcost of the tree. Consider a data set composed of 1024 objects belonging to 4 classes with 10 binary tests available. Assume that the set of tests is complete, that is no two objects have the same set of test outcomes. Note that by fixing the order of the tests, the set of test outcomes maps each object to an integer in the range . From this mapping, we give the objects in the ranges , , , and the labels , , , and , respectively, and the objects , , , and the labels , , , and , respectively (Figure 2 shows the data projected to the first two tests). Suppose each test carries a unit cost. By Kraft’s Inequality [Cover and Thomas, 1991], the optimal maxcost in order to correctly classify every object is 10, however, using only and as selected by the greedy algorithm, leads to a correct classification of all but 4 objects, as shown in Figure 3. For this type of data set, a constant sized set of costs can change from a tree with a constant maxcost to a tree with a maxcost.
Data Sets: We compare performance using 9 data sets from the UCI Repository [Frank and Asuncion, 2010]. We assume that all tests (features) have a uniform cost. For each data set, we replace nonunique objects with a single instance using the most common label for the objects, allowing every data set to be complete (perfectly classified by the decision trees). Additionally, continuous features are transformed to discrete features by quantizing to 10 uniformly spaced levels. More details on the data sets used can be found in the Appendix.
Error vs. Cost TradeOff: Fig. 4 shows the tradeoff between classification error and maxcost, which suggest two key trends. First, it appears that many data sets, such as house votes, Statlog DNA, Wisconsin breast cancer, and mammography, can be classified with minimal error using few tests. Intuitively, this small error appears to correspond to a small subset of outlier objects which require a large number of tests to correctly classify while the majority of the data can be classified with a small number of tests. Second, empirical evidence suggests that the optimal choice of impurity function is dependent on the desired maxcost of the tree. For trees with a smaller budget (and therefore lower depth), the hingedPairs impurity function outperforms the Powers impurity function with early stopping, whereas for larger budget (and greater depth), the Powers impurity function outperforms hingedPairs. This matches our intuitive understanding of the impurity functions, as the Powers impurity function biases towards tests which evenly divide the data whereas hingedPairs puts more emphasis on classification performance.
6 Conclusion
We characterize a broad class of admissible impurity functions that can be used in a greedy algorithm to yield guarantees of the optimal maxcost. We give examples of such admissible functions and demonstrate that they have different empirical properties even though they all enjoy the guarantee. We further design admissible functions to allow for accuracycost tradeoff and provide a bound relating classification error to cost. Finally, through real world datasets we demonstrate that our algorithm can indeed censor the outliers and achieve high classification accuracy using low maxcost. To visualize such outliers we construct a 2D synthetic experiment and show our algorithm successfully identifies these as outliers.
References
 [Bellala et al., 2012] Bellala, G., Bhavnani, S., and Scott, C. (2012). Groupbased active query selection for rapid diagnosis in timecritical situations. Information Theory, IEEE Transactions on, 58(1):459–478.
 [Chakaravarthy et al., 2011] Chakaravarthy, V. T., Pandit, V., Roy, S., Awasthi, P., and Mohania, M. K. (2011). Decision trees for entity identification: Approximation algorithms and hardness results. ACM Trans. Algorithms, 7(2):15:1–15:22.
 [Cicalese et al., 2014] Cicalese, F., Laber, E. S., and Saettler, A. M. (2014). Diagnosis determination: decision trees optimizing simultaneously worst and expected testing cost. In Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 2126 June 2014, volume 32 of JMLR Proceedings, pages 414–422. JMLR.org.
 [Cover and Thomas, 1991] Cover, T. M. and Thomas, J. A. (1991). Elements of Information Theory. WileyInterscience, New York, NY, USA.
 [Dasgupta, 2004] Dasgupta, S. (2004). Analysis of a greedy active learning strategy. In In Advances in Neural Information Processing Systems, pages 337–344. MIT Press.
 [Frank and Asuncion, 2010] Frank, A. and Asuncion, A. (2010). UCI machine learning repository.
 [Golovin and Krause, 2011] Golovin, D. and Krause, A. (2011). Adaptive submodularity: Theory and applications in active learning and stochastic optimization. Journal of Artificial Intelligence Research (JAIR), 42:427–486.
 [Golovin et al., 2010] Golovin, D., Krause, A., and Ray, D. (2010). Nearoptimal bayesian active learning with noisy observations. In Lafferty, J., Williams, C. K. I., ShaweTaylor, J., Zemel, R., and Culotta, A., editors, Advances in Neural Information Processing Systems 23, pages 766–774.
 [Guillory and Bilmes, 2010] Guillory, A. and Bilmes, J. A. (2010). Interactive submodular set cover. In FâÂºrnkranz, J. and Joachims, T., editors, Proceedings of the 27th International Conference on Machine Learning (ICML10), pages 415–422. Omnipress.
 [Gupta et al., 2010] Gupta, A., Nagarajan, V., and Ravi, R. (2010). Approximation algorithms for optimal decision trees and adaptive tsp problems. In Proceedings of the 37th International Colloquium Conference on Automata, Languages and Programming, ICALP’10, pages 690–701, Berlin, Heidelberg. SpringerVerlag.
 [Hanneke, 2006] Hanneke, S. (2006). The cost complexity of interactive learning. unpublished.
 [Kosaraju et al., 1999] Kosaraju, S. R., Przytycka, T. M., and Borgstrom, R. S. (1999). On an optimal split tree problem. In Proceedings of the 6th International Workshop on Algorithms and Data Structures, WADS ’99, pages 157–168, London, UK, UK. SpringerVerlag.
 [Moshkov, 2010] Moshkov, M. J. (2010). Greedy algorithm with weights for decision tree construction. Fundam. Inf., 104(3):285–292.
 [Nowak, 2008] Nowak, R. (2008). Generalized binary search. In In Proceedings of the 46th Allerton Conference on Communications, Control, and Computing, pages 568–574.
 [Saettler et al., 2014] Saettler, A., Laber, E., and Cicalese, F. (2014). Trading off worst and expected cost in decision tree problems and a value dependent model. ArXiv, pages 1–13.
Appendix
Proof of Lemma 3.4
Before showing admissibility of the hingedPairs function in the multiclass setting, we first show is admissible for the binary setting.
Lemma .1.
Consider the binary classification setting, let
where . is admissible.
Proof.
All the properties are obviously true except supermodularity. To show supermodularity, suppose and object . Suppose belongs to the first class. We need to show
(11) Consider 3 cases:
(1) : The right hand side of (11) is 0 and (11) holds because of monotonicity of .
(2) : (11) reduces to , which is true by monotonicity.
(3) : Note that implies that which further implies . Thus the left hand side isThe right hand side is
If , because implies . So .
(4) : We haveThis completes the proof. ∎
Now we are ready to generalize from the binary hingedPairs function to the multiclass hingedPairs function. Again, all properties are obviously except supermodularity. The supermodularity follows from the fact that each term in the sum is supermodular according to Lemma .1.
Proof of Lemma 4.4
We begin by considering any leaf of , suppose is the largest class in . For , if , we have
, which implies . So
If , we have . Let be the number of objects in leaf that are not from the majority class: . So for any leaf we have .
Now we enumerate the leaves of in nonincreasing order according to the number of objects they contain. Let be the set of the first leaves. By definition of , the total number of objects contained in is .
The overall error bound is obtained by considering leaves in and the complement separately:
where we have used the fact that and that .
Details of Computation in Figure 1
If Pairs is used, we can compute impurity of each set of interest: ; according to Algorithm 1, we can compute so will be chosen. On the other hand, the impurities for the hingedPairs with are ; again we can compute so will be chosen. The above example shows that Pairs has a stronger preference to balanced tests and may in some cases lead to poor classification result.
Details of Data Sets
The house votes data set is composed of the voting records for 435 members of the U.S. House of Representatives (342 unique voting records) on 16 measures, with a goal of identifying the party of each member. The sonar data set contains 208 sonar signatures, each composed of energy levels (quantized to 10 levels) in 60 different frequency bands, with a goal of identifying The ionosphere data set has 351 (350 unique) radar returns, each composed of 34 responses (quantized to 10 levels), with a goal of identifying if an event represents a free electron in the ionosphere. The Statlog DNA data set is composed of 3186 (3001 unique) DNA sequences with 180 features, with a goal of predicting whether the sequence represents a boundary of DNA to be spliced in or out. The Boston housing data set contains 13 attributes (quantized to 10 levels) pertaining to 506 (469 unique) different neighborhoods around Boston, with a goal of predicting which quartile the median income of the neighborhood the neighborhood falls. The soybean data set is composed of 307 examples (303 unique) composed of 34 categorical features, with a goal of predicting from among 19 diseases which is afflicting the soy bean plant. The pima data set is composed of 8 features (with continuous features quantized to 10 levels) corresponding to medical information and tests for 768 patients (753 unique feature patterns), with a goal of diagnosing diabetes. The Wisconsin breast cancer data set contains 30 features corresponding to properties of a cell nucleus for 569 samples, with a goal of identifying if the cell is malignant or benign. The mammography data set contains 6 features from mammography scans (with age quantized into 10 bins) for 830 patients, with a goal of classifying the lesions as malignant or benign.