1 Introduction
###### Abstract

We propose novel methods for max-cost Discrete Function Evaluation Problem (DFEP) under budget constraints. We are motivated by applications such as clinical diagnosis where a patient is subjected to a sequence of (possibly expensive) tests before a decision is made. Our goal is to develop strategies for minimizing max-costs. The problem is known to be NP hard and greedy methods based on specialized impurity functions have been proposed. We develop a broad class of admissible impurity functions that admit monomials, classes of polynomials, and hinge-loss functions that allow for flexible impurity design with provably optimal approximation bounds. This flexibility is important for datasets when max-cost can be overly sensitive to “outliers.” Outliers bias max-cost to a few examples that require a large number of tests for classification. We design admissible functions that allow for accuracy-cost trade-off and result in guarantees of the optimal cost among trees with corresponding classification accuracy levels.

\aistatstitle

Max-Cost Discrete Function Evaluation Problem under a Budget

\aistatsauthor

Feng Nan &Joseph Wang &Venkatesh Saligrama \aistatsaddress Boston University
fnan@bu.edu &Boston University
joewang@bu.edu &Boston University
srv@bu.edu

## 1 Introduction

In many applications such as clinical diagnosis, monitoring, and web search, a patient, entity or query is subjected to a sequence of tests before a decision or prediction is made. Tests can be expensive and often complementary, namely, the outcome of one test may render another redundant. The goal in these scenarios is to minimize total test costs with negligible loss in diagnostic performance.

We propose to formulate this problem as an instance of the Discrete Function Evaluation Problem (DFEP). Under this framework, we seek to learn a decision tree which correctly classifies data while minimizing the cost of testing. We then propose methods to trade-off accuracy and costs.

An instance of the problem is defined as ; Here is the set of objects; is a partition of into classes; is a set of tests; is a cost function that assigns a cost for each test . Applying test on object will output a discrete value in a finite set of possible outcomes . is assumed to be complete in the sense that for any distinct there exists a such that so they can be distinguished by . Given an instance of the DFEP, the goal is to build a testing procedure that uses tests in to determine the class of an unknown object. Formally, any testing procedure can be represented by a decision tree, where every internal node is associated with a test and objects are directed from the root to the corresponding leaves based on the test outcomes at each node. Given instance and decision tree , the testing cost of , denoted as , is the sum of all costs incurred along the root-to-leaf path in traced by . We define the total cost as

 CostW(D)=maxs∈Scost(D,s)

This is known as the max-cost testing problem in the DFEP literature and has independently received significant attention [Cicalese et al., 2014, Saettler et al., 2014, Moshkov, 2010, Bellala et al., 2012] due to the fact that in real world problems, the prior probability used to compute the expected testing cost is either unavailable or inaccurate. Another motivation stems from time-critical applications, such as emergency response [Bellala et al., 2012], where violation of a time-constraint may lead to unacceptable consequences.

In this paper we propose novel approaches and themes for the max-cost DFEP problem. It is now well-known [Cicalese et al., 2014] that is the best approximation factor for DFEP unless . Greedy methods that achieve approximation guarantee have been proposed [Cicalese et al., 2014, Saettler et al., 2014, Moshkov, 2010]. These methods often rely on judiciously engineering so called impurity functions that are surprisingly effective in realizing “optimal” guarantees. Authors in [Cicalese et al., 2014, Moshkov, 2010, Saettler et al., 2014] describe impurity functions based on the notion of Pairs, while the authors in [Bellala et al., 2012] describe more complex impurity functions but require distributional assumptions.

In contrast, we propose a broad class of admissible functions such that any function from this class can be chosen as an impurity function with an approximation guarantee. Our admissible functions are in essence positive, monotone supermodular functions and admit not only pairs, monomials, classes of polynomials, but also hinge-loss functions.

We propose new directions for the max-cost DFEP problem. In contrast to the current emphasis on correct classification, we propose to deliberately trade-off cost with accuracy. This perspective can be justified under various scenarios. First, max-cost is overly sensitive to “outliers,” namely, a few instances require prohibitively many tests for correct classification. In these situations max-cost is not representative of most of the data and is biased towards a small subset of objects. Consequently, censoring those few “outliers” is meaningful from the perspective that max-cost applies to all but few examples. Second many applications have hard cost constraints that supersede correct classification of the entire data set and the goal is a tree that guarantees these cost constraints while minimizing errors.

Our proposed admissible functions are sufficiently general and allows for trading accuracy for cost. In particular we develop methods with guarantees of the optimal cost among trees with a corresponding classification accuracy level. Moreover, we show empirically on a number of examples that selection of impurity functions plays an important role in this trade-off. In particular some admissible functions, such as hinge-loss are particularly well-suited for low-budgets while others are preferable in high-budget scenarios.

Apart from the related approaches already described above, our work is also related to those that generally deal with expected costs [Golovin and Krause, 2011, Golovin et al., 2010, Bellala et al., 2012] or related problems such as sub-modular set coverage problem [Guillory and Bilmes, 2010]. At a conceptual level the main difference in  [Guillory and Bilmes, 2010, Golovin and Krause, 2011, Golovin et al., 2010, Bellala et al., 2012] is in the way tests are chosen. Unlike our approach these methods employ utility functions in the policy space that acts on a sequence of observations. [Golovin and Krause, 2011] develops the notion of adaptive submodularity and has applied it for automated diagnosis. The proposed adaptive greedy algorithm can handle multiple classes/ test outcomes and arbitrary test costs but the approximation factor for the max-cost depends on the prior probability and can be very large in adversarial situations. A popular class of related approximation algorithms is generalized binary search (GBS) [Dasgupta, 2004, Kosaraju et al., 1999, Nowak, 2008]. A special case of this problem is where each object belongs to a distinct class and is known as object identification problem [Chakaravarthy et al., 2011] or pool-based active learning [Dasgupta, 2004]. When tests are restricted to binary outcomes and uniform test costs, approximation, where is the minimum probability of any single object [Dasgupta, 2004] can be obtained. Alternatively [Gupta et al., 2010] provides an algorithm which leads to an approximation factor for the optimal expected cost with arbitrary test costs and binary test outcomes. With respect to the max-cost, [Hanneke, 2006] gave a approximation for multiway tests and arbitrary test costs.

### Organization:

We present a greedy algorithm in Section 2 which we show under general assumptions on the impurity function leads to an approximation of the optimal tree. We examine the assumptions on impurity functions and use them to define a class of admissible impurity functions in Section 3. Following this, we generalize from the error-free case to trade-off between max-cost and error in Section 4. Finally, we demonstrate performance of the greedy algorithm on real world data sets in Section 5 and show the advantage of different impurity functions along with the trade-off between error and max-cost.

## 2 Greedy Algorithm and Analysis

In this section, we present an analysis of the greedy algorithm GreedyTree. We first show that GreedyTree yields a tree whose max-cost is within of the optimal max-cost for any DFEP. This bound on max-cost holds for any impurity function that satisfies a very general criteria as opposed to a fixed impurity function. In Section  3 we examine the assumptions on the impurity functions and present multiple examples of impurity functions for which this approximation bound holds.

Before beginning the analysis, we first define the following terms: for a given impurity function , is the impurity function on the set of objects ; is the family of decision trees with for any of its leaf ; is the minimum max-cost among all trees in for the given input set of objects ; is the max-cost of the tree constructed by GreedyTree based on impurity function .

For simplicity, we assume the impurity function takes on integer values and outcome-independent test costs. Note that integer valued impurity functions is not a limitation because of the discrete (finite) nature of the problem - one can always scale any rational-valued impurity function to make it integer-valued. Similarly, it can be easily shown that our result extends to the outcome-dependent cost setting considered in [Saettler et al., 2014] as well.

Given a DFEP, GreedyTree greedily chooses the test with the largest worst-case impurity reduction until all leaves are pure, i.e. impurity equals zero. Let be the first test selected by GreedyTree. By definition of the max-cost,

 CostF(S)OPT(S)=c(τ)+maxiCostF(Siτ)OPT(S),

where is the set of objects in that has outcome for test . Let be such that . We first provide a lemma to lower bound the optimal cost, which will later be used to prove a bound on the cost of the tree.

###### Lemma 2.1.

Let be monotone and supermodular, and is the first test chosen by GreedyTree on the set of objects , then

 c(τ)F(S)/(F(S)−F(Sqτ))≤OPT(S).
###### Proof.

Let be a tree with optimal max-cost. Let be an arbitrarily chosen internal node in , let be the test associated with and let be the set of objects associated with the leaves of the subtree rooted at . Let be such that is maximized and be such that is maximized. We then have:

 c(τ)F(S)−F(Sqτ)≤c(τ)F(S)−F(Siτ) ≤c(γ)F(S)−F(Sjγ)≤c(γ)F(R)−F(Rjγ). (1)

The first inequality follows from the definition of . The second inequality follows from the greedy choice at the root. To show the last inequality, we have to show . This follows from the fact that and and therefore , where the first inequality follows from monotonicity and the second follows from the definition of supermodularity.

For a node , let be the set of objects associated with the leaves of the subtree rooted at . Let be a root-to-leaf path on as follows: is the root of the tree, and for each the node is a child of associated with the branch of that maximizes , where is the test associated with . If follows from (1) that

 [F(S(vi))−F(S(vi+1))]c(τ)F(S)−F(Sqτ)≤cti. (2)

Since the cost of the path from to is no larger than the max-cost of the , we have that

 OPT(S)≥p−1∑i=1cti ≥c(τ)F(S)−F(Sqτ)p−1∑i=1(F(S(vi))−F(S(vi+1)) =c(τ)(F(S)−F(S(vp))F(S)−F(Sqτ)=c(τ)F(S)F(S)−F(Sqτ).

Using Lemma 2.1, we can now state the main theorem of this section which bounds the cost of the greedily constructed tree.

###### Theorem 2.2.

GreedyTree constructs a decision tree achieving -factor approximation of the optimal max-cost in on the set of objects if is non-negative, monotone, supermodular with .

###### Proof.
 CostF(S)OPT(S)=c(τ)+CostF(Sqτ)OPT(S) (3) ≤c(τ)OPT(S)+CostF(Sqτ)OPT(Sqτ) (4) ≤F(S)−F(Sqτ)F(S)+CostF(Sqτ)OPT(Sqτ) (5) ≤log(F(S)F(Sqτ))+log(F(Sqτ))+1 (6) =log(F(S))+1=O(logn). (7)

The inequality in (4) follows from the fact that . (5) follows from Lemma 2.1. The first term in (6) follows from the inequality for and the second term follows from the induction hypothesis that for each , . If for some set of objects , we define .

We can verify the base case of the induction as follows. if , which is the smallest non-zero impurity of on subsets of objects , we claim that the optimal decision tree chooses the test with the smallest cost among those that can reduce the impurity function :

 OPT(G)=mint|F(Git)=0,∀i∈outcomesc(t).

Suppose otherwise, the optimal tree chooses first a test with a child node such that and later chooses another test such that all the child nodes of by has zero impurity, then could have been chosen in the first place to reduce all child nodes of to zero impurity by supermodularity of and therefore this cannot be the optimal ordering of tests. On the other hand, in GreedyTree for those test that cannot reduce impurity and for those tests that can. So the algorithm would pick the test among those that can reduce impurity and have the smallest cost. Thus, we have shown that for the base case. ∎

Given that , the optimal order approximation for the DFEP problem is , which is achieved by GreedyTree. This approximation is not dependent on a particular impurity function, but instead holds for any function which satisfies the assumptions. In Section 3, we define a family of impurity functions that satisfy these assumptions.

A fundamental element of constructing decision trees is the impurity function, which measures the disagreement of labels between a set of objects. Many impurity functions have been proposed for constructing decision trees, and the choice of impurity function can have a significant impact on the performance of the tree. In this section we examine the assumptions placed on the impurity function by Lemma 2.1 and Theorem 2.2 which we use to define a class of functions we call admissible impurity functions and provide examples of admissible impurity functions.

• A function of a set of objects is admissible if it satisfies the following five properties: (1) Non-negativity: for any set of objects ; (2) Purity: if consists of objects of the same class; (3) Monotonicity: ; (4) Supermodularty: for any and object ; (5) .

A wide range of functions falls into the class of admissible impurity functions. We propose a general family of polynomial functions which we show is admissible. Given a set of objects , denotes the number of objects in that belong to class .

###### Lemma 3.1.

Suppose there are classes in . Any polynomial function of with non-negative terms such that do not appear as singleton terms is admissible. Formally, if

 F(G)=M∑i=1γi(n1G)pi1(n2G)pi2…(nkG)pik, (8)

where ’s are non-negative, ’s are non-negative integers and for each there exists at least 2 non-zero ’s, then is admissible.

###### Proof.

Properties (1),(2),(3) and (5) are obviously true. To show is supermodular, suppose and object and belongs to class , we have

 F(R∪^j)−F(R) =∑i∈Ijγi[(n1R)pi1…(njR+1)pij…(nkR)pik− (n1R)pi1…(njR)pij…(nkR)pik] ≤∑i∈Ijγi[(n1G)pi1…(njG+1)pij…(nkG)pik− (n1G)pi1…(njG)pij…(nkG)pik] =F(G∪^j)−F(G),

where the first summation index set is the set of terms that involve . The inequality follows because can be expanded so the negative term can be canceled, leaving a sum-of-products form for , which is term-by-term dominated by that of . ∎

A special case of polynomial impurity function is the previously proposed Pairs function [Saettler et al., 2014, Cicalese et al., 2014, Moshkov, 2010]. Two objects are defined as a pair if they are of different classes, with the Pairs function equal to the total number of pairs in the set :

 P(G)=k−1∑i=1k∑j=i+1niGnjG,

where is the number of distinct classes in set .

###### Corollary 3.2.

The Pairs impurity function is admissible.

As a corollary of Theorem 2.2 and Corollary 3.2, we see that approximation for Pairs and outcome-dependent cost holds for multiple test outcomes as well, extending the binary outcome setting shown in [Saettler et al., 2014].

Another family of admissible impurity functions is the Powers function.

###### Corollary 3.3.

Powers function

 F(G)=(k∑i=1niG)l−k∑i=1(niG)l (9)

Note Pairs can be viewed as a special case of Powers function when . An important property of the Powers impurity functions is the fact that for any power , the function is zero only if the set of objects all belong to the same class. As a result, using any of these Powers impurity function in GreedyTree results in an error-free tree with near optimal cost.

Another interesting admissible impurity used in Section 4 is the hinged-Pairs function defined:

 Pα(G)=∑i≠j[[niG−α]+[njG−α]+−α2]+, (10)

where . This function differs from the Powers impurity function due to the fact that for a , the function need not imply that all objects in belong to the same class. In the next section, we will discuss how this allows for trees to be constructed incorporating classification error. We include the proof of the following lemma in the Appendix.

###### Lemma 3.4.

In the multi-class setting, is admissible.

Impurity Function Selection: While all admissible impurity functions enjoy the approximation of the optimal max-cost, they lead to different trees depending on the problems. To illustrate this point, consider the toy example in Figure 1. A set has 30 objects in class 1 (circles) and 30 objects in Class 2 (triangles). Two tests and are available to the algorithm. Test separates 20 objects of Class 2 from the rest of the objects while evenly divides the objects into halves with equal number of objects from Class 1 and Class 2 in either half. Intuitively, is not a useful test from a classification point of view because it does not separate objects based on class at all. This is reflected in the right plot of Figure 1: choosing increases cost but does not reduce classification error while choosing reduces the error to . If the impurity function chosen is the Pairs function, test will be chosen due to the fact that Pairs biases towards tests with balanced test outcomes. In contrast, the hinged-Pairs function leads to test , and therefore may be preferable in this case (for more details on this example see the Appendix). Although both impurity functions are admissible and return trees with near optimal guarantees, empirical performance can differ greatly and is strongly dependent on the structure of the data. In practice, we find that choosing the tree with the lowest classification error across a variety of impurity functions yields improved performance compared to a single impurity function strategy.

Up to this point, we have focused on constructing error-free trees. Unfortunately, the max-cost criteria is highly sensitive to outliers, and therefore often yields trees with unnecessarily large maximum depth to accommodate a small subset of outliers in the data set. Refer to the synthetic experiment in Section 5 for such an example. To overcome the sensitivity to outliers, we present an approach to constructing near optimal trees with non-zero error rates.

### Early-stopping:

Instead of requiring all leaves to have zero impurity () in Algorithm 1, we can stop the recursion as soon as all leaves have impurity below a threshold (). This will allow error and cost trade-off. Let denote the set of trees with for all leaves and let denote the optimal max-cost among all trees in .

Similar to the error-free setting, the approximation of the optimal max-cost still holds for early stopping as shown next. The proofs of Lemma 4.1 and Theorem 4.2 are similar to that of Lemma 2.1 and Theorem 2.2 and we include them in the Appendix.

###### Lemma 4.1.

Let be an admissible function and is the first test chosen by GreedyTree on the set of objects , then

 c(τ)(F(S)−δ)/(F(S)−F(Sqτ))≤OPTF:δ(S).
###### Theorem 4.2.

GreedyTree constructs a decision tree achieving -factor approximation of the optimal max-cost in on the set of objects if is admissible.

### Hinged-Pairs:

Similar to early-stopping, we can also use the hinged-Pairs (10) with in GreedyTree to allow error-cost trade-off. We first establish an error upper bound for trees in .

###### Lemma 4.3.

For a multi-class input set with classes, the classification error of any tree in with leaves is bounded by , where we set .

###### Proof.

Suppose is the largest class in leaf . For , if , we have , which implies . So

 niL≤kniLnjLnL≤kα=kϵn.

If , we have . So for any leaf we have . The overall error bound thus follows. ∎

Often in practice a tree may contain a relatively large number of leaves but only a small fraction of them contain most of the objects. A more refined upper bound on the error is given by the following lemma, which we prove in the Appendix.

###### Lemma 4.4.

Consider a multi-class input set with classes and . For any tree with leaves, given any , let be the smallest integer such that the largest leaves of have more than of the total number of objects . Then the classification error is bounded by .

Denote as the class of trees with classification error less than or equal to on the set of input . We can further derive a useful relation between and .

###### Lemma 4.5.

For any multi-class input set with classes, .

###### Proof.

To show , for any tree , we have , where is the number of leaves and is the number of objects in leaf that are not from the majority class: . This implies for all leaves of . Suppose is the class with most number of objects in leaf : . It is not hard to see for any class

 niLnjLniL+njL≤niL≤ϵn,

which implies . Thus we have . Thus . follows from Lemma 4.3. ∎

The main theorem of this section is the following.

###### Theorem 4.6.

In multi-class classification with classes, if is the decision tree returned by GreedyTree using hinged-Pairs (setting ) applied on the set of objects, then we have the following:

 CostPα(S)≤O(logn)OPTPα:0(S)≤O(logn)OPTE:ϵ(S).
###### Proof.

The first inequality follows from Theorem 2.2 and the second inequality follows from Lemma 4.5. ∎

The above theorem states that for a given error parameter , a greedy tree can be constructed using hinged-Pairs by setting , with the max-cost guaranteed to be within an factor of the best possible max-cost among all decision trees that have classification error less than or equal to . To our knowledge this is the first bound relating classification error to cost, which provides a theoretical basis for accuracy-cost trade-off.

## 5 Experimental Results

We first demonstrate the effect of outliers using a simple synthetic example, where a small set of outliers dramatically increases the max-cost of the tree. We show that allowing a small number of errors in the tree drastically reduces the cost of the tree, allowing for efficient trees to be constructed in the presence of outliers. Next, we demonstrate the ability to construct decision trees on real world data sets. We observe a similar behavior to the synthetic data set on many of these data sets, where allowing a small amount of error results in trees with significantly lower cost. Additionally, we see the effect of impurity function choice on performance of the trees. For all real datasets, we present performance of the Powers impurity function presented in Eq. (9) with and error introduced by early stopping as well as the hinged-Pairs impurity function presented in Eq. (10) with error introduced by varying the parameter .

Synthetic Example: Here we consider a multi-class classification example to demonstrate the effect a small set of objects can have on the max-cost of the tree. Consider a data set composed of 1024 objects belonging to 4 classes with 10 binary tests available. Assume that the set of tests is complete, that is no two objects have the same set of test outcomes. Note that by fixing the order of the tests, the set of test outcomes maps each object to an integer in the range . From this mapping, we give the objects in the ranges , , , and the labels , , , and , respectively, and the objects , , , and the labels , , , and , respectively (Figure 2 shows the data projected to the first two tests). Suppose each test carries a unit cost. By Kraft’s Inequality [Cover and Thomas, 1991], the optimal max-cost in order to correctly classify every object is 10, however, using only and as selected by the greedy algorithm, leads to a correct classification of all but 4 objects, as shown in Figure 3. For this type of data set, a constant sized set of costs can change from a tree with a constant max-cost to a tree with a max-cost.

Data Sets: We compare performance using 9 data sets from the UCI Repository [Frank and Asuncion, 2010]. We assume that all tests (features) have a uniform cost. For each data set, we replace non-unique objects with a single instance using the most common label for the objects, allowing every data set to be complete (perfectly classified by the decision trees). Additionally, continuous features are transformed to discrete features by quantizing to 10 uniformly spaced levels. More details on the data sets used can be found in the Appendix.

Error vs. Cost Trade-Off: Fig. 4 shows the trade-off between classification error and max-cost, which suggest two key trends. First, it appears that many data sets, such as house votes, Statlog DNA, Wisconsin breast cancer, and mammography, can be classified with minimal error using few tests. Intuitively, this small error appears to correspond to a small subset of outlier objects which require a large number of tests to correctly classify while the majority of the data can be classified with a small number of tests. Second, empirical evidence suggests that the optimal choice of impurity function is dependent on the desired max-cost of the tree. For trees with a smaller budget (and therefore lower depth), the hinged-Pairs impurity function outperforms the Powers impurity function with early stopping, whereas for larger budget (and greater depth), the Powers impurity function outperforms hinged-Pairs. This matches our intuitive understanding of the impurity functions, as the Powers impurity function biases towards tests which evenly divide the data whereas hinged-Pairs puts more emphasis on classification performance.

## 6 Conclusion

We characterize a broad class of admissible impurity functions that can be used in a greedy algorithm to yield guarantees of the optimal max-cost. We give examples of such admissible functions and demonstrate that they have different empirical properties even though they all enjoy the guarantee. We further design admissible functions to allow for accuracy-cost trade-off and provide a bound relating classification error to cost. Finally, through real world datasets we demonstrate that our algorithm can indeed censor the outliers and achieve high classification accuracy using low max-cost. To visualize such outliers we construct a 2-D synthetic experiment and show our algorithm successfully identifies these as outliers.

## References

• [Bellala et al., 2012] Bellala, G., Bhavnani, S., and Scott, C. (2012). Group-based active query selection for rapid diagnosis in time-critical situations. Information Theory, IEEE Transactions on, 58(1):459–478.
• [Chakaravarthy et al., 2011] Chakaravarthy, V. T., Pandit, V., Roy, S., Awasthi, P., and Mohania, M. K. (2011). Decision trees for entity identification: Approximation algorithms and hardness results. ACM Trans. Algorithms, 7(2):15:1–15:22.
• [Cicalese et al., 2014] Cicalese, F., Laber, E. S., and Saettler, A. M. (2014). Diagnosis determination: decision trees optimizing simultaneously worst and expected testing cost. In Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 21-26 June 2014, volume 32 of JMLR Proceedings, pages 414–422. JMLR.org.
• [Cover and Thomas, 1991] Cover, T. M. and Thomas, J. A. (1991). Elements of Information Theory. Wiley-Interscience, New York, NY, USA.
• [Dasgupta, 2004] Dasgupta, S. (2004). Analysis of a greedy active learning strategy. In In Advances in Neural Information Processing Systems, pages 337–344. MIT Press.
• [Frank and Asuncion, 2010] Frank, A. and Asuncion, A. (2010). UCI machine learning repository.
• [Golovin and Krause, 2011] Golovin, D. and Krause, A. (2011). Adaptive submodularity: Theory and applications in active learning and stochastic optimization. Journal of Artificial Intelligence Research (JAIR), 42:427–486.
• [Golovin et al., 2010] Golovin, D., Krause, A., and Ray, D. (2010). Near-optimal bayesian active learning with noisy observations. In Lafferty, J., Williams, C. K. I., Shawe-Taylor, J., Zemel, R., and Culotta, A., editors, Advances in Neural Information Processing Systems 23, pages 766–774.
• [Guillory and Bilmes, 2010] Guillory, A. and Bilmes, J. A. (2010). Interactive submodular set cover. In FâÂºrnkranz, J. and Joachims, T., editors, Proceedings of the 27th International Conference on Machine Learning (ICML-10), pages 415–422. Omnipress.
• [Gupta et al., 2010] Gupta, A., Nagarajan, V., and Ravi, R. (2010). Approximation algorithms for optimal decision trees and adaptive tsp problems. In Proceedings of the 37th International Colloquium Conference on Automata, Languages and Programming, ICALP’10, pages 690–701, Berlin, Heidelberg. Springer-Verlag.
• [Hanneke, 2006] Hanneke, S. (2006). The cost complexity of interactive learning. unpublished.
• [Kosaraju et al., 1999] Kosaraju, S. R., Przytycka, T. M., and Borgstrom, R. S. (1999). On an optimal split tree problem. In Proceedings of the 6th International Workshop on Algorithms and Data Structures, WADS ’99, pages 157–168, London, UK, UK. Springer-Verlag.
• [Moshkov, 2010] Moshkov, M. J. (2010). Greedy algorithm with weights for decision tree construction. Fundam. Inf., 104(3):285–292.
• [Nowak, 2008] Nowak, R. (2008). Generalized binary search. In In Proceedings of the 46th Allerton Conference on Communications, Control, and Computing, pages 568–574.
• [Saettler et al., 2014] Saettler, A., Laber, E., and Cicalese, F. (2014). Trading off worst and expected cost in decision tree problems and a value dependent model. ArXiv, pages 1–13.

## Appendix

### Proof of Lemma 3.4

Before showing admissibility of the hinged-Pairs function in the multiclass setting, we first show is admissible for the binary setting.

###### Lemma .1.

Consider the binary classification setting, let

 Pα(G)=[[n1G−α]+[n2G−α]+−α2]+,

###### Proof.

All the properties are obviously true except supermodularity. To show supermodularity, suppose and object . Suppose belongs to the first class. We need to show

 Pα(G∪j)−Pα(G)≥Pα(R∪j)−Pα(R). (11)

Consider 3 cases:
(1) : The right hand side of (11) is 0 and (11) holds because of monotonicity of .
(2) : (11) reduces to , which is true by monotonicity.
(3) : Note that implies that which further implies . Thus the left hand side is

 Pα(G∪j)−Pα(G)=(n1G−α+1)(n2G−α)−α2−((n1G−α)(n2G−α)−α2)=n2G−α.

The right hand side is

 Pα(R∪j)=(n1R−α+1)(n2R−α)−α2=(n1R−α)(n2R−α)−α2+(n2R−α).

If , because implies . So .
(4) : We have

 Pα(G∪j)−Pα(G)=n2G−α≥n2R−α=Pα(R∪j)−Pα(R).

This completes the proof. ∎

Now we are ready to generalize from the binary hinged-Pairs function to the multiclass hinged-Pairs function. Again, all properties are obviously except supermodularity. The supermodularity follows from the fact that each term in the sum is supermodular according to Lemma .1.

### Proof of Lemma 4.4

We begin by considering any leaf of , suppose is the largest class in . For , if , we have

 [[niL−α]+[njL−α]+−α2]+ = max(niLnjL−α(niL+njL),0)=0

, which implies . So

 niL≤kniLnjLnL≤kα=kϵn.

If , we have . Let be the number of objects in leaf that are not from the majority class: . So for any leaf we have .

Now we enumerate the leaves of in non-increasing order according to the number of objects they contain. Let be the set of the first leaves. By definition of , the total number of objects contained in is .

The overall error bound is obtained by considering leaves in and the complement separately:

 ∑L∈A~nL+∑L∈¯A~nLn ≤k(k−1)ϵlηn+k−1kηnn =k(k−1)lηϵ+k−1kη,

where we have used the fact that and that .

### Details of Computation in Figure 1

If Pairs is used, we can compute impurity of each set of interest: ; according to Algorithm 1, we can compute so will be chosen. On the other hand, the impurities for the hinged-Pairs with are ; again we can compute so will be chosen. The above example shows that Pairs has a stronger preference to balanced tests and may in some cases lead to poor classification result.

### Details of Data Sets

The house votes data set is composed of the voting records for 435 members of the U.S. House of Representatives (342 unique voting records) on 16 measures, with a goal of identifying the party of each member. The sonar data set contains 208 sonar signatures, each composed of energy levels (quantized to 10 levels) in 60 different frequency bands, with a goal of identifying The ionosphere data set has 351 (350 unique) radar returns, each composed of 34 responses (quantized to 10 levels), with a goal of identifying if an event represents a free electron in the ionosphere. The Statlog DNA data set is composed of 3186 (3001 unique) DNA sequences with 180 features, with a goal of predicting whether the sequence represents a boundary of DNA to be spliced in or out. The Boston housing data set contains 13 attributes (quantized to 10 levels) pertaining to 506 (469 unique) different neighborhoods around Boston, with a goal of predicting which quartile the median income of the neighborhood the neighborhood falls. The soybean data set is composed of 307 examples (303 unique) composed of 34 categorical features, with a goal of predicting from among 19 diseases which is afflicting the soy bean plant. The pima data set is composed of 8 features (with continuous features quantized to 10 levels) corresponding to medical information and tests for 768 patients (753 unique feature patterns), with a goal of diagnosing diabetes. The Wisconsin breast cancer data set contains 30 features corresponding to properties of a cell nucleus for 569 samples, with a goal of identifying if the cell is malignant or benign. The mammography data set contains 6 features from mammography scans (with age quantized into 10 bins) for 830 patients, with a goal of classifying the lesions as malignant or benign.

You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters