Decision Trees for Function Evaluation
Simultaneous Optimization of
Worst and Expected Cost111A preliminary version of this paper was accepted for presentation at ICML 2014
In several applications of automatic diagnosis and active learning a central problem is the evaluation of a discrete function by adaptively querying the values of its variables until the values read uniquely determine the value of the function. In general, the process of reading the value of a variable might involve some cost, computational or even a fee to be paid for the experiment required for obtaining the value. This cost should be taken into account when deciding the next variable to read. The goal is to design a strategy for evaluating the function incurring little cost (in the worst case or in expectation according to a prior distribution on the possible variables’ assignments).
Our algorithm builds a strategy (decision tree) which attains a logarithmic approximation simultaneously for the expected and worst cost spent. This is best possible under the assumption that .
In order to introduce the problem we analyze in the paper, let us start with some motivating examples.
In high frequency trading, an automatic agent decides the next action to be performed as sending or canceling a buy/sell order, on the basis of some market variables as well as private variables (e.g., stock price, traded volume, volatility, order books distributions as well as complex relations among these variables). For instance in  the trading strategy is learned in the form of a discrete function, described as a table, that has to be evaluated whenever a new scenario is faced and an action (sell/buy) has to be taken. The rows of the table represent the possible scenarios of the market and the columns represent the variables taken into account by the agent to distinguish among the different scenarios. For each scenario, there is an associated action. Every time an action need to be taken, the agent can identify the scenario by computing the value of each single variable and proceed with the associated action. However, recomputing all the variable every time might be very expensive. By taking into account the structure of the function/table together with information on the probability distribution on the scenarios of the market and also the fact that some variables are more expensive (or time consuming) to calculate than others, the algorithm could limit itself to recalculate only some variables whose values determine the action to be taken. Such an approach can significantly speed up the evaluation of the function. Since market conditions change on a millisecond basis, being able to react very quickly to a new scenario is the key to a profitable strategy.
In a classical Bayesian active learning problem, the task is to select the right hypothesis from a possibly very large set Each is a mapping from a set called the query/test space to the set (of labels) It is assumed that the functions in are unique, i.e., for each pair of them there is at least one point in where they differ. There is one function which provides the correct labeling of the space and the task is to identify it through queries/tests. A query/test coincides with an element and the result is the value Each test has an associated cost that must be paid in order to acquire the response since the process of labeling an example may be expensive either in terms of time or money (e.g. annotating a document). The goal is to identify the correct hypothesis spending as little as possible. For instance, in automatic diagnosis, represents the set of possible diagnoses and the set of symptoms or medical tests, with being the exact diagnosis that has to be achieved by reducing the cost of the examinations.
In , a more general variant of the problem was considered where rather than the diagnosis it is important to identify the therapy (e.g., for cases of poisoning it is important to quickly understand which antidote to administer rather than identifying the exact poisoning). This problem can be modeled by defining a partition on with each class of representing the subset of diagnoses which requires the same therapy. The problem is then how to identify the class of the exact rather than itself. This model has also been studied by Golovin et al.  to tackle the problem of erroneous tests’ responses in Bayesian active learning.
The above examples can all be cast into the following general problem.
The Discrete Function Evaluation Problem (DFEP). An instance of the problem is defined by a quintuple where is a set of objects, is a partition of into classes, is a set of tests, is a probability distribution on and is a cost function assigning to each test a cost A test , when applied to an object , incurs a cost and outputs a number in the set . It is assumed that the set of tests is complete, in the sense that for any distinct there exists a test such that The goal is to define a testing procedure which uses tests from and minimizes the testing cost (in expectation and/or in the worst case) for identifying the class of an unknown object chosen according to the distribution
The DFEP can be rephrased in terms of minimizing the cost of evaluating a discrete function that maps points (corresponding to objects) from some finite subset of into
values (corresponding to classes), where an object corresponds to the point
obtained by applying each test of to .
This perspective motivates the name we chose for the problem. However, for the sake of uniformity with more recent work [15, 4] we employ the definition of the problem in terms of objects/tests/classes.
Decision Tree Optimization. Any testing procedure can be represented by a decision tree, which is a tree where every internal node is associated with a test and every leaf is associated with a set of objects that belong to the same class. More formally, a decision tree for is a leaf associated with class if every object of belongs to the same class . Otherwise, the root of is associated with some test and the children of are decision trees for the sets , where , for , is the subset of that outputs for test .
Given a decision tree , rooted at , we can identify the class of an unknown object by following a path from to a leaf as follows: first, we ask for the result of the test associated with when performed on ; then, we follow the branch of associated with the result of the test to reach a child of ; next, we apply the same steps recursively for the decision tree rooted at . The procedure ends when a leaf is reached, which determines the class of .
We define as the sum of the tests’ cost on the root-to-leaf path from the root of to the leaf associated with object . Then, the worst testing cost and the expected testing cost of are, respectively, defined as
Figure 1 shows an instance of the DFEP and a decision tree for it. The tree has worst testing cost and expected testing cost .
Our Results. Our main result is an algorithm that builds a decision tree whose expected testing cost and worst testing cost are at most times the minimum possible expected testing cost and the minimum possible worst testing cost, respectively. In other words, the decision tree built by our algorithm achieves simultaneously the best possible approximation achievable with respect to both the expected testing cost and the worst testing cost. In fact, for the special case where each object defines a distinct class—known as the identification problem— both the minimization of the expected testing cost and the minimization of the worst testing cost do not admit a sub-logarithmic approximation unless as shown in  and in , respectively. In addition, in Section 4, we show that the same inapproximability results holds in general for the case of exactly classes for any
It should be noted that in general there are instances for which the decision tree that minimizes the expected testing cost has worst testing cost much larger than that achieved by the decision tree with minimum worst testing cost. Also there are instances where the converse happens. Therefore, it is reasonable to ask whether it is possible to construct decision trees that are efficient with respect to both performance criteria. This might be important in practical applications where only an estimate of the probability distribution is available which is not very accurate. Also, in medical applications like the one depicted in , very high cost (or equivalently significantly time consuming therapy identification) might have disastrous/deadly consequences. In such cases, besides being able to minimize the expected testing cost, it is important to guarantee that the worst testing cost also is not large (compared with the optimal worst testing cost).
With respect to the minimization of the expected testing cost, our result improves upon the previous approximation shown in  and , where is the minimum positive probability among the objects in . From the result in these papers an approximation could be attained only for the particular case of uniform costs via a technique used in .
From a high-level perspective, our method closely follows the one used by Gupta et al.  for obtaining the approximation for the expected testing cost in the identification problem. Both constructions of the decision tree consist of building a path (backbone) that splits the input instance into smaller ones, for which decision trees are recursively constructed and attached as children of the nodes in the path.
A closer look, however, reveals that our algorithm is much simpler than the one presented in . First, it is more transparently linked to the structure of the problem, which remained somehow hidden in  where the result was obtained via an involved mapping from adaptive TSP. Second, our algorithm avoids expensive computational steps as the Sviridenko procedure  and some non-intuitive/redundant steps that are used to select the tests for the backbone of the tree. In fact, we believe that providing an algorithm that is much simpler to implement and an alternative proof of the result in  is an additional contribution of this paper.
State of the art. The DFEP has been recently studied under the names of class equivalence problem  and group identification problem  and long before it had been described in the excellent survey by Moret . Both  and  give approximation algorithms for the version of the DFEP where the expected testing cost has to be minimized and both the probabilities and the testing costs are non-uniform. In addition, when the testing costs are uniform both algorithms can be converted into a approximation algorithm via Kosaraju approach . The algorithm in  is more general because it addresses multiway tests rather than binary ones. For the minimization of the worst testing cost, Moshkov has studied the problem in the general case of multiway tests and non-uniform costs and provided an -approximation in . In the same paper it is also proved that no -approximation algorithm is possible under standard the complexity assumption The minimization of the worst testing cost is also investigated in  under the framework of covering and learning.
The particular case of the DFEP where each object belongs to a different class—known as the identification problem—has been more extensively investigated [11, 1, 5, 6]. Both the minimization of the worst and the expected testing cost do not admit a sublogarithmic approximation unless as proved by  and . For the expected testing cost, in the variant with multiway tests, non uniform probabilities and non uniform testing costs, an approximation is given by Guillory and Blimes in . Gupta et al.  improved this result to employing new techniques not relying on the Generalized Binary Search (GBS)—the basis of all the previous strategies.
An approximation algorithm for the minimization of the worst testing cost for the identification problem has been given by Arkin et. al.  for binary tests and uniform cost and by Hanneke  for case with mutiway tests and non-uniform testing costs.
In the case of Boolean functions, the DFEP is also known as Stochastic Boolean Function Evaluation (SBFE), where the distribution over the possible assignments is a product distribution defined by assuming that variable has a given probability of being one independently of the value of the other variables. Another difference with respect to the DFEP as it is presented here, is that in Stochastic Boolean Function Evaluation the common assumption is that the complete set of associations between the assignments of the variables and the value of the function is provided, directly or via a representation of the function, e.g., in terms of its DNF or CNF. The present definition of DFEP considers the more general problem where only a sample of the Boolean function is given and from this we want to construct a decision tree with minimum expected costs and that exactly fits the sample.
Results on the exact solution of the SBFE for different classes of Boolean functions can be found in the survey paper . In a recent paper Deshpande et al. , provide a -approximation algorithm for evaluating Boolean linear threshold formulas and an approximation algorithm for the evaluation of CDNF formulas, where and is the number of clauses of the input CNF and is the number of terms of the input DNF. The same result had been previously obtained by Kaplan et al.  for the case of monotone formulas and uniform distribution (in a slightly different setting). Both algorithms of  are based on reducing the problem to Stochastic Submodular Set Cover introduced by Golovin and Krause  and providing a new algorithm for this latter problem.
Other special cases of the DFEP like the evaluation of AND/OR trees (a.k.a. read-once formulas) and the evaluation of Game Trees (a central task in the design of game procedures) are discussed in [36, 34, 17]. In , Charikar et al. considered discrete function evaluation from the perspective of competitive analysis; results in this alternative setting are also given in [24, 8].
Given an instance of the DFEP, we will denote by () the expected testing cost (worst testing cost) of a decision tree with minimum possible expected testing cost (worst testing cost) over the instance When the instance is clear from the context, we will also use the notation () for the above quantity, referring only to the set of objects involved. We use to denote the smallest non-zero probability among the objects in .
Let be an instance of DFEP and let be a subset of . In addition, let and be, respectively, the restrictions of and to the set . Our first observation is that every decision tree for is also a decision tree for the instance . The following proposition immediately follows.
Let be an instance of the DFEP and let be a subset of . Then, and where is the restriction of to .
One of the measures of progress of our strategy is expressed in terms of the number of pairs of objects belonging to different classes which are present in the set of objects satisfying the tests already performed. The following definition formalizes this concept of pairs for a given set of objects.
Definition 1 (Pairs).
Let be an instance of the DFEP and We say that two objects constitute a pair of if they both belong to but come from different classes. We denote by the number of pairs of In formulae, we have
where for and denotes the number of objects in belonging to class
As an example, for the set of objects in Figure 1 we have and the following set of pairs
We will use to denote the initially unknown object whose class we want to identify. Let be a sequence of tests applied to identify the class of (it corresponds to a path in the decision tree) and let be the set of objects that agree with the outcomes of all tests in . If , then all objects in belong to the same class, which must coincide with the class of the selected object . Hence, indicates the identification of the class of the object Notice that might still be unknown when the condition is reached.
For each test and for each , let be the set of objects for which the outcome of test is For a test the outcome resulting in the largest number of pairs is of special interest for our strategy. We denote with the set among such that (ties are broken arbitrarily). We denote with the set of objects not included in i.e., we define . Whenever is clear from the context we use instead of .
Given a set of objects , each test produces a tripartition of the pairs in : the ones with both objects in those with both objects in and those with one object in and one object in We say that the pairs in are kept by and the pairs with one object from and one object from are separated by We also say that a pair is covered by the test if it is either kept or separated by Analogously, we say that a test covers an object if .
For any set of objects the probability of is
3 Logarithmic approximation for the Expected Testing Cost and the Worst Case Testing Cost
In this section, we describe our algorithm DecTree and analyze its performance. The concept of the separation cost of a sequence of tests will turn useful for defining and analyzing our algorithm.
The separation cost of a sequence of tests. Given an instance of the DFEP, for a sequence of tests we define the separation cost of in the instance denoted by as follows: Fix an object If there exists such that then we set If for each then we set Let denote the cost of separating in the instance by means of the sequence Then, the separation cost of (in the instance ) is defined by
In addition, we define as the total cost of the sequence , i.e.,
Lower bounds on the cost of an optimal decision tree for the DFEP. We denote by the minimum separation cost in attainable by a sequence of tests in which covers all the pairs in and as the minimum total cost attainable by a sequence of tests in which covers all the pairs in
The following theorem shows lower bounds on both the expected testing cost and the worst case testing cost of any instance of the DFEP.
For any instance of the DEFP, it holds that and
Let be a decision tree for the instance . Let be the nodes in the root-to-leaf path in such that for each the node is on the branch stemming from which is associated with , and the leaf node is the child of associated with the objects in
Let . Abusing notation let us now denote with the test associated with the node so that is a sequence of tests. In particular, is the sequence of tests performed according to the strategy defined by when the object whose class we want to identify, is such that holds for each test performed in the sequence.
Notice that, by construction, is a sequence of tests covering all pairs of .
Claim. For each object it holds that
If for each we have that then it holds that Conversely, let be the first test in for which Therefore, we have that is a prefix of the root to leaf path followed when is the object chosen. It follows that The claim is proved.
In order to prove the first statement of the theorem, we let be a decision tree which achieves the minimum possible expected cost, i.e., Then, we have
In order to prove the second statement of the theorem, we let be a decision tree which achieves the minimum possible worst testing cost, i.e., Let be such that, for each it holds that Then, by the above claim it follows that
Using (5), we have
The proof is complete. ∎
The following subadditivity property will be useful.
Proposition 2 (Subadditivity).
Let be a partition of the object set We have and , where and are, respectively, the minimum expected testing cost and the worst case testing cost when the set of objects is
The optimization of submodular functions of sets of tests. Let be an instance of the DFEP. A set function is submodular non-decreasing if for every and every , it holds that (submodularity) and (non-decreasing).
It is easy to verify that the functions
are non-negative non-decreasing submodular set functions. In words, is the function mapping a set of tests into the number of pairs covered by the tests in . The function , instead, maps a set of tests into the probability of the set of objects covered by the tests in .
Let be a positive integer. Consider the following optimization problem defined over a non-negative, non-decreasing, sub modular function :
The following theorem summarizes results from [Theorems 2 and 3].
Let be the sequence of all the tests selected by Adapted-Greedy, i.e., the concatenation of the two possible outputs in line 7. Then, we have that the total cost of the tests in is at most and
Our algorithm for building a decision tree will employ this greedy heuristic for finding approximate solutions to the optimization problem over the submodular set functions and defined in (7).
3.1 Achieving logarithmic approximation
We will show that Algorithm 2 attains a logarithmic approximation for DFEP. The algorithm consists of 4 blocks. The first block (lines 1-2) is the basis of the recursion, which returns a leaf if all objects belong to the same class . If , we have that and the algorithm returns a tree that consists of a root and two leaves, one for each object, where the root is associated with the cheapest test that separates these two objects. Clearly, this tree is optimal for both the expected testing cost and the worst testing cost.
The second block (line 3) calls procedure FindBudget to define the budget allowed for the tests selected in the third and fourth blocks. FindBudget finds the smallest such that Adapted-Greedy() returns a set of tests covering at least pairs.
The third (lines 4-10) and the fourth (lines 11-17) blocks are responsible for the construction of the backbone of the decision tree (see Fig. 2) as well as to call DecTree recursively to construct the decision trees that are children of the nodes in the backbone.
The third block (the while loop in lines 4-10) constructs the first part of the backbone (sequence in Fig. 2) by iteratively selecting the test that covers the maximum uncovered mass probability per unit of testing cost (line 5). The selected test induces a partition on the set of objects , which contains the objects that have not been covered yet. In lines 7 and 8, the procedure is recursively called for each set of this partition but for the one that is contained in the subset . With reference to Figure 2, these calls will build the subtrees rooted at nodes not in which are children of some node in .
Similarly, the fourth block (the repeat-until loop) constructs the second part of the backbone (sequence in Fig. 2) by iteratively selecting the test that covers the maximum number of uncovered pairs per unit of testing cost (line 12). The line 18 is responsible for building a decision tree for the objects that are not covered by the tests in the backbone.
We shall note that both the third and the fourth block of the algorithm are based on the adapted greedy heuristic of Algorithm 1. In fact, in line 5 (third block) corresponds to in Algorithm 1 because, right before the selection of the -th test, is the set of tests and . Thus,
A similar argument shows that in line 12 (fourth block) corresponds to in Algorithm 1. These connections will allow us to apply both Theorem 2 and Corollary 1 to analyze the cost and the coverage of these sequences.
Let denote the sequence of tests obtained by concatenating the tests selected in the while loop and in the repeat-until loop of the execution of DecTree over instance We delay to the next section the proof of the following key result.
Let be the solution of and There exists a constant such that for any instance of the DFEP, the sequence covers at least pairs, and it holds that and
Applying Theorem 3 to each recursive call of DecTree we can prove the following theorem about the approximation guaranteed by our algorithm both in terms of worst testing cost and expected testing cost.
For any instance of the DFEP, the algorithm DecTree outputs a decision tree with expected testing cost at most and with worst testing cost at most .
For any instance let be the decision tree produced by the algorithm DecTree. First, we prove an approximation for the expected testing cost. Let be such that , where is the constant given in the statement of Theorem 3. Let us assume by induction that the algorithm guarantees approximation , for the expected testing cost, for every instance on a set of objects with
Let be the set of instances on which the algorithm is recursively called in lines 8,15 and 18. We have that
The first equality follows by the recursive way the algorithm DecTree builds the decision tree. Inequality (9) follows from (8) by the subadditivity property (Proposition 2) and simple algebraic manipulations. The inequality in (10) follows by Theorem 3 together with Theorem 1 yielding The inequality (11) follows by induction (we are using to denote the number of pairs of instance ).
To prove that the inequality in (12) holds we have to argue that every instance has at most pairs. Let as in the lines 8 and 15. First we show that the number of pairs of is at most . We have and is the set with the maximum number of pairs in the partition , induced by on the set . It follows that Now it remains to show that the instance , recursively called, in line 18 has at most pairs. This is true because the number of pairs of is equal to the number of pairs not covered by which is bounded by by Theorem 3.
Now, we prove an approximation for the worst testing cost of the tree . Let be such that . Let us assume by induction that the worst testing cost of is at most for every instance on a set of objects with We have that
Inequality (14) follows from the subadditivity property (Proposition 2) for the worst testing cost. The inequality (15) follows by Theorem 1. The inequality (3.1) follows from Theorem 3, the induction hypothesis (we are using to denote the number of pairs of instance ) and from the fact mentioned above that every instance in has at most pairs.
Since it follows that the algorithm provides an approximation for both the expected testing cost and the worst testing cost.
The previous theorem shows that algorithm DecTree provides simultaneously logarithmic approximation for the minimization of expected testing cost and worst testing cost. We would like to remark that this is an interesting feature of our algorithm. In this respect, let us consider the following instance of the DFEP222 This is also an instance of the identification problem mentioned in the introduction: Let ; , for and ; the set of tests is in one to one correspondence with the set of all binary strings of length so that the test corresponding to a binary string outputs for object if and only if the th bit of is 0(1). Moreover, all tests have unitary costs. This instance is also an instance of the problem of constructing an optimal prefix coding binary tree, which can be solved by the Huffman’s algorithm . Let and be, respectively, the decision trees with minimum expected cost and minimum worst testing cost for this example. Using Huffman’s algorithm, it is not difficult to verify that and . In addition, we have that . This example shows that the minimization of the expected testing cost may result in high worst testing cost and vice versa the minimization of the worst testing cost may result in high expected testing cost. Clearly, in real situations presenting such a dichotomy, the ability of our algorithm to optimize simultaneously both measures of cost might provide a significant gain over strategies only guaranteeing competitiveness with respect to one measure.
3.2 The proof of Theorem 3
We now return to the proof of Theorem 3 for which will go through three lemmas.
For any instance of the DFEP, the value returned by the procedure FindBudget satisfies .
Let us consider the problem in equation (7) with the function that measures the number of pairs covered by a set of tests. Let be the number of pairs covered by the solution constructed with Adapted-Greedy when the budget—the righthand side of equation (7)—is . By construction, FindBudget finds the smallest such that .
Let be a sequence that covers all pairs in and that satisfies . Arguing by contradiction we can show that Suppose that this was not the case, then would be the sequence which covers pairs using a sequence of tests of total cost not larger than some By Theorem 2, the procedure Adapted-Greedy provides an -approximation of the maximum number of pairs covered with a given budget. Therefore, when run with budget Adapted-Greedy is guaranteed to produce a sequence of total cost which covers at least pairs. However, by the minimality of it follows that such a sequence does not exist. Since this contradiction follows by the hypothesis it must hold that as desired. ∎
Given an instance for a sequence of tests and a real , let be the separation cost of when every non-covered object is charged , that is,
The proofs of the following technical lemma is deferred to the appendix.
Let be the sequence obtained by concatenating the tests selected in the while loop of Algorithm 2. Then, and where is a positive constant and is the budget calculated at line 3.
The sequence covers at least pairs and it holds that
The sequence can be decomposed into the sequences and , that are constructed, respectively, in the while and repeat-until loop of the algorithm DecTree (see also Fig. 2).
It follows from the definition of that there is a sequence of tests, say , of total cost not larger than that covers at least pairs for instance . Let be the number of pairs of instance covered by the sequence . Thus, the tests in , that do not belong to , cover at least pairs in the set of objects not covered by .
The sequence coincides with the concatenation of the two possible outputs of the procedure Adapted-Greedy( (Algorithm 1), when it is executed on the instance defined by: the objects in (those not covered by ); the tests that are not in the submodular set function and bound By Corollary 1, we have that and covers at least uncovered pairs.
Therefore, since altogether, we have that covers at least pairs and ∎
The proof of Theorem 3 will now follow by combining the previous three lemmas.
To prove that , we decompose into and , the sequences of tests selected in the while and in the repeat-until loop of Algorithm 2, respectively.
For , let . In addition, let be the set of objects which are not covered by the tests in Thus,