A Tight Analysis of Greedy Yields Subexponential Time Approximation for Uniform Decision Tree
Abstract
Decision Tree is a classic formulation of active learning: given hypotheses with nonnegative weights summing to 1 and a set of tests that each partition the hypotheses, output a decision tree using the provided tests that uniquely identifies each hypothesis and has minimum (weighted) average depth. Previous works showed that the greedy algorithm achieves a approximation ratio for this problem and it is NPhard beat a approximation, settling the complexity of the problem. However, for Uniform Decision Tree, i.e. Decision Tree with uniform weights, the story is more subtle. The greedy algorithm’s approximation ratio is the best known, but the largest approximation ratio known to be NPhard is . We prove that the greedy algorithm gives a approximation for Uniform Decision Tree, where is the cost of the optimal tree and show this is best possible for the greedy algorithm. As a corollary, we resolve a conjecture of Kosaraju, Przytycka, and Borgstrom [KPB99]. Our results also hold for instances of Decision Tree whose weights are not too far from uniform. Leveraging this result, we exhibit a subexponential algorithm that yields an approximation to Uniform Decision Tree in time . As a corollary, achieving any superconstant approximation ratio on Uniform Decision Tree is not NPhard, assuming the Exponential Time Hypothesis. This work therefore adds approximating Uniform Decision Tree to a small list of natural problems that have subexponential algorithms but no known polynomial time algorithms. Like the greedy analysis, our analysis of the subexponential algorithm gives similar approximation guarantees even for slightly nonuniform weights.
1 Introduction
In Decision Tree (also known as Split Tree), one is given hypotheses with nonnegative weights summing to 1 and a set of ary tests that each partition the hypotheses, and must output a decision tree using the provided tests that uniquely identifies each hypothesis and has minimum (weighted) average depth. Decision Tree is a classic problem that arises naturally in active learning [Das04, Now11, GB09] and hypothesis identification [Mor82]. Active learning with a wellspecified and finite hypothesis class with noiseless tests is precisely Decision Tree where the tests are data points and the answers are their labels. Decision Tree was first proved to be NPhard by Hyafil and Rivest [HR76]. Since then, a large number works have provided algorithms for this question [GG74, Lov85, KPB99, Das04, CPR11, CPRS09, GB09, GNR10, CJLM10, AH12].
A natural algorithm for Decision Tree is the greedy algorithm, which creates a decision tree by iteratively choosing the test that most evenly splits the set of remaining hypotheses. For binary tests (), there is a natural notion of “most even split,” but for , there are multiple possible definitions (see discussion in Section 2). It is well known that the greedy algorithm achieves a approximation ratio for Decision Tree assuming all weights are at least . It was first shown for binary tests and uniform weights ( for all ) [KPB99, AH12], then ary tests [CPRS09], and finally, general nonuniform weights [GB09]. Furthermore, it is NPhard to achieve a approximation ratio for Decision Tree [CPR11], settling the complexity of approximating Decision Tree.
However, there are still gaps in our knowledge. For Uniform Decision Tree, i.e. Decision Tree with uniform weights, the approximation given by the greedy algorithm was previously the best known approximation achievable in polynomial time. Chakaravarthy et al. [CPR11] proved that it is NPhard to give a approximation, giving best known hardness of approximation result, and they asked whether the gap between the best approximation and hardness results could be improved. Previously, it was not even known whether the greedy algorithm could beat the approximation ratio in previous analyses: the best lower bound on the greedy algorithm’s approximation ratio is [KPB99, Das04]. In the setting where the optimal solution to Uniform Decision Tree has cost , Kosaraju et al. [KPB99] showed that the greedy algorithm indeed gives a approximation, and they conjectured that the greedy algorithm gives a approximation in general.
1.1 Our contributions
We summarize the main contributions of our work below. The approximation guarantees of our algorithms are captured in Figure 1.

Greedy algorithm. We give a new analysis of the greedy algorithm, showing that it gives a approximation for Decision Tree, where is the cost of the optimal tree, , and . This implies a approximation for instances of Uniform Decision Tree and of Decision Tree whose weights are close to uniform. As always, this proves the conjecture of Kosaraju et al. [KPB99].

Subexponential algorithm. Leveraging the above greedy analysis, for , we give a subexponential^{1}^{1}1Throughout this work, subexponential means for some absolute . We make a distinction when referring to runtimes. time approximation algorithm for Uniform Decision Tree. Assuming the Exponential Time Hypothesis (ETH) [IP01, IPZ01]^{2}^{2}2ETH states that there are no time algorithms for 3SAT., this algorithm implies that any superconstant approximation of Uniform Decision Tree is not NPhard. Our work adds Uniform Decision Tree to a select group of natural problems whose time complexity is known to be subexponential (and, for some approximation ratios, ) but not known to be polynomial. Famous examples of such problems include Factoring [LLMP93], Unique Games [Kho02, ABS15], Graph Isomorphism [Bab16], and approximating Nash Equilibrium [LMM03, Rub18], with the later two having algorithms. Like in our greedy analysis, our subexponential algorithm gives a similar approximation guarantee even for slightly nonuniform weights, in particular when .

Approximation ratio tightness. We prove that the approximation ratio for the greedy algorithm is tight for Uniform Decision Tree. We also prove that the term in the approximation ratio for the greedy algorithm is necessary, in the sense that no algorithm can give a approximation for Decision Tree when for some unless P=NP.

Repeatable, noisy tests. Kääriäinen[Kää06] provides a method to convert a solution for Decision Tree into a solution for a variant of Decision Tree that handles noisy, repeatable tests. An immediate corollary of our greedy result is that the cost of a solution for the noisy problem derived from the greedy algorithm is at most . Previously, this cost was bounded by .
1.2 Techniques
Our work gives a new analysis of the greedy algorithm for Decision Tree. When the weights are uniform, similar to [AH12], we account the cost of the greedy tree by summing the number of leaves under each vertex in the tree. However, rather than accounting for all the vertices at once, we separately analyze the vertices with “imbalanced” splits and those with “balanced” splits. A global entropy argument accounts for the vertices with balanced splits, and to account for the vertices with imbalanced splits, we use the fact that greedy gives a constant factor approximation for Min Sum Set Cover [FLT04]. Putting the two together gives the desired approximation result. For nonuniform weights, we additionally prove and use a generalization of a result on the greedy algorithm’s performance for Set Cover [Lov75, Joh74, Chv79, Ste74].
For the subexponential algorithm, we leverage our new result that the greedy algorithm gives a approximation. We first run the greedy algorithm. If the greedy algorithm returns a tree with cost at least , we return the greedy tree knowing we have a approximation. Otherwise, we find by brute force the optimal tree up to depth in time , then recurse.
1.3 Organization of paper
In Section 2, we formally introduce notation used throughout the paper. In Section 3, we state our results. In Section 4, we sketch a proof of Theorem 3.1, that the greedy algorithm gives a approximation on Decision Tree. In Section 5, we state the subexponential algorithm and give a sketch of the analysis. In Section 6, we describe some related work. In Section 7, we conclude with some open problems.
We defer many details of our proofs to the appendices. For convenience, since the proof of Theorem 3.1 is involved, we include a simplified proof of Theorem 3.1 specialized to Uniform Decision Tree with binary tests in Section A, and prove the full theorem in Section B. A lemma on the greedy algorithm’s performance in a generalization of Set Cover that is used in the proof Theorem 3.1 is proved in Appendix C. We formally analyze our subexponential algorithm in Appendix D. In Appendices E and F, we prove Propositions 3.3 and 3.4, which show two ways that Theorem 3.1 is tight. In Appendix G, we demonstrate a rounding trick that allows us to assume without changing the difficulty of approximating Decision Tree.
2 Preliminaries
For a positive integer , let . All logs are base 2 when the base is not specified. The Decision Tree problem is as follows: given a set of hypotheses with probabilities summing to 1, and distinct ary tests, output a decision tree with hypotheses as leaves, such that the weighted average of the depth of the leaves is minimal. Formally, a ary test is a map . We refer to as the branching factor of the test , and the elements of as the possible answers to the tests. We think of a test as defining a way partition of . A decision tree is a rooted tree such that each interior vertex has the index of some test, and the edge to the th child of is labeled with . We say that a hypothesis is consistent with a vertex if, in the rootto path, the edge following any vertex has label . We let denote the set of hypotheses that are consistent with . We say a decision tree is complete if, for all , there exists a (unique) leaf such that , and for a complete decision tree , let denote the depth of this vertex . The cost of a complete decision tree is defined to be the average depth of the leaves, weighted by , i.e.
(1) 
We set to be a complete decision tree that minimizes (in general, there may be more than one optimal decision tree), and abbreviate .
This paper is concerned with the greedy algorithm for Decision Tree. We call a decision tree greedy if the test of each interior vertex minimizes the (weighted) number of hypotheses of the largest partition in ’s partitioning of . Formally, a decision tree is greedy if, for all interior vertices , we have
(2) 
where for . Given a Uniform Decision Tree instance, we let be a complete, greedy decision tree, choosing one arbitrarily if there is more than one. For brevity, we write .
We remark that, when , our notion of a “greedy” algorithm for Decision Tree is not the only one. As mentioned in the previous paragraph, our definition of greedy chooses, at each vertex in the decision tree, the test that minimizes the (weighted) number of candidate hypotheses, assuming a worstcase answer to the test. Our definition corresponds to the definition by [CPRS09], but other choices include maximizing the (weighted) number of pairs of hypotheses that are distinguished [CPR11, GB09] and maximizing the mutual information between the test and the remaining hypotheses [ZRB05]. For binary tests, , these definitions are all equivalent.
Define as Decision Tree with the guarantee that . In this notation, is Uniform Decision Tree.
3 Our results
3.1 Greedy algorithm
The main driver of this paper is Theorem 3.1, which relates the cost of greedy to the optimal cost for Decision Tree.
Theorem 3.1.
For any instance of Decision Tree on hypotheses, we have
(3) 
Our theorem holds for any branching factor , and when is the cost of any greedy tree. Note that the greedy algorithm always gives a approximation for Uniform Decision Tree, resolving the conjecture of [KPB99]. Additionally, if is for constant and the weights are uniform, then the greedy algorithm obtains a constant approximation.
For the simpler case when and the weights are uniform, we give a sketch of the proof in Section 4 and a full proof in Appendix A. This full result is sketched in Section 4 and proven in full in Appendix B. For Uniform Decision Tree, the constant 12 can be improved to 6, and, when is sufficiently large, , so that greedy gives a approximation (see Section 4).
3.2 Subexponential algorithm
Using Theorem 3.1, we give a subexponential algorithm that achieves a constant factor approximation for the Decision Tree problem when the weights are close to uniform.
Theorem 3.2.
For any and , there exists an approximation algorithm for with runtime . For Uniform Decision Tree, for any , we can achieve a approximation in the same runtime.
In Section 5, the subexponential algorithm is stated and an analysis is sketched. The analysis is given formally in Appendix D. Importantly, this result implies that achieving a superconstant approximation ratio is not NPhard, given the Exponential Time Hypothesis. As an informal proof, suppose for contradiction there was a polynomial reduction from 3SAT to achieving a approximation ratio for Uniform Decision Tree for some as . By Theorem 3.2, there exists a time algorithm to achieve a approximation for Uniform Decision Tree, and thus a time algorithm to solve 3SAT, contradicting the Exponential Time Hypothesis. This adds approximating Uniform Decision Tree to a list of interesting natural problems that have subexponential or time algorithms but are not known to be in P. Figure 1 illustrates the contrast between Decision Tree and Uniform Decision Tree.
3.3 Approximation ratio tightness
We also show that the approximation ratio is tight up to a constant factor for the greedy algorithm by generalizing the example given by [Das04]. The proof is given in Appendix E.
Proposition 3.3.
There exists an such that for all and any , there exists an instance of Uniform Decision Tree with branching factor 2 for which
(4) 
We also show that, when the weights are nonuniform, the term in the approximation ratio of Theorem 3.1 is computationally necessary.
Proposition 3.4.
Let . Then, for sufficiently large, approximating to a factor of is NPhard.
In other words, even if the ratio is guaranteed to be for a constant , one cannot give a approximation algorithm unless . The proof is given in Appendix F.
3.4 Decision tree with noise
Theorem 3.1 implies an improved blackbox result for a noisy variant of Decision Tree. Kääriäinen [Kää06] considers a variant of Decision Tree with binary tests where the output of each test may be corrupted by i.i.d. noise. Formally, there exists such that querying any test on any hypothesis , outputs the correct answer with probability and the wrong answer with probability , for some . Tests are repeatable, with each one producing different draws of the noise. Kääriäinen [Kää06] gives an algorithm that turns a decision tree of cost for the noiseless problem into a decision tree with cost for the noisy problem by repeating queries sufficiently many times.
Combining Kääriäinen’s result with the greedy algorithm for Uniform Decision Tree gives an algorithm for the noisy problem using an average of queries. Previously, using the bound , the noisy problem’s cost was bounded by . However, by Theorem 3.1, we have , so we in fact have cost at most , improving the cost ratio to the optimal solution of the noiseless problem by a nearly quadratic factor.
4 Sketch of proof of Theorem 3.1
In this section, we sketch a proof of Theorem 3.1. We first sketch the proof assuming that the branching factor is 2, so that is a binary tree, and that the distribution is uniform ( for all ). Since the proof of Theorem 3.1 is involved, we give the details of this easier result in Appendix A. At the end of the section, we give the additional ideas necessary to complete the full proof of Theorem 3.1. The details of the full proof are given in Appendix B.
4.1 Uniform weights and binary tests
Recall that . By a simple double counting argument (Lemma A.2), we can account the cost of the greedy tree by summing the weights of the vertices rather than summing the depths of leaves. That is,
(5) 
where the sum is over the interior vertices of .
Defining balanced and imbalanced vertices.
We then define balanced and imbalanced vertices using a parameter . These definitions are crucial to the proof. A vertex is imbalanced if there exists an integer (called the level) such that and . Here, is the child of containing a smaller weight of hypotheses in its subtree. We say is balanced if it is not imbalanced.^{3}^{3}3We remark that imbalanced vertices can have arbitrarily close to , so the hypotheses at vertex are not necessary split in an imbalanced way. However, as we show (Lemma A.7), all balanced vertices are in fact split in a balanced way with , hence the terminology. Note that imbalanced vertices exist only for , where . We prove a structural result (Lemma A.5) that shows that the level imbalanced vertices of can be partitioned into downward paths, which we call chains, such that, for all , each leaf has vertices from at most one level chain among its ancestors. The parameter quantifies how many chains we consider: smaller means fewer, longer chains, and larger means more, shorter chains. We optimize the choice of at the end of this proof sketch. In the remainder of the proof, we bound the weight of the balanced and imbalanced vertices separately.
Bounding the weight of balanced vertices.
To bound the weight of balanced vertices, we use an entropy argument. We consider the random variable corresponding to a uniformly random hypothesis from . On one hand, this random variable has entropy . On the other hand, we can take a uniformly random hypothesis from by an appropriate random walk down the decision tree. Starting from the root, at each vertex, we step to a child with probability proportional to the number of hypotheses in that child’s subtree. The total entropy of this process is given by , where is the entropy of the random walk’s step at . A simple argument (Lemma A.7) shows that, for all balanced vertices , we have and hence . We thus have
(6) 
Hence,
(7) 
Bounding the weight of imbalanced vertices
To bound the cost of imbalanced vertices, we crucially use a connection to Min Sum Set Cover (MSSC). In MSSC, one is given a universe and sets , and needs to construct an ordering of the sets that minimizes the cost: the cost of a solution is the average of the cover times of the elements in the universe . That is, the cost of a solution is
(8) 
A result by Feige, Lovasz, and Tetali shows that the greedy algorithm gives a 4 approximation of MSSC, and they show this is tight by proving that finding a approximation of MSSC is NPhard. On the lower bound side, a connection between MSSC and Decision Tree was already known: Chakaravarthy et al. [CPR11] proved that it is NPhard to approximate Uniform Decision Tree with ratio between than by a reduction to MSSC. The key technical contribution of our work is showing that there is also a connection on the upper bound side. Roughly, one can read off a greedy solution to MSSC from the greedy tree by choosing to be the sets of hypotheses consistent with the minority child of the ’s, starting with the minority child of . The MSSC cover time corresponds to the depth of a hypothesis in the tree past . Bounding the weight of imbalanced vertices works as follows.

For each chain , define a corresponding instance (Definition A.9) of
Min Sum Set Cover induced by the chain as follows:
Universe , the set of all hypotheses that are consistent with .

For , the set is the set of hypotheses in that give the minority answer of test with respect to hypotheses . (See Figure 2).

For each , a singleton set . These tests, while mostly trivial, are included for technical reasons.
Note we have a total of sets, so that a solution is a permutation . The sets for are chosen so that the second step below holds.


Prove that the weight of a chain is bounded by the cost of a greedy solution to MSSC (Lemma A.13), and hence, using a result of Feige, Lovasz, and Tetali (Theorem A.12), by 4 times the optimal cost of MSSC (Corollary A.14). That is, there exists a greedy solution to such that
(9) This step is somewhat technical, as one must show that the greediness of the greedy decision tree produces a greedy solution to . The choice of is natural: for , let be the index of the test used at vertex in the chain (see Figure 3). However, showing that this is in fact a greedy solution to is a subtle argument that depends on the carefully chosen definition of a chain.

Prove that, for any integer , the sum, over all level chains , of optimal cost of MSSC, is bounded by (Lemma A.15). Hence,
(10) This step is also technical, as one must draw the connection between the optimal MSSC solution and the optimal decision tree.

In total, we have
(11) where the first inequality is by part 2 and the second inequality is by part 3. In other words, for any integer , the sum of the weights of all level chains in at most . Hence, the sum of the weights of vertices in any chain, and thus the total weight of all imbalanced vertices, is at most (Lemma A.16), where is the number of levels. As , we have
(12)
To finish the proof, we bound
(13) 
The above is optimized roughly when , giving the desired bound of . Observing that always, we conclude that the greedy algorithm gives an approximation. If is sufficiently large, taking yields .
4.2 General weights and larger
The proof of the general Theorem 3.1 follows similarly to the specific case given above. The two differences are that Theorem 3.1 is stated for general and for general, notnecessarilyuniform distributions .
Adapting the proof to general is the easier step. The main difference is the definition of an imbalanced vertex. Now, we say a vertex is imbalanced if there is an integer such that and , where is the total weight of hypotheses in the subtrees of all children of except the majority vertex, , the child of with the largest weight of hypotheses. Under this definition, a similar analysis follows. Note that could be much larger than in this case, but this does not affect the proof much. A little more care needed in the entropy argument for balanced vertices, and with the MSSC instance defined by a path now taking to be all hypotheses that do not take the majority answer of with respect to the MSSC universe. Note that, if we specialize to , the value is simply .
In the weighted case, we again define to be imbalanced if there is an integer such that and . We again bound the cost of the balanced vertices by an entropy argument, and the cost of the imbalanced vertices via a connection to MinSumSetCover. However, because the entities are now weighted, we need to consider the greedy algorithm for a weighted generalization of MSSC called Weighted MinSumSetCover (WMSSC). In order to make the condition between the greedy decision tree and the greedy solution to WMSSC, we need a somewhat technical definition: call a vertex is heavy if is consistent with and . Define if there exists such that is heavy, and set otherwise. One can easily check that, for any vertex , there is at most one such that is heavy, so is well defined. Now, we follow the argument in the uniform case, bounding
(14) 
where and are the greedy solution and optimal solution, respectively, to the corresponding WMSSC. The first inequality holds because for all and every imbalanced vertex is in some chain^{4}^{4}4It is inequality because some imbalanced vertices may be in multiple chains. The second inequality holds by a technical lemma (Lemma B.15) comparing the greedy decision tree with a greedy solution to WMSSC. Just as for MSSC, the greedy algorithm gives a 4 approximation for WMSSC, so the third inequality holds. Additionally, for all , we can still bound , the sum of all WMSSC costs in a single level, by , so the fourth inequality holds. To finish:
(15) 
The last inequality (Lemma B.20) comes from comparing, for fixed , the vertices of the greedy tree that are heavy to an appropriate SETCOVER instance, and using the fact that the greedy algorithm on a weighted generalization of SETCOVER gives a approximation (Theorem B.19).
5 Sketch of proof of Theorem 3.2
5.1 Algorithm
We describe the algorithm that achieves a approximation for . In Appendix D, we give the details and describe how the same algorithm with minor adjustments gives an improved approximation guarantee for Uniform Decision Tree.
The key idea in the algorithm is that, if the optimal tree has cost at least , then greedy gives an approximation by Theorem 3.1. Fix . Our algorithm first computes the greedy tree. If the cost of the greedy tree is at least , we simply return the greedy tree. Otherwise, we perform an exhaustive search over decision trees of depth at most such that all hypotheses not consistent with vertices at depth are uniquely distinguished. We choose such a tree with minimum cost (see definition of below). Finally, at each leaf of at depth , we recursively compute a decision tree that distinguishes the hypotheses consistent with . The runtime of this algorithm is dominated by the exhaustive search, which we can solve in time using a divideandconquer algorithm.
Let denote the cost a decision tree with respect to hypothesis set , given by
(16) 
where is the depth of the deepest vertex of consistent with . In this way, we have . To solve the Decision Tree instance, we run Fulltree below.
5.2 Analysis sketch
We now sketch an analysis of the algorithm. First, it is easy to check that FullTree returns a valid decision tree. By Theorem 3.1, when the greedy tree is used in the recursive call FullTree, it gives a approximation to the instance induced by . Hence, by careful bookkeeping, the greedy trees included in the output tree contribute at most to the cost (Lemma D.4). If the greedy tree is not used, then, in the optimal tree, the average depth of hypotheses is at most . Hence, by a simple counting argument, at each recursive call, the fraction of undistinguished hypotheses shrinks by a factor of , so the maximum depth of recursive calls is (Lemma D.6). Careful bookkeeping shows that, for any , the outputs to PartialTree called from the th level of recursion collectively contribute at most to the cost of the output tree (Lemma D.5). Hence, the trees computed by exhaustive search across all levels of recursion contribute a cost of . Hence, the cost of our output tree is .
6 Related Work
There have been several other works analyzing Decision Tree and they analyze it in a variety of cases to achieve the gold standard . While we examined the case with ary tests and nonuniform weights, we assumed that the tests had equal costs. Other works [GB09, GNR10] analyze the case where the test costs are nonuniform. [GB09] shows that the greedy algorithm yields when either the costs are nonuniform or the weights are nonuniform (with the rounding trick) but not both. [GNR10] introduces a new algorithm that achieves with both nonuniform weights and costs.
In this work we studied the average depth of decision trees. We remark that, in the worstcase decision tree problem, where the cost of a tree is defined to be the maximum depth of a leaf in the tree, the approximability is known. The greedy algorithm gives a approximation [AMM98], and obtaining a approximation is NPhard [LN04].
For the worstcase decision tree problem, there is a line of work that examines the absolute query rate rather than the query rate relative to the optimal. In this line of work, the chief goal is to identify conditions where the greedy algorithm achieves the informationtheoretically optimal rate . One such condition that ensures the rate is “samplerich” [NJC12], which states that every binary partition of the hypotheses has a test with matching preimages. [Now09, Now11] introduced the more lenient neighborly condition, which requires that every two tests be connected by a sequence of tests where neighboring tests disagree on at most hypotheses. An even more general condition is the splitneighborly condition [ML18], which is satisfied if every two tests are connected by a sequence of tests where neighboring tests must have every subset of the disagreeing hypotheses be evenly split by some other test.
7 Conclusion
There are two primary open questions left by our work: Could one prove hardness of approximation results for Uniform Decision Tree for ratios larger than ? It would be interesting to prove either NPhardness results for larger constant factor approximations, or finegrained complexity results for larger approximation ratios such as in [MR17]. On the flip side, could one find faster, perhaps polynomial time algorithms for approximating Uniform Decision Tree for ratios where we now have subexponential algorithms?
8 Acknowledgements
The authors thank Joshua Brakensiek for helpful discussions and feedback on an earlier draft of this paper.
References
 [ABS15] Sanjeev Arora, Boaz Barak, and David Steurer. Subexponential algorithms for unique games and related problems. J. ACM, 62(5):42:1–42:25, 2015.
 [AH12] Micah Adler and Brent Heeringa. Approximating optimal binary decision trees. Algorithmica, 62(34):1112–1121, 2012.
 [AMM98] Esther M Arkin, Henk Meijer, Joseph SB Mitchell, David Rappaport, and Steven S Skiena. Decision trees for geometric models. International Journal of Computational Geometry & Applications, 8(03):343–363, 1998.
 [Bab16] László Babai. Graph isomorphism in quasipolynomial time [extended abstract]. In Proceedings of the 48th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2016, Cambridge, MA, USA, June 1821, 2016, pages 684–697, 2016.
 [Chv79] Vasek Chvatal. A greedy heuristic for the setcovering problem. Mathematics of operations research, 4(3):233–235, 1979.
 [CJLM10] Ferdinando Cicalese, Tobias Jacobs, Eduardo Laber, and Marco Molinaro. On greedy algorithms for decision trees. In International Symposium on Algorithms and Computation, pages 206–217. Springer, 2010.
 [CPR11] Venkatesan T. Chakaravarthy, Vinayaka Pandit, Sambuddha Roy, Pranjal Awasthi, and Mukesh K. Mohania. Decision trees for entity identification: Approximation algorithms and hardness results. volume 7, pages 15:1–15:22, 2011.
 [CPRS09] Venkatesan T. Chakaravarthy, Vinayaka Pandit, Sambuddha Roy, and Yogish Sabharwal. Approximating decision trees with multiway branches. In Automata, Languages and Programming, 36th International Colloquium, ICALP 2009, Rhodes, Greece, July 512, 2009, Proceedings, Part I, pages 210–221, 2009.
 [Das04] Sanjoy Dasgupta. Analysis of a greedy active learning strategy. In Advances in neural information processing systems, pages 337–344, 2004.
 [FLT04] Uriel Feige, László Lovász, and Prasad Tetali. Approximating min sum set cover. Algorithmica, 40(4):219–234, 2004.
 [GB09] Andrew Guillory and Jeff Bilmes. Averagecase active learning with costs. In International Conference on Algorithmic Learning Theory, pages 141–155. Springer, 2009.
 [GG74] M. R. Garey and Ronald L. Graham. Performance bounds on the splitting algorithm for binary testing. Acta Inf., 3:347–355, 1974.
 [GK11] Daniel Golovin and Andreas Krause. Adaptive submodularity: Theory and applications in active learning and stochastic optimization. Journal of Artificial Intelligence Research, 42:427–486, 2011.
 [GNR10] Anupam Gupta, Viswanath Nagarajan, and R Ravi. Approximation algorithms for optimal decision trees and adaptive tsp problems. In International Colloquium on Automata, Languages, and Programming, pages 690–701. Springer, 2010.
 [HR76] Laurent Hyafil and Ronald L. Rivest. Constructing optimal binary decision trees is npcomplete. Inf. Process. Lett., 5(1):15–17, 1976.
 [IP01] Russell Impagliazzo and Ramamohan Paturi. On the complexity of ksat. J. Comput. Syst. Sci., 62(2):367–375, 2001.
 [IPZ01] Russell Impagliazzo, Ramamohan Paturi, and Francis Zane. Which problems have strongly exponential complexity? J. Comput. Syst. Sci., 63(4):512–530, 2001.
 [Joh74] David S Johnson. Approximation algorithms for combinatorial problems. Journal of computer and system sciences, 9(3):256–278, 1974.
 [Kää06] Matti Kääriäinen. Active learning in the nonrealizable case. In International Conference on Algorithmic Learning Theory, pages 63–77. Springer, 2006.
 [Kho02] Subhash Khot. On the power of unique 2prover 1round games. In Proceedings on 34th Annual ACM Symposium on Theory of Computing, May 1921, 2002, Montréal, Québec, Canada, pages 767–775, 2002.
 [KPB99] S Rao Kosaraju, Teresa M Przytycka, and Ryan Borgstrom. On an optimal split tree problem. In Workshop on Algorithms and Data Structures, pages 157–168. Springer, 1999.
 [LLMP93] Arjen K Lenstra, Hendrik W Lenstra, Mark S Manasse, and John M Pollard. The number field sieve. In The development of the number field sieve, pages 11–42. Springer, 1993.
 [LMM03] Richard J. Lipton, Evangelos Markakis, and Aranyak Mehta. Playing large games using simple strategies. In Proceedings 4th ACM Conference on Electronic Commerce (EC2003), San Diego, California, USA, June 912, 2003, pages 36–41, 2003.
 [LN04] Eduardo S Laber and Loana Tito Nogueira. On the hardness of the minimum height decision tree problem. Discrete Applied Mathematics, 144(12):209–212, 2004.
 [Lov75] László Lovász. On the ratio of optimal integral and fractional covers. Discrete mathematics, 13(4):383–390, 1975.
 [Lov85] Donald W. Loveland. Performance bounds for binary testing with arbitrary weights. Acta Inf., 22(1):101–114, 1985.
 [ML18] Stephen Mussmann and Percy Liang. Generalized binary search for splitneighborly problems. arXiv preprint arXiv:1802.09751, 2018.
 [Mor82] Bernard ME Moret. Decision trees and diagrams. ACM Computing Surveys (CSUR), 14(4):593–623, 1982.
 [Mos12] Dana Moshkovitz. The projection games conjecture and the nphardness of ln napproximating setcover. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, pages 276–287. Springer, 2012.
 [MR17] Pasin Manurangsi and Aviad Rubinstein. Inapproximability of VC dimension and littlestone’s dimension. In Proceedings of the 30th Conference on Learning Theory, COLT 2017, Amsterdam, The Netherlands, 710 July 2017, pages 1432–1460, 2017.
 [NJC12] Mohammad Naghshvar, Tara Javidi, and Kamalika Chaudhuri. Noisy bayesian active learning. In Communication, Control, and Computing (Allerton), 2012 50th Annual Allerton Conference on, pages 1626–1633. IEEE, 2012.
 [Now09] Robert Nowak. Noisy generalized binary search. In Advances in neural information processing systems, pages 1366–1374, 2009.
 [Now11] Robert D Nowak. The geometry of generalized binary search. IEEE Transactions on Information Theory, 57(12):7893–7906, 2011.
 [Rub18] Aviad Rubinstein. Inapproximability of nash equilibrium. SIAM J. Comput., 47(3):917–959, 2018.
 [Ste74] Sherman K Stein. Two combinatorial covering theorems. Journal of Combinatorial Theory, Series A, 16(3):391–397, 1974.
 [ZRB05] Alice X Zheng, Irina Rish, and Alina Beygelzimer. Efficient test selection in active diagnosis via entropy approximation. In Proceedings of the TwentyFirst Conference on Uncertainty in Artificial Intelligence, pages 675–682. AUAI Press, 2005.
Appendix A Proof of Theorem 3.1 for uniform weights and
We prove a special case of Theorem 3.1 when and the weights are uniform, that is, we show that the Uniform Decision Tree with binary tests gives a approximation. Throughout this section, we have a Uniform Decision Tree instance with hypotheses and tests .
Theorem A.1.
For any instance of the Uniform Decision Tree problem on hypotheses with branching factor 2, and any greedy tree with average cost , we have
(17) 
a.1 Notation
We use the following notation for our proof. These notations help us reason about the greedy tree. We write to mean that is a vertex of tree , and we write to mean that is a interior vertex. We say the length of a path in the tree is the number of edges along the path. For , we say is an ancestor of if there is a (possibly degenerate) path from to going down the tree. In particular, is an ancestor of . We write this as . We call a descendant of if and only if is an ancestor of . For , let denote the set of hypotheses consistent with . For a subset of hypotheses, denote its weight or cost by . For brevity, let , denoting the weight of vertex , and we say the weight of a set of vertices is the sum of the weights of the individual vertices in the set.
a.2 The basic argument
The following lemma shows that, rather than accounting the cost of the greedy tree by summing the depths of the leaves associated with the hypotheses, we can instead account the cost by summing the weights of vertices of the tree.
Lemma A.2.
We have .
Proof.
We have,
(18) 
where, in the third equality, we switched the order of summation. ∎
At a high level, our proof defines balanced and imbalanced vertices (next subsection) using a parameter and bound the weight of the balanced and imbalanced vertices separately. We bound the weight of the balanced vertices by an entropy argument, and the weight of the imbalanced vertices by partitioning the imbalanced vertices into paths, called chains, and bounding the weights of each chain separately. Overall, we get the following bound.
(19) 
Choosing gives .
For the rest of the proof, fix . Additionally, for convenience and without loss of generality, assume that our instance is nontrivial, i.e. there is some test such that both of and have at least 2 hypotheses, as otherwise the greedy tree is optimal and and the theorem is true.
a.3 More notation: Majority and minority answers
We define majority (minority) answers, edges, children. These definitions are useful for defining balanced and imbalanced vertices. We later show that imbalanced vertices form paths whose edges are majority edges. We call these paths chains. We then analyze the balanced and imbalanced vertices separately, and in particular analyze each path of majority edges separately.
For each vertex in the greedy tree, let denote the test used at . For each vertex , label its children by and so that , with ties broken^{5}^{5}5any tiebreaking procedure suffices, as long as the tiebreaking is consistent with the and notation in the next paragraph. by labeling by the vertex corresponding to a test outputting 1.^{6}^{6}6it is possible to have a vertex that has one child, namely a test that doesn’t distinguish any pairs of hypotheses at a vertex, but such a test is useless and never appears in either the greedy or optimal tree, so we assume such vertices don’t exist. Accordingly, we have for all . Call the edge from to a majority edge, and the edge from to a minority edge. This is illustrated in Figure 4.
In order to reason about the greedy tree precisely, we use the following notation which is more technical. For test and hypotheses , let be the answer to test that accounts for the maximum weight of hypotheses in , and let be the other index, with ties broken by . In other words, and are chosen so that . We call the majority answer of test with respect to hypothesis set . Call the other answer the minority answer of test with respect to hypothesis set . For all and , let
(20) 
We think of () as the set of hypotheses that, under test , output the majority (minority) answer to test with respect to set . Note that, with the above notation, we have and .
The following is a key property of the greedy tree : the weight of hypotheses consistent with the minority child decreases as we descend the tree.
Lemma A.3.
For any vertices of with a descendant of , we have .
Proof.
Because was constructed greedily, for all , the test was chosen to maximize the weight of , the hypotheses in giving the minority answer . Hence, any other test, in particular, the test chosen at vertex , has a smaller weight of hypotheses of that give the minority answer of with respect to hypotheses . Hence, we have . Hence,
(21) 
The second inequality holds because . The third inequality holds because test defines a partition of into two parts, and is one of the two parts, so is one of or . ∎
a.4 Defining balanced and imbalanced vertices
In the following definition, we identify balanced vertices and imbalanced vertices. By Lemma A.2, we can separately bound the weights of the balanced and imbalanced vertices.
Definition A.4.
Let be a positive integer.

We say a vertex is level imbalanced if and .

We say a vertex is imbalanced if it is level imbalanced for some , and balanced otherwise.

We say a level imbalanced vertex is minimal if no descendant of is also level imbalanced vertex, and a level imbalanced vertex is maximal if no ancestor of is level imbalanced.
Let
(22) 
and note that level imbalanced vertices exist only for . The following lemma proves a structural result about balanced vertices, with the punchline being item (iii), which permits Definition A.6. For an illustration, see Figure 5.
Lemma A.5.
Let be a positive integer.

If is a level imbalanced vertex, then, among the children of , only can be a level imbalanced vertex.

Additionally, if and are level imbalanced vertices and is an ancestor of , then every vertex on the path from to is a level imbalanced vertex.

Finally, the set of level imbalanced vertices can be partitioned into vertex disjoint paths, each of which connects a maximal level imbalanced vertex to a minimal level imbalanced vertex and contains only majority edges.
Proof.
For (i), note that if is level imbalanced, then , so cannot be level imbalanced. Hence, among the children of , only can be level imbalanced.
For (ii), let be three vertices in the tree. Suppose that and are level imbalanced. We know that , and Lemma A.3 gives . Hence is level imbalanced.
For (iii), note that each level imbalanced vertex has a maximal level imbalanced ancestor (possibly itself), so we may partition the level imbalanced vertices into sets based on their maximal level imbalanced ancestor. We claim each set in the partition is a connected path. Let be the (unique) maximal level imbalanced vertex in a set . For , if has a level imbalanced child, let be that child, which is unique by the first item and in by definition. Let be the largest index such that is defined. Then has no level imbalanced children. We claim are the only vertices in the set . Suppose not. Let be the largest index such that has a level imbalanced descendant not among . Then, by the second item, every vertex on the path from to is level imbalanced. If , this means has a level imbalanced child, a contradiction of the maximality of . Otherwise, as is maximal, is not on the path from to , in which case, by (ii), is level imbalanced, which contradicts (i). Thus, we always have a contradiction, so is the path . By (i), every edge along is a majority edge. This completes the proof. ∎
Lemma A.5 motivates the following definition.
Definition A.6.
Let be a positive integer. A level chain,