Evaluation of DNF Formulas
Stochastic Boolean Function Evaluation (SBFE) is the problem of determining the value of a given Boolean function on an unknown input , when each bit of of can only be determined by paying a given associated cost . Further, is drawn from a given product distribution: for each , , and the bits are independent. The goal is to minimize the expected cost of evaluation. This problem has been studied in the Operations Research literature, where it is known as “sequential testing” of Boolean functions (cf. ). It has been studied in learning theory in the context of learning with attribute costs .
In this paper, we study the complexity of the SBFE problem for classes of DNF formulas. We consider both exact and approximate versions of the problem for subclasses of DNF, for arbitrary costs and product distributions, and for unit costs and/or the uniform distribution. Because of the NP-hardness of satisfiability, the general SBFE problem is easily shown to be NP-hard for arbitrary DNF formulas .
We consider the SBFE problem for monotone -DNF and -term DNF formulas. We use a simple reduction to show that the SBFE problem for -DNF is NP-hard, even for . We present an algorithm for evaluating monotone -DNF that achieves a solution that is within a factor of of optimal, where is either the minimum value, or the minimum value, whichever is smaller. We present an algorithm for evaluating monotone -term DNF with an approximation factor of We also prove that the SBFE problem for monotone -term DNF can be solved exactly in polynomial time for constant .
Previously, Kaplan et al. gave an approximation algorithm solving the SBFE problem for CDNF formulas (and decision trees) for the special case of unit costs, the uniform distribution, and monotone CDNF formulas . CDNF formulas are formulas consisting of a DNF formula together with an equivalent CNF formula, so the size of the input depends both on the size of the CNF and the size of the DNF. Having both formulas makes the evaluation problem easier. They showed that their algorithm achieves a solution whose cost is within an factor of the expected certificate cost, where is the number of terms of the DNF, and is the number of clauses. The expected certificate cost is a lower bound on the cost of the optimal solution. Deshpande et al. subsequently gave an algorithm solving the unrestricted SBFE problem for CDNF formulas, whose solution is within a factor of of optimal, for arbitrary costs, arbitrary probabilities, and without the monotonicity assumption . Thus the Deshpande et al. result solves a more general problem than that of Kaplan et al., but their approximation bound is weaker because it is not in terms of expected certificate cost.
The Kaplan et al. algorithm uses a round robin technique that alternates between two processes, one of which attempts to achieve a 0-certificate and one which attempts to achieve a 1-certificate. The technique requires unit costs. We show how to modify the technique to handle non-unit costs, with no change in the approximation bound. The algorithm can also be trivially extended to remove the uniform distribution restriction, changing the approximation bound to .
We do not know how to remove the assumption of Kaplan et al. that the CDNF formula is monotone, while still achieving an approximation factor that is within of the expected certificate cost. We do show, however, that this approximation factor is close to optimal, even for the special case they considered. We prove that, with respect to the expected certificate cost, the approximation factor must be at least , for any constant where .
This proof also implies that the (optimal) average depth of a decision tree computing a Boolean function can be exponentially larger than the average certificate size for that function (i.e., the average of the minimum-size certificates for all assignments). In contrast, the depth complexity of a decision tree for a function, (a worst-case measure) is at most quadratic in its certificate complexity (cf. ).
2 Stochastic Boolean Function Evaluation
The formal definition of the Stochastic Boolean Function Evaluation (SBFE) problem is as follows. The input is a representation of a Boolean function from a fixed class of representations , a probability vector , where , and a real-valued cost vector , where . An algorithm for this problem must compute and output the value of on an , drawn randomly from the product distribution , i.e., the distribution where and the are independent. However, the algorithm is not given direct access to . Instead, it can discover the value of any only by “testing” it, at a cost of . The algorithm must perform the tests sequentially, each time choosing the next test to perform. The algorithm can be adaptive, so the choice of the next test can depend on the outcomes of the previous tests. The expected cost of the algorithm is the cost it incurs on a random from . (Note that since each is strictly between 0 and 1, the algorithm must continue doing tests until it has obtained a 0-certificate or 1-certificate for the function.) The algorithm is optimal if it has the minimum possible expected cost with respect to .
We consider the running time of the algorithm to be the (worst-case) time it takes to determine the single next variable to be tested, or to compute the value of after the last test result is received. The algorithm corresponds to a Boolean decision tree (testing strategy) computing , indicating the adaptive sequence of tests.
SBFE problems arise in many different application areas. For example, in medical diagnosis, the might correspond to medical tests performed on a given patient, where if the patient should be diagnosed as having a particular disease. In query optimization in databases, could correspond to a Boolean query, on predicates corresponding to , that has to be evaluated for every tuple in the database in order to find tuples satisfying the query [11, 13, 4, 14].
There are polynomial-time algorithms solving the SBFE problem exactly for a small number of classes of Boolean formulas, including read-once DNF formulas and -of- formulas (see  for a survey of exact algorithms). There is a naive approximation algorithm for evaluating any function under any distribution that achieves an approximation factor of : Simply test the variables in increasing order of their costs. This follows easily from the fact that the cost incurred by the naive algorithm in evaluating function on an input is at most times the cost of the min-cost certificate for , contained in (cf. ).
Deshpande et al. explored a generic approach to developing approximation algorithms for SBFE problems, called the -value approach. It involves reducing the problem to an instance of Stochastic Submodular Set Cover and then solving it using the Adaptive Greedy algorithm of Golovin and Krause . They proved that the -value approach does not yield a sublinear approximation bound for evaluating -DNF formulas, even for . They also developed a new algorithm for solving Stochastic Submodular Set Cover, called Adaptive Dual Greedy, and used it to obtain a 3-approximation algorithm solving the SBFE problem for linear threshold formulas .
Table 1 summarizes work on the SBFE problem for classes of DNF formulas, and for monotone versions of those classes. The table includes both previous results and the results in this paper.
|DNF formula||general case||monotone case|
|read-once DNF||•-time algorithm [12, 8]|
The abbreviations uc and ud are used to refer to unit costs and uniform distribution, respectively. refers to the number of terms in the DNF, refers to the number of clauses in the CNF. is the minimum value of any or . Citations of results from this paper are enclosed in parentheses and include the section number. All approximation factors are with respect to E[CERT], the expected certificate cost, except for the CDNF bound of . That bound is with respect to E[OPT], the expected cost of the optimal strategy, which is lower bounded by E[CERT].
A literal is a variable or its negation. A term is a possibly empty conjunction () of literals. If the term is empty, all assignments satisfy it. A clause is a possibly empty disjunction () of literals. If the clause is empty, no assignments satisfy it. The size of a term or clause is the number of literals in it.
A DNF (disjunctive normal form) formula is either the constant 0, the constant 1, or a formula of the form , where and each is a term. Likewise, a CNF (conjunctive normal form) formula is either the constant 0, the constant 1, or a formula of the form , where each is a clause.
A -term DNF is a DNF formula consisting of at most terms. A -DNF is a DNF formula where each term has size at most . The size of a DNF (CNF) formula is the number of its terms (clauses); if it is the constant 0 or 1, its size is 1. A DNF formula is monotone if it contains no negations. A read-once DNF formula is a DNF formula where each variable appears at most once.
Given a Boolean function , a partial assignment is a 0-certificate (1-certificate) of if () for all such that for all . It is a certificate for if it is either a 0-certificate or a 1-certificate. Given a cost vector , the cost of a certificate is . We say that input contains certificate if for all . The variables in a certificate are the such that . If contains and is a superset of the variables in , then we say that contains .
The expected certificate cost of a function , with respect to cost vector and probability vector , is , where the expectation is with respect to drawn from product distribution , and is the minimum cost of a certificate of contained in .
Given a Boolean function , let denote the minimum expected cost of any algorithm solving the SBFE for , in the unit-cost, uniform distribution case. Let denote the expected certificate cost, in the unit cost, uniform distribution case.
The set covering problem is as follows: Given a ground set of elements, a set of subsets of , and a positive integer , does there exist such that and ? Each set is said to cover the elements it contains. Thus the set covering problem asks whether has a “cover” of size at most .
4 Hardness of the SBFE problem for monotone DNF
Before presenting approximation algorithms solving the SBFE problem for classes of monotone DNF, we begin by discussing the hardness of the exact problem.
Greiner et al.  showed that the SBFE problem for CNF formulas is NP-hard, as follows. If a CNF formula is unsatisfiable, then no tests are necessary to determine its value on an assignment . If there were a polynomial-time algorithm solving the SBFE problem for CNF formulas, we could use it to solve SAT: given CNF Formula , we could run the SBFE algorithm on (with arbitrary and ), and just observe whether the algorithm begins by choosing a variable to test, or whether it immediately outputs 0 as the value of the formula. Thus the SBFE problem on CNF formulas is NP-hard, and by duality, the same is true for DNF formulas.
Moreover, if , we cannot approximate the SBFE problem for DNF within any factor . If a -approximation algorithm existed, then on a tautological DNF , the algorithm would have to immediately output 1 as the value of , because . On non-tautological , the algorithm would instead have to specify a variable to test.
The SBFE problem for DNF is still NP-hard even when the DNF is monotone. To show this, we use an approach used by Cox  in proving NP-hardness of linear threshold evaluation. Intuitively, in an instance of SBFE with unit costs if the probabilities are very close to 0 (or 1), then the expected cost of evaluation is dominated by the cost of evaluating the given function on a specific input . That cost is minimized by testing only the variables in a minimum-cost certificate for on . The idea, then, is to show hardness of the SBFE problem for a class of formulas by reducing an NP-hard problem to the problem of finding, given and a particular input , a smallest size certificate of contained in . Cox reduced from Knapsack, and here we reduce from Vertex-Cover. The following lemma is implicit in the proof of Lemma 1 of Cox:
Let be a Boolean decision tree computing Boolean function . For , let and let . Let , let be the vector of unit costs, and let denote the all 0’s assignment. If with respect to and , has minimum expected evaluation cost over all decision trees computing , then the variables tested along the path corresponding to in are precisely those set to in a min-cost certificate for contained in .
If , there is no polynomial time algorithm solving the SBFE problem for monotone DNF. This holds even with unit costs, and even for -DNF where . Also, if , the SBFE problem for monotone DNF, even with unit costs, cannot be approximated to within a factor of less than , for some constant .
Suppose there is a polynomial-time algorithm ALG for the SBFE problem for monotone 2-DNF, with unit costs and arbitrary probabilities. We show this algorithm could be used to solve the Vertex Cover problem: Given a graph , find a minimum-size vertex cover for , i.e., a minimum-size set of vertices such that for each edge , .
The reduction is as follows. Given graph , construct a monotone 2-DNF formula whose variables correspond to the vertices , and whose terms correspond to the edges in . Consider the all 0’s assignment . Since a 0-certificate for must set each term of to 0, any min-cost certificate for contained in must also be a minimum-size vertex cover for . Thus by the previous lemma, one can find a minimum-size vertex cover for by using ALG to evaluate on input , with unit costs and the probabilities given in Lemma 1, and observing which variables are tested.
A more general version of this reduction can be used to reduce the general Set Cover problem to the SBFE problem for monotone DNF (with terms of arbitrary length). The non-approximability bound in the theorem then follows from the inapproximability result for Set Cover .
Given the difficulty of exactly solving the SBFE problem for monotone DNF formulas, we now consider approximation algorithms.
5 Approximation algorithms for the evaluation of monotone -DNF and -term DNF
5.1 Monotone -DNF formulas
In this section, we will present a polynomial time algorithm for evaluating monotone -DNF formulas. To evaluate we will alternate between two algorithms, Alg0 and Alg1, each of which performs tests on the variables . Alg0 tries to find a min-cost 0-certificate for , and Alg1 tries to find a min-cost 1-certificate for . As soon as one of these algorithms succeeds in finding a certificate, we know the value of , and can output it.
This basic approach was used previously by Kaplan et al.  in their algorithm for evaluating monotone CDNF formulas in the unit cost, uniform distribution case. They used a standard greedy set-cover algorithm for both Alg0 and Alg1, with a strict round-robin policy that alternated between doing one test of Alg0 and one test of Alg1. Our algorithm uses a dual greedy set-cover algorithm for Alg0 and a different, simple algorithm for Alg1. The strict round-robin policy used by Kaplan et al. is only suitable for unit costs, and our algorithm has to handle arbitrary costs. Our algorithm uses a modified round-robin protocol instead. We begin by presenting that protocol.
Although we will use the protocol with a particular Alg0 and Alg1, it works for any Alg0 and Alg1 that “try” to find 0-certificates and 1-certificates respectively. In the case of Alg0, this means that Alg0 will succeed in outputing a 0-certificate of contained in if , and will eventually terminate and report failure otherwise. Similarly, Alg1 will output a 1-certificate contained in if , and will report failure otherwise.
The modified round-robin protocol works as follows. It maintains two values: and , where is the cumulative cost of all tests performed so far in Alg0, and is the cumulative cost of all tests performed so far in Alg1. At each step of the protocol, each of Alg0 and Alg1 independently determines a test to be performed next and the protocol chooses one of them. (Initially, the two tests are the first tests of Alg0 and Alg1 respectively.) Let and denote the respective costs of these tests. Let denote the next test of Alg1 and let denote the next test of Alg0. To choose which test to perform, the protocol uses the following rule: if it performs test , otherwise it performs test .
The result of the test is given to the algorithm to which it belongs, and that algorithm continues until it either (1) computes a new next test, (2) terminates successfully and outputs a certificate, or (3) terminates by reporting failure. In the first case, the protocol again chooses between the next test of Alg0 and Alg1, using the rule above. In the second, the protocol terminates because one of the algorithms has output a certificate. In the third, the protocol runs the other algorithm (the one that did not terminate) until completion, performing all of its remaining tests. That algorithm is guaranteed to output a certificate, because if doesn’t have a 0-certificate for , it must have a 1-certificate, and vice-versa.
Note that it would be possible for the above protocol to share information between Alg0 and Alg1, so that if was tested by Alg0, Alg1 would not need to retest . However, to simplify the analysis, we do not have the protocol do such sharing. We now show that the following invariant holds holds at the end of each step of the protocol, provided that neither Alg0 nor Alg1 terminated in that iteration.
At the end of each step of the above modified round-robin protocol, if was tested in that step, then . Otherwise, if was tested, then at the end of the step.
The invariant clearly holds after the first step. Suppose it is true at the end of the th step, and without loss of generality assume that was tested during that step. Thus at the end of the th step.
Consider the st step. Note that the value of is the same in this step as in the previous one, because in the previous step, we did not execute the next step of Alg0. There are 2 cases, depending on which if-condition is satisfied when the rule is applied in this step, or .
Case 1: is satisfied.
Then is tested in this step and increases by . We show that and at the end of the step, which is what we need. At the start of the step, and at the end, is augmented by , so . Consequently, . Further, by assumption, at the start of the step, and hence at the end, .
Case 2: is satisfied [and by assumption at the start]
Then is tested in this step, and increases by . We show that and at the end of the step. By the condition in the case, at the start of the step, so at the end, , and hence . Further, by assumption at the start, and since only was increased, this also holds at the end.
We can now prove the following lemma:
If , then at the end of the modified round-robin protocol, . The lemma holds true symmetrically if .
There are two ways for the protocol to terminate. Either Alg0 or Alg1 is detected to have succeeded at the start of the repeat loop, or within the loop, one fails and the other is run to successful termination.
Suppose the former, and without loss of generality suppose it is Alg0 that succeeded. It follows that it was that was tested at the end of the previous step (unless this is the first step, which would be an easy case), because otherwise, the success of Alg0 would have been detected in an earlier step.
Thus at the end of the last step, by Lemma 2, .
Suppose instead that one algorithm fails, and without loss of generality, suppose it was Alg0, and thus we ran Alg1 to termination. Since Alg0 did not fail in a prior step, it follows that in the previous step, was tested (unless this is the first step, which would be an easy case). Thus at the end of the previous step, by the invariant, and so . We have to run at least one step of Alg1 when we run it to termination. Thus running Alg1 to termination augments by , and so at the end of the algorithm, we have .
We now describe the particular Alg0 and Alg1 that we use in our algorithm for evaluating monotone -DNF. We describe Alg0 first. Since is a monotone function, the variables in any 0-certificate for must all be set to 0. Consider an assignment such that . Let . Finding a min-cost 0-certificate for contained in is equivalent to solving the set-cover instance where the elements to be covered are the terms , and for each , there is a corresponding subset .
Suppose . If Alg0 was given both and as input, it could find an approximate solution to this set cover instance using Hochbaum’s Dual Greedy algorithm for (weighted) set cover . This algorithm selects items to place in the cover, one by one, based on a certain greedy choice rule.
Alg0 is not given , however. It can only discover the values of variables by testing them. We get around this as follows. Alg0 begins running Hochbaum’s algorithm, using the assumption that all variables are in . Each time that algorithm chooses a variable to place in the cover, Alg0 tests the variable . If the test reveals that , Alg0 continues directly to the next step of Hochbaum’s algorithm. If, however, the test reveals that , it removes the from consideration, and uses the greedy choice rule to choose the best variable from the remaining variables. The variables that are placed in the cover by Alg0 in this case are precisely those that would have been placed in the cover if we had run Hochbaum’s algorithm with as input.
Hochbaum’s algorithm is guaranteed to construct a cover whose total cost is within a factor of of the optimal cover, where is the maximum number of subsets in which any ground element appears. Since each term can contain a maximum of literals, each term can be covered at most times. It follows that when , Alg0 outputs a certificate that is within a factor of at most of the minimum cost certificate of contained in .
If , Alg0 will eventually test all elements without having constructed a cover, at which point it will terminate and report failure.
We now describe Alg1. Alg1 begins by evaluating the min-cost term of , where the cost of a term is the sum of the costs of the variables in it. (In the unit-cost case, this is the shortest term. If there is a tie for the min-cost term, Alg1 breaks the tie in some suitable way, e.g., by the lexicographic ordering of the terms.) The evaluation is done by testing the variables of one by one in increasing cost order until a variable is found to equal 0, or all variables have been found to equal 1. (For variables with equal cost, Alg1 breaks ties in some suitable way, e.g., in increasing order of their indices .) In the latter case, Alg1 terminates and outputs the certificate setting the variables in the term to 1.
Otherwise, for each tested variable in , Alg1 replaces all occurrences of that variable in with its tested value. It then simplifies the formula (deleting terms with 0’s and deleting 1’s from terms, and optionally making the resulting formula minimal). Let denote the simplified formula. Because was not satisfied, does not contain any satisfied terms. If is identically 0, does not contain a 1-certificate and Alg1 terminates unsuccessfully. Otherwise, Alg1 proceeds recursively on the simplified formula, which contains only untested variables.
Having presented our Alg0 and Alg1, we are ready to prove the main theorem of this section.
The evaluation problem for monotone -DNF can be solved by a polynomial-time approximation algorithm computing a strategy that is within a factor of of the expected certificate cost.
Let be the input monotone -DNF, defined on . We will also use to denote the function computed by this formula.
Let Alg be the algorithm for evaluating that alternates between the Alg0 and Alg1 algorithms just described, using the modified round-robin protocol.
Let and . Let denote the expected cost incurred by the round-robin algorithm in evaluating on random . Let denote the cost incurred by running the algorithm on . Let denote the probability that with respect to the product distribution . Thus is equal to . Similarly, let denote the cost of the minimum cost certificate of contained in . We need to show that the ratio between and is at most .
We consider first the costs incurred by Alg0 on inputs . Following the approach of Kaplan et al., we divide the tests performed by Alg0 into two categories, which we call useful and useless, and amortize the cost of the useless tests by charging them to the useful tests. More particularly, we say that a test on variable is useful to Alg0 if ( is added to the 0-certificate in this case) and useless if . The number of useful tests on is equal to the size of the certificate output by Alg0, and thus the total cost of the useful tests Alg0 performs on is at most .
Let denote the cost incurred by Alg0 alone when running Alg to evaluate on , and let denote the cost incurred by Alg1 alone. Suppose Alg0 performs a useless test on an , finding that . Let be the assignment produced from by setting to . Because and is monotone, too. Because and differ in only one bit, if Alg0 tests on assignment , it will test on , and that test will be useful. Thus each useless test performed by Alg0 on corresponds to a distinct useful test performed on an . When is tested, the probability that it is 1 is , and the probability that it is 0 is . Each useless test contributes to the expected cost, whereas each useful test contributes . If we multiply the contribution of the useful test by , we get the contribution of both a useful and a useless test, namely . To charge the cost of a useless test to its corresponding useful test, we can therefore multiply the cost of the useful test by (so that if, for example, , we charge double for the useful test). Because for all , it follows that . Hence,
We will now show, by induction on the number of terms of , that .
If has only one term, it has at most variables. In this case, Alg1 is just using the naïve algorithm which tests the variables in increasing cost order until the function value is determined. Since the cost of using the naïve algorithm on in this case is at most times , , and for all , it follows that . Thus we have the base case.
Assume for the purpose of induction that holds for having at most terms. Suppose has terms. Let denote the min-cost term. Let denote the cost of , and the number of variables in , so . If does not satisfy term , then after Alg1 evaluates term on , the results of the tests performed in the evaluation correspond to a partial assignment to the variables in . More particularly, if Alg1 tested exactly variables of , the test results correspond to the partial assignment setting the cheapest variables of to 1 and the th to 0, leaving all other variables in unassigned. There are thus possible values for . Let denote this set of partial assignments .
For , let denote the formula obtained from by replacing any occurrences of variables in by their assigned values in (if a variable in is not assigned in , then occurrences of those variables are left unchanged). Let is identically 0 , and let . For any , the cost incurred by Alg1 in evaluating on is at most . For , Alg1 only evaluates , so its total cost on is at most . Let denote the joint probability of obtaining the observed values of those variables tested in . More formally, if is the set of variables tested in , . We thus have the following recursive expression:
where is a random assignment to the variables of not assigned values in , chosen independently according to the relevant parameters of .
For any satisfying , since is min-cost and is monotone, . Let , and let be the partial assignment representing the results of the tests Alg1 performed in evaluating on . Let be the restriction of to the variables of not assigned values by . Any certificate for that is contained in can be converted into a certificate for , contained in , by simply removing the variables assigned values by . It follows that .
Since , the probability that satisfies the first term is at least . By ignoring the we get
The ratio between the first term in the expression bounding , to the first term in the expession bounding , is equal to . By induction, for each , . Thus
Clearly, . By Lemma 3, the cost incurred by Alg on any is at most twice the cost incurred by Alg1 alone on that . Thus . Further, because contributes to both and to the summation over .
It follows from the above that is at most , since .
5.2 Monotone -term DNF formulas
We can use techniques from the previous subsection to obtain results for the class of monotone -term DNF formulas as well. In Section 6, we will present an exact algorithm whose running time is exponential in . Here we present an approximation algorithm that runs in time polynomial in , with no dependence on .
The evaluation problem for monotone -term DNF can be solved by a polynomial-time approximation algorithm computing a strategy that is within a factor of of the minimum-cost certificate.
Let be the input monotone -term DNF, defined on .
Just as in the proof of Theorem 5.1, we will utilize a modified round robin protocol that alternates between one algorithm for finding a 0-certificate (Alg0) and one for finding a 1-certificate (Alg1). Again, let and .
However, in this case Alg0 will use Greedy, Chvátal’s well-known greedy algorithm for weighted set cover , instead of the Dual Greedy algorithm of Hochbaum. The standard greedy algorithm simply maximizes, at each iteration, “bang for the buck” by selecting the subset that covers the largest number of uncovered elements relative to the cost of selecting that subset. Greedy yields a approximation, where is the number of ground elements in the set cover instance and is the th harmonic number, which is upper bounded by . Once again, we will view the terms as ground elements and the variables that evaluate to 0 as the subsets. Since has at most terms, there are at most ground elements. On any , Greedy will yield a certificate that is within a factor of of the min-cost 0-certificate , and thus the cost incurred by the useful tests on (tests on where ) is at most . By multiplying by the charge to the variables that evaluate to 0, to account for the useless tests, we get that the cost incurred by Alg0 on , for , is at most .
Alg1 in this case simply evaluates term by term, each time choosing the remaining term of minimum cost and evaluating all of the variables in it. Without loss of generality, let be the first (cheapest) term evaluated by Alg1, and be the th term evaluated. Suppose . If falsifies terms through and then satisfies , is precisely the cost of , and Alg1 terminates after evaluating . Since none of the costs of the first terms exceeds the cost of , the total cost of evaluating is at most times the cost of . Hence, Alg1 incurs a cost of at most .
By executing the two algorithms according to the modified round robin protocol, we can solve the problem of evaluating monotone -term DNF with cost no greater than double the cost incurred by Alg1, when , and no more than double the cost incurred by Alg0, when . Hence the total cost of the algorithm is within a factor of of the cost of the min-cost certificate for .
We now prove that the problem of exactly evaluating monotone -term DNF can be solved in polynomial time for constant .
6 Exact learning of monotone -term DNF
In this section, we provide an exact algorithm for evaluating k-term DNF formulas in polynomial time for constant . First, we will adapt results from Greiner et al.  to show some properties of optimal strategies for monotone DNF formulas. Then we will use these properties to compute an optimal strategy monotone k-term DNF formulas. Greiner et al.  consider evaluating read-once formulas with the minimum expected cost. Each read-once formula can be described by a rooted and-or tree where each leaf node is labeled with a test and each internal node is labeled as either an or-node or an and-node. The simplest read-once formulas are the simple AND and OR functions, where the depth of the and-or tree is 1. Other read-once formulas can be obtained by taking the AND or OR of other read-once formulas over disjoint sets of variables. In the and-or tree, an internal node whose children include at least one leaf is called a leaf-parent, leaves with the same parent are called leaf-siblings (or siblings) and the set of all children of a leaf-parent is called a sibling class. Intuitively, the siblings have the same effect on the value of the read-once formula. The ratio of a variable is defined to be . Further, tests and are R-equivalent if they are leaf-siblings and . An R-class is an equivalence class with respect to the relation of being R-equivalent. Greiner et al. show that, for any and-or tree, (WLOG they assume that leaf-parents are OR nodes), there is an optimal strategy that satisfies the following conditions:
For any sibling tests and such that , is not performed before on any root-to leaf path of .
For any R-class , is contiguous with respect to .
We observe that by redefining siblings and sibling classes, corresponding properties hold for general monotone DNF formulas. Let us define a maximal subset of the variables that appear in exactly the same set of terms as a sibling class in a DNF formula. All the other definitions can easily be adapted accordingly. In this case, the ratio of a variable is . For instance, all variables are siblings for an AND function, whereas no two variables are siblings in an OR function.
It is possible to adapt the proof of Theorem 20 in  to apply to monotone DNF formulas. All the steps of the proof can be adapted in this context, using the new definitions of siblings and the ratio of a variable.
For any monotone DNF, there exists an optimal testing strategy that satisfies conditions (a) and (b) stated above.
In other words, there exists an optimal strategy such that on any path from the root to the leaf, sibling tests appear in non-decreasing order of their ratios. Further, for this strategy, sibling tests with the same ratio (R-class) appear one after another on any path from the root to the leaf. By a duality argument, a similar result holds for monotone CNFs by defining the ratio of a variable as and sibling class as a set of variables that appear in exactly the same set of clauses. For a -term monotone DNF there are at most sibling classes, since each sibling class corresponds to a non-empty subset of the terms of the monotone DNF formula. Next, we provide a dynamic programming based method to find an optimal strategy.
The evaluation problem for monotone -term DNF formula over a product distribution on input and with arbitrary costs can be solved exactly in polynomial time for constant .
We will use a dynamic programming method similar to that used in  for building decision trees for functions defined by truth tables. We use notation consistent with that paper.
Let be the function that is defined by We will construct a table indexed by partial assignments to the sibling classes. By Theorem 6.1, there is an optimal evaluation order of the variables within each sibling class. Let index the sibling classes in arbitrary order. For each sibling class , let us rename the variables contained in it , where refers to the position of the variable in the testing order according to their ratios , and where refers to the number of variables within the class . Hence, for each class we will have states in : not evaluated, variable evaluated to 1, variable evaluated to 1, …variable evaluated to 1, any variable evaluated to 0. (Due to monotonicity, the evaluation of any variable to 0 ends the evaluation of the entire class.) Given the optimal ordering, the knowledge of which variable of a sibling class was evaluated last is sufficient to determine which variable should be evaluated next within that class. Given a partial assignment that is being evaluated under an optimal testing strategy, let denote the variable that will be evaluated next for each class (that is, that the values of variables for all have already been revealed).
At each position in the table, we will place the decision tree with the minimum expected cost that computes the function , where is the function defined by projected over the partial assignment . Then, once the full table has been constructed, (the value for the empty assignment) will provide the minimum cost decision tree for .
For any Boolean function , let denote the size of the minimum cost decision tree consistent with . For any partial assignment and any variable not assigned a value in , let denote the partial assignment created by assigning the value to to extend . Let denote the cost of evaluating , let denote the probability that , and let denote the probability that .
We can construct the table using dynamic programming by following these rules:
For any complete assignment , the minimum size decision tree has a cost of 0, since no variables need to be evaluated to determine its value. Hence, the value .
For any partial assignment such that there exists a variable that has not yet been evaluated and , then = and the entry .
For any partial assignment that does not meet conditions 1 or 2, then
Then we can fill in the entry for by finding the next variable that has the minimum cost testing strategy, placing it at the root of a tree and creating left and right subtrees accordingly.
Since there are sibling classes and each can have at most variables, we can construct in time . Since , the dynamic program will run in time .
The evaluation problem for monotone -term DNF, restricted to the uniform distribution on input and unit costs, can be solved exactly in polynomial time for .
Under the uniformity assumption the ratios are the same for all variables. Hence, each sibling class will be evaluated as a single block and tested in an arbitrary order until either a variable evaluates to 0 or a term evaluates to 1, or until the sibling class is exhausted. Since we will evaluate each sibling class together, we can view each class as a single variable. Then we have a -term DNF defined over variables. Let be the set of the new variables. For each , let denote the number of “real” variables in .
We can then find the optimal strategy using a dynamic programming method as before. The first two rules are as in the previous program. We will modify the third rule as follows:
For any partial assignment that does not meet the first two conditions, then
which follows directly from the unit costs and uniform probabilities.
The size of the table will be only ; hence we can determine the optimal testing strategy over the sibling classes in time .
7 Expected certificate cost and optimal expected evaluation cost
Some of the approximation bounds discussed in this paper are with respect to the optimal expected cost of an evaluation strategy, while others are in terms of the expected certificate cost of the function, which lower bounds the former quantity. It has been previously observed in  that for arbitrary probabilities, there can be a gap of between the two measures. In what follows, we prove that even in the unit-cost, uniform distribution case, the ratio between these two can be extremely large: for any constant where . (Note that in the unit-cost case, both measures are at most .) We also show near-optimality of the CDNF approximation bound of Kaplan et al. and give a gap between two complexity measures related to decision trees for Boolean functions.
Let be a constant such that . Let be a read-once DNF formula on variables where each term is of length , and every variable appears in exactly one term. Then , and .
Let designate the base 2 log. Let and . So is a read-once -term -DNF formula.
We begin by showing a lower bound on . The probability that a term is equal to 1 is . An optimal strategy for evaluating read-once formulas is known. It works by evaluating each term in decreasing order of the term’s optimal expected evaluation cost, until either a term evaluates to 1, or all terms have evaluated to 0 [12, 8]. Since we are considering the unit-cost, uniform distribution, case, all terms are symmetric, and terms can be evaluated in arbitrary order. The probability that this strategy for evaluating evaluates at least terms, and all evaluate to 0, is
We first show that for . We use the standard inequality that says that for all , From this inequality, it follows that
It is easy to show using simple algebra that
Raising both sides to the power yields the desired result that for .
Since the probability that this optimal strategy evaluates at least terms is at least , and each evaluation costs at least 1, it follows that .
We now upper bound . By definition,
Again using the inequality , we get that . Since approaches 0 as approaches infinity, it is less than 1 for large enough . It follows that for sufficiently large , and since , .
For any constant , where , there is a constant where . For large enough , . We thus have the following corollary.
There exists a Boolean function such that
for any constant , where ,
Theorem 7.1 and the above corollary can also be interpreted as results on average-case analogues of depth-complexity and certificate-complexity. The depth complexity of a Boolean function is the minimum, over all decision trees for , of the depth of that tree. Note that the depth of the tree is the worst-case (i.e., maximum), over all assignments to the variables of that function, of the number of tests (decisions) induced by the tree on assignment . The certificate complexity of a Boolean function is the worst-case (i.e., maximum), over all input assignments , of the smallest 0-certificate or 1-certificate of that is contained in . The average depth-complexity and average certificate-complexity of a Boolean function can be defined analogously, with worst-case replaced by average case. Thus the average depth-complexity is equal to , and the average certificate-complexity is equal to .
We can also use Theorem 7.1 to show near-optimality of the approximation bound achieved by Kaplan et al. for monotone CDNF evaluation, with respect to , the expected certificate cost under unit costs and the uniform distribution. The function computed by the formula in Theorem 7.1 has a CNF formula with clauses. Thus in this case is , which is . The strategy computed by any approximation algorithm for this problem cannot do better than the optimal strategy, so its expected cost must be at least times larger than . It follows that the approximation bound of Kaplan et al. has a matching lower bound of (for ), with respect to the expected certificate cost.
We do not know, however, whether it is possible for a polynomial-time algorithm to achieve an approximation factor much better than with respect to the expected cost of the optimal strategy, . We have no non-trivial lower bound for the approximation algorithm in this case; clearly such a lower bound would have to depend on complexity theoretic assumptions.
Sarah R. Allen was partially supported by an NSF Graduate Research Fellowship under Grant 0946825 and by NSF grant CCF-1116594. Lisa Hellerstein was partially supported by NSF Grants 1217968 and 0917153. Devorah Kletenik was partially supported by NSF Grant 0917153. Tonguç Ünlüyurt was partially supported by TUBITAK 2219 programme. Part of this research was performed while Tonguç Ünlüyurt was visiting faculty at Polytechnic Institute of NYU and Sarah Allen was a student there.
-  H. Buhrman and R. De Wolf. Complexity measures and decision tree complexity: A survey. Theoretical Computer Science, 288:2002, 1999.
-  V. Chvátal. A greedy heuristic for the set-covering problem. Mathematics of Operations Research, 4(3):233–235, 1979.
-  L. Cox, Y. Qiu, and W. Kuehner. Heuristic least-cost computation of discrete classification functions with uncertain argument values. Annals of Operations Research, 21:1–29, 1989.
-  A. Deshpande and L. Hellerstein. Flow algorithms for parallel query optimization. In ICDE, 2008.
-  A. Deshpande, L. Hellerstein, and D. Kletenik. Approximation algorithms for stochastic boolean function evaluation and stochastic submodular set cover. 2013. http://arxiv.org/abs/1303.0726.
-  U. Feige. A threshold of ln n for approximating set cover. Journal of the ACM, 45:314–318, 1998.
-  D. Golovin and A. Krause. Adaptive submodularity: Theory and applications in active learning and stochastic optimization. JAIR, 2011.
-  R. Greiner, R. Hayward, M. Jankowska, and M. Molloy. Finding optimal satisficing strategies for and-or trees. Artif. Intell., 170(1):19–58, 2006.
-  D. Guijarro, V. Lavín, and V. Raghavan. Exact learning when irrelevant variables abound. In EuroCOLT, 1999.
-  D. S. Hochbaum. Approximation algorithms for the set covering and vertex cover problems. SIAM J. Comput., 11(3):555–556, 1982.
-  T. Ibaraki and T. Kameda. On the optimal nesting order for computing n-relational joins. ACM Trans. Database Syst., 9(3):482–502, 1984.
-  H. Kaplan, E. Kushilevitz, and Y. Mansour. Learning with attribute costs. In STOC, pages 356–365, 2005.
-  R. Krishnamurthy, H. Boral, and C. Zaniolo. Optimization of nonrecursive queries. In VLDB, 1986.
-  U. Srivastava, K. Munagala, J. Widom, and R. Motwani. Query optimization over web services. In VLDB, 2006.
-  T. Ünlüyurt. Sequential testing of complex systems: a review. Discrete Applied Mathematics, 142(1-3):189–205, 2004.