# From Exact Learning to Computing Boolean Functions and Back Again

###### Abstract

The goal of the paper is to relate complexity measures associated with the evaluation of Boolean functions (certificate complexity, decision tree complexity) and learning dimensions used to characterize exact learning (teaching dimension, extended teaching dimension). The high level motivation is to discover non-trivial relations between exact learning of an unknown concept and testing whether an unknown concept is part of a concept class or not. Concretely, the goal is to provide lower and upper bounds of complexity measures for one problem type in terms of the other.

## 1 Introduction

The problem of learning a function from a concept class should be connected or easy to relate to the problem of deciding whether the function is in the function class or not. Imagine that one searches an element in an ordered set: the time it takes to find the element in the worst case scales with the logarithm of the size of the set. But the same fact is true (in the worst case) for testing whether the element is in the set or not. Thus, in exact learning, a hypothesis space could be viewed as playing the role of a (partially) ordered set and a target function as having the role of the element that is searched or tested for membership.

Our main goal is to discuss relations between learning and computing Boolean functions in a setting where a friendly ’teacher’ provides the shortest proofs to exactly identify a function in a class or to evaluate^{1}^{1}1We will use the terms ’evaluate’ and ’compute’ interchangeably in the paper. it. On the learning side this protocol is known as exact learning with a teacher [6] while on the computational side it is known as the non-deterministic decision tree model [5]. We will focus both on the worst case versions of the complexity measures for exact learning and computation of Boolean functions and their average case counterparts ([10], [11]) as it has been observed that the worst case complexity measures are sometimes unreasonably large even for simple concept classes.

A natural way to interpret the non-deterministic decision tree model and the protocol for learning with a teacher is as best case (but non-trivial) scenarios for evaluation and learning. We are interested in these protocols as any hardness results in such settings establish a natural limit for any other evaluation or exact learning protocol. That being said, we will also investigate the aforementioned relations in a setting where the agent has the power to do queries. In this context, we will briefly study the relations between the decision tree complexity of Boolean functions (on the evaluation side) and the query complexity (usually measured using a combinatorial measure called extended teaching dimension [7]) of exact learning with membership queries [1].

Motivation Our motivation is two-fold. From a purely theoretical perspective we think it is interesting to formally study relations between combinatorial measures like teaching dimension and certificate / decision tree complexity as such relations could provide useful tools for proving lower and upper bounds in learning theory.

From a more applied perspective, the motivation is very similar to the one connecting learning and property testing (viewed as a relaxation of the learning problem [14]). The intuition is that evaluating whether a particular concept is part of a concept class or is ’far’ from being in the concept class should be an easier problem than learning the concept accurately. Thus, if multiple hypothesis classes are candidates for being parent classes for the target concept (like in agnostic learning), it might be worth running a testing algorithm before actually learning the concept to determine which function class to use as a hypothesis space or, alternatively, which function classes to eliminate from consideration.

Related Work. On the learning side, since the introduction of the teaching protocol and its associated notion of teaching dimension [6], there were several papers that described bounds for this complexity measure for various concept classes: (monotone) monomials, (monotone) DNFs, geometrical concepts, juntas, linear threshold functions ([2], [11]). One of the early observations was that sometimes, even for simple concept classes, the worst case teaching dimension was trivially large, which contradicts the intuition that ’teaching’ should be relatively ’easy’ for naturally occurring concept classes. There are several relatively recent attempts ([3], [16]) to change the model so as to better capture this intuition by allowing the learner or the teacher to assume more about each other.

Another perspective on better capturing the overall difficulty of learning a concept class in the teaching model is to consider the average case version of the teaching dimension. The general case for a function class of size was solved by [10] who proved that samples are enough to learn any function class, while there exists a function class for which sample are necessary. For particular concept classes, somewhat surprisingly, the average case bounds are actually much smaller: some hypothesis classes (DNFs [11], LTFs [2]) have bounds on the average teaching dimension that scale with while others (juntas [11]) are even independent of .

One intuitive reason for these gaps is that the general case upper bound is actually uninformative for large concept classes (when is large——, a better upper bound is the trivial that shows the learner all instances), whereas the proofs for particular concept classes actually take advantage of the specific structure of a class to derive meaningful upper bounds.

On the computational side, the (non-)deterministic decision tree model is relatively well understood (see [5] for an excellent survey on the topic). Complexity measures like certificate complexity, sensitivity, block sensitivity, decision tree complexity are used to quantify the difficulty of evaluating a Boolean function when access to the inputs is provided either by an ’all-knowing teacher’ (the non-deterministic decision tree model) or via a query oracle (the deterministic decision tree model). While most of the results deal with the worst case versions of the aforementioned complexity measures, bounds for some of their average case versions appeared in the literature. Among them, we mention the results from [4] which addresses the problem of the gap between average block sensitivity and average sensitivity of a Boolean function—a well known open problem for the worst case versions of the complexity measures.

Contributions. The first result (Section 3) is that the teaching dimension and the certificate that a function is part of a hypothesis class (i.e. -certificate complexity) play a dual role: when a class is ’easy’ to teach it is ’hard’ to certify its membership and vice-versa. The second contribution (Section 4) is to give lower bounds for the general case of the average non-membership certificate size. The results have several applications to learning and computing Boolean functions. Finally, we will describe structural properties of Boolean functions that point to connections between learning and computation in a setting that relates the (easier) teaching model with the (harder) query model (Section 5).

## 2 Setting and Notation

Let with a class of Boolean functions and let be its complement (we denote ). itself can be seen as a Boolean function, , iff . We will usually consider to be and we will label elements as instances or examples and as the Boolean variables that describe . In what follows, it is assumed that both the nature and the agent know , with nature choosing in an adversarial manner while the agent is not aware of the identity of . We will first describe the learning problem, then the computation problem and then discuss how they are related.

Learning with a Teacher. On the learning side, we will focus on exact learning with a teacher in the loop. In this protocol, nature chooses and the learner knows is in but is not aware of its identity. The learner receives samples (pairs of with ) from a ’teacher’, without knowing whether the teacher is well-intentioned or not. The goal of the learner is to uniquely identify the hidden function using as few samples as possible. The teacher is an optimal algorithm, aware of the identity of , that gives the learner that most informative set of instances so that the learner uniquely identifies the target concept as fast as possible. The teacher is not allowed to make any assumptions about the learning algorithm, other than assuming it is consistent (i.e. that it maintains a hypothesis space consistent with the set of revealed samples).

In this protocol, learning stops when the consistent hypothesis space of the learner has size and thus only contains the target hypothesis. For the purpose of this paper, the learner and the teacher are assumed to have unbounded computational power to compute updates to the hypothesis space and optimal sample sets (computational issues are treated in [15] and [6]). One intuitive perspective for this learning protocol is that it is the best case scenario of exact learning with membership queries [1], where the learner always guesses the best possible queries to find the target hypothesis. In the model of exact learning with membership queries, the teacher is removed from the protocol, and the learner is responsible for deciding which inputs to query for labels with the same goal of minimizing the number of samples until the target concept is discovered.

We will now define a complexity measure (the teaching dimension) for learning a fixed function in the protocol of learning with a teacher.

###### Definition 1

For a fixed , a minimum size teaching set is a set of samples that uniquely identifies among all functions in with a size that is minimal among all possible teaching sets for . The teaching dimension of (with respect to ) is . The teaching dimension of is .

Intuitively, a teaching set is a shortest ’proof’ that certifies the identity of the initially hidden target concept. The teaching dimension is simply the maximum size of such a ’proof’ over the entire hypothesis space.

To capture the difficulty of learning a hypothesis class as a whole we will define average , which has some interesting combinatorial properties ([10]).

###### Definition 2

The average teaching dimension of is .

Computation in the Decision Tree Model. On the evaluation side, we will focus on ’proofs’ that certify what is the value of a Boolean function on an unknown input (with usually ). We will thus focus on certificate complexity, which quantifies the difficulty of computing a Boolean function in the non-deterministic decision tree computation model.

Let’s assume is fixed and known to both the nature and the agent. The protocol of interaction is as follows: nature chooses an input ( with or ) without revealing it to the agent, and offers query access to the bits that define . For any query , it reveals the correct bit value of the previously unknown bit in . Now we can define certificate complexity for a fixed input and for a function:

###### Definition 3

For a fixed function and a fixed and unknown input with , a minimal -certificate of on is a minimal size query set that fixes the value of on to . The -certificate complexity is the size of such a minimal query set.

###### Definition 4

The -certificate complexity of a Boolean function is

, where . Symmetrically, . And then .

An intuitive way to interpret certificate complexity is that it quantifies what is the minimal number of examples (pairs value of on a friendly ’teacher’ (that knows ) must reveal to certify to an agent what is the value of on .

While there is no previous definition for the notion of average certificate complexity in the literature, it is natural to define it in a similar manner (and for similar reasons) as for the average teaching dimension:

###### Definition 5

The average -certificate complexity of a Boolean function is . We can symmetrically define and .

We will now define block sensitivity, another well studied complexity measure for computing Boolean functions, as we will need it later in the paper.

###### Definition 6

A Boolean function is sensitive to a set on if , where is the input with bits in flipped to the opposite values. Then the block sensitivity of on , is the size of the largest set of disjoint sets with the property that is sensitive to each set on . Also, the block sensitivity of is the maximum block sensitivity over all inputs : .

The definition of average block sensitivity is natural and follows similarly to the definition of average teaching dimension and average certificate complexity. It is worth noting though that Definition 7 is the same as that introduced in [4] (as other notions of average block sensitivity have been studied).

###### Definition 7

The average block sensitivity of a Boolean function is .

### 2.1 Connecting Learning and Computation

If in section 2 we set and we re-label as , we can interpret as Boolean functions (with each being a truth table and thus a complete description of ). Thus is a complete description of a hypothesis class with iff . The interaction protocol for both learning and evaluation proceeds in the same manner: at each step, a teacher reveals the value of an unknown function on an input instance from . This is the sense in which we connect exact learning with a teacher and evaluation of Boolean functions in the non-deterministic decision tree computation model.

To gain more intuition, if one imagines the function class as a matrix with the rows being all elements in and the columns being the inputs in , then, fixing the interaction protocol to contain an optimal teacher aware of the identity of a hidden row, learning is about identifying the hidden row among the subset of rows that determine to be , while evaluation is about determining whether a hidden row is part of a chosen subset of rows (that define ) or not.

Another intuitive perspective is through the lens of hypergraphs: fixing a vertex set, learning is about identifying a hidden edge from a set of edges that form a hypergraph (the function class ) while evaluation is about determining whether a given subset of vertices is an edge in the hypergraph or not.

### 2.2 Simple Examples

In this section we will describe bounds for several simple concept classes with the goal of building intuition and exhibiting extreme values for , and .

Powerset. This class is a trivial example for which (all functions defined on are part of the concept class). It is easy to see that since to locate a particular concept one needs to query all examples (otherwise there will be at least two concepts that are identical on all previous instances). And since is constant.

Singletons. , with iff (i.e. the all-0 vector with the -th coordinate flipped to ). Then because it is enough to show the -bit of the target function to uniquely identify it among all . Nevertheless certifying that an is (or isn’t) in is hard in the worst case. If nature chooses the all- function as a target, certifying that is not part of will require seeing all inputs as, at any intermediate time, there will be at least a function in consistent with . So . Similarly .

Singletons with empty set. . For this function class, teaching becomes hard, as teaching requires seeing all examples to differentiate it from the other functions. So . Teaching the other functions is easy though, as showing the -bit is enough to certify what the function is. So . The -certificate is small as any function not in is evaluated to for at least two examples, so showing these two examples is enough to certify that the function is not in and thus . The -certificate is large though since for any , at least examples must be shown, so .

The dictator function. , with iff for some fixed (half of the Boolean functions defined on are in ). The learning problem is hard () as it reduces to learning a Powerset class on bits. However, , since membership to can be decided if the value of the function on is revealed.

## 3 Teaching and Certifying Membership

In this section we will study connections between teaching a function in a hypothesis class (conditioned on knowing that the function is indeed in the class) and proving that the function is part of the class (with no prior knowledge other than the knowledge of ). We will begin with a simple fact that is meant to illustrate the improvement in the next subsection:

###### Fact 3.1

For any fixed instance space , any function class and any , .

### 3.1 A Lower Bound Technique

The upper bound from Fact 3.1 is almost tight in the worst case—the example of Singletons with empty set from section 2.2 shows that the sum of the two quantities can be . The lower bound, on the other hand, is very loose as the next theorem shows:

###### Theorem 3.2

For any fixed instance space , any function class and any , .

###### Proof

Let’s assume nature chooses as the target hypothesis but does not reveal to the agent whether or . Let be the minimal teaching set for and be the smallest -certificate for .

Let’s assume that the goal of the teacher is to reveal the identity of to the learner using samples . But, since in this protocol the learner is not aware of whether is evaluated to or by , it has to be the case that to uniquely identify it, it must see the value of the function on all (otherwise there will always be another function consistent with the examples seen so far that is evaluated to the opposite value by ). This argument is equivalent to learning a function in the case of the Powerset function class from Section 2.2.

Now let’s describe an alternative strategy for the teacher that has the same effect of uniquely identifying . In the first epoch, the teacher reveals all the samples from . This will ’certify’ to any consistent learner that (since no such that is consistent with ). We are now in the standard exact learning setting where the agent knows but does not know its identity. In the second epoch, the teacher will reveal all the samples in to the learner—since there is no point in presenting elements from their (possibly non-empty) intersection twice. This strategy will uniquely identify without any prior knowledge about its membership to .

But we know that to uniquely identify we need exactly samples. So it has to be the case that .∎

Since the above relation holds for any function it must hold for the average and worst case values of the complexity measures:

###### Corollary 1

For any , and .

### 3.2 Certifying Membership is (Usually) Hard

In this section we will present a first application of Theorem 3.2. We will show that, for some of the standard concept classes in learning theory, certifying membership is hard, meaning all input variables need to be queried to determine whether an unknown function is part of the function class.

TD() | aTD() | ||
---|---|---|---|

Monotone Monomials | [6] | [11] | |

Monomials | [6] | [11] | |

Monotone k-term DNF | [11] | [11] | |

k-term DNF | [11] | ||

LTF | [2] | [2] | |

k-Juntas | [11] | [11] |

In table 1 we present known lower and upper bounds for and for a few hypothesis classes encountered in learning theory. It is important to note that scales at most logarithmically with the size of the input space (constant for -juntas and logarithmic for (monotone) conjunctions, (monotone) DNFs, LTFs). Thus, from Corollary 1, for such concept classes with , it follows that: .

### 3.3 Sparse Boolean Functions are Hard to Compute

In this section we will present another application of Theorem 3.2. We will show that ’sparse’ Boolean functions have large certificate complexity. By sparse we mean Boolean functions with a ’small’ (roughly the size of ) Hamming weight (number of ’s) in the output truth table. The following theorem makes this precise:

###### Theorem 3.3

For any sets and and any Boolean function , , let . If , then .

###### Proof

It is worth mentioning that while the teaching dimension is used in the proof, the result is strictly about the hardness of computing .

## 4 Teaching and Certifying Non-Membership

In this section we will study bounds and relations between learning a function in a class and proving that the function is not in the class. The high-level intuition of why the two problems are similar comes from the similarity with the problem of searching an element in an unordered / ordered set: in the worst case, searching an element that is part of the set is as hard as searching the element if it is not in the set. We will present a set of results that are a first indication that, at least in the average case, the learning problem and the non-membership decision problem are similar.

We will first deal with the worst case for .

###### Theorem 4.1

For any instance space , any function class , with , and any , .

###### Proof

From Bondy’s theorem ([9], Theorem ), we know there exists a set of coordinates of size that, when revealed, uniquely identifies any target function . Let’s choose .

Let’s first assume that is consistent with on . If we reveal the labels for all the coordinates in , will be the only function consistent with . It must then be the case that revealing just another coordinate will lead to a certificate of size for . If, on the other hand, for any , is inconsistent with on , revealing the labels of all the coordinates in will lead to an upper bound for the size of any certificate that .∎

The following corollary follows immediately:

###### Corollary 2

For any function class , with , .

The bound on is tight as the example of Singletons from 2.2 demonstrates, which thus settles the worst case bounds for .

### 4.1 Bounds for

We will begin by proving a lower bound for . The result is a simple application of a theorem from [4] that puts a lower bound on the average block sensitivity of a Boolean function.

###### Theorem 4.2

For any set , there exists a function such that .

###### Proof

We will choose to be the Rubinstein function (more details in the proof for Theorem 4.3) and apply the result from Proposition in [4] that . But, for any , since a certificate for an input must contain at least a bit from each sensitive block otherwise the value of the function can be flipped by an adversary (see [5], Proposition ). And so the relation must hold for the average case as well since .∎

Now we will prove that the same property holds even if we restrict our attention to which requires more work.

###### Theorem 4.3

For any set , there exists a function such that .

###### Proof

Let . We will choose again to be the Rubinstein function on and we define it as: the variables are partitioned in pieces of size each, is iff there exists at least one piece of the partition that has consecutive variables equal to and the rest . We will count the number of inputs that are evaluated to , i.e. , with .

If we fix a piece of the partition, there are configurations of the input variables in that piece that lead to being evaluated to . And so configurations don’t ”contribute” into making to be . But, if in each piece there is a configuration that doesn’t ”contribute” to , then . There are thus such configurations.

We know from Theorem 4.2 that:

where the second inequality follows by upper bounding all by the maximum possible certificate complexity, i.e. the size of , . It can then be shown that: and and thus .∎

Before we continue, it is interesting to remark the similarity (at a high level) of the result from Theorem 4.3 with Theorem from [10] that describes a lower bound of on . The lower bound tools are very different (Rubinstein function for and projective planes for ), but, since they lead to an identical lower bound for related complexity measures, it would be interesting to see if there are some deep connections between them.

We will now prove a (weak) lower bound tool for the relationship between and .

###### Theorem 4.4

For any , there exists a function class such that .

The theorem states that and can be simultaneously ”large” (), a statement that is not immediately obvious (at least given the simple concept classes considered in Section 2.2).

###### Proof

Let’s consider (the complement of the ’Singletons with empty set’ concept class). Then . Also . So and thus there exists no such that the sum is smaller than . ∎

## 5 Connection with Membership Query Learning

For the purpose of this section we will focus on learning and computing with queries (the membership query learning and deterministic decision tree computation models) as this perspective will allow us to get more intuition about the structure of the function .

In a manner similar to the way we have defined teaching dimension for the protocol of exact learning with a teacher, we will define to be the (worst case) optimal learning bound for learning a function class in the exact learning with membership queries protocol. Also, in a similar manner as for the certificate complexity definition in the non-deterministic decision tree model, we define to be the (worst case) optimal complexity of computing a Boolean function in the deterministic decision tree model (for more formal definitions see [1] and [5]).

In [7] Hegedűs introduced a complexity measure for bounding called the Extended Teaching Dimension, which, as the name suggests, is inspired by the definition of the Teaching Dimension. We will define this complexity measure and then describe a result that establishes a connection between and .

###### Definition 8 ([7])

A set is a specifying set (SPS) for an arbitrary concept with respect to the hypothesis class if there is at most one concept in that is consistent with on . Then the Extended Teaching Dimension () of is the minimal integer such that there exists a specifying set of size at most for any concept .

###### Theorem 5.1

For any function class , .

###### Proof

A specifying set for any is also a teaching set for , as it uniquely identifies the function among all other functions in . Also, a specifying set for any is ’almost’ a certificate that is not in as it differentiates from all other functions in with the exception of at most one function. Revealing an extra instance is thus sufficient to differentiate from all and thus obtaining a certificate for .

Let for some fixed . Then there must be at least a function that has a minimal specifying set of size exactly . Let’s assume, wlog, that such a function is unique. Let’s first consider the case of . Since , it means all have a teaching set of size . can’t have a teaching set with a size smaller than since such a teaching set would also be a specifying set of size which is not possible given the assumption of uniqueness. And since it means . Now let’s pick an arbitrary . Since is the unique function with a specifying set of size , it means that and thus which thus proves the desired relation for this case.

The second case with is treated similarly. ∎

Theorem 5.1 directly leads to the following corollary (since is a lower bound for )) stating that learning with membership queries is at least as hard as certifying non-membership:

###### Corollary 3

For any function class , .

### 5.1 Is weakly symmetric for natural learning problems?

We will now give a result that follows from Theorem 3.2 and connects learning and computation in the query model:

###### Corollary 4

For any instance space , any fixed function class and any , .

The proof is immediate as the teaching dimension is a lower bound for the optimal membership query bound [6] and the certificate complexity is a lower bound for the decision tree complexity [5] and so we can just apply theorem 3.2 to get the desired relation.

A natural question is how useful is this bound for standard concept classes from learning theory. It is this question that we address in this subsection where we describe an interesting structural property of .

We will begin with a few (informal) definitions (see [12] for a complete reference). In the deterministic decision tree computation model, a Boolean function is labeled evasive if, in the worst case, all input variables need to be queried to determine the value of the function. Several results describe sufficient conditions for large classes of Boolean functions to be evasive. An interesting class of Boolean functions are graph properties. A graph property is a class of graphs (on a fixed number of vertices) that remains unchanged for any permutation of the vertices (graph connectivity for example). The variables for a graph property are the possible edges of a graph.

By construction, a graph property can be encoded as a weakly symmetric Boolean function on the edges. Weakly symmetric Boolean functions are a generalization of symmetric Boolean functions. A Boolean function is weakly symmetric if, for any pair of variables, there exists a permutation of all variables that permutes the variables in the pair, such that the function remains unchanged.

Graph properties are weakly symmetric since all permutations on the vertex set induce a set of permutations of the edge set which leave the function unchanged. A general hardness result (the Rivest-Vuillemin theorem [13]) for computing weakly symmetric Boolean functions (and implicitly graph properties) states that any non-constant weakly symmetric Boolean function defined on a number of variables that is the power of a prime number is evasive.

This brings us to the point of connection with the Boolean function as we’ve defined it in section 2.1. The intuition is that in the same way that permuting vertices doesn’t change a graph property, the input variables for ’s don’t change the Boolean function , or in other words the definition of a function class () is invariant to permutations of (the bits of ). Moreover, by construction, has a number of inputs which is a power of a prime ( - all the possible inputs that can be defined on the original input variables) and natural concept class are not trivial.

So, if and when the intuition that is weakly symmetric is correct, we can actually apply the aforementioned result to show that is evasive and in turn that (and implicitly that ). In such situations the bound from Theorem 4 is not very useful as it puts no constraints on the optimal membership query bound.

Interestingly though, the above intuition is false in general. For example the following theorem shows a natural concept class that leads to a function that is not weakly symmetric.

###### Theorem 5.2

If is the class of monotone monomials of size exactly^{2}^{2}2A Boolean function is representable by a monomial of size exactly if it has a monomial representation of size and no monomial representation for any . , (viewed as a Boolean function with input ’bits’ from and inputs from ) is not weakly symmetric.

###### Proof

Let and for any let be the weight of (the number of bits in that are ).

Let’s consider and such that . Then it must be the case that (there exist bits among the bits of that are ). Let’s consider be an input in such that . Then since can’t encode a monotone monomial of size .

Now let’s consider an arbitrary permutation that changes with . This means that will induce a Boolean function that will be evaluated to for . But such a function can’t be a monotone monomial of size exactly since and can’t be evaluated to . This means that any permutation that changes and will change the function . So we have found a pair of variables for which no permutation of the other variables (that permutes the two) leaves unchanged. Thus can’t be weakly symmetric. ∎

For such a concept class there is thus hope that a relation like the one from Theorem 4 might be useful. However there are interesting concept classes that lead to a weakly symmetric function . An example is the class of monomials of size exactly :

###### Theorem 5.3

If is the class of monomials of size exactly k, is weakly symmetric.

###### Proof

Let and the extended set of variables indexed in that contains the variables and their complements: with for and for .

Let’s fix and let be the set of variables that have identical values for and and , the complement of . We will construct a permutation over based on and that will induce a permutation over which will in turn induce a permutation over . will have the property that i.e. the permutation that induces on the set of possible functions leaves unchanged, which is what we need to show.

For any , let and for any , let . In other words any variable on which and agree will remain unchanged, while any variable for which there is disagreement will be negated. From the construction of it follows that and , as desired (where, as above, is the permutation induced by on ).

We will first consider ’s such that , which means is a monomial of size . In the expression of , will either leave a variable unchanged or it will replace it with its negation. But that means that (where as above is the induced permutation over ) will still be a monomial of size exactly (albeit a different one), so .

The second case considers functions such that . Let be the number of terms in the minimal DNF representation of . Is then is representable by a monomial and, since , is a -monomial with or . But negating any subset of variables from will not increase or decrease the number of variables in the conjunction (as the variables are uniquely represented in the conjunction and the expression can’t be reduced in any way), so for .

If , let’s assume that has an -term DNF representation for some . But this means that (we apply the induced permutation a second time) will also have an -term DNF representation. But since (as applying two times only doubly negates a subset of the variables), we get a contradiction. So has an -term DNF representation with and can’t be a monomial. Thus . ∎

It is easy to extend the proof and show that the class of monomials of size at most also lead to weakly symmetric functions.

## 6 Discussion

As mentioned in the introduction, the combination of evaluation and learning is a characteristic of property testing. A natural question is whether we can design an exact (i.e. non-distributional) property testing protocol that is useful. As we saw in Section 3, in the exact setting we are considering, whenever we will be able to positively test for membership, the learning problem will be hard and vice-versa. So, as compared to the commonly used property testing protocol (which is defined with respect to some distribution over the instance space), we can’t expect two-sided property testers (that certify both membership and non-membership) to be combined with exact learners successfully. But, it is still possible to combine learners and algorithms that certify non-membership with potential applications to agnostic exact learning.

Regarding other future directions, one natural thing to study is a general upper bound on that only depends on the size of the concept class. Moreover, as mentioned in the text, the lower bounds for and use different tools to obtain a similar result, and these tools are often encountered in proofs for lower bounds, so perhaps understanding their connections would be beneficial in its own right.

Another interesting research direction is to study bounds for and for particular concept classes. Several results exist ([7] and [8]) for but they do not cover all natural concept classes. Another hope is that deriving upper bounds for and would in turn lead to a deeper understanding of the gap between the worst case upper bound for and the upper bounds for particular concept classes.

On another topic, as described in Section 5, interesting connections exist between the membership query learning and deterministic decision tree frameworks. One interesting direction would be to further investigate what other function classes lead to weakly symmetric functions, as both positive and negative answers would potentially help in revealing new connections between learning and evaluation.

### Acknowledgments

The author would like to thank Rocco Servedio, Michael Saks and Chris Mesterharm for their valuable comments and feedback.

## References

- [1] D. Angluin. Queries and concept learning. Mach. Learn., 2:319–342.
- [2] M. Anthony, G. Brightwell, and J. Shawe-Taylor. On specifying boolean functions by labelled examples. Discrete Applied Mathematics, 61(1):1–25, 1995.
- [3] F. J. Balbach. Measuring teachability using variants of the teaching dimension. Theor. Comput. Sci., 397(1-3):94–113, May 2008.
- [4] A. Bernasconi. Sensitivity vs. block sensitivity (an average-case study). Inf. Process. Lett., 59(3):151–157, Aug. 1996.
- [5] H. Buhrman and R. D. Wolf. Complexity measures and decision tree complexity: A survey. Theoretical Computer Science, 288:2002, 2000.
- [6] S. A. Goldman and M. J. Kearns. On the complexity of teaching. Journal of Computer and System Sciences, 50:303–314, 1992.
- [7] T. Hegedűs. Generalized teaching dimensions and the query complexity of learning. In COLT ’95, pages 108–117, 1995.
- [8] L. Hellerstein, K. Pillaipakkamnatt, V. Raghavan, and D. Wilkins. How many queries are needed to learn? J. ACM, 43(5):840–862, Sept. 1996.
- [9] S. Jukna. Extremal combinatorics - with applications in computer science. Texts in theoretical computer science. Springer, 2001.
- [10] E. Kushilevitz, N. Linial, Y. Rabinovich, and M. Saks. Witness sets for families of binary vectors. J. Comb. Theory Ser. A, 73:376–380.
- [11] H. K. Lee, R. A. Servedio, and A. Wan. Dnf are teachable in the average case. Mach. Learn., 69:79–96.
- [12] L. Lovász and N. E. Young. Lecture notes on evasiveness of graph properties. CoRR, cs.CC/0205031, 2002.
- [13] R. L. Rivest and J. Vuillemin. On recognizing graph properties from adjacency matrices. Theoretical Computer Science, 3(3):371 – 384, 1976.
- [14] D. Ron. Property testing: A learning theory perspective. Found. Trends Mach. Learn., 1(3):307–402, Mar. 2008.
- [15] R. A. Servedio. On the limits of efficient teachability. Inf. Process. Lett., 79, 2001.
- [16] S. Zilles, S. Lange, R. Holte, and M. Zinkevich. Models of cooperative teaching and learning. J. Mach. Learn. Res., 12:349–384, Feb. 2011.