Learning circuits with few negations

# Learning circuits with few negations

Eric Blais eric.blais@uwaterloo.ca, University of Waterloo. This work was completed as a Simons Postdoctoral Fellow at the Massachusetts Institute of Technology.    Clément L. Canonne ccanonne@cs.columbia.edu, Columbia University.    Igor C. Oliveira oliveira@cs.columbia.edu, Columbia University.    Rocco A. Servedio rocco@cs.columbia.edu, Columbia University. Supported by NSF grants CCF-1115703 and CCF-1319788.    Li-Yang Tan liyang@cs.columbia.edu, Columbia University.
###### Abstract

Monotone Boolean functions, and the monotone Boolean circuits that compute them, have been intensively studied in complexity theory. In this paper we study the structure of Boolean functions in terms of the minimum number of negations in any circuit computing them, a complexity measure that interpolates between monotone functions and the class of all functions. We study this generalization of monotonicity from the vantage point of learning theory, giving near-matching upper and lower bounds on the uniform-distribution learnability of circuits in terms of the number of negations they contain. Our upper bounds are based on a new structural characterization of negation-limited circuits that extends a classical result of A. A. Markov. Our lower bounds, which employ Fourier-analytic tools from hardness amplification, give new results even for circuits with no negations (i.e. monotone functions).

\setitemize

noitemsep,topsep=3pt,parsep=2pt,partopsep=2pt \setenumerateitemsep=1pt,topsep=2pt,parsep=2pt,partopsep=2pt \setdescriptionitemsep=1pt \pdfstringdefDisableCommands

## 1 Introduction

A monotone Boolean function is one that satisfies whenever , where denotes the bitwise partial order on . The structural and combinatorial properties of monotone Boolean functions have been intensively studied for many decades, see e.g. [Kor03] for an in-depth survey. Many famous results in circuit complexity deal with monotone functions, including celebrated lower bounds on monotone circuit size and monotone formula size (see e.g. [RW90, Raz85] and numerous subsequent works).

Monotone functions are also of considerable interest in computational learning theory, in particular with respect to the model of learning under the uniform distribution. In an influential paper, Bshouty and Tamon [BT96] showed that any monotone Boolean function can be learned from uniform random examples to error in time . They also gave a lower bound, showing that no algorithm running in time for any can learn arbitrary monotone functions to accuracy (Many other works in learning theory such as [Ang88, KV94, BBL98, AM02, Ser04, OS07, OW09] deal with learning monotone functions from a range of different perspectives and learning models, but we limit our focus in this paper to learning to high accuracy with respect to the uniform distribution.)

### 1.1 Beyond monotonicity: Inversion complexity, alternations, and Markov’s theorem.

Given the importance of monotone functions in complexity theory and learning theory, it is natural to consider various generalizations of monotonicity. One such generalization arises from the simple observation that monotone Boolean functions are precisely the functions computed by monotone Boolean circuits, i.e. circuits which have only AND and OR gates but no negations. Given this, an obvious generalization of monotonicity is obtained by considering functions computed by Boolean circuits that have a small number of negation gates. The inversion complexity of , denoted , is defined to be the minimum number of negation gates in any AND/OR/NOT circuit (with access to constant inputs 0/1) that computes . We write to denote the class of -variable Boolean functions that have .

Another generalization of monotonicity is obtained by starting from an alternate characterization of monotone Boolean functions. A function is monotone if and only if the value of “flips” from 0 to 1 at most once as the input ascends any chain in from to . (Recall that a chain of length is an increasing sequence of vectors in , i.e. for every we have .) Thus, it is natural to consider a generalization of monotonicity that allows more than one such “flip” to occur. We make this precise with the following notation and terminology: given a Boolean function and a chain , a position is said to be alternating with respect to if We write to denote the set of alternating positions in with respect to , and we let denote its size. We write to denote the maximum of taken over all chains in , and we say that is -alternating if .

A celebrated result of A. A. Markov from 1957 [Mar57] gives a tight quantitative connection between the inversion and alternation complexities defined above:

###### Markov’s Theorem.

Let be a function which is not identically 0. Then (i) if , then ; and (ii) if , then

This robustness motivates the study of circuits which contain few negation gates, and indeed such circuits have been studied in complexity theory. Amano and Maruoka [AM05] have given bounds on the computational power of such circuits, showing that circuits for the clique function which contain fewer than many negation gates must have superpolynomial size. Other works have studied the effect of limiting the number of negation gates in formulas [Mor09a], bounded-depth circuits [ST03], constant-depth circuits [SW93] and non-deterministic circuits [Mor09b]. In the present work, we study circuits with few negations from the vantage point of computational learning theory, giving both positive and negative results.

### 1.2 Our results

We begin by studying the structural properties of functions that are computed or approximated by circuits with few negation gates. In Section 2 we establish the following extension of Markov’s theorem:

###### Theorem 1.1.

Let be a -alternating Boolean function. Then , where each is monotone and is either the parity function or its negation. Conversely, any function of this form is -alternating.

Theorem 1.1 along with Markov’s theorem yields the following characterization of :

###### Corollary 1.1.

Every can be expressed as where is either or its negation, each is monotone, and

A well-known consequence of Markov’s theorem is that every Boolean function is exactly computed by a circuit which has only negation gates, and as we shall see an easy argument shows that every Boolean function is -approximated by a circuit with negations. In Section 2 we note that no significant savings are possible over this easy upper bound:

###### Theorem 1.2.

For almost every function , any Boolean circuit that -approximates must contain negations.

We then turn to our main topic of investigation, the uniform-distribution learnability of circuits with few negations. We use our new extension of Markov’s theorem, Theorem 1.1, to obtain a generalization of the Fourier-based uniform-distribution learning algorithm of Bshouty and Tamon [BT96] for monotone circuits:

###### Theorem 1.3.

There is a uniform-distribution learning algorithm which learns any unknown from random examples to error in time

Theorem 1.3 immediately leads to the following question: can an even faster learning algorithm be given for circuits with negations, or is the running time of Theorem 1.3 essentially the best possible? Interestingly, prior to our work a matching lower bound for Theorem 1.3 was not known even for the special case of monotone functions (corresponding to ). As mentioned earlier, Bshouty and Tamon proved that to achieve accuracy any learning algorithm needs time for any (see Section 3.2.1 for a slight sharpening of this statement). For larger values of , though, the strongest previous lower bound was due to Blum, Burch and Langford [BBL98]. Their Theorem 10 implies that any membership-query algorithm that learns monotone functions to error (for any ) must run in time (in fact, must make at least this many membership queries). However, this lower bound does not differentiate between the number of membership queries required to learn to high accuracy versus “moderate” accuracy – say, versus . Thus the following question was unanswered prior to the current paper: what is the best lower bound that can be given, both as a function of and , on the complexity of learning monotone functions to accuracy ?

We give a fairly complete answer to this question, providing a lower bound as a function of and on the complexity of learning circuits with negations. Our lower bound essentially matches the upper bound of Theorem 1.3, and is thus simultaneously essentially optimal in all three parameters and for a wide range of settings of and . Our lower bound result is the following:

###### Theorem 1.4.

For any and any , , any membership-query algorithm that learns any unknown function to error must make membership queries.

We note that while our algorithm uses only uniform random examples, our lower bound holds even for the stronger model in which the learning algorithm is allowed to make arbitrary membership queries on points of its choosing.

Theorem 1.4 is proved using tools from the study of hardness amplification. The proof involves a few steps. We start with a strong lower bound for the task of learning to high accuracy the class of balanced monotone Boolean functions (reminiscent of the lower bound obtained by Bshouty and Tamon). Then we combine hardness amplification techniques and results on the noise sensitivity of monotone functions in order to get stronger and more general lower bounds for learning monotone Boolean functions to moderate accuracy. Finally, we use hardness amplification once more to lift this result into a lower bound for learning circuits with few negations to moderate accuracy. An ingredient employed in this last stage is to use a -alternating combining function which “behaves like” the parity function on (roughly) variables; this is crucial in order for us to obtain our essentially optimal final lower bound of for circuits with negations. These results are discussed in more detail in Section 3.2.

## 2 Structural facts about computing and approximating functions with low inversion complexity

### 2.1 An extension of Markov’s theorem.

We begin with the proof of our new extension of Markov’s theorem. For any let be the characteristic function of . For and , we write to denote

 af(x)\lx@stackreldef=max{a(f,X):X is a % chain that starts at x},

and note that . For let us write to denote , and let We note that partition the set of all inputs: for all , and .

We will need the following simple observation:

###### Observation 2.0.

Fix any and any . If and then for some Furthermore, if then .

###### Theorem 1.1.

(Restated) Fix and let Then , where

• the functions are monotone for all ,

• is if and if ,

and is the parity function on variables. Conversely, for any monotone Boolean functions , any Boolean function of the form is -alternating.

###### Proof.

Claim (i) follows immediately from Section 2.1 above. The proof of (ii) is by induction on . In the base case , we have that is a constant function and the claim is immediate.

For the inductive step, suppose that the claim holds for all functions that have . We define as Section 2.1 implies that for all and , and in particular, . Therefore we may apply the inductive hypothesis to and express it as Since for , we may use this along with the fact that to get:

and the inductive hypothesis holds (note that ).

The converse is easily verified by observing that any chain in can induce at most possible vectors of values for because of their monotonicity. ∎

Theorem 1.1 along with Markov’s theorem immediately yields the following corollary:

###### Corollary 1.2.

Every can be expressed as where is either or its negation, each is monotone, and

### 2.2 Approximation.

As noted earlier, Markov’s theorem implies that every -variable Boolean function can be exactly computed by a circuit with (essentially) negations (since for all ). If we set a less ambitious goal of approximating Boolean functions (say, having a circuit correctly compute on a fraction of all inputs), can significantly fewer negations suffice?

We first observe that every Boolean function is -close (with respect to the uniform distribution) to a function that has . The function is obtained from simply by setting for all inputs that have Hamming weight outside of ; a standard Chernoff bound implies that and disagree on at most inputs. Markov’s theorem then implies that the inversion complexity is at most . Thus, every Boolean function can be approximated to high accuracy by a circuit with only negations.

We now show that this upper bound is essentially optimal: for almost every Boolean function, any -approximating circuit must contain at least negations. To prove this, we recall the notion of the total influence of a Boolean function : this is

 Inf[f]=n∑i=1Infi[f],whereInfi[f]=Prx∈{0,1}n[f(x)≠f(x⊕i)]

and denotes with its -th coordinate flipped. The total influence of is easily seen to equal , where is the fraction of all edges in the Boolean hypercube that are bichromatic, i.e. have In Appendix A.1 we prove the following lemma:

###### Lemma 2.0.

Suppose is such that . Then .

It is easy to show that a random function has influence with probability . Given this, Section 2.2, together with the elementary fact that if is -close to then , directly yields the following:

###### Theorem 1.2.

With probability , any -approximator for a random function must have inversion complexity

###### Remark 2.0.

The results in this section (together with simple information-theoretic arguments showing that random functions are hard to learn) imply that one cannot expect to have a learning algorithm (even to constant accuracy) for the class of circuits with negations in time significantly better than . As we shall see in Section 3.1, for any fixed it is possible to learn to accuracy in time .

## 3 Learning circuits with few negations

### 3.1 A learning algorithm for Cnt.

We sketch the learning algorithm and analysis of Bshouty and Tamon [BT96]; using the results from Section 2 our Theorem 1.3 will follow easily from their approach. Our starting point is the simple observation that functions with good “Fourier concentration” can be learned to high accuracy under the uniform distribution simply by estimating all of the low-degree Fourier coefficients. This fact, established by Linial, Mansour and Nisan, is often referred to as the “Low-Degree Algorithm:”

###### Theorem 3.1 (Low-Degree Algorithm ([Lmn93])).

Let be a class of Boolean functions such that for and ,

 ∑|S|>τˆf(S)2≤ε

for any . Then can be learned from uniform random examples in time .

Using the fact that every monotone function has total influence , and the well-known Fourier expression for total influence, a simple application of Markov’s inequality let Bshouty and Tamon show that every monotone function has

 ∑|S|>√n/εˆf(S)2≤ε.

Together with Theorem 3.1, this gives their learning result for monotone functions.

Armed with Section 1.2, it is straightforward to extend this to the class . Section 1.2 and a union bound immediately give that every has , so the Fourier expression for influence and Markov’s inequality give that

 ∑|S|>O(2t)√n/εˆf(S)2≤ε

for Theorem 1.3 follows immediately using the Low-Degree Algorithm.

An immediate question is whether this upper bound on the complexity of learning is optimal; we give an affirmative answer in the next subsection.

### 3.2 Lower bounds for learning.

As noted in the introduction, we prove information-theoretic lower bounds against learning algorithms that make a limited number of membership queries. We start by establishing a new lower bound on the number of membership queries that are required to learn monotone functions to high accuracy, and then build on this to provide a lower bound for learning Our query lower bounds are essentially tight, matching the upper bounds (which hold for learning from uniform random examples) up to logarithmic factors in the exponent.

We first state the results; the proofs are deferred to Section 3.2.1. We say that a Boolean function is balanced if

###### Theorem 3.2.

There exists a class of balanced -variable monotone Boolean functions such that for any , , learning to accuracy requires membership queries.

This immediately implies the following corollary, which essentially closes the gap in our understanding of the hardness of learning monotone functions:

###### Corollary 3.2.

For any bounded away from , learning -variable monotone functions to accuracy requires queries.

Using this class as a building block, we obtain the following hardness of learning result for the class of -alternating functions:

###### Theorem 3.3.

For any function , there exists a class of balanced -alternating -variable Boolean functions such that, for any sufficiently large and such that (i) , and (ii) , learning to accuracy requires membership queries.

(We note that the tradeoff between the ranges of and that is captured by condition (ii) above seems to be inherent to our approach and not a mere artifact of the analysis; see Section 3.2.1.) This theorem immediately yields the following:

###### Corollary 3.3.

Learning the class of -alternating functions to accuracy in the uniform-distribution membership-query model requires membership queries, for any and .

###### Corollary 3.3.

For , learning to accuracy requires membership queries, for any .

#### 3.2.1 Proofs.

We require the following standard notion of composition for two functions and :

###### Definition 3.3 (Composition).

For and , we denote by the Boolean function on inputs defined by

 (g⊗f)(x)\lx@stackreldef=g(f,…,fr)(x)=g(f(x1,…,xm),…,f(x(r−1)m+1,…,xrm))

Similarly, for any and a class of Boolean functions on variables, we let

 g⊗Fm={g⊗f:f∈Fm}

and .

Overview of the arguments. Our approach is based on hardness amplification. In order to get our lower bound against learning -alternating functions, we (a) start from a lower bound ruling out very high-accuracy learning of monotone functions; (b) use a suitable monotone combining function to get an XOR-like hardness amplification, yielding a lower bound for learning (a subclass of) monotone functions to moderate accuracy; (c) repeat this approach on this subclass with a different (now -alternating) combining function to obtain our final lower bound, for learning -alternating functions to moderate accuracy.

 \underbracket[{high-accuracy}monotone](a)⨂-like−−−−−−−−−→k-% monotone\underbracket[%moderateaccuracymonotone](b)⨂-like−−−−−−−−−→k-alternating\underbracket[moderate % accuracyk-alternating](c) (1)

In more detail, in both steps (b) and (c) the idea is to take as base functions the hard class from the previous step (respectively “monotone hard to learn to high accuracy”, and “monotone hard to learn to moderate accuracy”), and compose them with a very noise-sensitive function in order to amplify hardness. Care must be taken to ensure that the combining function satisfies several necessary constraints (being monotone for (b) and -alternating for (c), and being as sensitive as possible to the correct regime of noise in each case).

##### Useful tools.

We begin by recalling a few notions and results that play a crucial role in our approach.

###### Definition 3.3 (Noise stability).

For , the noise stability of at is

 Stabη(f)\lx@stackreldef=1−2Pr[f(x)≠f(y)]

where is drawn uniformly at random from and is obtained from by independently for each bit having (i.e., and are -correlated).

###### Definition 3.3 (Bias and expected bias).

The bias of a Boolean function is the quantity , while the expected bias of at is defined as , where is a random restriction on coordinates where each coordinate is independently left free with probability and set to 0 or 1 with same probability .

###### Fact 3.3 (Proposition 4.0.11 from [O’d03]).

For and , we have

 12+12Stab1−2δ(f)≤ExpBias2δ(f)≤12+12√Stab1−2δ(f).

Building on Talagrand’s probabilistic construction [Tal96] of a class of functions that are sensitive to very small noise, Mossel and O’Donnell [MO03] gave the following noise stability upper bound. (We state below a slightly generalized version of their Theorem 3, which follows from their proof with some minor changes; see Appendix A.2 for details of these changes.)

###### Theorem 3.4 (Theorem 3 of [Mo03]).

There exists an absolute constant and an infinite family of balanced monotone functions such that holds for all sufficiently large , as long as .

Applying Section 3.2.1, it follows that for the Mossell-O’Donnell function on inputs and any as above, we have

 12≤ExpBiasγ(gr)≤12+12√1−Kτ≤1−K4τ (2)

for .

We will use the above upper bound on expected bias together with the following key tool from [FLS11], which gives a hardness amplification result for uniform distribution learning. This result builds on the original hardness amplification ideas of O’Donnell [O’D03]. (We note that the original theorem statement from [FLS11] deals with the running time of learning algorithms, but inspection of the proof shows that the theorem also applies to the number of membership queries that the learning algorithms perform.)

###### Theorem 3.5 (Theorem 12 of [Fls11]).

Fix , and let be a class of -variable Boolean functions such that for every , . Let be a uniform distribution membership query algorithm that learns to accuracy using queries. Then there exists a uniform-distribution membership query algorithm that learns to accuracy using membership queries.

Hardness of learning monotone functions to high accuracy. At the bottom level, corresponding to step (a) in (1), our approach relies on the following simple claim which states that monotone functions are hard to learn to very high accuracy. (We view this claim, as essentially folklore; as noted in the introduction it slightly sharpens a lower bound given in [BT96]. A proof is given for completeness in Appendix A.3.)

###### Claim 3.5 (A slice of hardness).

There exists a class of balanced monotone Boolean functions and a universal constant such that, for any constants , learning to error requires at least membership queries.

We now prove Theorem 3.2, i.e. we establish a stronger lower bound (in terms of the range of accuracy it applies to) against learning the class of monotone functions. We do this by amplifying the hardness result of Section 3.2.1 by composing the “mildly hard” class of functions with a monotone function — the Mossel-O’Donnell function of Theorem 3.4 — that is very sensitive to small noise (intuitively, the noise rate here is comparable to the error rate from Section 3.2.1).

###### Proof of Theorem 3.2.

We will show that there exists an absolute constant such that for any sufficiently large and , there exist , (both of which are ) such that learning the class of (balanced) functions on variables to accuracy requires at least membership queries.

By contradiction, suppose we have an algorithm which, for all as above, learns the class to accuracy using membership queries. We show that this implies that for infinitely many values of , one can learn to error with membership queries, in contradiction to Section 3.2.1.

Fix any large enough and , and choose satisfying and where is the constant from Theorem 3.4. Note that this implies so indeed both and are (Intuitively, the value is the error we want to achieve to get a contradiction, while the value is the error we can get from Theorem 3.5.) Note that we indeed can use the Mossel-O’Donnell function from Theorem 3.4, which requires – for our choice of , this is equivalent to . Finally, set .

We apply Theorem 3.5 with , and . (Note that all functions in are balanced, and thus trivially satisfy the condition that , and recall that is the accuracy the theorem guarantees against the original class .) With these parameters we have

 ExpBiasγ(g)+ϵ ≤Eq.(???)1−K45τK+τ4=1−τ≤accuracy(A).

Theorem 3.5 gives that there exists a learning algorithm learning to accuracy with membership queries, that is, many queries. However, we have , where the inequality comes from observing that (so that it suffices to pick satisfying ). This contradicts Section 3.2.1 and proves the theorem. ∎

###### Remark 3.5 (Improving this result).

Proposition 1 of [MO03] gives a lower bound on the best noise stability that can be achieved by any monotone function. If this lower bound were in fact tight — that is, there exists a family of monotone functions such that for all , — then the above lower bound could be extended to an (almost) optimal range of , i.e. for any fixed superconstant function.

From hardness of learning monotone functions to hardness of learning -alternating functions. We now establish the hardness of learning -alternating functions. Hereafter we denote by the class of “hard” monotone functions from Theorem 3.2. Since is balanced and every has bias zero, it is easy to see that is a class of balanced functions.

We begin by recalling the following useful fact about the noise stability of functions that are close to :

###### Fact 3.5 (e.g., from the proof of Theorem 9 in [Bt13]).

Let . If is a Boolean function on variables which -approximates , then for all ,

 Stab1−2δ(f)≤(1−2η)2(1−2δ)r+4η(1−η). (3)

We use the above fact to define a function that is tailored to our needs: that is, a -alternating function that is very sensitive to noise and is defined on roughly inputs. Without the last condition, one could just use , but in our context this would only let us obtain a (rather than a ) in the exponent of the lower bound, because of the loss in the reduction. To see why, observe that by using a combining function on variables instead of , the number of variables of the combined function would be only . However, to get a contradiction with the hardness of monotone functions we shall need , where , as the hardness amplification lemma requires the error to scale down with the number of combined functions.

###### Definition 3.5.

For any odd111The above definition can be straightforwardly extended to not necessarily odd, resulting in a similar -alternating perfectly balanced function that agrees with on middle layers of the cube and is below and above those layers. For the sake of simplicity we leave out the detailed description of the other cases. , let be the symmetric Boolean function on inputs defined as follows: for all ,

In particular, is -alternating, and agrees with on the middle layers of the hypercube. By an additive Chernoff bound, one can show that is -close to , for .

###### Proof of Theorem 3.3.

will be defined as the class for some and such that (see below). It is easy to check that functions in are balanced and -alternating. We show below that for sufficiently large, and , learning to accuracy requires membership queries.

By contradiction, suppose we have an algorithm learning for all as above the class of -alternating functions to accuracy using membership queries, where is a universal constant to be determined during the analysis. We claim that this implies that for infinitely many values of , one can learn to some range of accuracies with a number of membership queries contradicting the lower bound of Theorem 3.2.

Fix any large enough, and as above (which in particular impose ). The constraints we impose on , and are the following:

 mr=n;ExpBiasτ(PAR′k,r)+ε ≤1−ε;m=ωn(1);τ≥1m1/6; (4) βk√nε <α√mτ, (5)

where the constraints in (4) are for us to apply the previous theorems and lemmas, while (5) is needed to ultimately derive a contradiction.

One can show that by taking and the second constraint of (4) is satisfied, as then (for the derivation, see Appendix Section A.4). Then, with the first constraint of (4), we get (omitting for simplicity the floors) , so as long as , the third constraint of (4) is met as well. With these settings, the final constraint of (4) can be rewritten as As , it is sufficient to have which holds because of the lower bound on .

It only remains to check Constraint (5) holds:

 k√nε =100k√nτr=100k√r√mτ≤(100√2ln51−2ln5/k2)√mτ≤300√2ln5⋅√mτ,

where the first inequality holds because as and the second holds because So for the right choice of , e.g. , , and (5) is satisfied.

It now suffices to apply Theorem 3.5 to , with parameters and , on algorithm , which has accuracy . Since the functions of are unbiased, it follows that there exists an algorithm learning to accuracy , with , making only

membership queries, which contradicts the lower bound of Theorem 3.2. ∎

###### Remark 3.5 (On the relation between ε and k).

The tradeoff in the ranges for and appear to be inherent to this approach. Namely, it comes essentially from Constraint (4), itself deriving from the hypotheses of Theorem 3.2. However, even getting an optimal range in the latter would still require , which along with and impose and .

## References

• [AM02] K. Amano and A. Maruoka. On learning monotone Boolean functions under the uniform distribution. In Proceedings of the 13th International Conference on Algorithmic Learning Theory (ALT), pages 57–68, 2002.
• [AM05] K. Amano and A. Maruoka. A Superpolynomial Lower Bound for a Circuit Computing the Clique Function with At Most Negation Gates. SIAM Journal on Computing, 35(1):201–216, 2005.
• [Ang88] D. Angluin. Queries and concept learning. Machine Learning, 2:319–342, 1988.
• [BBL98] A. Blum, C. Burch, and J. Langford. On learning monotone Boolean functions. In Proceedings of the Thirty-Ninth Annual Symposium on Foundations of Computer Science, pages 408–415, 1998.
• [BT96] N. Bshouty and C. Tamon. On the Fourier spectrum of monotone functions. Journal of the ACM, 43(4):747–770, 1996.
• [BT13] E. Blais and L-Y. Tan. Approximating Boolean functions with depth-2 circuits. Electronic Colloquium on Computational Complexity (ECCC), 20:51, 2013.
• [FLS11] V. Feldman, H. K. Lee, and R. A. Servedio. Lower bounds and hardness amplification for learning shallow monotone formulas. Journal of Machine Learning Research - Proceedings Track, 19:273–292, 2011.
• [Kor03] A. D. Korshunov. Monotone Boolean functions. Russian Mathematical Surveys (Uspekhi Matematicheskikh Nauk), 58(5):929–1001, 2003.
• [KV94] M. Kearns and L. Valiant. Cryptographic limitations on learning Boolean formulae and finite automata. Journal of the ACM, 41(1):67–95, 1994.
• [LMN93] N. Linial, Y. Mansour, and N. Nisan. Constant depth circuits, Fourier transform and learnability. Journal of the ACM, 40(3):607–620, 1993.
• [Mar57] A. A. Markov. On the inversion complexity of systems of functions. Doklady Akademii Nauk SSSR, 116:917–919, 1957. English translation in [Mar58].
• [Mar58] A. A. Markov. On the inversion complexity of a system of functions. Journal of the ACM, 5(4):331–334, October 1958.
• [MO03] E. Mossel and R. O’Donnell. On the noise sensitivity of monotone functions. Random Structures and Algorithms, 23(3):333–350, 2003.
• [Mor09a] H. Morizumi. Limiting Negations in Formulas. In ICALP, pages 701–712, 2009.
• [Mor09b] H. Morizumi. Limiting negations in non-deterministic circuits. Theoretical Computer Science, 410(38-40):3988–3994, 2009.
• [O’D03] R. O’Donnell. Computational applications of noise sensitivity. PhD thesis, MIT, June 2003.
• [OS07] R. O’Donnell and R. Servedio. Learning monotone decision trees in polynomial time. SIAM Journal on Computing, 37(3):827–844, 2007.
• [OW09] R. O’Donnell and K. Wimmer. KKL, Kruskal-Katona, and monotone nets. In Proc. 50th IEEE Symposium on Foundations of Computer Science (FOCS), 2009.
• [Raz85] A. Razborov. Lower bounds on the monotone complexity of some Boolean functions. Doklady Akademii Nauk SSSR, 281:798–801, 1985. English translation in: Soviet Mathematics Doklady 31:354–357, 1985.
• [RW90] R. Raz and A. Wigderson. Monotone circuits for matching require linear depth. In Proceedings of the 22nd ACM Symposium on Theory of Computing, pages 287–292, 1990.
• [Ser04] R. Servedio. On learning monotone DNF under product distributions. Information and Computation, 193(1):57–74, 2004.
• [ST03] S. Sung and K. Tanaka. Limiting Negations in Bounded-Depth Circuits: an Extension of Markov’s Theorem. In ISAAC, pages 108–116, 2003.
• [SW93] M. Santha and C. Wilson. Limiting negations in constant depth circuits. SIAM Journal on Computing, 22(2):294–302, 1993.
• [Tal96] M. Talagrand. How much are increasing sets positively correlated? Combinatorica, 16(2):243–258, 1996.

## Appendix A Proofs

### a.1 Proof of Section 2.2.

Suppose for some : this means that at least an fraction of all edges are bichromatic. Define the weight level (denoted ) to be the set of all edges going from a vertex of Hamming weight to a vertex of Hamming weight (in particular, ), and consider weight levels (the “middle levels”) for . (We suppose without loss of generality that is a whole number.) Now, the fraction of all edges which do not lie in these middle levels is at most

 1n2n−1⋅2n2−a√n−1∑j=0|Wk|≤2nn2n−1n2−a√n−1∑j=0(nk)≤42nn2−a√n−1∑j=1(nk)≤4e−2a2=α2.

So no matter how many of these edges are bichromatic, it must still be the case that at least an fraction of all edges in the “middle levels” are bichromatic.

Since the ratio

 ∣∣Wn/2∣∣∣∣Wn/2−a√n∣∣=n2(nn/2)(n2+a√n)(nn/2−a√n)

converges monotonically from below (when goes to infinity) to , any two weight levels amongst the middle ones have roughly the same number of edges, up to a multiplicative factor . Setting and , this implies that at least a fraction of the weight levels in the middle levels have at least a fraction of their edges being bichromatic. (Indeed, otherwise we would have, letting denote the number of bichromatic edges in weight layer ,

 α2⋅n2+a√n−1∑k=n2−a√n|Wk|total

So , which gives , a contradiction.)

Let be this collection of at least weight levels (from the middle ones) that each have at least a fraction of edges being bichromatic, and write to denote the fraction of bichromatic edges in , so that for each it holds that . Consider a random chain from to . The marginal distribution according to which an edge is drawn from any given fixed weight level is uniform on , so by linearity, the expected number of bichromatic edges in a random chain is at least , and hence some chain must have that many bichromatic edges. ∎

### a.2 Derivation of Theorem 3.4 using Theorem 3 of [Mo03].

The original theorem is stated for , with the upper bound being . However, the proof of [MO03] goes through for our purposes until the very end, where they set