PAC Quasi-automatizability of Resolution over Restricted Distributions

# PAC Quasi-automatizability of Resolution over Restricted Distributions

Brendan Juba
Harvard University
bjuba@alum.mit.edu
Supported by ONR grant number N000141210358.
###### Abstract

We consider principled alternatives to unsupervised learning in data mining by situating the learning task in the context of the subsequent analysis task. Specifically, we consider a query-answering (hypothesis-testing) task: In the combined task, we decide whether an input query formula is satisfied over a background distribution by using input examples directly, rather than invoking a two-stage process in which (i) rules over the distribution are learned by an unsupervised learning algorithm and (ii) a reasoning algorithm decides whether or not the query formula follows from the learned rules. In a previous work [15], we observed that the learning task could satisfy numerous desirable criteria in this combined context – effectively matching what could be achieved by agnostic learning of CNFs from partial information – that are not known to be achievable directly. In this work, we show that likewise, there are reasoning tasks that are achievable in such a combined context that are not known to be achievable directly (and indeed, have been seriously conjectured to be impossible, cf. Alekhnovich and Razborov [1]). Namely, we test for a resolution proof of the query formula of a given size in quasipolynomial time (that is, “quasi-automatizing” resolution). The learning setting we consider is a partial-information, restricted-distribution setting that generalizes learning parities over the uniform distribution from partial information, another task that is known not to be achievable directly in various models (cf. Ben-David and Dichterman [5] and Michael [20]).

## 1 Introduction

When learning is employed in data mining and artificial intelligence, the objective is not simply to form hypotheses that are supported by the data, but to draw further inferences based on these learned hypotheses. The use of logical inference on such learned hypotheses is problematic since learning algorithms (namely, PAC-learning algorithms here [25]) do not guarantee that the hypotheses they produce can be interpreted as “valid” in the standard sense of mathematical logic. Motivated by this lacuna, Valiant [26] introduced PAC-Semantics (for a logic) to enable the analysis of logical inferences drawn from learned hypotheses. Such logical inferences are only interesting in a partial-information model, since otherwise a query about a candidate conclusion can be evaluated directly by evaluating the query formula on each complete example and considering the fraction of examples that satisfy the query.

Valiant’s work analyzed a two-stage process in which in the first stage, learning algorithms are used to extract explicit premises from examples drawn from a background distribution. (“Supervision” is only provided in that the rules predict individual bits of the examples based on others.) In the second stage, conclusions are drawn from these premises by applying logical inferences. A stronger approach would be to ask, given a query formula as input, if there exist some premises that can be verified to hold on the (incomplete) examples, for which logical inference could derive the given query. The work of Khardon and Roth [17] on learning to reason suggests that it may be advantageous to take the two tasks together rather than in two separate stages as done by Valiant: they show that by processing the data directly, it is possible to answer queries (in a slightly different model) for which the reasoning problem is NP-hard, and for which the hypotheses that would support such queries are general DNFs, and hence not known to be learnable. In a follow-up work, Khardon and Roth [18] showed similar results for a answering a more limited class of queries from incomplete information.

One of Khardon and Roth’s major objectives in their works was to provide tractable reasoning by avoiding the use of theorem-proving techniques. Indeed, in most approaches to these problems of learning and reasoning that are actually used in practice (e.g., based on graphical models) the resulting reasoning problem is known to be #P-hard, even for highly restricted classes of formulas (cf. Roth [23]). Khardon and Roth’s approach [18] could still only answer CNF queries under partial information that were, for example, Horn clauses or of low width. Our recent work [15] established that the combined problem of evaluating a query formula from partial examples using theorem-proving is actually no more difficult than the “second stage” problem of exact, classical reasoning for all “natural” (in the sense of Beame et al. [4]) fragments of proof systems.111We note some similarly motivating results of Alekhnovich et al. [2], showing how theorem-proving algorithms can solve classical learning problems, and thus that learning is easier than theorem-proving over the same representations. Such results raised the possibility that in some cases, such theorem-proving under PAC-Semantics may actually be easier than in the classical case. The results of the current work suggest that this is actually so: we quasi-automatize222Automatization is the standard terminology for the problem of deciding whether or not a proof of a formula exists in a given (fragment of a) proof system. Quasi-automatization means that there is a quasipolynomial time algorithm for this problem. the (general, dag-like) resolution proof system under PAC-Semantics (i.e., when given incomplete examples) with restricted distributions, whereas the best known automatization of resolution, first obtained by Clegg et al. [10], achieves a running time of for polynomial-size resolution proofs over variables. Moreover, results by Alekhnovich and Razborov [1] suggest that perhaps resolution is not quasi-automatizable at all.

In this work (in contrast to [15]), we consider problems in which the background distribution (and hence learning problem) is restricted. The main example we consider is a generalization of learning parities under the uniform distribution: we note that one way of viewing the (standard) problem of learning parities over bits is that there is an unknown parity constraint over the bits; here, we consider an “unsupervised” setting where there is again no distinguished label bit, and there may be several such constraints. We refer to such distributions as affine distributions. Such problems may easily be impossible under partial information: in the (weakly) Restricted Focus of Attention (wRFA) model, Ben-David and Dichterman [5] showed that learning a large parity becomes impossible if an insufficient number of bits can be simultaneously viewed. We remark that such parity formulas are also natural and interesting from the standpoint of the reasoning problem, since they presented classic examples of theorems that were hard for resolution [24].

Our theorems are actually established for a more general class of distributions that merely feature a “correlation gap”: when we condition on a conjunction, the other bits are either highly biased or feature bounded bias (with the bounds on these biases parameterizing the “gap” that gives these distributions their name). Such distributions are reasonably natural. For example, some simple topic models, e.g., along the lines of the pure document model underlying the analysis of Latent Semantic Indexing given by Papadimitriou et al. [22], feature such correlation gaps. Roughly, the presence or absence of “primary terms” in a document determine the relative prevalence or absence of other primary terms by determining the latent topic variable, and other words do not have such an effect.

### 1.1 A closer look at our results

To get a sense of how automatization under PAC-Semantics could possibly be easier than the usual, “worst-case” automatization, we first consider the case of automatizing over the uniform distribution in Section 3.1. We observe that clauses of at least logarithmic width have a satisfied literal – i.e., are witnessed satisfied – in our partial examples with high probability. Using the techniques of our previous work [15], it therefore suffices to search for proofs of small width for each of the examples, which is known to be automatizable.

We then turn to the learning context, in which there are parity constraints on the distribution that could be useful as premises in a small proof of the query. Our algorithm distinguishes when an input CNF can be refuted by a small resolution proof on the basis of some learnable formulas over the unknown distribution (e.g., some small parity constraints) from when the input CNF is satisfied with moderate probability, that is, distinguishing when its negation (a DNF) can be proved by a small resolution proof, from when its negation is falsified with moderately large probability:

###### Theorem 1 (Main theorem, cf. Theorem 19)

There is an algorithm that, given an input CNF , a bound , , and access to partial examples from an affine distribution in which indices are hidden independently with probability in each example, and given that either

• is satisfied by with probability at least or

• there is a resolution refutation of size of for CNFs and such that

• consists of -valid clauses and

• is witnessed to evaluate to true with probability on the partial examples

decides which case holds with probability , and runs in time .

Our main theorem supports the learning of two kinds of hypotheses, denoted and in the theorem statement. The conditions on correspond essentially to standard, “realizable” learning. Although in contrast to the usual learning set-up we allow some potentially noticeable counterexamples to the target hypothesis, these counterexamples are required to be very rare relative to the accuracy parameter given to the algorithm. Since the parity constraints in an affine distribution hold perfectly, the learnability of such a establishes the learnability of the parity constraints of an affine distribution in the context of a resolution proof, which after all can only operate on clauses. We remark that our previous work [15] could not exploit such hypotheses in general, due to the fact that we did not restrict our attention to the independent masking processes.

By contrast, the conditions required in our main theorem on the second hypothesis allow it to have an arbitrary error tolerance given as an input parameter to the algorithm. In this case, our algorithm is essentially detecting when there exist any CNF formulas that suffice to complete a proof of the query, that are -close to being valid. For example, in the simple topic model distributions, clauses encoding a rule that the presence of one primary term implies that at least one out of of the other primary terms for the same topic also appears is a -valid clause that might be useful in reasoning about such documents. The permission of an arbitrary flaw in the formulas here may be seen as analogous to the agnostic learning setting in traditional concept learning [16] and we feel that it is crucial to the utility of such algorithms in artificial intelligence, for example. The conditions we impose on this second hypothesis follow the conceptual developments from our previous work [15] and this groundwork merely serves as a starting point in the present setting. Still, for the reader’s sake, we will briefly review why we deem the restrictions imposed on this hypothesis to be reasonable and how they lead to a tractable learning problem. Agnostic learning is notoriously hard, and the main reason we avoid this problem is by declining to specify any explicit representation of a relation that is -close to valid,333Surely, this is a step beyond merely specifying another explicit representation that lies outside the target concept class as done in improper learning, and hence we count it as a separate issue. a problem that is conveniently irrelevant to our ultimate goal of answering a query here. The more serious difficulty that impacts the class of hypotheses we ultimately ask to learn is that determining whether or not a formula is satisfied in the context of partial information can be intractable or even impossible, depending on the level of generality; in our previous work we observed that this leads to a serious obstacle since the ability to detect when queries are provable implies that we can detect that the premises are satisfied. As this may not be feasible, it is therefore unreasonable to hope for an algorithm that utilizes any CNF that is -close to being valid as a hypothesis; what we can hope for instead (and is achieved by the previous work) is that we can utilize hypotheses that could be efficiently verified to be -close to being valid if only we had identified such a hypothesis. (Note that even in the usual complete information agnostic learning setting, as hard as it is, this condition is always satisfied.) The solution proposed in the previous work was to use a simple definition of witnessed evaluation that suffices to enable efficient certification that a formula is satisfied in a partial example. As illustrated in our warm-up, this definition moreover turns out to be strong enough to enable solutions to the combined learning and reasoning problem.

The core of the proof of the main theorem is a structural result for resolution proofs over correlation gap distributions: every formula with a polynomial-size resolution refutation (possibly using clauses that hold over the background distribution as additional premises) simplifies to a logarithmic-width refutation using logarithmic-width clauses that are learnable from the background distribution. Essentially, we only need to focus on clauses in the proof that cannot be learned, and the key observation is that the only way a clause of sufficient (logarithmic) width could not be witnessable (and so unlearnable) is if many of its literals are biased to be false; this may happen if the literals of the clause appear in some even-parity constraint, for example. There is then some other (small) learnable clause that can be used to eliminate each such literal, and so by a sequence of resolution steps, we can reduce the wide clauses of the proof down to logarithmic width. We similarly establish the learnability of the nearly-realizable by showing that we only need to consider logarithmic-width subclauses for the proof. These techniques actually also allow us to determine whether or not a given CNF query is highly valid directly: In Section 4, we exhibit an algorithm that distinguishes when an input CNF is satisfied with high probability over the background distribution from when it is falsified with some moderate probability, without reference to resolution proofs. This latter result may be understood as lifting the restrictions on the CNF query in Khardon and Roth’s work [18], given some new restrictions on the distribution and obscuring of information.444By contrast, our main theorem essentially certifies the validity of DNF queries, which seems to require the existence of simple proofs in the context of incomplete information—a further discussion of this point appears in Section 2, in the context of the formal set-up. These techniques for learning to reason about CNFs from partial information are new to this work, and seem to rely on both the correlation gap and the relative simplicity of the process that hides information.

## 2 Our setting: background and approach

#### PAC-Semantics.

PAC-Semantics (for a logic) was introduced by Valiant [26] to capture the kind of validity possessed by statements produced by PAC-learning algorithms; for example, if one uses a PAC-learning algorithm to learn a conjunction to match the “labels” given by another variable from examples from a distribution , then the formula is (probably) approximately valid in the following sense:

###### Definition 2 ((1−ϵ)-valid)

Given a distribution over , we say that a Boolean formula is -valid if . If , we say that the formula is perfectly valid.

Of course, the definition makes sense for other kinds of formulas (not just equivalences). It is not hard to show (by a union bound) that any classical logical inference can be applied to formulas possessing this weaker kind of validity, as long as we allow for further loss in the approximation.555It also is not hard to show that as long as the distributions are arbitrary, the union bound is tight here [15].

###### Proposition 3 (Classical reasoning in PAC-Semantics [15])

Let be formulas such that each is -valid under a common distribution for some . Suppose that (in the classical sense). Then is -valid under for .

The main problem that we wish to address, introduced in prior work [15], is that of deciding the degree of validity of a given query formula using examples from an unknown distribution. In the present work, we will see how to obtain stronger results (than [15] in particular) for a moderately restricted class of distributions, in which the correlation between variables is either strong or weak (and not of moderate strength):

###### Definition 4 (Correlation gap)

We will say that a distribution over has a width- correlation gap if for any conjunction of literals satisfied with nonzero probability and any variable , either and or else for some , . In the former case, we say that is -balanced for , and in the latter we say that, respectively, (for ) or (for ) is -implied by , denoted as (or , respectively).

A simple example of a large collection of distributions that have a very strong correlation gap of up to width and serve as a motivating example, is the class of affine distributions:

###### Definition 5 (Affine distribution)

For any solvable linear system over , the distribution over that is uniform over solutions to the linear system is an affine distribution.

Affine distributions turn out to have a correlation gap, since intuitively, if a conjunction determines the value of another variable in a linear constraint, that literal is easily seen to be -implied, and otherwise the literal turns out to be uniformly distributed (and thus, -balanced).

###### Definition 6 (Constraints on clauses)

We say that there is a constraint on a clause in an affine distribution given by the linear system if there is a linear combination of the rows of such that the only nonzero entries are in indices for which the corresponding variable appears in a literal of .

###### Lemma 7

Let be a clause and be an affine distribution such that there are no constraints on . Then the marginal distribution over the variables appearing in is uniform.

Proof:   Let be the linear system defining . Let denote the variables in and denote the variables not appearing in , and let be the submatrix of formed by columns corresponding to variables in , and be the submatrix formed by columns of corresponding to variables not in . Note that by converting the matrix into reduced form, we may verify that if for some there were no solutions to the system , then there must be some linear combination of the rows of yielding the vector. The corresponding linear combination of the rows of the linear system would be a constraint on , which does not exist by hypothesis. Therefore, for every , the system has solutions, and furthermore, since is the same for every assignment , it always has the same number of solutions. Thus, as is uniform over these solutions (and ), the marginal distribution over the variables in is indeed uniform as claimed.

Correlation gap distributions are reasonably natural. For example, the simple “pure document” probabilistic corpus model introduced by Papadimitriou et al. [22] to analyze Latent Semantic Indexing generally features a nontrivial correlation gap:

###### Example 8 (Pure document topic model [22])

The pure document topic model is a probabilistic model of document generation. We suppose that documents are represented by the set of words appearing in them. A document is then generated in a two-stage process in which a (latent) topic variable is first sampled, where each outcome of the topic variable is associated with one member of a family of disjoint sets of primary terms. The set of words associated with this topic is obtained by the union of these primary terms with a set of generic terms (shared across topics); each word has an associated probability in the range for some small constants and . The document itself is then sampled by independently tossing a biased coin for each word in the topic set, including it with its given probability. If the overall probability of the topic words appearing in the document is at most , which holds, for example, if each topic has sufficiently small probability of being chosen (relative to ), then the distribution has a width- -correlation gap.

#### Partial assignments.

Answering queries using (complete) examples drawn from a distribution is trivially accomplished. By contrast, the task is much more difficult when some of the variables’ settings are deleted from the examples, resulting in partial example assignments. In the present work, we will focus on a very simple process for generating partial assignments that chooses whether to delete each entry from the assignment by tossing a biased coin. In learning theory, such a model first appeared in the work of Decatur and Gennaro [11]. Formally:

###### Definition 9 (Partial assignments and masking)

A partial assignment is an element of . We say that a partial assignment is consistent with an assignment if whenever , .

A mask is a function taking assignments to consistent partial assignments. A masking process is a mask-valued random variable. We denote the distribution obtained by applying a masking process to a distribution over assignments by .

The independent masking process with parameter is the following masking process : for every , for every , independently with probability , and otherwise .

We note that a related set-up appears in recent works by Dvir et al. [13], Wigderson and Yehudayoff [27], and Moitra and Saks [21] with the latter result(s) being particularly relevant: in our terms, they essentially show that there is a polynomial time algorithm (in , and , for constant ) that for any distribution on , completely recovers to within an additive at each point in given access to . As a consequence, they obtain (in particular) an algorithm for answering queries that is efficient when has sparse support: we simply recover the distribution (to within an additive error, say) and add up the mass assigned to points where the formula is true to obtain an additive estimate of the degree of validity of the query under .

We will be interested in evaluating queries for distributions that have exponentially large support, such as affine distributions. In order to do so, we will need to impose further restrictions on the formula, namely (looking ahead) that the formula has a small resolution proof from clauses that are either always true or merely witnessed to be true on most examples from :

###### Definition 10 (Witnessing)

We define a clause to be witnessed to evaluate to true on a partial assignment if there is some literal ( or ) in the clause such that ( and) . We say that the clause is witnessed to evaluate to false if for every literal in the clause, ( and) .

A very natural and related notion is using the partial assignment as a restriction to simplify a formula by “plugging in the known variables.” For the cases of interest (CNFs here) this operation can be simply captured by the following definition.

###### Definition 11 (Restricted formula)

Given a partial assignment and a clause , the restriction of under , denoted , is the tautological formula if is witnessed to evaluate to true on , and otherwise it is given by the set of literals in for which . For a CNF , if some clause is witnessed to evaluate to false, then the restriction of under is defined to be the contradictory formula ; otherwise, it is the conjunction of restrictions of clauses not witnessed to evaluate to true.

#### Theorem-proving versus other approaches to answering queries.

When a (small) resolution proof exists, this implies that the query has the form of a DNF formula. This naturally raises the question of whether techniques similar to those used by Wigderson and Yehudayoff [27] (in particular) could be applied to the partial assignments corresponding to the DNF’s terms to evaluate such a DNF directly, without turning to a resolution proof. The difficulty that arises with such an approach is that when many DNF terms correspond to partial assignments that are consistent with one another, determining the weight of their disjunction by such techniques seems to involve an (exponentially large) inclusion-exclusion expression.666Wigderson and Yehudayoff avoid the need for inclusion-exclusion by only considering complete assignments, which of course correspond to disjoint events. While one could try to bound the number of terms in the sum by considering only terms of bounded width using the techniques we develop here (cf. Lemmas 15 and 17 and the approach of Theorem 19), decreasing the error per omitted term by increasing the width bound increases the number of terms at a greater rate, so that a nontrivial bound on the error cannot be obtained. By contrast, the existence of a small a priori bound on the size of a resolution proof of the formula enables us to derive a related bound on the width we need to consider (naturally, this is made precise in the proof of Theorem 19).

Khardon and Roth [18], in their work on “learning to reason,” had a different approach to an essentially similar problem—among others, they considered a problem where one is given a distribution over partial assignments that are consistent with satisfying assignments of some background formula (generally a polynomial-size DNF), and they wish to decide whether the query formula is either entailed by the background formula or whether it is falsified with some given probability (promised that one of these two cases holds). When the query is a -CNF (say for ), they answer the query by constructing a relatively small set of partial assignments and testing whether the query has an unsatisfied clause on any of the partial assignments; this approach again answers queries while avoiding any consideration of theorem-proving. The size of the set (and therefore also the running time) is polynomial in and exponential in , and so it is quasipolynomial for . It turns out that our work can be used to extend their scheme to answer general CNF queries under distributions with strong correlation gaps and independent masking by first reducing the width of the (important) clauses in the query to , as we show in Section 4. The reader should note that this approach only works for testing whether the query is almost perfectly valid.777Although the masking process is simple enough to allow us to compute (e.g.) unbiased estimates of the probability that a given (set of) clause(s) is witnessed unsatisfied, recovering an estimate of the true probability that the formula is unsatisfied seems to again involve a potentially exponentially large inclusion-exclusion calculation.

We stress that these two approaches apply to largely orthogonal classes of formulas: the latter approach answers CNF queries, whereas the former, resolution-based approach answers DNF queries. For both approaches, there is an important asymmetry: in the first approach, it is the existence of a small resolution proof that restricts the class of formulas that can be used as hypotheses and queries, whereas for the second, it is a matter of the asymmetric way the error arises when testing the query on a partial assignment.

#### Resolution.

We now briefly review the resolution proof system, which plays a central role in this work. Resolution is a proof system for establishing the validity of a DNF: it is given by a refutation of its (CNF) negation.

###### Definition 12 (Resolution)

A resolution refutation of a CNF is given by a sequence of clauses such that is the (unsatisfiable) empty clause and for each in the proof, one of the following holds:

1. (Axiom) is a clause of

2. (Weakening) some for is a subclause of

3. (Cut) there exist clauses and (for some variable ) with such that . The literals and are said to be a complementary pair.

Note that resolution is sound: each derived in a step of the proof is a consequence of . Thus, the derivation of the empty clause implies that is unsatisfiable. It is standard to note that weakening is unnecessary in a resolution refutation: if we simulate a resolution proof without using the weakening rule (applying the cut rule to the original clauses instead of weakened ones at each step) we end up with a legal derivation that has a subclause of the original derivation at each step, which must also end with the empty clause. We include weakening as a derivation rule because it is useful in the analysis of resolution. In particular, we will be interested in the result of “plugging in” a partial assignment to each step of a resolution refutation:

###### Definition 13 (Restricted proof)

Given a resolution refutation and partial assignment , the restriction of under , denoted , is the proof obtained by substituting for each clause appearing in .

Using the weakening rule, the following (folklore) observation is easily established.

###### Proposition 14 (Resolution is restriction-closed)

For any CNF with a resolution refutation and any partial assignment , the restriction is a resolution refutation of .

We will see the value of Proposition 14 illustrated simply in Section 3.1. (It will, of course, also play a key role in establishing the main theorem, Theorem 19.)

## 3 Automatizability of resolution

### 3.1 Width-based automatizability for the uniform distribution

We start with the simpler special case of the uniform distribution (which we denote ), which will introduce a useful lemma for the general case, and help develop some intuition. The key observation here is that clauses that are sufficiently wide are almost always witnessed to evaluate to true:

###### Lemma 15

Let be a clause such that for any literal of and any subclause of without , is -balanced for , and suppose has width at least . Then is witnessed to evaluate to true on with probability .

Proof:   We note that drawn from are of the form for drawn from and drawn from independently. Suppose we construct a sequence of literals and subclauses of as we sample and in the following way: put , and fix the entries of in order until we encounter some unmasked entry corresponding to some literal in ; we then put in . Each literal of is thus included in some independently with probability . Now, is witnessed true on precisely when some is set to true in , and if the first literals are set to false in , then since is -balanced for , each is set to true in with probability at least when the first are set to false. Thus, the probability that none of these literals in a clause of width is satisfied by is at most , which for is at most .

Since the uniform distribution is -balanced, Lemma 15 establishes that any wide clauses that appear in a resolution refutation are witnessed to evaluate to true with high probability over , and hence in a proof of size , a union bound establishes that every clause of width is substituted by in the restriction . Thus, the width-based algorithm first studied by Galil [14] that searches for refutations using clauses of width at most (and thus runs in time ) finds a refutation when one exists with probability . The following theorem is then almost immediate:

###### Theorem 16

There is a quasipolynomial time algorithm that, given an input CNF , size bound , , and access to examples from , and given that either

• is satisfied by with probability at least or

• there exists a CNF that is witnessed to evaluate to true with probability at least under such that there is a resolution refutation of size of

decides which case holds with probability .

Proof:   The algorithm takes a sample of partial assignments of size and for each partial assignment uses the width-based algorithm for to check for a width- refutation of ; if more than an fraction of these refutations fail, the algorithm rejects and otherwise it accepts. It is immediate that (for constant ) the algorithm runs in quasipolynomial time in , , , and . We thus turn to considering correctness.

If is valid, then Hoeffding’s inequality shows that with probability , at least an fraction of the assignments drawn from satisfy ; since this satisfying assignment is consistent with any partial assignment in the support of , is satisfiable, and hence by the soundness of resolution, no refutation exists for these partial assignments, and we see that the algorithm rejects.

In the second case, we note first that (again by Hoeffding’s inequality) with probability , every clause of the unknown formula is witnessed to evaluate to true in at least a -fraction of the partial assignments. Let be the size refutation of . It then follows from Lemma 15 (cf. the above discussion) that with probability , for this fraction of the (out of ), is a width- refutation of since the clauses from and clauses of of width greater than all simplify to . Thus, in this case, the algorithm accepts with probability , as needed.

### 3.2 Augmenting the width-based algorithm with learning

While the uniform distribution illustrates how the reasoning problem may become easier in the context of a distribution, in a sense there is no learning problem if the distribution is known to be uniform: a given formula indicates which settings of the variables are “positive” or “negative” examples, and the entire learning question for a given formula merely concerns whether the distribution assigns high weight to the negative examples. We now turn to considering our learning problem.

Distributions with a correlation gap turn out to be easy to work with because we only need to consider narrow clauses: for starters, Lemma 15 guarantees that (sub)clauses of sufficient width ( here) for which every variable is balanced in (for every small subset of the rest) are witnessed with high probability. Naturally, in an affine distribution, we can make a similar claim if all of the literals of a sufficiently wide clause are each involved in some constraint that -implies one of them. More generally:

###### Lemma 17

Let be a clause of width for and be a distribution with a width- correlation gap such that for every subclause of of size there is some further subclause with . Then with probability , there is an unmasked literal of that is true with probability under .

Proof:   Let be the first literals of , and suppose is drawn from and is drawn from . Given (of width ), some for the subclause is -implied by . Thus, with probability at least , either is satisfied in or some other literal of is satisfied in . We let be the other satisfied literal of in the latter case, and let it be in the former case. Now, we construct by removing from and taking the next literal of , and we repeat. Note that since has width at least , we find at least such literals . With probability at least , at least one of these literals is unmasked. The first such unmasked literal is as needed.

The final case to consider is when there is a subclause of width at most (noting ) for which there is a literal such that its negation is implied by the negation of the clause. In this case, as we elaborate on next, there is a small clause that we can learn that can be used to reduce the width of the first clause. If the clause is sufficiently wide and yet not witnessed to evaluate to true, it must be because there are many such literals. Using these learned clauses to eliminate these literals from the given clause enables us to find an equivalent clause that satisfies our width bound.

#### Analysis of Algorithm 1.

We first note that the learned clauses are highly valid under and contain clauses corresponding to whenever for :

###### Lemma 18

Suppose is a distribution with a width- correlation gap. Let be the conjunction of all clauses of width at most that are witnessed to evaluate to false in at most a -fraction of a sample of partial assignments from (for as given in Algorithm 1). Then with probability at least , is -valid and contains all -valid clauses of width at most , including specifically such that .

Proof:   Since each clause in has width at most , every literal in each such clause is simultaneously not set to by with probability at least ; if a clause is not valid, then it is witnessed to evaluate to false on with probability at least . Therefore, Hoeffding’s inequality implies that it is only not witnessed to evaluate to false sufficiently often to eliminate it from with probability at most . Since there are fewer than clauses of width at most , by a union bound over these clauses, we find that with probability at least , these clauses are all -valid. Again, by a union bound over the clauses, this means that their conjunction, , is -valid.

Now, when a clause of width at most is -valid (including such that ), Hoeffding’s inequality similarly guarantees that will be witnessed to evaluate to false on in a -fraction of partial assignments with probability at most . Therefore, all of these clauses appear in except with probability , and a final union bound gives that both conditions hold with probability .

The analysis of the algorithm is now a generalization of the analysis of the width-based algorithm discussed in Section 3.1. The main twist is that we may need to modify the resolution derivation by introducing learned clauses in order to obtain a low-width derivation.

###### Theorem 19

Given an input CNF , a bound , , and access to examples from for a distribution that has a width- correlation gap, (where ) and given that either

• is satisfied by with probability at least or

• there exist CNFs and such that

• consists of -valid clauses and

• is witnessed to evaluate to true with probability under

and there is a resolution refutation of size of

Algorithm 1 decides which case holds with probability , and runs in time .

Proof:   We note that the running time bound is essentially immediate from the description of the algorithm, as the width-based algorithm, for width , runs in time . We therefore turn to considering the correctness of the algorithm. The first case is similarly simple: by Proposition 3, if is -valid under , then since the learned CNF is -valid with probability at least by Lemma 18, then is -valid—meaning that with probability at least , a masked example is drawn from for which is satisfiable. It then follows by Hoeffding’s inequality that with probability greater than , for at least an -fraction of the actual examples, will be satisfiable, and therefore by the soundness of resolution, at least an -fraction of the iterations must fail to find a refutation (of the consequence ), leading the algorithm to reject as needed with probability at least .

It only remains to establish that the algorithm accepts with probability at least when there exists a size- refutation from some almost perfectly valid and some that is witnessed to evaluate to true with probability under . We assume (WLOG) that every clause of appears in the refutation. We now begin by writing the size- proof as two subsets of clauses, and where consists of clauses which either

1. contain a subclause of width (for, recall, ) for which every further subclause of has balanced for or

2. contain a subclause of width for which every (sub-)subclause of width has a further subclause for and satisfying ,

and consists of the rest of the clauses.

We first note that the clauses in (and in ) are simultaneously witnessed in most of the partial examples: we first note that by Hoeffding’s inequality, with probability at least , each clause in is witnessed to evaluate to true in at least partial examples; likewise, for the clauses of which have subclauses for which every width subclause has an implied literal, by a union bound over all literals, the probability that any of them is both a -implied literal indicated by Lemma 17 and false in any example is at most . (Note that with probability at least , every such clause has some such unmasked literal in each example.) Therefore, by Hoeffding’s inequality, these clauses are all witnessed to evaluate to true in at least of the examples with probability at least . Similarly, by Lemma 15, the clauses with -balanced subclauses of width at least are witnessed to evaluate to true in each partial example with probability at least . Hence by a union bound over the clauses and examples, every clause in (and ) is ultimately (simultaneously) witnessed to evaluate to true in at least of the examples considered in the main loop of the algorithm with probability .

We now note that for every clause in , every subclause of width greater than must have a subclause of width with some literal such that for a further subclause , —every such must have a subclause with a literal that is not balanced (or else the clause would be in by the first condition) and the further subclause cannot have for every width subclause (or else it would be in by the second condition). Therefore, by Lemma 18, the clause is added to by the algorithm. Moreover, since the clauses of are -valid, and would reach width after eliminating fewer than such literals, for the clauses of in there is a -CNF consisting of subclauses of every clause of appearing in that (by Proposition 3) are -valid, and hence also added to by Lemma 18.

Similarly, for every input clause (from ) in , we can always derive a subclause of width at most using these clauses—as long as our subclause of has width greater than , some subclause of width must contain a subclause for which a clause is in , which we can use to reduce its size further, so the clauses of corresponding to clauses of in have width at most .

To conclude, we observe that since all of the clauses of (and ) are witnessed to evaluate to true in at least of the partial examples considered by the algorithm in the main loop with probability , for each such partial example , by Proposition 14 is a resolution refutation of (restrictions of) the input and clauses of , consisting only of (restrictions of) clauses from . Now, we know that there is a -CNF consisting of subclauses of every clause of that is included in with probability at least , and for each input clause in there is a subclause of width less than that can be derived by the algorithm. Therefore, by induction on the steps of , we know that by carrying out the steps of the proof with the derived subclauses, we can always derive some subclause of each th clause in , which is in , and hence has some further subclause of width at most which can be derived in width : If the next step of is obtained by applying the cut rule to and , we can apply the cut rule to the derived width- subclauses and to obtain a subclause of the next step of width at most . Since this next step is in , this clause can be further reduced to width as noted above, completing the induction step. Since, finally, the only subclause of the empty clause (derived in the final step) is the empty clause itself, this ultimately yields a width- refutation of . Since the algorithm therefore derives the empty clause in at least out of the examples with probability at least , it therefore accepts with probability at least in this case, as needed.

#### Remarks on the algorithm.

One could divide the derivation steps of our algorithm into “steps of ” and “intermediate derivations,” where the intermediate derivations only involve the width CNF , and serve to find a width- subclause of the next step of . Based on this observation, we could have obtained a slightly more complicated algorithm that only uses dynamic programming over width- clauses, that checks after each derivation to see if the clause can be reduced to width by applications of the cut rule to clauses from that strictly reduce the size of the intermediate clause (and discarding the clause if such a reduction is not possible). One might prefer it because its complexity is rather than , as acheived by our algorithm.

In contrast to the simple algorithm of Section 3.1 (and more generally, the generic algorithms from prior work [15]) Theorem 19 also permits the proof to utilize an arbitrary highly valid CNF, in addition to one that is witnessed with probability . This is accomplished partially due to the simplicity of the distribution – we know that the clauses of any such CNF are either witnessed or are essentially equivalent to a short clause – and partially due to the simplicity of the masking process, which allows us to identify all such short clauses.

Clearly, our algorithm avoids the sources of intractability indicated by the various hardness results for the tasks in the two stages. It is easiest to see how the problem of Tseitin’s hard tautologies [24] are avoided in the problem we consider: we only certify the validity of a DNF when there exists a small resolution proof for us to discover, and the proof complexity results merely establish that no such small resolution proof exists. So, such an example is “out of bounds” for our algorithm. The way we circumvent difficulties in learning is perhaps more interesting. Recall that Valiant’s original work in this area [26] concerned algorithms that learned a collection of formulas and subsequently reasoned about what these learned formulas entailed. Following our previous work [15], our objective in learning is ultimately to determine whether or not there exist formulas that would suffice to complete a proof of a query. Here, we avoid a first common source of intractability by not aiming to find CNF representations of our hypotheses—essentially as done by “improper” learning algorithms. Arguably, this is the first step in circumventing the hardness of learning parities identified by Michael [20] (but note also that it is still not clear how to test whether a parity constraint of moderate size is satisfied in our affine distributions). Ultimately, the hardness of learning parities in the wRFA model [5] is circumvented by our structural result about the resolution proofs under our distributions: since we can show it suffices to only consider logarithmic-width clauses, we effectively only need to learn logarithmic-size parities, which is feasible in our partial-information model.

## 4 Deciding near-perfect validity of CNF queries

We now briefly note that our techniques also allow us to strengthen the results of Khardon and Roth [18] on learning to reason for the special case of distributions with a width- correlation gap (such as affine distributions) for a quasipolynomial function and independent masking: we show how to decide whether a (general) CNF query is almost perfectly valid or significantly invalid. In particular, this avoids the requirement that the queries have small proofs, but the reader should notice that it applies only to establishing the validity of CNF (not DNF) queries, only for “nearly perfect” implications, and only for distinguishing “nearly perfect” queries from defective ones.

###### Theorem 20

Given an input CNF , , and access to examples from for a distribution with a width- correlation gap, and given that either is at least -valid under or is at most -valid, Algorithm 2 decides which case holds with probability in quasipolynomial time.

Proof:   We begin by noting that the running time of the algorithm is indeed quasipolynomial in , , , , and . Now, towards correctness, we observe that by Lemma 18, the formula consisting of the conjunction of learned clauses is valid with probability . We further note that since the clauses of have a simple resolution derivation from clauses of and .

Let us consider the first case, when is at least -valid. Here, since , is at least -valid by Proposition 3. Therefore, by Hoeffding’s inequality, in a sample of size from , it is satisfied in at least of the (underlying, full) examples with probability at least . Since the partial examples are consistent with these full examples, this means in particular that no clause of can be witnessed to evaluate to false in more than of the partial examples, so we see that the algorithm accepts with probability at least in this case, as needed.

Now, turning to the second case, we split the clauses of into two formulas, and as follows: let consist of those clauses having either literals that are balanced given any subset of the rest (of the ) under or a subclause of width such that every width subclause has a literal such that for a further subclause , , and let consist of the rest. Clauses in are simultaneously true in most of the partial examples: essentially by Lemma 15 with and the definition of implied literals, each clause in is true in each underlying example (though not necessarily witnessed) with probability at least . Hence by a union bound over the clauses, every clause in is simultaneously true with probability . By contrast, though, since is at most -valid, we see that with probability at least , some clause in must evaluate to false.

Let be the clauses of corresponding to clauses from . For every subclause of a clause of of width greater than , some further subclause of width at most must have , as if every literal is balanced in any small subclause, the orignal clause would have been in by the first condition, and if some width subclause had in some further subclause of every width subclause of , it would be in by the second condition. Much as in the proof of Theorem 19, by an analogue of Lemma 18, since this is -valid, it is in with probability , and hence there is some clause of that can be used to eliminate one of the literals on this width- subclause. Therefore, we find that the clauses of must have width at most with probability .

Now, since reveals each literal with probability independently, given that the clauses of have width at most , each clause in is completely revealed with probability at least , and hence some clause in is witnessed to evaluate to false in the partial examples sampled from with probability at least . Therefore, by Hoeffding’s inequality, some clause is witnessed to evaluate to false in at least a -fraction of the partial examples with probability at least given that the clauses have width at most , which we argued occurs with probability at least . Therefore, the algorithm rejects with probability at least in this case, as needed.

## 5 Directions for future research

Several problems present themselves as natural directions for future work. The most pressing of these is, can the restriction to distributions with a correlation gap be lifted? That is, how can we efficiently reason about “medium-strength” correlations? Although the ultimate objective of such work would be to strengthen these results to the distribution-free PAC setting, any work that handled a class of distributions that exhibited such correlations would also be of interest. A similar direction would be to obtain results for a more general class of masking processes; although it seems that our results generalize to masking distributions that simultaneously reveal any width- set of literals with non-negligible probability (for ) such as -wise independent distributions (Wigderson and Yehudayoff [27] make a similar observation about their algorithm), it would be desirable to find other, perhaps weaker properties that would also permit relatively efficient algorithms.

Of course, the results of this work beg the question so far as the classical (quasi)automatizability of resolution is concerned. Although there are families of counterexamples [8, 3] showing that a purely width (and/or treelike) based approach to finding small resolution proofs such as pursued by Ben-Sasson and Wigderson [6] cannot beat the current best-known bound of , it does not rule out other approaches. Since our algorithm and analysis essentially establish that every resolution proof over distributions with a correlation gap has a low-width approximate version using the learned clauses, it seems significant for our algorithm that the learned formula may not have a small-width derivation. Unfortunately, it is not clear how one might hope to exploit this in the absence of a distribution. Still, if any algorithm could find resolution derivations in quasipolynomial time, then using the results of our previous work [15], this would also immediately resolve both of the questions suggested in the previous paragraph.

The other natural direction in which one might hope to strengthen our results involves extending them to stronger proof systems than resolution, such as cutting planes or polynomial calculus (with resolution, aka PCR). We already observed in previous work [15] that there are natural fragments of these proof systems (most already well studied) for which the combined learning and reasoning problem is tractable. The question would be whether, as with width-restricted resolution, we could use these algorithms as a starting point to obtain algorithms for the unrestricted proof system in the context of reasoning about a distribution.

There are, of course, limits to what we can hope for: the strong non-automatizability results for systems such as bounded-depth Frege based on cryptographic hardness assumptions [7, 9, 19] are established by formulas equipped with a natural background distribution (over the variables encoding parameters generated by the Diffie-Hellman key exchange protocol [12]), in which moreover the pattern of masking is independent of the underlying example. (The bits are not masked with the same probability as in , though.) Although these results were not stated in the context of PAC-Semantics and the distributions were ignored, they should carry over to an appropriate generalization of the setting we considered here, still within the framework of the prior work [15]. One might reasonably ask if some analogous negative result can be proved for based on leakage-resilient cryptography (as leaks a constant fraction of the secret parameters).

## Acknowledgements

The author would like to thank Paul Beame, Eli Ben-Sasson, and Leslie Valiant for comments and conversations that helped shape this work.

## References

• [1] Michael Alekhnovich and Alexander A. Razborov. Resolution is not automatizable unless W[P] is tractable. SIAM J. Comput., 38(4):1347–1363, 2008.
• [2] Misha Alekhnovich, Mark Braverman, Vitaly Feldman, Adam R. Klivans, and Toniann Pitassi. The complexity of properly learning simple concept classes. JCSS, 74(1):16–34, 2008.
• [3] Albert Atserias and María Luisa Bonet. On the automatizability of resolution and related propositional proof systems. Inf. Comp., 189:182–201, 2004.
• [4] Paul Beame, Henry Kautz, and Ashish Sabharwal. Towards understanding and harnessing the potential of clause learning. JAIR, 22:319–351, 2004.
• [5] Shai Ben-David and Eli Dichterman. Learning with restricted focus of attention. JCSS, 56(3):277–298, 1998.
• [6] Eli Ben-Sasson and Avi Wigderson. Short proofs are narrow – resolution made simple. J. ACM, 48(2):149–169, 2001.
• [7] Maria Luisa Bonet, Carlos Domingo, Ricard Gavaldá, Alexis Maciel, and Toniann Pitassi. Non-automatizability of bounded-depth frege proofs. Comput. Complex., 13:47–68, 2004.
• [8] María Luisa Bonet and Nicola Galesi. Optimality of size-width tradeoffs for resolution. Computational Complexity, 10(4):261–276, 2001.
• [9] Maria Luisa Bonet, Toniann Pitassi, and Ran Raz. On interpolation and automization for Frege proof systems. SIAM J. Comput., 29(6):1939–1967, 2000.
• [10] Matthew Clegg, Jeff Edmonds, and Russell Impagliazzo. Using the Gröbner basis algorithm to find proofs of unsatisfiability. In Proc. 28th STOC, pages 174–183, 1996.
• [11] Scott E. Decatur and Rosario Gennaro. On learning from noisy and incomplete examples. In Proc. 8th COLT, pages 353–360, 1995.
• [12] Whitfield Diffie and Martin Hellman. New directions in cryptography. IEEE Trans. Inform. Theory, 22:423–439, 1976.
• [13] Zeev Dvir, Anup Rao, Avi Wigderson, and Amir Yehudayoff. Restriction access. In Proc. 3rd ITCS, 2012.
• [14] Zvi Galil. On resolution with clauses of bounded size. SIAM J. Comput., 6:444–459, 1977.
• [15] Brendan Juba. Implicit learning of common sense for reasoning. To appear in IJCAI’13, 2013. Preliminary version: Learning implicitly in reasoning in PAC-Semantics, arXiv:1209.0056v1 [cs.AI].
• [16] Michael J. Kearns, Robert E. Schapire, and Linda M. Sellie. Towards efficient agnostic learning. Machine Learning, 17(2-3):115–141, 1994.
• [17] Roni Khardon and Dan Roth. Learning to reason. J. ACM, 44(5):697–725, 1997.
• [18] Roni Khardon and Dan Roth. Learning to reason with a restricted view. Machine Learning, 35:95–116, 1999.
• [19] Jan Krajíček and Pavel Pudlák. Some consequences of cryptographical conjectures for and EF. In D. Leivant, editor, Logic and Computational Complexity, volume 960 of LNCS, pages 210–220. Springer, Berlin, 1995.
• [20] Loizos Michael. Partial observability and learnability. Artificial Intelligence, 174(11):639–669, 2010.
• [21] Ankur Moitra and Michael Saks. A polynomial time algorithm for lossy population recovery. arXiv:1302.1515, 2013.
• [22] Christos Papadimitriou, Prabhakar Raghavan, Hisao Tamaki, and Santosh Vempala. Latent semantic indexing: A probabilistic analysis. JCSS, 61(2):217–235, 2000.
• [23] Dan Roth. On the hardness of approximate reasoning. Artificial Intelligence, 82(1–2):273–302, 1996.
• [24] G. S. Tseitin. On the complexity of derivation in propositional calculus. In A. O. Slisenko, editor, Studies in constructive mathematics and mathematical logic, part 2, pages 115–125. Consultants Bureau, New York, 1970.
• [25] Leslie G. Valiant. A theory of the learnable. Communications of the ACM, 18(11):1134–1142, 1984.
• [26] Leslie G. Valiant. Robust logics. Artificial Intelligence, 117:231–253, 2000.
• [27] Avi Wigderson and Amir Yehudayoff. Population recovery and partial identification. In Proc. 53rd FOCS, 2012.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters