Optimal Bounds on Approximation of Submodular and XOS Functions by Juntas

# Optimal Bounds on Approximation of Submodular and XOS Functions by Juntas

## Abstract

We investigate the approximability of several classes of real-valued functions by functions of a small number of variables (juntas). Our main results are tight bounds on the number of variables required to approximate a function within -error over the uniform distribution:

• If is submodular, then it is -close to a function of variables. This is an exponential improvement over previously known results [FKV13]. We note that variables are necessary even for linear functions.

• If is fractionally subadditive (XOS) it is -close to a function of variables. This result holds for all functions with low total -influence and is a real-valued generalization of Friedgut’s theorem for boolean functions. We show that variables are necessary even for XOS functions.

As applications of these results, we provide learning algorithms over the uniform distribution. For XOS functions, we give a PAC learning algorithm that runs in time . For submodular functions we give an algorithm in the more demanding PMAC learning model [BH12] which requires a multiplicative factor approximation with probability at least over the target distribution. Our uniform distribution algorithm runs in time . This is the first algorithm in the PMAC model that can achieve a constant approximation factor arbitrarily close to 1 for all submodular functions (even over the uniform distribution). It relies crucially on our bounds for approximation by juntas. As follows from the lower bounds in [FKV13] both of these algorithms are close to optimal. We also give applications for proper learning, testing and agnostic learning of these classes.

## 1 Introduction

In this paper, we study the structure and learnability of several classes of real-valued functions over the uniform distribution on the Boolean hypercube . The primary class of functions that we consider is the class of submodular functions. Submodularity, a discrete analog of convexity, has played an essential role in combinatorial optimization [Edm70, Lov83, Que95, Fra97, FFI01]. Recently, interest in submodular functions has been revived by new applications in algorithmic game theory as well as machine learning. In machine learning, several applications [GKS05, KGGK06, KSG08] have relied on the fact that the information provided by a collection of sensors is a submodular function. In algorithmic game theory, submodular functions have found application as valuation functions with the property of diminishing returns [BLN06, DS06, Von08]. Along with submodular functions, other related classes have been studied in the algorithmic game theory context: coverage functions, gross substitutes, fractionally subadditive (XOS) functions, etc. It turns out that these classes are all contained in a broader class, that of self-bounding functions, introduced in the context of concentration of measure inequalities [BLM00]. We refer the reader to Section 2 for definitions and relationships of these classes.

Our focus in this paper is on structural properties of these classes of functions, specifically on their approximability by juntas (functions of a small number of variables) over the uniform distribution on . Approximations of various function classes by juntas is one of the fundamental topics in Boolean function analysis [NS92, Fri98, Bou02, FKN02] with a growing number of applications in learning theory, computational complexity and algorithms [DS05, CKK06, KR06, OS07, KR08, GMR12, FKV13]. A classical result in this area is Friedgut’s theorem [Fri98] which states that every boolean function is -close to a function of variables, where is the total influence of (see Sec. 4.1 for the formal definition). Such a result is not known for general real-valued functions, and in fact one natural generalization Freidgut’s theorem is known not to hold [OS07]. However, it was recently shown [FKV13] that every submodular function with range is close in -norm to a -junta. Stronger results are known in the special case when a submodular function only takes different values (for some small ). For this case Blais et al. prove existence of a junta of size [BOSY13] and Feldman et al. give a bound [FKV13].

As in [FKV13], our interest in approximation by juntas is motivated by applications to learning of submodular and XOS functions. The question of learning submodular functions from random examples was first formally considered by Balcan and Harvey [BH12] who motivate it by learning of valuation functions. Reconstruction of submodular functions up to some multiplicative factor from value queries (which allow the learner to ask for the value of the function at any point) was also considered by Goemans et al. [GHIM09]. These works and wide-spread applications of submodular functions have recently lead to significant attention to several additional variants of the problem of learning and testing submodular functions as well as their structural properties [GHRU11, SV11, CKKL12, BDF12, BCIW12, RY13, FKV13, BOSY13]. We survey related work in more detail in Sections 1.1 and 1.2.

### 1.1 Our Results

Our work addresses the following two questions: (i) what is the optimal size of junta that -approximates a submodular function, and in particular whether the known bounds are optimal; (ii) which more general classes of real-valued functions can be approximated by juntas, and in particular whether XOS functions have such approximations.

In short, we provide the following answers: (i) For submodular functions with range , the optimal -approximating junta has size . This is an exponential improvement over the bounds in [FKV13, BOSY13] which shows that submodular functions behave almost as linear functions (which are submodular) and are simpler than XOS functions which require a -junta to approximate. This result is proved using new techniques. (ii) All functions with range and constant total -influence can be approximated in -norm by a -junta. We show that this captures submodular functions, XOS and even self-bounding functions. This result is a real-valued generalization of Friedgut’s theorem and is proved using the same technique.

We now describe these structural results formally and then describe new learning and testing algorithms that rely on them.

#### Structural results

Our main structural result is an approximation of submodular functions by juntas.

###### Theorem 1.1.

For any and any submodular function , there exists a submodular function depending only on a subset of variables , , such that .

We also show that this result extends to arbitrary product distributions, with a dependence on the bias of the distribution (see Appendix A). In the special case of submodular functions that take values in , our result can be simplified to give a junta of size ( being the disagreement probability). This is an exponential improvement over bounds in both [FKV13] and [BOSY13] (see Corollary 3.8 for a formal statement).

Proof technique. Our proof is based on a new procedure that selects variables to be included in the approximating junta for a submodular function . We view the hypercube as subsets of and refer to as the marginal value of variable on set . Iteratively, we add a variable if its marginal value is large enough with probability at least taken over sparse random subsets of the variables that are already chosen. One of the key pieces of the proof is the use of a “boosting lemma1” on down-monotone events of Goemans and Vondrák [GV06]. We use it to show that our criterion for selection of the variables implies that with very high probability over a random and uniform choice of a subset of the selected variables, the marginal value of each of the variables that are excluded is small. The probability of having small marginal value is high enough to apply a union bound over all excluded variables. Bounded marginal values are equivalent to the function being Lipschitz in all the excluded variables which allows us to apply concentration of Lipschitz submodular functions to replace the functions of excluded variables by constants. Concentration bounds for submodular functions were first given by Boucheron et al. [BLM00] and are also a crucial component of some of the prior works in this area [BH12, GHRU11, FKV13].

One application of this procedure allows us to reduce the number of variables from to . This process can be repeated until the number of variables becomes .

Using a more involved argument based on the same ideas we show that monotone submodular functions can with high probability be multiplicatively approximated by a junta. Formally, is a multiplicative -approximation to over a distribution , if . In the PMAC learning model, introduced by Balcan and Harvey [BH12] a learner has to output a hypothesis that multiplicatively -approximates the unknown function. It is a relaxation of the worst case multiplicative approximation used in optimization but is more demanding than the /-approximation that is the main focus of our work. We prove the following:

###### Theorem 1.2.

For every monotone submodular function and every , there is a monotone submodular function depending only on a subset of variables such that is a multiplicative -approximation of over the uniform distribution.

We then show that broader classes of functions such as XOS and self-bounding can also be approximated by juntas, although of an exponentially larger size. We denote by the total -influence of and by the total -influence of (see Sec. 4.1 for definitions). We prove the result via the following generalization of the well-known Friedgut’s theorem for boolean functions.

###### Theorem 1.3.

Let be any function and . There exists a function depending only on a subset of variables , such that . For a submodular, XOS or self-bounding , , giving .

Friedgut’s theorem gives approximation by a junta of size for a boolean . For a boolean function, the total influence (also referred to as average sensitivity) is equal to both and (up to a fixed constant factor). Previously it was observed that Friedgut’s theorem is not true if is used in place of in the statement [OS07]. However we show that with an additional factor which is just polynomial in one can obtain a generalization. O’Donnell and Servedio [OS07] generalized the Friedgut’s theorem to bounded discretized real-valued functions. They prove a bound of , where is the discretization step. This special case is easily implied by our bound. Technically, our proof is a simple refinement of the proof of Friedgut’s theorem.

The second component of this result is a simple proof that self-bounding functions (and hence submodular and XOS) have constant total -influence. An immediate implication of this fact alone is that self-bounding functions can be approximated by functions of Fourier degree . For the special case of submodular functions this was proved by Cheraghchi et al. also using Fourier analysis, namely, by bounding the noise stability of submodular functions [CKKL12]. Our more general proof is substantially simpler.

We show that this result is almost tight, in the sense that even for XOS functions variables are necessary for an -approximation in (see Thm. 5.2). Thus we obtain an almost complete picture, in terms of how many variables are needed to achieve an -approximation depending on the target function — see Figure 1.

#### Applications

We provide several applications of our structural results to learning and testing. These applications are based on new algorithms as well as standard approaches to learning over the uniform distribution.

For submodular functions our main application is a PMAC learning algorithm over the uniform distribution.

###### Theorem 1.4.

There exists an algorithm that given and access to random and uniform examples of a submodular function , with probability at least , outputs a function which is a multiplicative -approximation to (over the uniform distribution). Further, runs in time and uses examples.

We remark that this algorithm works even for non-monotone submodular functions and does not in fact rely on our multiplicative-approximation junta result (Theorem 1.2, which works only for monotone submodular functions). Instead, we boostrap the -approximation result (Theorem 1.1) as follows. Theorem 1.1 guarantees an -approximating junta of size . The main challenge here is that the criterion for including variables used in the proof of Theorem 1.1 cannot be (efficiently) evaluated using random examples alone. Instead we give a general algorithm to find a larger approximating junta whenever an approximating junta exists. This algorithm relies only on submodularity of the function and in our case finds a junta of size . From there one can easily use brute force to find a -junta in time .

We show that using the function returned by this building block we can partition the domain into subcubes such that on a constant fraction of those subcubes gives a multiplicative approximation. We then apply the building block recursively for levels.

In addition, the algorithm for finding close-to-optimal -approximating junta allows us to learn properly (by outputting a submodular function) in time . Using a standard transformation we can also test whether the input function is submodular or -far (in ) from submodular, in time and using just random examples. (Using earlier results, this would have been possible only in time doubly-exponential in .) We give the details of these results in Section 6.

For XOS functions, we give a PAC learning algorithm with error using the junta and low Fourier degree approximation for self-bounding functions (Theorem 1.3).

###### Theorem 1.5.

There exists an algorithm that given and access to random uniform examples of an XOS function , with probability at least , outputs a function , such that . Further, runs in time and uses random examples.

In this case the algorithm is fairly standard: we use the fact that XOS functions are monotone and hence their influential variables can be detected from random examples (as for example in [Ser04]). Given the influential variables we can exploit the low Fourier degree approximation to find a hypothesis using regression over the low degree parities (as done in [FKV13]).

This algorithm naturally extends to any monotone real-valued function of low total -influence, of which XOS functions are a special case. Using the algorithm in Theorem 1.5 we also obtain a PMAC-learning algorithm for XOS functions using the same approach as we used for submodular functions. However the dependence of the running time and sample complexity on and is doubly-exponential in this case (see Cor. 6.14 for details). To our knowledge, this is the first PMAC learning algorithm for XOS functions that can achieve constant approximation factor in polynomial time for all XOS functions.

Organization. We present a detailed discussion of the classes of functions that we consider and technical preliminaries in Section 2. The proof of our main structural result (Thm. 1.1) is presented in Section 3.1. Its extension to multiplicative approximation of monotone submodular functions (Thm. 1.2) is given in Section 3.2. An extension to the case of general product distributions is presented in Appendix A. In Section 4 we give the proof of real-valued generalization of Friedgut’s theorem (Thm. 1.3). Section 5 gives examples of functions that prove tightness of our bounds for submodular and XOS functions. The details of our algorithmic applications to PAC and PMAC learning are in Section 6. We state several implications of our structural results to agnostic learning and testing in Section 7.

### 1.2 Related Work

Reconstruction of submodular functions up to some multiplicative factor (on every point) from value queries was first considered by Goemans et al. [GHIM09]. They show a polynomial-time algorithm for reconstructing monotone submodular functions with -factor approximation and prove a nearly matching lower-bound. This was extended to the class of all subadditive functions in [BDF12] which studies small-size approximate representations of valuation functions (referred to as sketches). Theorem 1.2 shows that allowing an error probability (over the uniform distribution) makes it possible to get a multiplicative -approximation using a -sized sketch. This sketch can be found in polynomial time using value queries (see Section 3.2).

Balcan and Harvey initiated the study of learning submodular functions from random examples coming from an unknown distribution and introduced the PMAC learning model described above [BH12]. They give an O()-factor PMAC learning algorithm and show an information-theoretic -factor impossibility result for submodular functions. Subsequently, Balcan et al. gave a distribution-independent PMAC learning algorithm for XOS functions that achieves an -approximation and showed that this is essentially optimal [BCIW12]. They also give a PMAC learning algorithm in which the number of clauses defining the target XOS function determines the running time and the approximation factor that can be achieved (for polynomial-size XOS functions it implies -approximation factor in time for any ).

The lower bound in [BH12] also implies hardness of learning of submodular function with (or )-error: it is impossible to learn a submodular function in time within any nontrivial -error over general distributions. We emphasize that these strong lower bounds rely on a very specific distribution concentrated on a sparse set of points, and show that this setting is very different from the setting of uniform/product distributions which is the focus of this paper.

For product distributions, Balcan and Harvey show that 1-Lipschitz monotone submodular functions of minimum nonzero value at least have concentration properties implying a PMAC algorithm with a multiplicative -approximation [BH12]. The approximation is by a constant function and the algorithm they give approximates the function by its mean on a small sample. Since a constant is a function of variables, their result can be viewed as an extreme case of approximation by a junta. Our result gives multiplicative -approximation for arbitrarily small . The main point of Theorem 1.2, perhaps surprising, is that the number of required variables grows only polynomially in and logarithmically in .

Learning of submodular functions with additive rather than multiplicative guarantees over the uniform distribution was first considered by Gupta et al. who were motivated by applications in private data release [GHRU11]. They show that submodular functions can be -approximated by a collection of -Lipschitz submodular functions. Concentration properties imply that each -Lipschitz submodular function can be -approximated by a constant. This leads to a learning algorithm running in time , which however requires value queries in order to build the collection. Cheraghchi et al. use an argument based on noise stability to show that submodular functions can be approximated in by functions of Fourier degree [CKKL12]. This leads to an learning algorithm which uses only random examples and, in addition, works in the agnostic setting. Most recently, Feldman et al. show that the decomposition from [GHRU11] can be computed by a low-rank binary decision tree [FKV13]. They then show that this decision tree can then be pruned to obtain depth decision tree that approximates a submodular function. This construction implies approximation by a -junta of Fourier degree . They used these structural results to give a PAC learning algorithm running in time . Note that our multiplicative -approximation in this case implies -error (but -error gives no multiplicative guarantees). In [FKV13] it is also shown that random examples (or even value queries) are necessary to PAC learn monotone submodular functions to -error of . This implies that our learning algorithms for submodular and XOS functions cannot be substantially improved.

In a recent work, Raskhodnikova and Yaroslavtsev consider learning and testing of submodular functions taking values in the range (referred to as pseudo-Boolean) [RY13]. The error of a hypothesis in their framework is the probability that the hypothesis disagrees with the unknown function. They build on the approach from [GHRU11] to show that pseudo-Boolean submodular functions can be expressed as -DNF and then apply Mansour’s algorithm for learning DNF [Man95] to obtain a -time PAC learning algorithm using value queries. In this special case the results in [FKV13] give approximation of submodular functions by junta of size and PAC learning algorithm from random examples. In an independent work, Blais et al. prove existence of a junta of size and use it to give an algorithm for testing submodularity using value queries [BOSY13].

It is interesting to remark that several largely unrelated methods point to approximating junta being of exponential size, namely, pruned decision trees in [FKV13]; Friedgut’s theorem based analysis in this work; two Sunflower lemma-style arguments in [BOSY13]. However, unexpectedly (at least for the authors), a polynomial-size junta suffices.

Previously, approximations by juntas of size polynomial in were only known in some simple special cases of submodular functions. Boolean submodular functions are disjunctions and hence, over the uniform distribution, can be approximated by an -junta. It can be easily seen that linear functions are approximable by -juntas. Coverage functions which are non-negative linear combinations of monotone disjunctions have been recently shown to be approximable by -juntas [FK14]. More generally, for Boolean functions the results in [DS09] imply that linear threshold functions with constant total influence can be -approximated by a junta of size polynomial in . In both [DS09] and [FK14] the techniques are unrelated to ours.

## 2 Preliminaries

### 2.1 Classes of valuation functions

Let us describe several classes of functions on the discrete cube, which can be also equivalently viewed as set functions. The functions in these classes share some form of the property of “forbidden complementarities” — e.g., cannot be more than . These functions could be monotone or non-monotone; we call a function monotone if whenever .

Linear functions. Linear (or additive) functions are functions in the form . This is the smallest class in the hierarchy that we consider here.

Submodular functions. Submodular functions are defined by the condition for all . A monotone submodular function can be viewed as a valuation on sets with the property of diminishing returns: the marginal value of an element, , cannot increase if we enlarge the set . Non-monotone submodular functions play a role in combinatorial optimization, primarily as generalizations of the cut function in a graph, , which is known to be submodular. Another important subclass of monotone submodular functions is the class of rank functions of matroids: , where is the family of independent sets in a matroid. In fact, it is known that a function of this type is submodular if and only if forms a matroid.

Fractionally subadditive functions (XOS). A set function is fractionally subadditive if whenever and .

This class is broader than that of (nonnegative) monotone submodular functions (but does not contain non-monotone functions, since fractionally-subadditive functions are monotone by definition). For fractionally subadditive functions such that , there is an equivalent definition known as “XOS” or maximum of non-negative linear functions [Fei06]: is XOS iff , where any positive integer and ’s are arbitrary non-negative real-valued weights (note that for every , is a non-negative linear function).

It is instructive to consider again the example of rank functions: . As we mentioned, is submodular exactly when forms a matroid. In contrast, is XOS for any down-closed set system (satisfying ; this follows from an equivalent formulation of a rank function for down-closed set systems, ). In this sense, XOS is a significantly broader class than submodular functions. Another manifestation of this fact is that optimization problems like admit constant-factor approximation algorithms using polynomially many value queries to when is submodular, but no such algorithms exist for XOS functions.

Subadditive functions. Subadditive functions are defined by the condition for all . Subadditive functions are more general than submodular and fractionally subadditive functions. In fact, subadditive functions are in some sense much less structured than fractionally subadditive functions. It is easy to verify that every function is subadditive. While submodular and fractionally subadditive functions satisfy “dimension-free” concentration bounds, this is not true for subadditive functions (see [Von10] for more details).

Self-bounding functions. Self-bounding functions were defined by Boucheron, Lugosi and Massart [BLM00] and further generalized by McDiarmid and Reed [MR06] as a unifying class of functions that enjoy strong concentration properties. Self-bounding functions are defined generally on product spaces ; here we restrict our attention to the hypercube, i.e. the case where . We identify functions on with set functions on in a natural way. By and , we denote the all-zeroes and all-ones vectors in respectively (corresponding to and sets).

###### Definition 2.1.

For a function and any , let . Then is -self-bounding, if for all and ,

 f(x)−minxif(x) ≤ 1, (1) n∑i=1(f(x)−minxif(x)) ≤ af(x)+b. (2)

In this paper, we are primarily concerned with -self-bounding functions, to which we also refer as -self-bounding functions. Note that the definition implies that for every -self-bounding function. Self-bounding functions include (-Lipschitz) fractionally subadditive functions. To subsume -Lipschitz non-monotone submodular functions, it is sufficient to consider the slightly more general -self-bounding functions — see [Von10]. The -Lipschitz condition will not play a role in this paper, as we normalize functions to have values in the range.

Self-bounding functions satisfy dimension-free concentration bounds, based on the entropy method of Boucheron, Lugosi and Massart [BLM00]. Currently this is the most general class of functions known to satisfy such concentration bounds. The entropy method for self-bounding functions is general enough to rederive bounds such as Talagrand’s concentration inequality. An example of a self-bounding function (related to applications of Talagrand’s inequality) is a function with the property of small certificates: has small certificates, if it is 1-Lipschitz and whenever , there is a set of coordinates , , such that if , then . Such functions often arise in combinatorics, by defining to equal the maximum size of a certain structure appearing in . Another well-studied class of self-bounding functions arises from Rademacher averages which are widely used to measure the complexity of model classes in statistical learning theory [Kol01, BM02]. See [BLB03] for a more detailed discussion and additional examples.

The definition of self-bounding functions is more symmetric than that of submodular functions: note that the definition does not change if we swap the meaning of and for any coordinate. This is a natural property in the setting of machine learning; the learnability of functions on should not depend on switching the meaning of and for any particular coordinate.

### 2.2 Norms and discrete derivatives

The and -norms of are defined by and , respectively, where is the uniform distribution.

###### Definition 2.2 (Discrete derivatives).

For , and , let denote the vector in that equals with -th coordinate set to . For a function and index we define . We also define .

A function is monotone (non-decreasing) if and only if for all and , . For a submodular function, , by considering the submodularity condition for , , , and .

Absolute error vs. error relative to norm: In our results, we typically assume that the values of are in a bounded interval , and our goal is to learn with an additive error of . Some prior work considered an error relative to the norm of , for example at most [CKKL12]. In fact, it is known that for a non-negative submodular or XOS function , [Fei06, FMV07a] and hence this does not make much difference. If we scale by , we obtain a function with values in and learning the original function within an additive error of is equivalent to learning the scaled function within an error of .

## 3 Junta Approximations of Submodular Functions

First we turn to the class of submodular functions and their approximations by functions of a small number of variables.

### 3.1 Additive Approximation For Submodular Functions

Here we prove Theorem 1.1, a bound of on the size of a junta needed to approximate a submodular function bounded by within an additive error of . The core of our proof is the following (seemingly weaker) statement. We remark that in this paper all logarithms are base 2.

###### Lemma 3.1.

For any and any submodular function , there exists a submodular function depending only on a subset of variables , , such that .

Note that if and , Lemma 3.1 reduces the number of variables to rather than a constant. However, we show that this is enough to prove Theorem 1.1, effectively by repeating this argument. In fact, it was previously shown [FKV13] that submodular functions can be -approximated by functions of variables. One application of Lemma 3.1 to this result brings the number of variables down to , and another repetition of the same argument brings it down to . This is a possible way to prove Theorem 1.1. Nevertheless, we do not need to rely on this previous result, and we can derive Theorem 1.1 directly from Lemma 3.1 as follows.

###### Proof of Theorem 1.1.

Let be a submodular function. We shall prove a bound of for the size of the approximating junta.

Observe that this bound holds trivially for , because then we are allowed to choose . For contradiction, suppose that there is for which the statement of Theorem 1.1 does not hold. Let be the set of all for which the statement does not hold, and pick an such that . Then, the statement still holds for .

By the statement of Theorem 1.1 for , there is a subset of variables of size and a submodular function depending only on , such that . Now let us apply Lemma 3.1 to with parameter . Thus, there exists a submodular function such that , and depends only on a subset of variables , . We have , and therefore (using ). We conclude that as required in Theorem 1.1. By the triangle inequality, we have . However, this would mean that the statement of Theorem 1.1 holds for as well, which is a contradiction. ∎

In the rest of this section, our goal is to prove Lemma 3.1.

What we need. Our proof relies on two previously known facts: a concentration result for submodular functions, and a “boosting lemma” for down-monotone events.

Concentration of submodular functions. It is known that a -Lipschitz nonnegative submodular function is concentrated within a standard deviation of [BLM00, Von10]. This fact was also used in previous work on learning of submodular functions [BH12, GHRU11, FKV13]. Exponential tail bounds are known in this case, but we do not even need this. We quote the following result which follows from the Efron-Stein inequality (the first part is stated as Corollary 2 in [BLB03], Section 2.2; the second part follows easily from the same proof).

###### Lemma 3.2.

For any self-bounding function under a product distribution,

 Var[f]≤E[f].

For any -self-bounding function under a product distribution,

 Var[f]≤aE[f].

We use the fact that -Lipschitz monotone submodular functions are self-bounding, and -Lipschitz nonmonotone submodular functions are -self-bounding (see [Von10]). By scaling, we obtain the following for -Lipschitz submodular functions (see also [FKV13]).

###### Corollary 3.3.

For any -Lipschitz monotone submodular function under a product distribution,

 Var[f]≤αE[f].

For any -Lipschitz (nonmonotone) submodular function under a product distribution,

 Var[f]≤2αE[f].

Boosting lemma for down-monotone events. The following was proved as Lemma 3 in [GV06].

###### Lemma 3.4.

Let be down-monotone (if and coordinate-wise, then ). For , define

 σp=Pr[X(p)∈F]

where is a random subset of , each element sampled independently with probability . Then

 σp=(1−p)ϕ(p)

where is a non-decreasing function for .

The proof of Lemma 3.1 Given a submodular function , let denote the multilinear extension of : where has independently random 0/1 coordinates with expectations . We also denote by the characteristic vector of a set .

###### Algorithm 3.5.

Given , produce a small set of important coordinates as follows (for parameters ):

• Set .

• As long as there is such that , include in .
(This step is sufficient for monotone submodular functions.)

• As long as there is such that , include in .
(This step deals with non-monotone submodular functions.)

• Return .

The intuition here (for monotone functions) is that we include greedily all variables whose contribution is significant, when measured at a random point where the variables chosen so far are set to with a (small) probability . The reason for this is that we can bound the number of such variables, and at the same time we can prove that the contribution of unchosen variables is very small with high probability, when the variables in are assigned uniformly at random (this part uses the boosting lemma). This is helpful in estimating the approximation error of this procedure.

First, we bound the number of variables chosen by the procedure. The argument is essentially that if the procedure had selected too many variables, their expected cumulative contribution would exceed the bounded range of the function. This argument would suffice for monotone submodular functions. The final proof is somewhat technical because of the need to deal with potentially negative discrete derivatives of non-monotone submodular functions.

###### Lemma 3.6.

The number of variables chosen by the procedure above is .

###### Proof.

For each , let be the subset of variables in included before the selection of . For a set let denote . Further, for , let us define to be the set where iff and ; in other words, these are all the elements in that have a marginal contribution more than to the previously included elements.

For each variable included in , we have by definition . Since each appears in with probability , and (independently) with probability at least , we get that each element of appears in with probability at least . In expectation, . Also, for any set and each , submodularity implies that , since . Now we get that

 f(R+)=f(0)+∑i∈R+∂if(1R+α|R+|.

From here we obtain that

 E[f(S(δ)+)]>αE[|S(δ)+|]≥12αδ|S|.

This implies that , otherwise the expectation would exceed the range of , which is .

To bound the size of we observe that the function defined as for every is submodular and for every , . The criterion for including the variables in is the same as criterion of including the variables in used for function in place of . Therefore, by an analogous argument, we cannot include more than elements in , hence . ∎

The next step in the analysis replaces the condition used by Algorithm 3.5 by a probability bound exponentially small in . The tool that we use here is the “boosting lemma” (Lemma 3.4) which amplifies the probability bound from to , as the sampling probability goes from to .

###### Lemma 3.7.

With the same notation as above, if , then for any

 Pr[∂if(1J′(1/2))>α]≤2−1/(2δ)

and

 Pr[∂if(1J∖J′(1/2))<−α]≤2−1/(2δ).
###### Proof.

Let us prove the first inequality; the second one will be similar. First, we know by the selection rule of the algorithm that for any ,

 Pr[∂if(1S(δ))>α]≤1/2.

By submodularity of we get that for any ,

 Pr[∂if(1J′(δ))>α]≤1/2.

Denote by the family of points such that . By the submodularity of , which is equivalent to partial derivatives being non-increasing, is a down-monotone set: if , then . If we define as in Lemma 3.4, we have . Therefore, by Lemma 3.4, where is a non-decreasing function. For , we get , which implies (note that for any ). As is non-decreasing, we must also have . This means . Recall that so this proves the first inequality.

For the second inequality, we denote similarly . Again, this is a down-monotone set by the submodularity of . By the selection rule of the algorithm, This implies by Lemma 3.4 that . This proves the second inequality. ∎

###### Proof of Lemma 3.1.

Given a submodular function , we construct a set of coordinates as described above, with parameters and . Lemma 3.6 guarantees that .

Let us use to denote the -tuple of coordinates of indexed by . Consider the subcube of where the coordinates on are fixed to be . In the following, all expectations are over a uniform distribution on the respective subcube, unless otherwise indicated. We denote by the restriction of to this subcube, . We define to be the function obtained by replacing each by its expectation over the respective subcube:

 h(x)=E[fxJ′]=Ey∈{0,1}¯J′[f(xJ′,y)].

Obviously depends only on the variables in and it is easy to see that it is submodular with range in . It remains to estimate the distance of from . Observe that

 ∥f−h∥22 = Ex∈{0,1}J[(f(x)−h(x))2] = ExJ′∈{0,1}J′Ey∈{0,1}¯J′[(f(xJ′,y)−h(xJ′,y))2] = ExJ′∈{0,1}J′Ey∈{0,1}¯J′[(fxJ′(y)−E[fxJ′])2] = ExJ′∈{0,1}J′[Var[fxJ′]].

We partition the points into two classes:

1. Call bad, if there is such that

• , or

In particular, we call bad for the coordinate where this happens.

2. Call good otherwise, i.e. for every we have

• , and

Consider a good point and the restriction of to the respective subcube, . The condition above means that for every , the marginal value of is at most at the bottom of this subcube, and at least at the top of this subcube. By submodularity, it means that the marginal values are between , for all points of this subcube. Hence,