Black Box Submodular Maximization:Discrete and Continuous Settings

Black Box Submodular Maximization: Discrete and Continuous Settings


In this paper, we consider the problem of black box continuous submodular maximization where we only have access to the function values and no information about the derivatives is provided. For a monotone and continuous DR-submodular function, and subject to a bounded convex body constraint, we propose Black-box Continuous Greedy, a derivative-free algorithm that provably achieves the tight approximation guarantee with function evaluations. We then extend our result to the stochastic setting where function values are subject to stochastic zero-mean noise. It is through this stochastic generalization that we revisit the discrete submodular maximization problem and use the multi-linear extension as a bridge between discrete and continuous settings. Finally, we extensively evaluate the performance of our algorithm on continuous and discrete submodular objective functions using both synthetic and real data.

1 Introduction

Black-box optimization, also known as zeroth-order or derivative-free optimization1, has been extensively studied in the literature (Conn et al., 2009; Bergstra et al., 2011; Rios and Sahinidis, 2013; Shahriari et al., 2016). In this setting, we assume that the objective function is unknown and we can only obtain zeroth-order information such as (stochastic) function evaluations.

Fueled by a growing number of machine learning applications, black-box optimization methods are usually considered in scenarios where gradients (i.e., first-order information) are 1) difficult or slow to compute, e.g., graphical model inference (Wainwright et al., 2008), structure predictions (Taskar et al., 2005; Sokolov et al., 2016), or 2) inaccessible, e.g., hyper-parameter turning for natural language processing or image classifications Snoek et al. (2012); Thornton et al. (2013), black-box attacks for finding adversarial examples Chen et al. (2017c); Ilyas et al. (2018). Even though heuristics such as random or grid search, with undesirable dependencies on the dimension, are still used in some applications (e.g., parameter tuning for deep networks), there has been a growing number of rigorous methods to address the convergence rate of black-box optimization in convex and non-convex settings (Wang et al., 2017; Balasubramanian and Ghadimi, 2018; Sahu et al., 2018).

The focus of this paper is the constrained continuous DR-submodular maximization over a bounded convex body. We aim to design an algorithm that uses only zeroth-order information while avoiding expensive projection operations. Note that one way the optimization methods can deal with constraints is to apply the projection oracle once the proposed iterates land outside the feasibility region. However, computing the projection in many constrained settings is computationally prohibitive (e.g., projection over bounded trace norm matrices, flow polytope, matroid polytope, rotation matrices). In such scenarios, projection-free algorithms, a.k.a., Frank-Wolfe (Frank and Wolfe, 1956), replace the projection with a linear program. Indeed, our proposed algorithm combines efficiently the zeroth-order information with solving a series of linear programs to ensure convergence to a near-optimal solution.

Continuous DR-submodular functions are an important subset of non-convex functions that can be minimized exactly Bach (2016); Staib and Jegelka (2017) and maximized approximately Bian et al. (2017a, b); Hassani et al. (2017); Mokhtari et al. (2018a); Hassani et al. (2019); Zhang et al. (2019b) This class of functions generalize the notion of diminishing returns, usually defined over discrete set functions, to the continuous domains. They have found numerous applications in machine learning including MAP inference in determinantal point processes (DPPs) Kulesza et al. (2012), experimental design Chen et al. (2018c), resource allocation Eghbali and Fazel (2016), mean-field inference in probabilistic models Bian et al. (2018), among many others.

Motivation: Computing the gradient of a continuous DR-submodular function has been shown to be computationally prohibitive (or even intractable) in many applications. For example, the objective function of influence maximization is defined via specific stochastic processes (Kempe et al., 2003; Rodriguez and Schölkopf, 2012) and computing/estimating the gradient of the mutliliear extension would require a relatively high computational complexity. In the problem of D-optimal experimental design , the gradient of the objective function involves inversion of a potentially large matrix (Chen et al., 2018c). Moreover, when one attacks a submodular recommender model, only black-box information is available and the service provider is unlikely to provide additional first-order information (this is known as the black-box adversarial attack model) (Lei et al., 2019).

There has been very recent progress on developing zeroth-order methods for constrained optimization problems in convex and non-convex settings Ghadimi and Lan (2013); Sahu et al. (2018). Such methods typically assume the objective function is defined on the whole so that they can sample points from a proper distribution defined on . For DR-submodular functions, this assumption might be unrealistic, since many DR-submodular functions might be only defined on a subset of , e.g., the multi-linear extension Vondrák (2008), a canonical example of DR-submodular functions, is only defined on a unit cube. Moreover, they can only guarantee to reach a first-order stationary point. However, Hassani et al. (2017) showed that for a monotone DR-submodular function, the stationary points can only guarantee approximation to the optimum. Therefore, if a state-of-the-art zeroth-order non-convex algorithm is used for maximizing a monotone DR-submodular function, it is likely to terminate at a suboptimal stationary point whose approximation ratio is only .

Our contributions: In this paper, we propose a derivative-free and projection-free algorithm Black-box Continuous Greedy (BCG), that maximizes a monotone continuous DR-submodular function over a bounded convex body . We consider three scenarios:

  • In the deterministic setting, where function evaluations can be obtained exactly, BCG achieves the tight approximation guarantee with function evaluations.

  • In the stochastic setting, where function evaluations are noisy, BCG achieves the tight approximation guarantee with function evaluations.

  • In the discrete setting, Discrete Black-box Greedy (DBG) achieves the tight approximation guarantee with function evaluations.

Function Additional Assumptions Function Queries
continuous DR-submodular monotone, -Lip., -smooth [Theorem 1]
stoch. conti. DR-submodular monotone, -Lip., -smooth [Theorem 2]
discrete submodular monotone [Theorem 3]
Table 1: Number of function queries in different settings, where is the diameter of .

All the theoretical results are summarized in Table 1.

We would like to note that in discrete setting, due to the conservative upper bounds for the Lipschitz and smooth parameters of general multilinear extensions, and the variance of the gradient estimators subject to noisy function evaluations, the required number of function queries in theory is larger than the best known result, in Mokhtari et al. (2018a, b). However, our experiments show that empirically, our proposed algorithm often requires significantly fewer function evaluations and less running time, while achieving a practically similar utility.

Novelty of our work: All the previous results in constrained DR-submodular maximization assume access to (stochastic) gradients. In this work, we address a harder problem, i.e., we provide the first rigorous analysis when only (stochastic) function values can be obtained. More specifically, with the smoothing trick (Flaxman et al., 2005), one can construct an unbiased gradient estimator via function queries. However, this estimator has a large variance which may cause FW-type methods to diverge. To overcome this issue, we build on the momentum method proposed by Mokhtari et al. (2018a) in which they assumed access to the first-order information.

Given a point , the smoothed version of at is defined as . If is close to the boundary of the domain may fall outside of , leaving the smoothed function undefined for many instances of DR-submodular functions (e.g., the multilinear extension is only defined over the unit cube). Thus the vanilla smoothing trick will not work. To this end, we transform the domain and constraint set in a proper way and run our zeroth-order method on the transformed constraint set . Importantly, we retrieve the same convergence rate of as in Mokhtari et al. (2018a) with a minimum number of function queries in different settings (continuous, stochastic continuous, discrete).

We further note that by using more recent variance reduction techniques (Zhang et al., 2019b), one might be able to reduce the required number of function evaluations.

1.1 Further Related Work

Submodular functions Nemhauser et al. (1978), that capture the intuitive notion of diminishing returns, have become increasingly important in various machine learning applications. Examples include graph cuts in computer vision Jegelka and Bilmes (2011a, b), data summarization Lin and Bilmes (2011b, a); Tschiatschek et al. (2014); Chen et al. (2018a, 2017b), influence maximization Kempe et al. (2003); Rodriguez and Schölkopf (2012); Zhang et al. (2016), feature compression Bateni et al. (2019), network inference Chen et al. (2017a), active and semi-supervised learning Guillory and Bilmes (2010); Golovin and Krause (2011); Wei et al. (2015), crowd teaching Singla et al. (2014), dictionary learning Das and Kempe (2011), fMRI parcellation Salehi et al. (2017), compressed sensing and structured sparsity Bach (2010); Bach et al. (2012), fairness in machine learning Balkanski and Singer (2015); Celis et al. (2016), and learning causal structures Steudel et al. (2010); Zhou and Spanos (2016), to name a few. Continuous DR-submodular functions naturally extend the notion of diminishing returns to the continuous domains Bian et al. (2017b). Monotone continuous DR-submodular functions can be (approximately) maximized over convex bodies using first-order methods Bian et al. (2017b); Hassani et al. (2017); Mokhtari et al. (2018a). Bandit maximization of monotone continuous DR-submodular functions Zhang et al. (2019a) is a closely related setting to ours. However, to the best of our knowledge, none of the existing work has developed a zeroth-order algorithm for maximizing a monotone continuous DR-submodular function. For a detailed review of DFO and BBO, interested readers refer to book (Audet and Hare, 2017).

2 Preliminaries

Submodular Functions

We say a set function is submodular, if it satisfies the diminishing returns property: for any and , we have


In words, the marginal gain of adding an element to a subset is no less than that of adding to its superset .

For the continuous analogue, consider a function , where , and each is a compact subset of . We define to be continuous submodular if is continuous and for all , we have


where and are the component-wise maximizing and minimizing operators, respectively.

The continuous function is called DR-submodular Bian et al. (2017b) if is differentiable and An important implication of DR-submodularity is that the function is concave in any non-negative directions, i.e., for , we have


The function is called monotone if for , we have

Smoothing Trick

For a function defined on , its -smoothed version is given as


where is chosen uniformly at random from the -dimensional unit ball . In words, the function at any point is obtained by “averaging” over a ball of radius around . In the sequel, we omit the subscript for the sake of simplicity and use instead of .

Lemma 1 below shows that under the Lipschitz assumption for , the smoothed version is a good approximation of , and also inherits the key structural properties of (such as monotonicity and submodularity). Thus one can (approximately) optimize via optimizing .

Lemma 1 (Proof in Appendix A).

If is monotone continuous DR-submodular and -Lipschitz continuous on , then so is and


An important property of is that one can obtain an unbiased estimation for its gradient by a single query of . This property plays a key role in our proposed derivative-free algorithms.

Lemma 2 (Lemma 6.5 in (Hazan, 2016)).

Given a function on , if we choose uniformly at random from the -dimensional unit sphere , then we have


3 DR-Submodular Maximization

In this paper, we mainly focus on the constrained optimization problem:


where is a monotone continuous DR-submodular function on , and the constraint set is convex and compact.

For first-order monotone DR-submodular maximization, one can use Continuous Greedy Calinescu et al. (2011); Bian et al. (2017b), a variant of Frank-Wolfe Algorithm (Frank and Wolfe, 1956; Jaggi, 2013; Lacoste-Julien and Jaggi, 2015), to achieve the approximation guarantee. At iteration , the FW variant first maximizes the linearization of the objective function :


Then the current point moves in the direction of with a step size :


Hence, by solving linear optimization problems, the iterates are updated without resorting to the projection oracle.

Here we introduce our main algorithm Black-box Continuous Greedy which assumes access only to function values (i.e., zeroth-order information). This algorithm is partially based on the idea of Continuous Greedy. The basic idea is to utilize the function evaluations of at carefully selected points to obtain unbiased estimations of the gradient of the smoothed version, . By extending Continuous Greedy to the derivative-free setting and using recently proposed variance reduction techniques, we can then optimize near-optimally. Finally, by Lemma 1 we show that the obtained optimizer also provides a good solution for .

Recall that continuous DR-submodular functions are defined on a box . To simplify the exposition, we can assume, without loss of generality, that the objective function is defined on Bian et al. (2017a). Moreover, we note that since , for close to (the boundary of ), the point may fall outside of , leaving the function undefined.

To circumvent this issue, we shrink the domain by . Precisely, the shrunk domain is defined as


Since we assume , the shrunk domain is . Then for all , we have . So is well-defined on . By Lemma 1, the optimum of on the shrunk domain will be close to that on the original domain , if is small enough. Therefore, we can first optimize on , then approximately optimize (and thus ) on . For simplicity of analysis, we also translate the shrunk domain by , and denote it as .

Besides the domain , we also need to consider the transformation on constraint set . Intuitively, if there is no translation, we should consider the intersection of and the shrunk domain . But since we translate by , the same transformation should be performed on . Thus, we define the transformed constraint set as the translated intersection (by ) of and :


It is well known that the FW Algorithm is sensitive to the accuracy of gradient, and may have arbitrarily poor performance with stochastic gradients Hazan and Luo (2016); Mokhtari et al. (2018b). Thus we incorporate two methods of variance reduction into our proposed algorithm Black-box Continuous Greedy which correspond to Step 7 and Step 8 in Algorithm 1, respectively. First, instead of the one-point gradient estimation in Lemma 2, we adopt the two-point estimator of (Agarwal et al., 2010; Shamir, 2017):


where is chosen uniformly at random from the unit sphere .We note that (12) is an unbiased gradient estimator with less variance w.r.t. the one-point estimator. We also average over a mini-batch of independently sampled two-point estimators for further variance reduction. The second variance-reduction technique is the momentum method used in (Mokhtari et al., 2018a) to estimate the gradient by a vector which is updated at each iteration as follows:


Here is a given step size, is initialized as an all zero vector , and is an unbiased estimate of the gradient at iterate . As is a weighted average of previous gradient approximation and the newly updated stochastic gradient , it has a lower variance compared with . Although is not an unbiased estimation of the true gradient, the error of it will approach zero as time proceeds. The detailed description of Black-box Continuous Greedy is provided in Algorithm 1.

1:  Input: constraint set , iteration number , radius , step size , batch size
2:  Output:
4:  for  to  do
5:     Sample i.i.d. from
6:     For to , let and evaluate
11:  end for
12:  Output
Algorithm 1 Black-box Continuous Greedy
Theorem 1 (Proof in Appendix B).

For a monotone continuous DR-submodular function , which is also -Lipschitz continuous and -smooth on a convex and compact constraint set , if we set in Algorithm 1, then we have

where is a constant, , and is the global maximizer of on .

Remark 1.

By setting , , and , the error term (RHS) is guaranteed to be at most . Also, the total number of function evaluations is at most .

We can also extend Algorithm 1 to the stochastic case in which we obtain information about only through its noisy function evaluations , where is stochastic zero-mean noise. In particular, in Step 6 of Algorithm 1, we obtain independent stochastic function evaluations and , instead of the exact function values and . For unbiased function evaluation oracles with uniformly bounded variance, we have the following theorem. {mdframed}

Theorem 2 (Proof in Appendix C).

Under the condition of Theorem 1, if we further assume that for all , and , then we have

where is a constant, and is the global maximizer of on .

Remark 2.

By setting , , and , the error term (RHS) is at most . The total number of evaluations is at most .

4 Discrete Submodular Maximization

In this section, we describe how Black-box Continuous Greedy can be used to solve a discrete submodular maximization problem with a general matroid constraint, i.e., , where is a monotone submodular set function and is a matroid.

For any monotone submodular set function , its multilinear extension , defined as


is monotone and DR-submodular (Calinescu et al., 2011). Here, is the size of the ground set . Equivalently, we have where means that the each element is included in with probability independently.

It can be shown that in lieu of solving the discrete optimization problem one can solve the continuous optimization problem where is the matroid polytope (Calinescu et al., 2011). This equivalence is obtained by showing that (i) the optimal values of the two problems are the same, and (ii) for any fractional vector we can deploy efficient, lossless rounding procedures that produce a set such that (e.g., pipage rounding (Ageev and Sviridenko, 2004; Calinescu et al., 2011) and contention resolution (Chekuri et al., 2014)). So we can view as the underlying function that we intend to optimize, and invoke Black-box Continuous Greedy. As a result, we want that is -Lipschitz and -smooth as in Theorem 1. The following lemma shows these properties are satisfied automatically if is bounded.

Lemma 3.

For a submodular set function defined on with , its multilinear extension is -Lipschitz and -smooth.

We note that the bounds for Lipschitz and smoothness parameters actually depend on the norms that we consider. However, different norms are equivalent up to a factor that may depend on the dimension. If we consider another norm, some dimension factors may be absorbed into the norm. Therefore, we only study the Euclidean norm in Lemma 3.

We further note that computing the exact value of is difficult as it requires evaluating over all the subsets . However, one can construct an unbiased estimate for the value by simply sampling a random set and returning as the estimate. We present our algorithm in detail in Algorithm 2, where we have , since is defined on , and thus . We state the theoretical result formally in Theorem 3.

1:  Input: matroid constraint , transformed constraint set where , number of iterations , radius , step size , batch size , sample size
2:  Output:
3:  ,
4:  for  to  do
5:     Sample i.i.d. from
6:     For to , let , independently sample subsets and for times according to , get sampled subsets , evaluate the function values , and calculate the averages
11:  end for
12:  Output
Algorithm 2 Discrete Black-box Greedy
Theorem 3 (Proof in Appendix E).

For a monotone submodular set function with , if we set in Algorithm 2, then we have

where , is a constant, is the global maximizer of under matroid constraint .

Remark 3.

By setting , , and , the error term (RHS) is at most . The total number of evaluations is at most .

We note that in Algorithm 2, is the unbiased estimation of , and the same holds for and . As a result, we can analyze the algorithm under the framework of stochastic continuous submodular maximization. By applying Theorem 2, Lemma 3, and the facts directly, we can also attain Theorem 3.

5 Experiments

In this section, we will compare Black-box Continuous Greedy (BCG) and Discrete Black-box Greedy (DBG) with the following baselines:

  • Zeroth-Order Gradient Ascent (ZGA) is the projected gradient ascent algorithm equipped with the same two-point gradient estimator as BCG uses. Therefore, it is a zeroth-order projected algorithm.

  • Stochastic Continuous Greedy (SCG) is the state-of-the-art first-order algorithm for maximizing continuous DR-submodular functions Mokhtari et al. (2018a, b). Note that it is a projection-free algorithm.

  • Gradient Ascent (GA) is the first-order projected gradient ascent algorithm Hassani et al. (2017).

The stopping criterion for the algorithms is whenever a given number of iterations is achieved. Moreover, the batch sizes in Algorithm 1 and in Algorithm 2 are both 1. Therefore, in the experiments, DBG uses 1 query per iteration while SCG uses queries.

We perform four sets of experiments which are described in detail in the following. The first two sets of experiments are maximization of continuous DR-submodular functions, which Black-box Continuous Greedy is designed to solve. The last two are submodular set maximization problems. We will apply Discrete Black-box Greedy to solve these problems. The function values at different rounds and the execution times are presented in Figs. 2 and 1. The first-order algorithms (SCG and GA) are marked in orange, and the zeroth-order algorithms are marked in blue.

(a) NQP
(b) Topic summarization
(c) Active set selection
(d) Influence maximization
Figure 1: Function value vs. number of oracle queries. Note that every chart has dual horizontal axes. Orange lines use the orange horizontal axes above while blue lines use the blue ones below.
(a) NQP
(b) Topic summarization
(c) Active set selection
(d) Influence maximization
Figure 2: Relative running time normalized with respect to BCG (for continuous DR-submodular maximization in the first two sets of experiments) and DBG (for submodular set maximization in the last two sets of experiments).

Non-convex/non-concave Quadratic Programming (NQP): In this set of experiments, we apply our proposed algorithm and the baselines to the problem of non-convex/non-concave quadratic programming. The objective function is of the form , where is a 100-dimensional vector, is a -by- matrix, and every component of is an i.i.d. random variable whose distribution is equal to that of the negated absolute value of a standard normal distribution. The constraints are , , and . To guarantee that the gradient is non-negative, we set . One can observe from Fig. 0(a) that the function value that BCG attains is only slightly lower than that of the first-order algorithm SCG. The final function value that BCG attains is similar to that of ZGA.

Topic Summarization: Next, we consider the topic summarization problem (El-Arini et al., 2009; Yue and Guestrin, 2011), which is to maximize the probabilistic coverage of selected articles on news topics. Each news article is characterized by its topic distribution, which is obtained by applying latent Dirichlet allocation to the corpus of Reuters-21578, Distribution 1.0. The number of topics is set to 10. We will choose from 120 news articles. The probabilistic coverage of a subset of news articles (denoted by ) is defined by , where is the topic distribution of article . The multilinear extension function of is , where  Iyer et al. (2014). The constraint is , , . It can be observed from Fig. 0(b) that the proposed BCG algorithm achieves the same function value as the first-ordered algorithm SCG and outperforms the other two. As shown in Fig. 1(a), BCG is the most efficient method. The two projection-free algorithms BCG and SCG run faster than the projected methods ZGA and GA. We will elaborate on the running time later in this section.

Active Set Selection We study the active set selection problem that arises in Gaussian process regression Mirzasoleiman et al. (2013). We use the Parkinsons Telemonitoring dataset, which is composed of biomedical voice measurements from people with early-stage Parkinson’s disease (Tsanas et al., 2010). Let denote the data matrix. Each row is a voice recording while each column denotes an attribute. The covariance matrix is defined by , where is set to . The objective function of the active set selection problem is defined by , where and is the principal submatrix indexed by . The total number of 22 attributes are partitioned into 5 disjoint subsets with sizes 4, 4, 4, 5 and 5, respectively. The problem is subject to a partition matroid requiring that at most one attribute should be active within each subset. Since this is a submodular set maximization problem, in order to evaluate the gradient (i.e., obtain an unbiased estimate of gradient) required by first-order algorithms SCG and GA, it needs function value queries. To be precise, the -th component of gradient is and requires two function value queries. It can be observed from Fig. 0(c) that DBG outperforms the other zeroth-order algorithm ZGA. Although its performance is slightly worse than the two first-order algorithms SCG and GA, it require significantly less number of function value queries than the other two first-order methods (as discussed above).

Influence Maximization In the influence maximization problem, we assume that every node in the network is able to influence all of its one-hop neighbors. The objective of influence maximization is to select a subset of nodes in the network, called the seed set (and denoted by ), so that the total number of influenced nodes, including the seed nodes, is maximized. We choose the social network of Zachary’s karate club Zachary (1977) in this study. The subjects in this social network are partitioned into three disjoint groups, whose sizes are 10, 14, and 10 respectively. The chosen seed nodes should be subject to a partition matroid; i.e., We will select at most two subjects from each of the three groups. Note that this problem is also a submodular set maximization problem. Similar to the situation in the active set selection problem, first-order algorithms need function value queries to obtain an unbiased estimate of gradient. We can observe from Fig. 0(d) that DBG attains a better influence coverage than the other zeroth-order algorithm ZGA. Again, even though SCG and GA achieve a slightly better coverage, due to their first-order nature, they require a significantly larger number of function value queries.

Running Time

The running times of the our proposed algorithms and the baselines are presented in Fig. 2 for the above-mentioned experimental set-ups. There are two main conclusions. First, the two projection-based algorithms (ZGA and GA) require significantly higher time complexity compared to the projection-free algorithms (BCG, DBG, and SCG), as the projection-based algorithms require solving quadratic optimization problems whereas projection-free ones require solving linear optimization problems which can be solved more efficiently. Second, when we compare first-order and zeroth-order algorithms, we can observe that zeroth-order algorithms (BCG, DBG, and ZGA) run faster than their first-order counterparts (SCG and GA).


The above experiment results show the following major advantages of our method over the baselines including SCG and ZGA.

  • BCG/DBG is at least twice faster than SCG and ZGA in all tasks in terms of running time (Figs. 1(d), 1(c), 1(b) and 1(a))

  • DBG requires remarkably fewer function evaluations in the discrete setting (Figs. 0(d) and 0(c))

  • In addition to saving function evaluations, BCG/DBG achieves an objective function value comparable to that of the first-order baselines SCG and GA.

Furthermore, we note that the number of first-order queries required by SCG is only half the number required by BCG. However, as is shown in Figs. 1(b) and 1(a), BCG runs significantly faster than SCG since a zeroth-order evaluation is faster than a first-order one.

In the topic summarization task (Fig. 0(b)), BCG exhibits a similar performance to that of the first-order baselines SCG and GA, in terms of the attained objective function value. In the other three tasks, BCG/DBG runs notably faster while achieving an only slightly inferior function value. Therefore, BCG/DBG is particularly preferable in a large-scale machine learning task and an application where the total number of function evaluations or the running time is subject to a budget.

6 Conclusion

In this paper, we presented Black-box Continuous Greedy, a derivative-free and projection-free algorithm for maximizing a monotone and continuous DR-submodular function subject to a general convex body constraint. We showed that Black-box Continuous Greedy achieves the tight approximation guarantee with function evaluations. We then extended the algorithm to the stochastic continuous setting and the discrete submodular maximization problem. Our experiments on both synthetic and real data validated the performance of our proposed algorithms. In particular, we observed that Black-box Continuous Greedy practically achieves the same utility as Continuous Greedy while being way more efficient in terms of number of function evaluations.


LC is supported by the Google PhD Fellowship. HH is supported by AFOSR Award 19RT0726, NSF HDR TRIPODS award 1934876, NSF award CPS-1837253, NSF award CIF-1910056, and NSF CAREER award CIF-1943064. AK is partially supported by NSF (IIS-1845032), ONR (N00014-19-1-2406), and AFOSR (FA9550-18-1-0160).

Appendix A Proof of Lemma 1


Using the assumption that is -Lipschitz continuous, we have




If is -Lipschitz continuous and monotone continuous DR-submodular, then is differentiable. For , we also have




By definition of , we have is differentiable and for ,




i.e., So is also a monotone continuous DR-submodular function. ∎

Appendix B Proof of Theorem 1

In order to prove Theorem 1, we need the following variance reduction lemmas [Shamir, 2017, Chen et al., 2018b], where the second one is a slight improvement of Lemma 2 in [Mokhtari et al., 2018a] and Lemma 5 in [Mokhtari et al., 2018b].

Lemma 4 (Lemma 10 of [Shamir, 2017]).

It holds that


where is a constant.

Lemma 5 (Theorem 3 of [Chen et al., 2018b]).

Let be a sequence of points in such that for all with fixed constants and . Let be a sequence of random variables such that and for every , where is the -field generated by and . Let be a sequence of random variables where is fixed and subsequent are obtained by the recurrence


with . Then, we have


where .

Now we turn to prove Theorem 1.

Proof of Theorem 1.

First of all, we note that technically we need the iteration number , which always holds in practical applications.

Then we show that , . By the definition of , we have . Since ’s are non-negative vectors, we know that ’s are also non-negative vectors and that . It suffices to show that . Since is a convex combination of and ’s are in , we conclude that . In addition, since ’s are also in , is also in . Therefore our final choice resides in the constraint .

Let and the shrunk domain (without translation) . By Jensen’s inequality and the fact has -Lipschitz continuous gradients, we have




Let . Since , we have . We know and


Since we assume that is monotone continuous DR-submodular, by Lemma 1, is also monotone continuous DR-submodular. As a result, is concave along non-negative directions, and is entry-wise non-negative. Thus we have


Since , we deduce


Therefore, we obtain