Black Box Submodular Maximization: Discrete and Continuous Settings
Abstract
In this paper, we consider the problem of black box continuous submodular maximization where we only have access to the function values and no information about the derivatives is provided. For a monotone and continuous DRsubmodular function, and subject to a bounded convex body constraint, we propose Blackbox Continuous Greedy, a derivativefree algorithm that provably achieves the tight approximation guarantee with function evaluations. We then extend our result to the stochastic setting where function values are subject to stochastic zeromean noise. It is through this stochastic generalization that we revisit the discrete submodular maximization problem and use the multilinear extension as a bridge between discrete and continuous settings. Finally, we extensively evaluate the performance of our algorithm on continuous and discrete submodular objective functions using both synthetic and real data.
1 Introduction
Blackbox optimization, also known as zerothorder or derivativefree
optimization
Fueled by a growing number of machine learning applications, blackbox optimization methods are usually considered in scenarios where gradients (i.e., firstorder information) are 1) difficult or slow to compute, e.g., graphical model inference (Wainwright et al., 2008), structure predictions (Taskar et al., 2005; Sokolov et al., 2016), or 2) inaccessible, e.g., hyperparameter turning for natural language processing or image classifications Snoek et al. (2012); Thornton et al. (2013), blackbox attacks for finding adversarial examples Chen et al. (2017c); Ilyas et al. (2018). Even though heuristics such as random or grid search, with undesirable dependencies on the dimension, are still used in some applications (e.g., parameter tuning for deep networks), there has been a growing number of rigorous methods to address the convergence rate of blackbox optimization in convex and nonconvex settings (Wang et al., 2017; Balasubramanian and Ghadimi, 2018; Sahu et al., 2018).
The focus of this paper is the constrained continuous DRsubmodular maximization over a bounded convex body. We aim to design an algorithm that uses only zerothorder information while avoiding expensive projection operations. Note that one way the optimization methods can deal with constraints is to apply the projection oracle once the proposed iterates land outside the feasibility region. However, computing the projection in many constrained settings is computationally prohibitive (e.g., projection over bounded trace norm matrices, flow polytope, matroid polytope, rotation matrices). In such scenarios, projectionfree algorithms, a.k.a., FrankWolfe (Frank and Wolfe, 1956), replace the projection with a linear program. Indeed, our proposed algorithm combines efficiently the zerothorder information with solving a series of linear programs to ensure convergence to a nearoptimal solution.
Continuous DRsubmodular functions are an important subset of nonconvex functions that can be minimized exactly Bach (2016); Staib and Jegelka (2017) and maximized approximately Bian et al. (2017a, b); Hassani et al. (2017); Mokhtari et al. (2018a); Hassani et al. (2019); Zhang et al. (2019b) This class of functions generalize the notion of diminishing returns, usually defined over discrete set functions, to the continuous domains. They have found numerous applications in machine learning including MAP inference in determinantal point processes (DPPs) Kulesza et al. (2012), experimental design Chen et al. (2018c), resource allocation Eghbali and Fazel (2016), meanfield inference in probabilistic models Bian et al. (2018), among many others.
Motivation: Computing the gradient of a continuous DRsubmodular function has been shown to be computationally prohibitive (or even intractable) in many applications. For example, the objective function of influence maximization is defined via specific stochastic processes (Kempe et al., 2003; Rodriguez and Schölkopf, 2012) and computing/estimating the gradient of the mutliliear extension would require a relatively high computational complexity. In the problem of Doptimal experimental design , the gradient of the objective function involves inversion of a potentially large matrix (Chen et al., 2018c). Moreover, when one attacks a submodular recommender model, only blackbox information is available and the service provider is unlikely to provide additional firstorder information (this is known as the blackbox adversarial attack model) (Lei et al., 2019).
There has been very recent progress on developing zerothorder methods for constrained optimization problems in convex and nonconvex settings Ghadimi and Lan (2013); Sahu et al. (2018). Such methods typically assume the objective function is defined on the whole so that they can sample points from a proper distribution defined on . For DRsubmodular functions, this assumption might be unrealistic, since many DRsubmodular functions might be only defined on a subset of , e.g., the multilinear extension Vondrák (2008), a canonical example of DRsubmodular functions, is only defined on a unit cube. Moreover, they can only guarantee to reach a firstorder stationary point. However, Hassani et al. (2017) showed that for a monotone DRsubmodular function, the stationary points can only guarantee approximation to the optimum. Therefore, if a stateoftheart zerothorder nonconvex algorithm is used for maximizing a monotone DRsubmodular function, it is likely to terminate at a suboptimal stationary point whose approximation ratio is only .
Our contributions: In this paper, we propose a derivativefree and projectionfree algorithm Blackbox Continuous Greedy (BCG), that maximizes a monotone continuous DRsubmodular function over a bounded convex body . We consider three scenarios:

In the deterministic setting, where function evaluations can be obtained exactly, BCG achieves the tight approximation guarantee with function evaluations.

In the stochastic setting, where function evaluations are noisy, BCG achieves the tight approximation guarantee with function evaluations.

In the discrete setting, Discrete Blackbox Greedy (DBG) achieves the tight approximation guarantee with function evaluations.
Function  Additional Assumptions  Function Queries 

continuous DRsubmodular  monotone, Lip., smooth  [Theorem 1] 
stoch. conti. DRsubmodular  monotone, Lip., smooth  [Theorem 2] 
discrete submodular  monotone  [Theorem 3] 
All the theoretical results are summarized in Table 1.
We would like to note that in discrete setting, due to the conservative upper bounds for the Lipschitz and smooth parameters of general multilinear extensions, and the variance of the gradient estimators subject to noisy function evaluations, the required number of function queries in theory is larger than the best known result, in Mokhtari et al. (2018a, b). However, our experiments show that empirically, our proposed algorithm often requires significantly fewer function evaluations and less running time, while achieving a practically similar utility.
Novelty of our work: All the previous results in constrained DRsubmodular maximization assume access to (stochastic) gradients. In this work, we address a harder problem, i.e., we provide the first rigorous analysis when only (stochastic) function values can be obtained. More specifically, with the smoothing trick (Flaxman et al., 2005), one can construct an unbiased gradient estimator via function queries. However, this estimator has a large variance which may cause FWtype methods to diverge. To overcome this issue, we build on the momentum method proposed by Mokhtari et al. (2018a) in which they assumed access to the firstorder information.
Given a point , the smoothed version of at is defined as . If is close to the boundary of the domain may fall outside of , leaving the smoothed function undefined for many instances of DRsubmodular functions (e.g., the multilinear extension is only defined over the unit cube). Thus the vanilla smoothing trick will not work. To this end, we transform the domain and constraint set in a proper way and run our zerothorder method on the transformed constraint set . Importantly, we retrieve the same convergence rate of as in Mokhtari et al. (2018a) with a minimum number of function queries in different settings (continuous, stochastic continuous, discrete).
We further note that by using more recent variance reduction techniques (Zhang et al., 2019b), one might be able to reduce the required number of function evaluations.
1.1 Further Related Work
Submodular functions Nemhauser et al. (1978), that capture the intuitive notion of diminishing returns, have become increasingly important in various machine learning applications. Examples include graph cuts in computer vision Jegelka and Bilmes (2011a, b), data summarization Lin and Bilmes (2011b, a); Tschiatschek et al. (2014); Chen et al. (2018a, 2017b), influence maximization Kempe et al. (2003); Rodriguez and Schölkopf (2012); Zhang et al. (2016), feature compression Bateni et al. (2019), network inference Chen et al. (2017a), active and semisupervised learning Guillory and Bilmes (2010); Golovin and Krause (2011); Wei et al. (2015), crowd teaching Singla et al. (2014), dictionary learning Das and Kempe (2011), fMRI parcellation Salehi et al. (2017), compressed sensing and structured sparsity Bach (2010); Bach et al. (2012), fairness in machine learning Balkanski and Singer (2015); Celis et al. (2016), and learning causal structures Steudel et al. (2010); Zhou and Spanos (2016), to name a few. Continuous DRsubmodular functions naturally extend the notion of diminishing returns to the continuous domains Bian et al. (2017b). Monotone continuous DRsubmodular functions can be (approximately) maximized over convex bodies using firstorder methods Bian et al. (2017b); Hassani et al. (2017); Mokhtari et al. (2018a). Bandit maximization of monotone continuous DRsubmodular functions Zhang et al. (2019a) is a closely related setting to ours. However, to the best of our knowledge, none of the existing work has developed a zerothorder algorithm for maximizing a monotone continuous DRsubmodular function. For a detailed review of DFO and BBO, interested readers refer to book (Audet and Hare, 2017).
2 Preliminaries
Submodular Functions
We say a set function is submodular, if it satisfies the diminishing returns property: for any and , we have
(1) 
In words, the marginal gain of adding an element to a subset is no less than that of adding to its superset .
For the continuous analogue, consider a function , where , and each is a compact subset of . We define to be continuous submodular if is continuous and for all , we have
(2) 
where and are the componentwise maximizing and minimizing operators, respectively.
The continuous function is called DRsubmodular Bian et al. (2017b) if is differentiable and An important implication of DRsubmodularity is that the function is concave in any nonnegative directions, i.e., for , we have
(3) 
The function is called monotone if for , we have
Smoothing Trick
For a function defined on , its smoothed version is given as
(4) 
where is chosen uniformly at random from the dimensional unit ball . In words, the function at any point is obtained by “averaging” over a ball of radius around . In the sequel, we omit the subscript for the sake of simplicity and use instead of .
Lemma 1 below shows that under the Lipschitz assumption for , the smoothed version is a good approximation of , and also inherits the key structural properties of (such as monotonicity and submodularity). Thus one can (approximately) optimize via optimizing .
Lemma 1 (Proof in Appendix A).
If is monotone continuous DRsubmodular and Lipschitz continuous on , then so is and
(5) 
An important property of is that one can obtain an unbiased estimation for its gradient by a single query of . This property plays a key role in our proposed derivativefree algorithms.
Lemma 2 (Lemma 6.5 in (Hazan, 2016)).
Given a function on , if we choose uniformly at random from the dimensional unit sphere , then we have
(6) 
3 DRSubmodular Maximization
In this paper, we mainly focus on the constrained optimization problem:
(7) 
where is a monotone continuous DRsubmodular function on , and the constraint set is convex and compact.
For firstorder monotone DRsubmodular maximization, one can use Continuous Greedy Calinescu et al. (2011); Bian et al. (2017b), a variant of FrankWolfe Algorithm (Frank and Wolfe, 1956; Jaggi, 2013; LacosteJulien and Jaggi, 2015), to achieve the approximation guarantee. At iteration , the FW variant first maximizes the linearization of the objective function :
(8) 
Then the current point moves in the direction of with a step size :
(9) 
Hence, by solving linear optimization problems, the iterates are updated without resorting to the projection oracle.
Here we introduce our main algorithm Blackbox Continuous Greedy which assumes access only to function values (i.e., zerothorder information). This algorithm is partially based on the idea of Continuous Greedy. The basic idea is to utilize the function evaluations of at carefully selected points to obtain unbiased estimations of the gradient of the smoothed version, . By extending Continuous Greedy to the derivativefree setting and using recently proposed variance reduction techniques, we can then optimize nearoptimally. Finally, by Lemma 1 we show that the obtained optimizer also provides a good solution for .
Recall that continuous DRsubmodular functions are defined on a box . To simplify the exposition, we can assume, without loss of generality, that the objective function is defined on Bian et al. (2017a). Moreover, we note that since , for close to (the boundary of ), the point may fall outside of , leaving the function undefined.
To circumvent this issue, we shrink the domain by . Precisely, the shrunk domain is defined as
(10) 
Since we assume , the shrunk domain is . Then for all , we have . So is welldefined on . By Lemma 1, the optimum of on the shrunk domain will be close to that on the original domain , if is small enough. Therefore, we can first optimize on , then approximately optimize (and thus ) on . For simplicity of analysis, we also translate the shrunk domain by , and denote it as .
Besides the domain , we also need to consider the transformation on constraint set . Intuitively, if there is no translation, we should consider the intersection of and the shrunk domain . But since we translate by , the same transformation should be performed on . Thus, we define the transformed constraint set as the translated intersection (by ) of and :
(11) 
It is well known that the FW Algorithm is sensitive to the accuracy of gradient, and may have arbitrarily poor performance with stochastic gradients Hazan and Luo (2016); Mokhtari et al. (2018b). Thus we incorporate two methods of variance reduction into our proposed algorithm Blackbox Continuous Greedy which correspond to Step 7 and Step 8 in Algorithm 1, respectively. First, instead of the onepoint gradient estimation in Lemma 2, we adopt the twopoint estimator of (Agarwal et al., 2010; Shamir, 2017):
(12) 
where is chosen uniformly at random from the unit sphere .We note that (12) is an unbiased gradient estimator with less variance w.r.t. the onepoint estimator. We also average over a minibatch of independently sampled twopoint estimators for further variance reduction. The second variancereduction technique is the momentum method used in (Mokhtari et al., 2018a) to estimate the gradient by a vector which is updated at each iteration as follows:
(13) 
Here is a given step size, is initialized as an all zero vector , and is an unbiased estimate of the gradient at iterate . As is a weighted average of previous gradient approximation and the newly updated stochastic gradient , it has a lower variance compared with . Although is not an unbiased estimation of the true gradient, the error of it will approach zero as time proceeds. The detailed description of Blackbox Continuous Greedy is provided in Algorithm 1.
Theorem 1 (Proof in Appendix B).
For a monotone continuous DRsubmodular function , which is also Lipschitz continuous and smooth on a convex and compact constraint set , if we set in Algorithm 1, then we have
where is a constant, , and is the global maximizer of on .
Remark 1.
By setting , , and , the error term (RHS) is guaranteed to be at most . Also, the total number of function evaluations is at most .
We can also extend Algorithm 1 to the stochastic case in which we obtain information about only through its noisy function evaluations , where is stochastic zeromean noise. In particular, in Step 6 of Algorithm 1, we obtain independent stochastic function evaluations and , instead of the exact function values and . For unbiased function evaluation oracles with uniformly bounded variance, we have the following theorem. {mdframed}
Theorem 2 (Proof in Appendix C).
Under the condition of Theorem 1, if we further assume that for all , and , then we have
where is a constant, and is the global maximizer of on .
Remark 2.
By setting , , and , the error term (RHS) is at most . The total number of evaluations is at most .
4 Discrete Submodular Maximization
In this section, we describe how Blackbox Continuous Greedy can be used to solve a discrete submodular maximization problem with a general matroid constraint, i.e., , where is a monotone submodular set function and is a matroid.
For any monotone submodular set function , its multilinear extension , defined as
(14) 
is monotone and DRsubmodular (Calinescu et al., 2011). Here, is the size of the ground set . Equivalently, we have where means that the each element is included in with probability independently.
It can be shown that in lieu of solving the discrete optimization problem one can solve the continuous optimization problem where is the matroid polytope (Calinescu et al., 2011). This equivalence is obtained by showing that (i) the optimal values of the two problems are the same, and (ii) for any fractional vector we can deploy efficient, lossless rounding procedures that produce a set such that (e.g., pipage rounding (Ageev and Sviridenko, 2004; Calinescu et al., 2011) and contention resolution (Chekuri et al., 2014)). So we can view as the underlying function that we intend to optimize, and invoke Blackbox Continuous Greedy. As a result, we want that is Lipschitz and smooth as in Theorem 1. The following lemma shows these properties are satisfied automatically if is bounded.
Lemma 3.
For a submodular set function defined on with , its multilinear extension is Lipschitz and smooth.
We note that the bounds for Lipschitz and smoothness parameters actually depend on the norms that we consider. However, different norms are equivalent up to a factor that may depend on the dimension. If we consider another norm, some dimension factors may be absorbed into the norm. Therefore, we only study the Euclidean norm in Lemma 3.
We further note that computing the exact value of is difficult as it requires evaluating over all the subsets . However, one can construct an unbiased estimate for the value by simply sampling a random set and returning as the estimate. We present our algorithm in detail in Algorithm 2, where we have , since is defined on , and thus . We state the theoretical result formally in Theorem 3.
Theorem 3 (Proof in Appendix E).
For a monotone submodular set function with , if we set in Algorithm 2, then we have
where , is a constant, is the global maximizer of under matroid constraint .
Remark 3.
By setting , , and , the error term (RHS) is at most . The total number of evaluations is at most .
We note that in Algorithm 2, is the unbiased estimation of , and the same holds for and . As a result, we can analyze the algorithm under the framework of stochastic continuous submodular maximization. By applying Theorem 2, Lemma 3, and the facts directly, we can also attain Theorem 3.
5 Experiments
In this section, we will compare Blackbox Continuous Greedy (BCG) and Discrete Blackbox Greedy (DBG) with the following baselines:

ZerothOrder Gradient Ascent (ZGA) is the projected gradient ascent algorithm equipped with the same twopoint gradient estimator as BCG uses. Therefore, it is a zerothorder projected algorithm.

Gradient Ascent (GA) is the firstorder projected gradient ascent algorithm Hassani et al. (2017).
The stopping criterion for the algorithms is whenever a given number of iterations is achieved. Moreover, the batch sizes in Algorithm 1 and in Algorithm 2 are both 1. Therefore, in the experiments, DBG uses 1 query per iteration while SCG uses queries.
We perform four sets of experiments which are described in detail in the following. The first two sets of experiments are maximization of continuous DRsubmodular functions, which Blackbox Continuous Greedy is designed to solve. The last two are submodular set maximization problems. We will apply Discrete Blackbox Greedy to solve these problems. The function values at different rounds and the execution times are presented in Figs. 2 and 1. The firstorder algorithms (SCG and GA) are marked in orange, and the zerothorder algorithms are marked in blue.
Nonconvex/nonconcave Quadratic Programming (NQP): In this set of experiments, we apply our proposed algorithm and the baselines to the problem of nonconvex/nonconcave quadratic programming. The objective function is of the form , where is a 100dimensional vector, is a by matrix, and every component of is an i.i.d. random variable whose distribution is equal to that of the negated absolute value of a standard normal distribution. The constraints are , , and . To guarantee that the gradient is nonnegative, we set . One can observe from Fig. 0(a) that the function value that BCG attains is only slightly lower than that of the firstorder algorithm SCG. The final function value that BCG attains is similar to that of ZGA.
Topic Summarization: Next, we consider the topic summarization problem (ElArini et al., 2009; Yue and Guestrin, 2011), which is to maximize the probabilistic coverage of selected articles on news topics. Each news article is characterized by its topic distribution, which is obtained by applying latent Dirichlet allocation to the corpus of Reuters21578, Distribution 1.0. The number of topics is set to 10. We will choose from 120 news articles. The probabilistic coverage of a subset of news articles (denoted by ) is defined by , where is the topic distribution of article . The multilinear extension function of is , where Iyer et al. (2014). The constraint is , , . It can be observed from Fig. 0(b) that the proposed BCG algorithm achieves the same function value as the firstordered algorithm SCG and outperforms the other two. As shown in Fig. 1(a), BCG is the most efficient method. The two projectionfree algorithms BCG and SCG run faster than the projected methods ZGA and GA. We will elaborate on the running time later in this section.
Active Set Selection We study the active set selection problem that arises in Gaussian process regression Mirzasoleiman et al. (2013). We use the Parkinsons Telemonitoring dataset, which is composed of biomedical voice measurements from people with earlystage Parkinson’s disease (Tsanas et al., 2010). Let denote the data matrix. Each row is a voice recording while each column denotes an attribute. The covariance matrix is defined by , where is set to . The objective function of the active set selection problem is defined by , where and is the principal submatrix indexed by . The total number of 22 attributes are partitioned into 5 disjoint subsets with sizes 4, 4, 4, 5 and 5, respectively. The problem is subject to a partition matroid requiring that at most one attribute should be active within each subset. Since this is a submodular set maximization problem, in order to evaluate the gradient (i.e., obtain an unbiased estimate of gradient) required by firstorder algorithms SCG and GA, it needs function value queries. To be precise, the th component of gradient is and requires two function value queries. It can be observed from Fig. 0(c) that DBG outperforms the other zerothorder algorithm ZGA. Although its performance is slightly worse than the two firstorder algorithms SCG and GA, it require significantly less number of function value queries than the other two firstorder methods (as discussed above).
Influence Maximization In the influence maximization problem, we assume that every node in the network is able to influence all of its onehop neighbors. The objective of influence maximization is to select a subset of nodes in the network, called the seed set (and denoted by ), so that the total number of influenced nodes, including the seed nodes, is maximized. We choose the social network of Zachary’s karate club Zachary (1977) in this study. The subjects in this social network are partitioned into three disjoint groups, whose sizes are 10, 14, and 10 respectively. The chosen seed nodes should be subject to a partition matroid; i.e., We will select at most two subjects from each of the three groups. Note that this problem is also a submodular set maximization problem. Similar to the situation in the active set selection problem, firstorder algorithms need function value queries to obtain an unbiased estimate of gradient. We can observe from Fig. 0(d) that DBG attains a better influence coverage than the other zerothorder algorithm ZGA. Again, even though SCG and GA achieve a slightly better coverage, due to their firstorder nature, they require a significantly larger number of function value queries.
Running Time
The running times of the our proposed algorithms and the baselines are presented in Fig. 2 for the abovementioned experimental setups. There are two main conclusions. First, the two projectionbased algorithms (ZGA and GA) require significantly higher time complexity compared to the projectionfree algorithms (BCG, DBG, and SCG), as the projectionbased algorithms require solving quadratic optimization problems whereas projectionfree ones require solving linear optimization problems which can be solved more efficiently. Second, when we compare firstorder and zerothorder algorithms, we can observe that zerothorder algorithms (BCG, DBG, and ZGA) run faster than their firstorder counterparts (SCG and GA).
Summary
The above experiment results show the following major advantages of our method over the baselines including SCG and ZGA.

BCG/DBG is at least twice faster than SCG and ZGA in all tasks in terms of running time (Figs. 1(d), 1(c), 1(b) and 1(a))

DBG requires remarkably fewer function evaluations in the discrete setting (Figs. 0(d) and 0(c))

In addition to saving function evaluations, BCG/DBG achieves an objective function value comparable to that of the firstorder baselines SCG and GA.
Furthermore, we note that the number of firstorder queries required by SCG is only half the number required by BCG. However, as is shown in Figs. 1(b) and 1(a), BCG runs significantly faster than SCG since a zerothorder evaluation is faster than a firstorder one.
In the topic summarization task (Fig. 0(b)), BCG exhibits a similar performance to that of the firstorder baselines SCG and GA, in terms of the attained objective function value. In the other three tasks, BCG/DBG runs notably faster while achieving an only slightly inferior function value. Therefore, BCG/DBG is particularly preferable in a largescale machine learning task and an application where the total number of function evaluations or the running time is subject to a budget.
6 Conclusion
In this paper, we presented Blackbox Continuous Greedy, a derivativefree and projectionfree algorithm for maximizing a monotone and continuous DRsubmodular function subject to a general convex body constraint. We showed that Blackbox Continuous Greedy achieves the tight approximation guarantee with function evaluations. We then extended the algorithm to the stochastic continuous setting and the discrete submodular maximization problem. Our experiments on both synthetic and real data validated the performance of our proposed algorithms. In particular, we observed that Blackbox Continuous Greedy practically achieves the same utility as Continuous Greedy while being way more efficient in terms of number of function evaluations.
Acknowledgements
LC is supported by the Google PhD Fellowship. HH is supported by AFOSR Award 19RT0726, NSF HDR TRIPODS award 1934876, NSF award CPS1837253, NSF award CIF1910056, and NSF CAREER award CIF1943064. AK is partially supported by NSF (IIS1845032), ONR (N000141912406), and AFOSR (FA95501810160).
Appendix A Proof of Lemma 1
Proof.
Using the assumption that is Lipschitz continuous, we have
(15)  
(16)  
(17)  
(18) 
and
(19)  
(20)  
(21)  
(22) 
If is Lipschitz continuous and monotone continuous DRsubmodular, then is differentiable. For , we also have
(23) 
and
(24) 
By definition of , we have is differentiable and for ,
(25)  
(26)  
(27)  
(28) 
and
(29)  
(30)  
(31)  
(32) 
i.e., So is also a monotone continuous DRsubmodular function. ∎
Appendix B Proof of Theorem 1
In order to prove Theorem 1, we need the following variance reduction lemmas [Shamir, 2017, Chen et al., 2018b], where the second one is a slight improvement of Lemma 2 in [Mokhtari et al., 2018a] and Lemma 5 in [Mokhtari et al., 2018b].
Lemma 4 (Lemma 10 of [Shamir, 2017]).
It holds that
(33) 
(34) 
where is a constant.
Lemma 5 (Theorem 3 of [Chen et al., 2018b]).
Let be a sequence of points in such that for all with fixed constants and . Let be a sequence of random variables such that and for every , where is the field generated by and . Let be a sequence of random variables where is fixed and subsequent are obtained by the recurrence
(35) 
with . Then, we have
(36) 
where .
Now we turn to prove Theorem 1.
Proof of Theorem 1.
First of all, we note that technically we need the iteration number , which always holds in practical applications.
Then we show that , . By the definition of , we have . Since ’s are nonnegative vectors, we know that ’s are also nonnegative vectors and that . It suffices to show that . Since is a convex combination of and ’s are in , we conclude that . In addition, since ’s are also in , is also in . Therefore our final choice resides in the constraint .
Let and the shrunk domain (without translation) . By Jensen’s inequality and the fact has Lipschitz continuous gradients, we have
(37) 
Thus,
(38)  
(39)  
(40)  
(41) 
Let . Since , we have . We know and
(42) 
Since we assume that is monotone continuous DRsubmodular, by Lemma 1, is also monotone continuous DRsubmodular. As a result, is concave along nonnegative directions, and is entrywise nonnegative. Thus we have
(43)  
(44) 
Since , we deduce
(45)  
(46)  
(47)  
(48) 
Therefore, we obtain