Parallel Algorithm for Non-Monotone DR-Submodular Maximization

Parallel Algorithm for Non-Monotone DR-Submodular Maximization

Alina Ene Department of Computer Science, Boston University, aene@bu.edu.    Huy L. Nguyễn College of Computer and Information Science, Northeastern University, hlnguyen@cs.princeton.edu.
Abstract

In this work, we give a new parallel algorithm for the problem of maximizing a non-monotone diminishing returns submodular function subject to a cardinality constraint. For any desired accuracy , our algorithm achieves a approximation using parallel rounds of function evaluations. The approximation guarantee nearly matches the best approximation guarantee known for the problem in the sequential setting and the number of parallel rounds is nearly-optimal for any constant . Previous algorithms achieve worse approximation guarantees using parallel rounds. Our experimental evaluation suggests that our algorithm obtains solutions whose objective value nearly matches the value obtained by the state of the art sequential algorithms, and it outperforms previous parallel algorithms in number of parallel rounds, iterations, and solution quality.

1 Introduction

In this paper, we study parallel algorithms for the problem of maximizing a non-monotone DR-submodular function subject to a single cardinality constraint111A DR-submodular function is a continuous function with the diminishing returns property: if coordinate-wise then coordinate-wise.. The problem is a generalization of submodular maximization subject to a cardinality constraint. Many recent works have shown that DR-submodular maximization has a wide-range of applications beyond submodular maximization. These applications include maximum a-posteriori (MAP) inference for determinantal point processes (DPP), mean-field inference in log-submodular models, quadratic programming, and revenue maximization in social networks [16, 13, 6, 14, 17, 5, 4].

The problem of maximizing a DR-submodular function subject to a convex constraint is a notable example of a non-convex optimization problem that can be solved with provable approximation guarantees. The continuous Greedy algorithm [18] developed in the context of the multilinear relaxation framework applies more generally to maximizing DR-submodular functions that are monotone increasing (if coordinate-wise then ). Chekuri et al. [7] developed algorithms for both monotone and non-monotone DR-submodular maximization subject to packing constraints that are based on the continuous Greedy and multiplicative weights update framework. The work [5] generalized continuous Greedy for submodular functions to the DR-submodular case and developed Frank-Wolfe-style algorithms for maximizing non-monotone DR-submodular function subject to general convex constraints.

A significant drawback of these algorithms is that they are inherently sequential and adaptive. In fact the highly adaptive nature of these algorithms go back to the classical greedy algorithm for submodular functions: the algorithm sequentially selects the next element based on the marginal gain on top of previous elements. In certain settings such as feature selection [15] evaluating the objective function is a time-consuming procedure and the main bottleneck of the optimization algorithm and therefore, parallelization is a must. Recent lines of work have focused on addressing these shortcomings and understanding the trade-offs between approximation guarantee, parallelization, and adaptivity. Starting with the work of Balkanski and Singer [3], there have been very recent efforts to understand the tradeoff between approximation guarantee and adaptivity for submodular maximization [3, 9, 2, 12, 8, 1]. The adaptivity of an algorithm is the number of sequential rounds of queries it makes to the evaluation oracle of the function, where in every round the algorithm is allowed to make polynomially-many parallel queries. Recently, the work [11] gave an algorithm for maximizing a submodular function subject to a cardinality constraint in rounds and approximation. For the general setting of DR-submodular functions with packing constraints, the work [10] gave an algorithm with rounds and approximation. In the special case of constraint, this algorithm uses rounds.

In this work, we develop a new algorithm for DR-submodular maximization subject to a single cardinality constraint using rounds of adaptivity and obtaining approximation. For constant , the number of rounds is almost a quadratic improvement from in the previous work to the nearly optimal rounds.

Theorem 1.

Let be a DR-submodular function and . For every , there is an algorithm for the problem with the following guarantees:

• The algorithm is deterministic if provided oracle access for evaluating and its gradient ;

• The algorithm achieves an approximation guarantee of ;

• The number of rounds of adaptivity is .

2 Preliminaries

In this paper, we consider non-negative functions that are diminishing returns submodular (DR-submodular). A function is DR-submodular if (where is coordinate-wise), , such that and are still in , it holds

 f(→x+δ→1i)−f(→x)≥f(→y+δ→1i)−f(→y),

where is the -th basis vector, i.e., the vector whose -th entry is and all other entries are .

If is differentiable, is DR-submodular if and only if for all . If is twice-differentiable, is DR-submodular if and only if all the entries of the Hessian are non-positive, i.e., for all .

For simplicity, throughout the paper, we assume that is differentiable. We assume that we are given black-box access to an oracle for evaluating and its gradient . It is convenient to extend the function to as follows: , where .

An example of a DR-submodular function is the multilinear extension of a submodular function. The multilinear extension of a submodular function is defined as follows:

 f(→x)=E[F(R(→x))]=∑S⊆VF(S)∏i∈S→xi∏i∈V∖S(1−→xi),

where is a random subset of where each is included independently at random with probability .

Basic notation. We use e.g. to denote a vector in . We use the following vector operations: is the vector whose -th coordinate is ; is the vector whose -th coordinate is ; is the vector whose -th coordinate is . We write to denote that for all . Let (resp. ) be the -dimensional all-zeros (resp. all-ones) vector. Let denote the indicator vector of , i.e., the vector that has a in entry if and only if .

We will use the following result that was shown in previous work [7].

Lemma 2 ([7], Lemma 7).

Let be a DR-submodular function. For all and , .

3 The algorithm

In this section, we present an idealized version of our algorithm where we assume that we can compute exactly the step size on line 16. The idealized algorithm is given in Algorithm 1. In the appendix (Section B), we show how to implement that step efficiently and incur only additive error in the approximation.

The algorithm takes as input a target value and it achieves the desired approximation if is an approximation of the optimal function value , i.e., we have . As noted in previous work [10], it is straightforward to approximately guess such a value using a single parallel round.

Finding the step size on line 16. As mentioned earlier, we assume that we can find the step exactly. In the appendix, we show that we can efficiently find approximately using -ary search for suitable . We can choose to obtain different trade-offs between the number of parallel rounds and total running time, see Section B in the appendix for more details.

Finding the step size on line 18. We have and

 ∑i∈[n]→zi(η) =∑i∈S(→zi+η(1−→zi))+∑i∉S→zi =∑i∈[n]→zi+η∑i∈S(1−→zi)

Additionally, for each , we have and thus . Therefore is the minimum between and the following value:

 ϵjk−∑i∈[n]→zi∑i∈S(1−→zi)=ϵjk−∥→z∥1|S|−∥→z∘→1S∥1

4 Analysis of the approximation guarantee

In this section, we show that Algorithm 1 achieves a approximation. Recall that we assume that is computed exactly on line 16. In Section B of the appendix, we show how to extend the algorithm and the analysis so that the algorithm efficiently computes a suitable approximation to that suffices for obtaining a approximation.

In the following, we refer to each iteration of the outer for loop as a phase. We refer to each iteration of the inner while loop as an iteration. Note that the update vectors are non-negative in each iteration of the algorithm, and thus the vectors remain non-negative throughout the algorithm and they can only increase. Additionally, since , we have throughout the algorithm. We will also use the following observations repeatedly, whose straightforward proofs are deferred to Section A of the appendix. By DR-submodularity, since the relevant vectors can only increase in each coordinate, the relevant gradients can only decrease in each coordinate. This implies that, for every , we have . Additionally, for every , we have .

We will need an upper bound on the and norms of and . Since , it suffices to upper bound the norms of (the norm bound will be used to show that the final solution is feasible, and the norm bound will be used to derive the approximation guarantee). We do so in the following lemma.

Lemma 3.

Consider phase of the algorithm (the -th iteration of the outer for loop). Throughout the phase, the algorithm maintains the invariant that and .

Proof.

We show that the invariants are maintained using induction on the number of iterations of the inner while loop in phase . Let be the vector right before the update on line 21 and let be the vector right after the update. By the induction hypothesis, we have . If , we have , and the invariant is maintained. Therefore we may assume that . By the definition of , we have . We have . Thus the invariant is maintained.

Next, we show the upper bound on the norm. Note that , where is the step size chosen on line 19. Thus we have , where the last inequality is by the choice of . ∎

Theorem 4.

Consider a phase of the algorithm (an iteration of the outer for loop). Let and be the vector at the beginning and end of the phase. We have

 f(→xend)−f(→xstart)≥(1−5ϵ)ϵ((1−ϵ)jf(→x∗)−f(→xend)−3ϵf(→x∗))
Proof.

We consider two cases, depending on whether the threshold at the end of the phase is equal to or not.

Case 1: we have . Note that the phase terminates with in this case. We fix an iteration of the phase that updates and on lines 2023, and analyze the gain in function value in the current iteration. We let denote the vectors right before the update on lines 2023. Let be the vector right after the update on line 20, and let be the vector right after the update on line 21.

We have:

 f(→x′)−f(→x) (a)≥⟨∇f(→x′),η(→1−→x)∘→1T(η)⟩ =⟨(→1−→x)∘∇f(→x′),η→1T(η)⟩ (b)≥⟨→g(η),η→1T(η)⟩ (c)≥ηvstart|S(η)|≥(1−ϵ)|S| (d)≥(1−ϵ)ηvstart|S|

In (a), we used the fact that and is concave in non-negative directions.

We can show (b) as follows. We have and thus by DR-submodularity. Additionally, for every coordinate , we have . Therefore, for every , we have .

In (c), we have used that , for all , and for all .

We can show (d) as follows. Since , we have , where the first inequality is by Lemma 14 and the second inequality is by the choice of .

Let and denote and in iteration of the phase (note that we are momentarily overloading and here, and they temporarily stand for the step size in iterations and , and not for the step sizes on lines 16 and 18). By summing up the above inequality over all iterations, we obtain:

 f(→xend)−f(→xstart) ≥(1−ϵ)vstart∑tηt|St| ≥(1−ϵ)vstart∥→zend−→zstart∥1≥ϵk (a)≥(1−ϵ)vstartϵk (b)=ϵ(1−ϵ)(((1−ϵ)j−2ϵ)M−f(→xstart)) (c)≥ϵ(1−ϵ)((1−ϵ)jf(→x∗)−f(→xstart)−3ϵM)

We can show (a) as follows. Recall that we have . Since , we have .

In (b), we used the definition of on line 7.

In (c), we used that .

Case 2: we have . Note that this implies that , since line 13 was executed at least once during the phase.

Let be the following subset of the coordinates:

 A:={i∈[n]:(1−(→zend)i)∇if(→zend)≥vend1−ϵ}

We have

Proof.

Fix an iteration of the phase that updates and on lines 2023. Let denote the vectors right before the update on lines 2023. Let be the vector right after the update on line 20, and let be the vector right after the update on line 21.

We have:

 f(→x′)−f(→x) (a)≥⟨∇f(→x′),η(→1−→x)∘→1T(η)⟩ =⟨(→1−→x)∘∇f(→x′),η→1T(η)⟩ (b)≥⟨→g(η),η→1T(η)⟩ =⟨→g(η),η→1S(η)⟩+⟨→g(η),η→1T(η)∖S(η)⟩ =ηvend|S(η)|≥(1−ϵ)|S|+⟨→g(η)−vend→1,η→1S(η)⟩≥0+⟨→g(η),η→1T(η)∖S(η)⟩≥0 (c)≥(1−ϵ)η(vend|S|+⟨→g(η)−vend→1,→1S(η)⟩+⟨→g(η),→1T(η)∖S(η)⟩) =(1−ϵ)η(vend|S|−vend|S(η)|+⟨→g(η),→1T(η)⟩) =(1−ϵ)η(vend|S|−vend|S(η)|+⟨→g(η),→1T(η)∖A⟩+⟨→g(η),→1T(η)∩A⟩) (d)=(1−ϵ)η(vend|S|−vend|S(η)|+⟨→g(η),→1T(η)∖A⟩+⟨→g(η),→1S∩A⟩) (e)≥(1−ϵ)η(vend|S|−vend|S(η)|+⟨→g(η),→1S(η)∖A⟩+⟨→g(η),→1S∩A⟩) (f)≥(1−ϵ)η(vend|S|−vend|S(η)|+vend|S(η)∖A|+⟨→g(η),→1S∩A⟩) =(1−ϵ)η(vend(|S|−|S(η)∩A|)+⟨→g(η),→1S∩A⟩) ≥(1−ϵ)η(vend|S∖A|+⟨→g(η),→1S∩A⟩) =(1−ϵ)η(vend|S∖A|+⟨∇f(→z(η)),(→1−→z(η))∘→1S∩A⟩) (g)≥(1−ϵ)η(vend|S∖A|+⟨∇f(→zend),(→1−→z(η))∘→1S∩A⟩)

In (a), we used the fact that and is concave in non-negative directions.

We can show (b) as follows. We have and thus by DR-submodularity. Additionally, for every coordinate , we have . Therefore, for every , we have .

We can show (c) as follows. Since , we have , where the first inequality is by Lemma 14 and the second inequality is by the choice of . By the definition of , we have for every . By the definition of , we have for every .

Equality (d) follows from the fact that , which we can show as follows. We have , and is the set of all coordinates with negative gradient . Thus it suffices to show that the coordinates in have positive gradient . For every , we have , where the first inequality is by DR-submodularity (since ) and the second inequality is by the definition of and the fact that for all (Lemma 3). Moreover, we have for all (Lemma 3). Thus for all , and hence .

In (e), we have used that and is non-negative on the coordinates of .

In (f), we have used that, for all .

In (g), we have used that and thus by DR-submodularity.

Let denote in iteration of the phase (note that we are momentarily overloading and , and they temporarily stand for the step size in iterations and , and not for the step sizes on lines 16 and 18). By summing up the above inequality over all iterations, we obtain:

 f(→xend)−f(→xstart) ≥(1−ϵ)(∑tvendηt|St∖A|+∑t⟨∇f(→zend),ηt(→1−→z(η))∘→1St∩A⟩) ≥(1−ϵ)(vend∥(→zend−→zstart)∘→1¯¯¯¯A∥1+⟨∇f(→zend),(→zend−→zstart)∘→1A⟩)

We will also need the following lemmas.

Lemma 6.

For every , we have:

 (→zend)i−(→zstart)i≥(1−3ϵ)ϵ(1−(→zend)i)
Proof.

Since was empty at the previous threshold , we have or . If it is the latter, the claim follows, since . Therefore we may assume it is the former. By Lemma 3, . Therefore

 (→zend)i−(→zstart)i≥ϵ(1−ϵ)j−1−ϵ2≥(1−3ϵ)ϵ(1−ϵ)j−1≥(1−3ϵ)ϵ(1−(→zend)i)

where in the second inequality we used that for sufficiently small (since ). ∎

Lemma 7.

We have:

 ⟨∇f(→zend)∨→0,(→1−→zend)∘→x∗⟩≥((1−ϵ)j−ϵ2)f(→x∗)−f(→xend)
Proof.

We have:

 ⟨∇f(→zend)∨→0,(→1−→zend)∘→x∗⟩ (a)≥⟨∇f(→zend)∨→0,→x∗∨→zend−→zend⟩ (b)≥f(→zend∨→x∗)−f(→zend) (c)≥f(→zend∨→x∗)−f(→xend) (d)≥(1−∥→zend∥∞)f(→x∗)−f(→xend) (e)≥((1−ϵ)j−ϵ2)f(→x∗)−f(→xend)

In (a), we used that for all .

In (b), we used the fact that is concave in non-negative directions.

In (c), we used the fact that the algorithm maintains the invariant that via the update on line 23.

In (d), we used Lemma 2.

In (e), we used Lemma 3. ∎

Recall that the phase terminates with either or . We consider each of these cases in turn.

Lemma 8.

Suppose that . We have:

 f(→xend)−f(→xstart)≥(1−5ϵ)ϵ((1−ϵ)jf(→x∗)−f(→xend)−2ϵf(→x∗))
Proof.

By Lemma 5, we have:

 f(→xend)−f(→xstart) ≥(1−ϵ)⟨∇f(→zend),(→zend−→zstart)∘→1A⟩ =(1−ϵ)(⟨∇f(→zend),(→zend−→zstart−ϵ(1−3ϵ)(→1−→zend)∘→x∗)∘→1A⟩≥0 +⟨∇f(→zend),ϵ(1−3ϵ)(→1−→zend)∘→x∗∘→1A⟩) (a)≥(1−4ϵ)ϵ⟨∇f(→zend),(→1−→zend)∘→x∗∘→