Safe Element Screening for SubmodularFunction Minimization

# Safe Element Screening for Submodular Function Minimization

Weizhong Zhang, Bin Hong, Lin Ma, Wei Liu, Tong Zhang
Tencent AI Lab, Shenzhen, China
State Key Lab of CADCG, College of Computer Science, Zhejiang University
###### Abstract

Submodular functions are discrete analogs of convex functions, which have applications in various fields, including machine learning, computer vision and signal processing. However, in large-scale applications, solving Submodular Function Minimization (SFM) problems remains challenging. In this paper, we make the first attempt to extend the emerging technique named screening in large-scale sparse learning to SFM for accelerating its optimization process. Specifically, we propose a novel safe element screening method—based on a careful studying of the relationships between SFM and the corresponding convex proximal problems, as well as the accurate estimation of the optimum of the proximal problem—to quickly identify the elements that are guaranteed to be included (we refer to them as active) or excluded (inactive) in the final optimal solution of SFM during the optimization process. By removing the inactive elements and fixing the active ones, the problem size can be dramatically reduced, leading to great savings in the computational cost without sacrificing accuracy. To the best of our knowledge, the proposed method is the first screening method in the fields of SFM and even combinatorial optimization, and thus points out a new direction for accelerating SFM algorithms. Experiment results on both synthetic and real datasets demonstrate the significant speedups gained by our screening method.

## 1 Introduction

Submodular Functions [9] are a special class of set functions, which have rich structures and a lot of links with convex functions. They arise naturally in many domains, such as clustering [17], image segmentation [12, 4], document summarization [13] and social networks [11]. Most of these applications can be finally deduced to a Submodular Function Minimization (SFM) problem, which takes the form of

 minA⊆VF(A), (SFM)

where is a submodular function defined on a set . The problem of SFM has been extensively studied for several decades in the literature [6, 15, 10, 16, 8], in which many algorithms (both exact and approximate) have been developed from the perspectives of combinatorial optimization and convex optimization. The most well known conclusion is that SFM is solvable in strongly polynomial time [10]. Unfortunately, due to the high-degree polynomial dependence, the applications of submodular functions on the large scale problems remain challenging, such as image segmentation [4] and speech analysis [14], which involve huge number of variables.

Screening [7] is an emerging technique, which has been proved to be efficient in accelerating large-scale sparse model training. It is motivated by the well known feature of sparse models that a significant portion of the coefficients in the optimal solutions of them (resp. their dual problems) are zero, that is, the corresponding features (resp. samples) are irrelevant with the final learned models. Screening methods aim to quickly identify these irrelevant features and/or samples and remove them from the datasets before or during the training process. Thus, the problem size can be reduced dramatically, leading to substantial savings in the computational cost. The framework of the screening methods is given in Algorithm 1. Since screening methods are always independent with the optimization algorithms, thus can be integrated with all the algorithms flexibly. In the recent few years, specific screening methods for most of the traditional sparse models have been developed, such as sparse logistic regression [24], lasso [22, 25], tree guided group lasso [23] and SVM [19, 27]. Empirical studies indicate that the speedups they achieved can be orders of magnitudes.

The binary attribute (each element in must be either in or not in the optimal solution) of SFM motivates us to introduce the key idea of screening into SFM to accelerate its optimization process. The most intuitive approach is to identify the elements that are guaranteed to be included or excluded in the minimizer of SFM prior to or during actually solving it. Then, by fixing the identified active elements and removing the inactive ones, we just need to solve a small-scale problem. However, we note that existing screening methods are all developed for convex models and they can not be applied to SFM directly. The reason is that they all heavily depend on KKT conditions (see Algorithm 1), which do not exist in SFM problems.

In this paper, to improve the efficiency of SFM algorithms, we propose a novel Inactive and Active Element Screening (IAES) framework for SFM, which consists of two kinds of screening rules, i.e., Inactive Elements Screening (IES) and Active Elements Screening (AES). As we analyze above, the major challenge in developing IAES is the absence of KKT conditions. We bypass this obstacle by carefully studying the relationship between SFM and convex optimization, which can be regarded as another form of KKT conditions. We find that SFM is closely related to a particular convex primal and dual problem pair Q-P and Q-D (see Section 2), that is, the minimizer of SFM can be obtained from the positive components of the optimum of Q-P. Hence, the proposed IAES identifies the active and inactive elements by estimating the lower and upper bounds of the components of the optimum of problem Q-P. Thus, one of our major technical contributions is a novel framework (Section 3)—developed by carefully studying the strong convexity of the corresponding primal and dual objective functions, the structure of the base polyhedra and the optimality conditions of the SFM problem—for deriving accurate estimations of optimum of problem Q-P. We integrate IAES with the solver for problems Q-P and Q-D. As the solver goes on, and the estimation becomes more and more accurate, IAES can identify more and more elements. By fixing the active elements and removing the inactive ones, the problem size can be reduced gradually. IAES is safe in the sense that it would never sacrifice any accuracy on the final output. To the best of our knowledge, IAES is the first screening method in the domain of SFM or even combinatorial optimization. Moreover, compared with the screening methods for sparse models, an outstanding feature of IAES is that it has no theoretical limit in reducing the problem size. That is we can finally reduce the problem size to zero, leading to substantial saving in computational cost. The reason is that as the optimization proceeds, our estimation will be accurate enough to infer the affiliations of all the elements with the optimizer . While in sparse models, screening methods can never reduce the problem size to zero since the features (resp. samples) with nonzero coefficients in the primal (resp. dual) optimum can never be removed from the dataset. Experiments (see Section 4) on both synthetic and real datasets demonstrate the significant speedups gained by IAES. For the convenience of presentation, we postpone the detailed proofs of theoretical results in the main text to the supplementary materials.

Notations: We consider the set , and denote its power set by , which is composed of the subsets of . is the cardinality of a set . and are the union and intersection of the sets and , respectively. means that is a subset of , potentially equals to . Moreover, for and , we let be the -th component of and (resp. ) be the weak (resp. strong) -sup-level sets of , which is defined as (resp. ). At last, for , we define a set function by .

## 2 Basics and Motivations

This section is composed of two parts: a) briefly review some basics of submodular functions, SFM and their relations with convex optimization; b) motivate our screening method IAES.

The followings are the definitions of submodular function, submodular polyhedra and base polyhedra, which play an important role in submodular analysis.

###### Definition 1.

(Submodular Function) [16]. A set function is submodular if and only if, for all subsets , we have:

 F(A)+F(B)≥F(A∪B)+F(A∩B).
###### Definition 2.

(Submodular and Base Polyhedra) [9]. Let be a submodular function such that . The submodular polyhedra and the base polyhedra are defined as:

 P(F)={s∈Rp:∀A⊆V,s(A)≤F(A)}, B(F)={s∈Rp:s(V)=F(V),∀A⊆V,s(A)≤F(A)}.

Below we give the definition of Lovász extension, which works as the bridge that connects submodular functions and convex functions.

###### Definition 3.

(Lovász Extension) [9]. Given a set-function such that , the Lovász extension is defined as follows: for , order the components in decreasing order , and define through the equation below,

 f(w)=p∑k=1[w]jk(F({j1,...,jk})−F({j1,...,jk−1})).

Lovász extension is convex if and only if is submodular (see [9]).

We focus on the generic submodular function minimization problem SFM defined in Section 1 and denote its minimizer as . To reveal the relationship between SFM and convex optimization and finally motivate our method, we need the following theorems.

###### Theorem 1.

Let be convex functions on and be their Fenchel-conjugates ([3]) and be the Lovász extension of a submodular function . Denote the subgradient of by . Then, the followings hold:
The problems below are dual of each other:

 minw∈Rpf(w)+p∑j=1ψj([w]j), (P) maxs∈B(F)−p∑j=1ψ∗j(−[s]j). (D)

The pair is optimal for problems (P) and (D) if and only if

 {\textup(a):[s]∗k∈−∂ψk([w]∗k),∀k∈V,\textup(b):w∗∈NB(F)(s∗), (Opt)

where is the normal cone (see Chapter 2 of [3]) of at .

When is differentiable, we consider a sequence of set optimization problems parameterized by :

 minA⊆VF(A)+∑j∈A∇ψj(α), (SFM’)

where is the gradient of . The problem SFM’ has tight connections with the convex optimization problem P (see the theorem below).

###### Theorem 2.

(Submodular function minimization from proximal problem)[Proposition 8.4 in [1]]. Under the same assumptions in Theorem 1, if is differentiable for all and is the unique minimizer of problem P, then for all , the minimal minimizer of problem SFM’ is and the maximal minimizer is , that is, for any minimizers , we have:

 {w∗>α}⊆A∗α⊆{w∗≥α}. (1)

By choosing and in SFM’, combining Theorems 1 and 2, we can see that SFM can be reduced to the following primal and dual problems, one is quadratic optimization problem and the other is equivalent to finding the minimum norm point in the base polytope :

 minw∈RpP(w)=f(w)+12∥w∥22, (Q-P) maxs∈B(F)D(s)=−12∥s∥22. (Q-D)

According to Eq. (1), we can define two index sets:

 E={j∈V:[w]∗j>0}, and G={j∈V:[w]∗j<0},

which imply that

 \textup(i):j∈E⇒j∈A∗, (R1) \textup(ii):j∈G⇒j∉A∗. (R2)

We call the -th element active if and the ones in inactive.

Suppose that we are given two subsets of and , by rules R1 and R2, we can see that many affiliations between and the elements of can be deduced. Thus, we have less unknowns to solve in SFM and its size can be dramatically reduced. We formalize this idea in Lemma 1.

###### Lemma 1.

Given two subsets and , the followings hold:

(i): , and for all , we have .

(ii): The problem SFM can be reduced to the following scaled problem:

 minC⊆V/(^E∪^G)^F(C):=F(^E∪C)−F(^E), (scaled-SFM)

which is also a SFM problem.

(iii): can be recovered by , where is the minimizer of scaled-SFM.

Lemma 1 indicates that, if we can identify the active set and inactive set , we only need to solve a scaled problem scaled-SFM—that may have much smaller size than the original problem SFM—to exactly recover the optimal solution without sacrificing any accuracy.

However, since is unknown, we cannot directly apply rules R1 and R2 to identify the active set and inactive set . Inspired by the ideas in the existing safe screening methods ([22, 18, 21]) for convex problems, we can first estimate the region that contains and then relax the rules R1 and R2 to the practicable versions. Specifically, we first denote

 ^E:={j∈V:minw∈W[w]j>0}, (2) ^G:={j∈V:maxw∈W[w]j<0}. (3)

It is obvious that and . Hence, the rules R1 and R2 can be relaxed as follows:

 \textup(i):j∈^E⇒j∈A∗, (R1’) \textup(ii):j∈^G⇒j∉A∗. (R2’)

In view of the rules R1’ and R2’, we sketch the development of IAES as follows:
Step 1: Derive the estimation such that .
Step 2: Develop IAES via deriving the detailed screening rules R1’ and R2’.

## 3 The Proposed Element Screening Method

In this section, we first present the accurate optimum estimation by carefully studying the strong convexity of the functions and , the optimality conditions of SFM and its relationships with the convex problem pair (see Section 3.1). Then, in Section 3.2, we develop our inactive and active element screening rules IES and AES step by step. At last, in Section 3.3, we develop the screening framework IAES by an alternating application of IES and AES.

### 3.1 Optimum Estimation

Let and be the active and inactive sets identified by the previous IAES steps (before applying IAES for the first time, they are ). From Lemma 1, we know that the problem SFM then can be reduced to the following scaled problem:

 minC⊆^V^F(C):=F(^E∪C)−F(^E),

where . The second term at the right side of the equation above is added to make . Thus, the corresponding problems Q-P and Q-D then become:

 min^w∈R^p^P(^w)=^f(^w)+12∥^w∥22, (Q-P’) max^s∈B(^F)^D(^s)=−12∥^s∥22. (Q-D’)

where is the Lovász extension of and . Now, we turn to estimate the minimizer of the problem Q-P’. The result is presented in the theorem below.

###### Theorem 3.

For any , and , we denote the dual gap as , then we have

 ^w∗∈W=B∩Ω∩P,

where , , and .

From the theorem above, we can see that the estimation is the intersection of three sets: the ball , the -norm equipped spherical and the plane . As the optimizer goes on, the dual gap becomes smaller, and and would converge to (See Chapter 7 of [1]). Thus, the volumes of and become smaller and smaller during the optimization process, the estimation would be more and more accurate.

### 3.2 Inactive and Active Element Screening

We now turn to develop the screening rules IES and AES based on the estimation of the optimum .

From (2) and (3), we can see that, to develop the screening rules we need to solve two problems: and . However, since is highly non-convex and has a complex structure, it is very hard to solve these two problems efficiently. Hence, we rewrite the estimation as , and develop two different screening rules on and , respectively.

#### 3.2.1 Inactive and Active Element Screening based on B∩P

Given the estimation , we derive the screening rules by solving the following problemsï¼

 minw∈B∩P[w]j and maxw∈B∩P[w]j.

We show that both of the two problems above admit closed form solutions.

###### Lemma 2.

Given the estimation ball , the plane and the active and inactive sets and , which are identified in the previous IAES steps, for all , we denote

 bj=2(∑i≠j[^w]i+^F(^V)−(^p−1)[^w]j), cj=(∑i≠j[^w]i+^F(^V))2−(^p−1)(2G(^w,^s)−[^w]2j),

then the followings hold:

 \textup(i):minw∈B∩P[w]j=[w]minj:=−bj−√b2j−4^pcj2^p, \textup(ii):maxw∈B∩P[w]j=[w]maxj:=−bj+√b2j−4^pcj2^p.

We are now ready to present the active and inactive screening rules AES-1 and IES-1.

###### Theorem 4.

Given the active and inactive sets and , which are identified in the previous IAES steps, then,

(i): The active element screening rule takes the form of

 [w]minj>0⇒j∈A∗,∀j∈V/(^E∪^G). (AES-1)

(ii): The inactive element screening rule takes the form of

 [w]maxj<0⇒j∉A∗,∀j∈V/(^E∪^G). (IES-1)

(iii): The active and inactive sets and can be updated by

 ^E←^E∪Δ^E, (4) ^F←^F∪Δ^F, (5)

where and are the newly identified active and inactive sets defined as

 Δ^E={j∈V/(^E∪^G):[w]minj>0}, Δ^G={j∈V/(^E∪^G):[w]maxj<0}.

From the theorem above, we can see that our rules AES-1 and IES-1 are safe in the sense that the detected elements are guaranteed to be include or exclude in .

#### 3.2.2 Inactive and Active Element Screening based on B∩Ω

We now derive the second screening rule pair based on the estimation .

Due to the high non-convexity and complex structure of , directly solving problems and is time consuming. Notice that, to derive IAS and IES, we only need to judge whether the inequalities and are satisfied or not, instead of calculating and . Hence, we only need to infer the hypotheses and are true or false. Thus, from the formulation of (see Theorem 3), the problems come down to calculating the minimum and the maximum of with or , which admit closed form solutions. The results are presented in the lemma below.

###### Lemma 3.

Given the estimation ball and the active and inactive sets and , which are identified in the previous IAES steps, then the followings hold:

(i): , if , then the element can be identified by rule AES-1 or IES-1 to be active or inactive.
(ii): , if , we have

 minw∈B,[w]j≤0∥w∥1<∥^w∥1, maxw∈B,[w]j≤0∥w∥1=⎧⎪ ⎪⎨⎪ ⎪⎩∥^w∥1−2[^w]j+√2^pG(^w,^s),if [^w]j−√2G(^w,^s)^p<0,∥^w∥1−[^w]j+√^p−1√2G(^w,^s)−[^w]2j, othereise.

(iii): , if , we have

 minw∈B,[w]j≥0∥w∥1<∥^w∥1, maxw∈B,[w]j≥0∥w∥1=⎧⎪ ⎪⎨⎪ ⎪⎩∥^w∥1+2[^w]j+√2^pG(^w,^s),if [^w]j+√2G(^w,^s)^p>0,∥^w∥1+[^w]j+√^p−1√2G(^w,^s)−[^w]2j,othereise.

We are now ready to present the second active and inactive screening rule pair AES-2 and IES-2. From the lemma above, we can see that the element with can be screened by rules AES-1 and IES-1. So we now only need to consider the cases when .

###### Theorem 5.

Given a set and the active and inactive sets and identified in the previous IAES steps, then,

(i): The active element screening rule takes the form of

 {0<[^w]j≤√2G(^w,^s)maxw∈B,[w]j≤0∥w∥1<^F(^V)−2^F(C)⇒j∈A∗,∀j∈V/(^E∪^G). (AES-2)

(ii): The inactive element screening rule takes the form of

 {−√2G(^w,^s)≤[^w]j<0maxw∈B,[w]j≥0∥w∥1<^F(^V)−2^F(C)⇒j∉A∗,∀j∈V/(^E∪^G). (IES-2)

(iii): The active and inactive sets and can be updated by

 ^E←^E∪Δ^E, (6) ^F←^F∪Δ^F, (7)

where and are the newly identified active and inactive sets defined as

 Δ^E={j∈V/(^E∪^G):0<[^w]j≤√2G(^w,^s),maxw∈B,[w]j≤0∥w∥1<^F(^V)−2^F(C)}, Δ^G={j∈V/(^E∪^G):−√2G(^w,^s)≤[^w]j<0,maxw∈B,[w]j≥0∥w∥1<^F(^V)−2^F(C)}.

Theorem 5 verifies the safety of AES-2 and IES-2.

### 3.3 The Proposed IAES Framework by An Alternating Application of AES and IES

To reinforce the capability of the proposed screening rules, we develop a novel framework IAES in Algorithm 2, which applies the active element screening rules (AES-1 and AES-2) and the inactive element screening rules (IES-1 and IES-2) in an alternating manner during the optimization process. Specifically, we integrate our screening rules AES-1, AES-2, IES-1 and IES-2 with the optimization algorithm for the problems Q-P’ and Q-D’. During the optimization process, we trigger the screening rules AES-1, AES-2, IES-1 and IES-2 every time when the dual gap is times smaller than itself in the last triggering of IAES. As the solver goes on, the volumes of and would decrease to zero quickly, IAES can thus identify more and more inactive and active elements.

Compared with the existing screening methods for convex sparse models, an appealing feature of IAES is that it has no theoretical limit in identifying the inactive and active elements and reducing the problem size. The reason is that, in convex sparse models, screening models can never rule out the features and samples whose corresponding coefficients in the optimal solution are nonzero. While in our case, as the optimizer going on, our estimation will be accurate enough for us to infer the affiliation of each element with . Hence, we can finally identify all the inactive and active elements and the problem size can be reduced to zero. This nice feature can lead to significant speedups in the computation times.

###### Remark 1.

The set in Algorithm 2 is updated by choosing one of the super-level sets of with the smallest value . It is free to get it. The reason is that most of the existing methods for the problems Q-P’ and Q-D’ need to calculate in each iteration, in which they need to calculate the value at all of the super-level sets of (see the greedy algorithm in [1] for details).

###### Remark 2.

The algorithm can be all the methods for the problems Q-P’ and Q-D’, such as minimum-norm point algorithm [26] and conditional gradient descent [5]. Although some algorithms only update , in IAES, we can update in each iteration by letting and refining it by the algorithm named pool adjacent violators [2].

###### Remark 3.

Due to Lemma 1 and the safety of AES-1, AES-2, IES-1 and IES-2, we can see that IAES would never sacrifice any accuracy.

###### Remark 4.

Although step 14 in Algorithm 2 may increase the dual gap slightly, it is worth it because of the reduced problem size. This is verified by the speedups gained by IAES in the experiments.

###### Remark 5.

The parameter in Algorithm 2 controls the frequency how often we trigger IAES. The larger value, the higher frequency to trigger IAES but more computational time consumed by IAES. In our experiment, we set and it achieves a good performance.

## 4 Experiments

We evaluate IAES through numerical experiments on both synthetic and real datasets by two measurements. The first one is the rejection ratios of IAES over iterations: , where and are the numbers of the active and inactive elements identified by IAES after the -th iteration, and and are the numbers of the active and inactive elements in . We notice that in our experiments, so the rejection ratio presents the problem size reduced by IAES. The second measurement is speedup, i.e., the ratio of the running times of the solver without IAES and with IAES. We set the accuracy to be .

Recall that, IAES can be integrated with all the solvers for the problems Q-P and Q-D. In this experiment, we use one of the most widely used algorithm minimum-norm point algorithm (MinNorm) [26] as the solver. The function varies according to the datasets, whose detailed definitions will be given in subsequent sections.

We write the code in Matlab and perform all the computations on a single core of Intel(R) Core(TM) i7-5930K 3.50GHz, 32GB MEM.

### 4.1 Experiments on Synthetic Datasets

We perform experiments on the synthetic dataset named two-moons with different sample size (see Figure 1 for an example). All the data points are sampled from two different semicircles. Specifically, each point can be presented as , where stands for the two semicircles, , is generated from a normal distribution , and are sampled from two uniform distributions and , respectively. We first sample data points from these two semicircles with equal probability. Then, we randomly choose samples and label each of them as positive if it is from the first semicircle and otherwise label it as negative. We generate five datasets by varying the sample size in . We perform semi-supervised clustering on each dataset and the objective function are defined as:

 F(A)=I(fA,fV/A)−∑j∈Alogηj−∑j∈V/Alog(1−ηj),

where is the mutual information between two Gaussian processes with a Gaussian kernel , if is labeled and otherwise (see Chapter 3 of [1] for more details). The kernel matrix here is dense with the size , leading to a big computational cost when is large.

Figure 2 presents the rejection ratios of IAES on two-moons. We can see that IAES can find the active and inactive elements incrementally during the optimization process. It can finally identify almost all of the elements and reduce the problem size to nearly zero in no more than 400 iterations, which is consistent with our theoretical analysis in Section 3.3.

Figure 3 visualizes the screening process of IAES on two-moons when . It shows that, during the optimization process, IAES identifies the elements those are easy to be classified first and then identify the rest.

Table 1 reports the running time of MinNorm without and with AES (AES-1 + AES-2), IES (IES-1 + IES-2) and IAES for solving the problem SFM on two-moons. We can see that the speedup of IAES can be up to 10 times. In all the datasets, IAES is significantly faster than MinNorm, the MinNorm with AES or IES. At last, we can see that the time cost of AES, IES and IAES are negligible.

### 4.2 Experiments on Real Datasets

In this experiment, we evaluate the performance of IAES on the image segmentation problem. We use five image segmentation instances (included in the supplemental material) in [20] to evaluate IAES. The objective function is the sum of unary potential for each pixel and pairwise potential of -neighbor grid graph:

where presents all the pixels, is the unary potential derived from the Gaussian Mixture model [20], ( and are the values of two pixels) if are neighbors, otherwise . Table 2 provides the statistics of the resulting image segmentation problems, including the numbers of the pixels and the edges in the -neighbor grid graph.

The rejection ratios in Figure 4 shows that IAES can identify the active and inactive elements during the optimization process incrementally until all of them are identified. This implies that IAES can lead to a significant speedup.

Table 3 reports the detailed time cost of MinNorm without and with AES, IES and IAES for solving the image segmentation problems. We can see that IAES leads to significant speedups, that is up to 30.7 times. In addition, we notice that the speedup gained by AES is small. The reason is that AES is used to identify the pixels of the foreground, which is a small region in the image, and thus the problem size cannot be reduced dramatically even if all the active elements were identified.

At last, from Table 3, we can also see that the speedup we achieve is supper-additive (speedup of AES + speedup of IES speedup of IAES). This can usually be expected, which comes from the super linear computational complexity of each iteration in MinNorm, leading to a super-additive saving in the computational cost. We notice that the speedup we achieve on some of the two-moon datasets is not super-additive, the reason is that we cannot identify a lot of elements in the early stage (Figure 2). Thus, the early stage takes up too much time cost.

## 5 Conclusion

In this paper, we proposed a novel safe element screening method IAES for SFM to accelerate its optimization process by simultaneously identifying the active and inactive elements. Our major contribution is a novel framework for accurately estimating the optimum of the corresponding primal problem of SFM developed by carefully studying the strong convexity of the primal and dual problems, the structure of the base polyhedra and the optimality conditions of SFM. To the best of our knowledge, IAES is the first screening method in the fields of SFM and even combinatorial optimization. Experiment results demonstrate that IAES can achieve significant speedups in solving SFM problems.

## Appendix A Appendix

In this appendix, we present the detailed proofs of all the theorems in the main text.

### a.1 Proof of Theorem 1

###### Proof.

of Theorem 1:

(i) Since , we can have that

 minw∈Rpf(w)+p∑j=1ψj([w]j) (8) = minw∈Rpmaxs∈B(F)⟨w,s⟩+p∑j=1ψj([w]j) = maxs∈B(F)minw∈Rp⟨w,s⟩+p∑j=1ψj([w]j) (9) = maxs∈B(F)−p∑j=1ψ∗j(−[s]j), (10)

where Eq.(9) holds since the strong duality theorem [3], and Eq.(10) is due to the definitions of the Fenchel conjugate of .

(ii) From Eq. (8), we have that

 s∗∈argmaxs∈B(F)⟨w∗,s⟩ ⇔⟨w∗,s∗⟩≥⟨w∗,s⟩,∀s∈B(F) ⇔w∗∈NB(F)(s∗).

From Eq. (10), we have that

 w∗∈argminw∈Rp⟨w,s∗⟩+p∑j=1ψj([w]j) ⇔[s]∗k∈−∂ψk([w]∗k),∀k∈V.

The proof is complete. ∎

### a.2 Proof of Lemma 1

###### Proof.

of Lemma 1:

(i) It is the immediate conclusion of Theorem 2.

(ii) Since and , we can solve the problem SFM by fixing the set and optimizing over . And the objective function becomes with . Thus, SFM can be deduced to

 minC⊆V/(^E∪^G)^F(C):=F(^E∪C)−F(^E).

The second term of the new objective function is added to make , which is essential in submodular function analysis, such as Lovász extension, submodular and base polyhedra.

Below, we argue that is a submodular function.

For all and , we have

 ^F(S)+^F(T) =(F(^E∪S)−F(^E))+(F(^E∪T)−F(^E)) =F(^E∪S)+F(^E∪T)−2F(^E) ≥F((^E∪S)∪(^E∪T))+F((^E∪S)∩(^E∪T))−2F(^E) (11) =F(^E∪(S∪T))+F(^E∪(S∪T))−2F(^E) =(F(^E∪(S∪T))−F(^E))+(F(^E∪(S∪T))−F(^E)) =^F(S∪T)+^F(S∩T).

The inequality (11) comes from the submoduality of .

(iii) It is the immediate conclusion of (ii).

The proof is complete. ∎

### a.3 Proof of Theorem 3

To prove Theorem 3, we need the following Lemma.

###### Lemma 4.

(Dual of minimization of submodular functions)[Proposition 10.3 in [1]] Let be a submodular function such that . We have:

 minA⊆VF(A)=maxs∈B(F)s−(V)=12(F(V)−mins∈B(F)∥s∥1), (12)

where for .

We now turn to prove Theorem 3.

###### Proof.

of Theorem 3:

Since is -strongly convex, for any and , we can have

 ^P(^w)≥^P(w∗)+⟨^g,^w−^w∗⟩+12∥^w−^w∗∥22.

where .

Since , it holds . Hence, we can get

 12∥^w−^w∗∥22≤^P(^