# Quadratic Decomposable Submodular Function Minimization

Pan Li
UIUC
panli2@illinois.edu
&Niao He
UIUC
niaohe@illinois.edu
&Olgica Milenkovic
UIUC
milenkov@illinois.edu
###### Abstract

We introduce a new convex optimization problem, termed quadratic decomposable submodular function minimization. The problem arises in many learning on graphs and hypergraphs settings and is closely related to decomposable submodular function minimization. We approach the problem via a new dual strategy and describe an objective that may be optimized via random coordinate descent (RCD) methods and projections onto cones. We also establish the linear convergence rate of the RCD algorithm and develop efficient projection algorithms with provable performance guarantees. Numerical experiments in transductive learning on hypergraphs confirm the efficiency of the proposed algorithm and demonstrate the significant improvements in prediction accuracy with respect to state-of-the-art methods.

## 1 Introduction

Given , a submodular function is a set function that for any satisfies . Submodular functions are ubiquitous in machine learning as they capture rich combinatorial properties of set functions and provide useful regularization functions for supervised and unsupervised learning bach2013learning (). Submodular functions also have continuous Lovász extensions lovasz1983submodular (), which establish solid connections between combinatorial and continuous optimization problems.

Due to their versatility, submodular functions and their Lovász extensions are frequently used in applications such as learning on directed/undirected graphs and hypergraphs zhu2003semi (); zhou2004learning (), image denoising via total variation regularization osher2005iterative (); chambolle2009total () and MAP inference in high-order Markov random fields kumar2003discriminative (). In many optimization settings involving submodular functions, one encounters the convex program

 minx∑i∈[N](xi−ai)2+∑r∈[R][fr(x)]p,

where , , and where for all in some index set , stands for the Lovász extension of a submodular function that describes a combinatorial structure over the set . For example, in image denoising, each parameter may correspond to the observed value of a pixel , while the functions may be used to impose smoothness constraints on pixel neighborhoods. One of the main difficulties in solving this optimization problem comes from the nondifferentiability of the second term: a direct application of subgradient methods leads to convergence rates as slow as , where denotes the number of iterations shor2012minimization ().

In recent years, the above described optimization problem with has received significant interest in the context of decomposable submodular function minimization (DSFM) stobbe2010efficient (). The motivation for studying this particular setup is two-fold: first, solving the convex optimization problem directly recovers the combinatorial solution to the submodular min-cut problem , where  fujishige2005submodular (); second, minimizing a submodular function decomposed into a sum of simpler components , is much easier than minimizing an unrestricted submodular function over a large set . There are several milestone results for the DSFM problem: Jegelka et al. jegelka2013reflection () first tackled the problem by considering its dual and proposed a solver based on Douglas-Rachford splitting. Nishihara et al. nishihara2014convergence () established the linear convergence rate of alternating projection methods for solving the dual problem. Ene et al. ene2015random (); ene2017decomposable () presented linear convergence rates of coordinate descent methods and subsequently tightened the results via submodular flows. Pan et al. li2018revisiting () improved those methods by leveraging incidence relations of the arguments of submodular function components.

Here, we focus on the other important case when ; we refer to the underlying optimization problem as quadratic DSFM (QDSFM). QDSFM appears naturally in a wide spectrum of applications, including learning on graphs and hypergraphs, and in particular, transductive learning and PageRank. It has also been demonstrated both theoretically johnson2007effectiveness () and empirically zhou2004learning (); hein2013total () that employing regularization with quadratic terms offers significantly improved predictive performance when compared to the case when . Despite the importance of the QDSFM problem, its theoretical and algorithmic developments have not reached the same level of maturity as those for the DSFM problem. To the best of our knowledge, only a few reported works hein2013total (); zhang2017re () have provided solutions for specific instances of QDSFMs with sublinear convergence guarantees.

This work takes a substantial step towards solving the QDSFM problem in its most general form by developing a family of algorithms with linear convergence rate and small iteration cost, including the randomized coordinate descent (RCD) and alternative projection (AP) algorithms. Our contributions are as follows. First, we derive a new dual formulation for the QDSFM problem since an analogue of the dual transformation for the DSFM problem is not applicable. Interestingly, the dual QDSFM problem requires one to find the best approximation of a hyperplane via a product cone as opposed to a product polytope, encountered in the dual DSFM problem. Second, we develop a linearly convergent RCD (and AP) algorithm for solving the dual QDSFM. Because of the special underlying conic structure, new analytic approaches are needed to prove the weak strong convexity of the dual QDSFM, which essentially guarantees linear convergence. Third, we develop generalized Frank-Wolfe and min-norm-point methods for efficiently computing the conic projection required in each step of RCD (and AP) and provide a -rate convergence analysis. Finally, we evaluate our methods on transductive learning over hypergraphs using synthetic and real datasets, and demonstrate superior performance both in convergence rate and prediction accuracy compared to existing methods.

## 2 Notation and Problem Formulation

For a submodular function defined over the ground set , the Lovász extension is a convex function defined for all according to

 f(x)=N−1∑k=1F({i1,...,ik})(xik−xik+1)+F([N])xiN, (1)

where . The base polytope of , denoted by , is defined as

 B={y∈RN|y(S)≤F(S),∀S⊂[N],y([N])=F([N])}. (2)

Using the base polytope, the Lovász extension can also be written as

We say that an element is incident to if there exists a such that . Furthermore, we use to denote the function . Given a positive diagonal matrix and a vector , we define the -norm according to and simply use when , the identity matrix. For an index set , we denote the -product of -dimensional Euclidean spaces by . A vector is written as where for all . The -norm induced on equals . We reserve the symbol for .

Next, we formally state the QDSFM problem. Consider a collection of submodular functions defined over the ground set , and denote their Lovász extensions and base polytopes by and , respectively. We use to denote the set of variables incident to and make the further assumption that the functions are normalized and nonnegative, i.e., that and . These two mild constraints are satisfied by almost all submodular functions that arise in practical applications. We consider the following minimization problem:

 QDSFM:minx∈RN∥x−a∥2W+∑r∈[R][fr(x)]2, (3)

where is a given vector and is a positive diagonal matrix. As an immediate observation, the problem has a unique solution, denoted by , due to the strong convexity of (3).

## 3 Applications

We start by reviewing some important machine learning problems that give rise to QDSFM.

Transductive Learning (TL) is a learning paradigm that allows one to utilize the underlying structure or distribution of unlabeled samples, whenever the information provided by labeled samples does not suffice for learning an inductive predictor gammerman1998learning (); joachims2003transductive (). A standard setting for a -class transductive learner is as follows: given data points and labels for the first () samples , the learner is asked to infer the labels for all the remaining data points . The widely-used TL problem with least squared loss requires one to solve regularization problems: for each class , set the scores of data points within the class to

 ^x(k)=argminx(k)β∥x(k)−a(k)∥2+Ω(x(k)),

where represents the information provided by the known labels, i.e., if , and otherwise, denotes a hyperparameter and stands for a smoothness regularizer. The labels of the data points are inferred according to . For typical graph and hypergraph learning problems, is often chosen to be a Laplacian regularizer constructed using (see Table 1). In Laplacian regularization, each edge/hyperedge corresponds to one functional component in the QDSFM problem. Note that the variables may also be normalized with respect to their degrees, in which case the normalized Laplacian is used instead. For example, in graph learning, one of the terms in assumes the form where and correspond to the degrees of the vertex variables and , respectively. It can be shown using some simple algebra that the normalization term reduces to the matrix used in the definition of the QDSFM problem (3).

PageRank (PR) is a well-known method used for ranking Web pages page1999pagerank (). Web pages are linked and they naturally give rise to a graph , where, without loss of generality, one may assume that . Let and be the adjacency matrix and diagonal degree matrix of , respectively. PR essentially finds a fixed point via the iterative procedure , where is a fixed vector and . It is easy to verify that is a solution of the problem

 minp(1−α)α∥p−s∥2D−1+(D−1p)T(D−A)(D−1p)=∥x−a∥2W+∑ij∈E(xi−xj)2, (4)

where , and . Obviously, (4) may be viewed as a special instance of the QDSFM problem. Note that the PR iterations on graphs take the form where is the normalized Laplacian of the graph. The PR problem for hypergraphs is significantly more involved, and may be formulated using diffusion processes (DP) based on a normalized hypergraph Laplacian operator  chan2018spectral (). The underlying PR procedure reads as where is the potential vector at time . Tracking this DP precisely for every time point is a difficult task which requires solving a densest subset problem chan2018spectral (). However, the stationary point of this problem, i.e., a point that satisfies may be easily found by solving the optimization problem The term matches the normalized regularization term for hypergraphs listed in Table 1, i.e., . Clearly, once again this leads to the QDSFM problem. The PR equation for directed or submodular hypergraphs can be stated similarly using the Laplacian operators described in chan2017diffusion (); li2018submodular (). The PR algorithm defined in this manner has many advantages over the multilinear PR method based on higher-order Markov chains gleich2015multilinear (), since it allows for arbitrarily large orders and is guaranteed to converge for any . In a companion paper, we provide a more detailed analysis of the above described PR method.

## 4 Algorithms for Solving the QDSFM Problem

We describe next the first known linearly convergent algorithms for solving the QDSFM problem. To start with, observe that the QDSFM problem is convex since the Lovász extensions are convex and nonnegative. But the objective is in general nondifferentiable. To address this issue, we consider the dual of the QDSFM problem. A natural idea is to try to mimic the approach used for DSFM by invoking the characterization of the Lovász extension, , . However, this leads to a semidefinite programing problem for the dual variables , which is complex to solve for large problems. Instead, we establish a new dual formulation that overcomes this obstacle. The dual formulation hinges upon on the following key observation:

 [fr(x)]2=maxϕr≥0ϕrfr(x)−ϕ2r4=maxϕr≥0maxyr∈ϕrBr⟨yr,x⟩−ϕ2r4. (5)

Let and . Using equation (5), we arrive at

###### Lemma 4.1.

The following optimization problem is dual to (3):

 miny,ϕg(y,ϕ):=∥∑r∈[R]yr−2Wa∥2W−1+∑r∈[R]ϕ2r,s.t. y∈⊗r∈[R]ϕrBr, ϕ∈⊗r∈[R]R≥0. (6)

By introducing , the previous optimization problem can be rewritten as

 miny,ϕ,Λ∑r∈[R][∥yr−λr√R∥2W−1+ϕ2r],s.t. y∈⊗r∈[R]ϕrBr, ϕ∈⊗r∈[R]R≥0,∑r∈[R]λr=2Wa. (7)

The primal variables in both cases are recovered via .

Counterparts of the above results for the DSFM problem were discussed in Lemma 2 of jegelka2013reflection (). However, there is a significant difference between jegelka2013reflection () and the QDSFM problem, since in the latter setting we use a conic set constructed from base polytopes of submodular functions. More precisely, for each , we define a convex cone which gives the feasible set of the dual variables . The optimization problem (7) essentially asks one to find the best approximation of an affine space in terms of a product cone , as opposed to a product polytope encountered in DSFM. Several algorithms have been developed for solving the DSFM problem, including the Douglas-Rachford splitting method (DR) jegelka2013reflection (), the alternative projection method (AP) nishihara2014convergence () and the random coordinate descent method (RCD) ene2015random (). Similarly, for QDSFM, we propose to solve the dual problem (6) using the RCD method exploiting the separable structure of the feasible set, and to solve (7) using the AP method. The analysis of the latter is deferred to the Supplement (Section C). It is worth mentioning that results of this work can be easily extended for the DR method, as well as accelerated and parallel variants of the RCD method ene2015random (); li2018revisiting ().

RCD Algorithm. Define the projection onto a convex cone as follows: for a given point in , let . For each coordinate , optimizing over the dual variables is equivalent to computing a projection onto the cone . This gives rise to the RCD method summarized in Algorithm 1.

In Section 5, we describe efficient methods to compute the projections. But throughout the remainder of this section, we treat the projections as provided by an oracle. Note that each iteration of the RCD method only requires the computation of one projection onto a single cone. In contrast, methods such as DR, AP and the primal-dual hybrid gradient descent (PDHG) proposed in chambolle2011first () used for TL on hypergraphs hein2013total (), require performing a complete gradient descent and computing a total of projections at each iteration. Thus, from the perspective of iteration cost, RCD is significantly more efficient, especially when is large and computing is costly.

The objective described in (6) is not strongly convex in general. Inspired by the work for DSFM ene2015random (), in what follows, we show that this objective indeed satisfies a weak strong convexity condition, which guarantees linear convergence of the RCD algorithm. We start by providing a general result that characterizes relevant geometric properties of the cone .

###### Lemma 4.2.

Consider a feasible solution and a nonnegative vector . Let be an arbitrary point in the base polytope of , and let be two positive diagonal matrices. Then, there exists a such that and

 ∥y−y′∥2I(W(1))+∥ϕ−ϕ′∥2≤μ(W(1),W(2))⎡⎣∥∑r∈[R]yr−s∥2W(2)+∥ϕ−ϕ′∥2⎤⎦,

where

 μ(W(1),W(2))=max⎧⎨⎩∑i∈[N]W(1)ii∑j∈[N]1/W(2)jj,94ρ2∑i∈[N]W(1)ii+1⎫⎬⎭. (8)

As a corollary of Lemma 4.2, the next result establishes the weak strong convexity of . To proceed, we introduce some additional notation. Denote the set of solutions of problem (6) by

 Ξ={(y,ϕ)|∑r∈[R]yr=2W(a−x∗),ϕr=infyr∈θBrθ,∀r}.

Note that this representation arises from the relationship between the optimal primal and dual solution as stated in Lemma 4.1. We denote the optimal value of the objective over by , and define a distance function .

###### Lemma 4.3.

Suppose that and that minimizes . Then

 ∥∑r∈[R](yr−y∗r)∥2W−1+∥ϕ−ϕ∗∥2≥d2((y,ϕ),Ξ)μ(W−1,W−1).

Based on Lemma 4.3, we can establish the linear convergence rate of the RCD algorithm.

###### Theorem 4.4.

After iterations of Algorithm 1, we obtain a pair that satisfies

 E[g(y(k),ϕ(k))−g∗+d2((y(k),ϕ(k)),Ξ)] ≤[1−2R[1+μ(W−1,W−1)]]k[g(y(0),ϕ(0))−g∗+d2((y(0),r(0)),Ξ)].

Theorem 4.4 implies that iterations are required to obtain an -optimal solution. Below we give the explicit characterization of the complexity for the TL and PR problems with normalized Laplacian regularization as discussed in Section 3.

###### Corollary 4.5.

Suppose that , where is a hyper-parameter, and is a diagonal degree matrix such that . Algorithm 1 requires an expected number of iterations to return an -optimal solution.

The term also appears in the expression for the complexity of the RCD method for solving the DSFM problem ene2017decomposable (). The term implies that whenever is small, the convergence rate is slow. This makes sense: for example, in the PR problem (4), a small corresponds to a large , which typically implies longer mixing times of the underlying Markov process. The term arises due to the degree-based normalization.

## 5 Computing the Projections ΠCr(⋅)

In this section, we provide efficient routines for computing the projection onto the conic set . As the procedure works for all values of , we drop the subscript for simplicity of notation. First, recall that

 ΠC(a)=argmin(y,ϕ)h(y,ϕ)≜∥y−a∥2~W+ϕ2s.t. y∈ϕB, ϕ≥0, (9)

where , and where denotes the base polytope of the submodular function . Let and be the optimal value of the objective function and the argument that optimizes it, respectively. When performing projections, one only needs to consider the variables incident to , and set all other variables to zero. For ease of exposition, we assume that all variables in are incident to

Unlike QDSFM, the DSFM involves the computation of projections onto the base polytopes of submodular functions. Two algorithms, the Frank-Wolfe (FW) method frank1956algorithm () and the Fujishige-Wolfe minimum norm algorithm (MNP) fujishige2011submodular (), are used for this purpose. Both methods assume cheap linear minimization oracles on polytopes and attain a -convergence rate. The MNP algorithm is more sophisticated and empirically more efficient. Nonetheless, neither of these methods can be applied directly to cones. To this end, we modify these two methods by adjusting them to the conic structure in 9 and show that a convergence rate still holds. We refer to the procedures as the conic MNP method and the conic FW method, respectively. Here we focus mainly on the conic MNP method described in Algorithm 2, as it is more sophisticated. A detailed discussion of the conic FW method and its convergence guarantees can be found in the Supplement (see Section E).

The conic MNP algorithm keeps track of an active set and searches for the best solution in its conic hull. Let us denote the cone of an active set as and its linear set as . Similar to the original MNP algorithm, Algorithm 2 also contains two level-loops: MAJOR and MINOR. In the MAJOR loop, we greedily add a new active point to the set obtained from the linear minimization oracle w.r.t. the base polytope (Step 2), and by the end of the MAJOR loop, we obtain a that minimizes over (Step 3-8). The MINOR loop is activated when contains some point that guarantees a smaller value of the objective function than that of the optimal point in , provided that some active points from may be removed. Compared to the original MNP method, Steps 2 and 5 as well as the termination Step 3 are specialized for the conic structure.

The following convergence result implies that the conic MNP algorithm also has a convergence rate of order ; the proof is essentially independent on the submodularity assumption and represents a careful modification of the arguments in chakrabarty2014provable () for conic structures.

###### Theorem 5.1.

Let be an arbitrary polytope in and let be the cone induced by the polytope. For some positive diagonal matrix , define . Algorithm 2 yields a sequence of such that decreases monotonically. Algorithm 2 terminates when with .

Both the (conic) FW and MNP are approximate algorithms for computing the projections for generic polytopes and their induced cones. We also devised an algorithm of complexity that exactly computes the projection for polytopes arising in learning on (un)directed hypergraphs. A detailed description of the algorithm for exact projections is described in Section G of the Supplement.

## 6 Experiments

Our dataset experiments focus on TL learning for hypergraphs on both real and synthetic datasets. For the particular problem at hand, the QDSFM problem can be formulated as follows

 minx∈RNβ∥x−a∥2+∑r∈[R]maxi,j∈Sr(xi√Wii−xj√Wjj)2, (10)

where indicates if the corresponding variable has a negative, missing, or positive label, respectively, is a parameter that balances out the influence of observations and the regularization term, defines a positive measure over the variables and may be chosen to be the degree matrix with . Each part in the decomposition corresponds to one hyperedge. We compare eight different solvers falling into three categories: (a) our proposed general QDSFM solvers, QRCD-SPE, QRCD-MNP, QRCD-FW and QAP-SPE; (b) alternative solvers for the specific problem (10), including PDHG hein2013total () and SGD zhang2017re (); (c) TL solvers that do not use QDSFM as the objective, including DRCD ene2015random () and InvLap zhou2007learning (). The first three methods all have outer-loops that execute RCD, but with different inner-loop projections computed via the exact projection algorithm for undirected hyperedges (see Section G in the Supplement), or the generic MNP and FW. The QAP-SPE method uses AP in the outer-loop and exact inner-loop projections. PDHG and SGD are the only known solvers for the specific objective (10). DRCD is a state-of-the-art solver for DSFM and also uses a combination of outer-loop RCD and inner-loop projections. InvLap first transforms hyperedges into cliques and then solves a Laplacian-based linear problem. All the aforementioned methods, except InvLap, are implemented via C++ in a nonparallel fashion. InvLap is executed via matrix inversion operations in Matlab which may be parallelized. The CPU times of all methods are recorded on a 3.2GHz Intel Core i5. The results are summarized for independent tests. When reporting the results, we use the primal gap (“gap”) to characterize the convergence properties of different solvers. Additional descriptions of the settings and experimental results for the QRCD-MNP and QRCD-FW methods for general submodular functions may be found in the Supplement.
Synthetic data. We generated a hypergraph with vertices that belong to two equal-sized clusters. We uniformly at random generated hyperedges within each cluster and hyperedges across the two clusters. Note that in higher-order clustering, we do not need to have many hyperedges within each cluster to obtain good clustering results. Each hyperedge includes vertices. We also uniformly at random picked vertices from each cluster to represent labeled datapoints. With the vector obtained by solving (10), we classified the variables based on the Cheeger cut rule hein2013total (): suppose that and define . We partition into two sets where

 j∗=argminj∈[N]c(Sj)≜|Sr∩Sj≠∅,Sr∩¯Sj≠∅}|max{∑r∈[R]|Sr∩Sj|,∑r∈[R]|Sr∩¯Sj|}.

The classification error is defined as (# of incorrectly classified vertices). In the experiment, we used , and tuned to be nearly optimal for different objectives with respect to the classification error rates.

The top-left figure in Figure 1 shows that QRCD-SPE converges much faster than all other methods when solving the problem (10) according to the gap metric (we only plotted the curve for as all other values of produce similar patterns). To avoid clutter, we postpone the results for QRCD-MNP and QRCD-FW to the Supplement (see Section H.3), as these methods are typically to times slower than QRCD-SPE. In the table that follows, we describe the performance of different methods with similar CPU-times. Note that when QRCD-SPE converges (with duality gap achieved after s), the obtained leads to a much smaller classification error than other methods. QAP-SPE, PDHG and SGD all have large classification errors as they do not converge within short CPU time-frames. QAP-SPE and PDHG perform only a small number of iterations, but each iteration computes the projections for all the hyperedges, which is more time-consuming. The poor performance of DRCD implies that the DFSM is not a good objective for TL. InvLap achieves moderate classification errors, but still does not match the performance of QRCD-SPE. Note that InvLap uses Matlab, which is optimized for matrix operations, and is hence fairly efficient. However, for experiments on real datasets, where one encounters fewer but significantly larger hyperedges, InvLap does not offer as good a performance as the one for synthetic data. The column “Average ” also illustrates that the QDSFM objective is a good choice for finding approximate balanced cuts of hypergraphs.

We also evaluated the influence of parameter choices on the convergence of QRCD methods. According to Theorem 4.4, the required number of RCD iterations for achieving an -optimal solution for (10) is roughly (see Section H.2 in the Supplement). We mainly focus on testing the dependence on the parameters and , as the term is also included in the iteration complexity of DSFM and was shown to be necessary given certain submodular structures li2018revisiting (). To test the effect of , we fix for all , and vary . To test the effect of , we fix and randomly choose half of the vertices and set their values to lie in , and set the remaining ones to . The two top-right plots of Figure. 1 show our results. The logarithm of gap ratios is proportional to for small and , which is not as sensitive as predicted by Theorem 4.4. Moreover, when is relatively large (), the dependence on levels out.

Real data. We also evaluated the proposed algorithms on three UCI datasets: Muchroom, Covertype45, Covertype67, used as standard datasets for TL on hypergraphs zhou2007learning (); hein2013total (); zhang2017re (). Each dataset corresponds to a hypergraph model as described in hein2013total (): entries correspond to vertices while each categorical feature is modeled as one hyperedge; numerical features are first quantized into bins of equal size, and then mapped to hyperedges. Compared to synthetic data, in this datasets, the size of most hyperedges is much larger ( 1000) while the number of hyperedges is small (). Previous works have been shown that fewer classification errors can be achieved by using QDSFM as an objective instead of DSFM or InvLap hein2013total (). In our experiment, we focused on comparing the convergence of different solvers for QDSFM. We set and for all and set the number of observed labels to , which is a proper setting as described in hein2013total (). Figure. 2 shows the results. Again, the proposed QRCD-SPE and QAP-SPE methods both converge faster than PDHG and SGD, while QRCD-SPE performs the best. Note that we did not plot the results for QRCD-MNP and QRCD-FW as the methods converge extremely slowly due to the large sizes of the hyperedges. InvLap requires , and seconds to run on the Mushroom, Covertype45 and Covertype67 datasets, respectively. Hence, the methods do not scale well.

## References

• (1) F. Bach, “Learning with submodular functions: A convex optimization perspective,” Foundations and Trends® in Machine Learning, vol. 6, no. 2-3, pp. 145–373, 2013.
• (2) L. Lovász, “Submodular functions and convexity,” in Mathematical Programming The State of the Art.   Springer, 1983, pp. 235–257.
• (3) X. Zhu, Z. Ghahramani, and J. D. Lafferty, “Semi-supervised learning using gaussian fields and harmonic functions,” in Proceedings of the 20th International conference on Machine learning, 2003, pp. 912–919.
• (4) D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Schölkopf, “Learning with local and global consistency,” in Advances in Neural Information Processing Systems (NIPS), 2004, pp. 321–328.
• (5) S. Osher, M. Burger, D. Goldfarb, J. Xu, and W. Yin, “An iterative regularization method for total variation-based image restoration,” Multiscale Modeling & Simulation, vol. 4, no. 2, pp. 460–489, 2005.
• (6) A. Chambolle and J. Darbon, “On total variation minimization and surface evolution using parametric maximum flows,” International Journal of Computer Vision, vol. 84, no. 3, p. 288, 2009.
• (7) S. Kumar and M. Hebert, “Discriminative random fields: A discriminative framework for contextual interaction in classification,” in Proceedings of IEEE International Conference on Computer Vision.   IEEE, 2003, pp. 1150–1157.
• (8) N. Z. Shor, Minimization Methods for Non-differentiable Functions.   Springer Science & Business Media, 2012, vol. 3.
• (9) P. Stobbe and A. Krause, “Efficient minimization of decomposable submodular functions,” in Advances in Neural Information Processing Systems, 2010, pp. 2208–2216.
• (10) S. Fujishige, Submodular functions and optimization.   Elsevier, 2005, vol. 58.
• (11) S. Jegelka, F. Bach, and S. Sra, “Reflection methods for user-friendly submodular optimization,” in Advances in Neural Information Processing Systems, 2013, pp. 1313–1321.
• (12) R. Nishihara, S. Jegelka, and M. I. Jordan, “On the convergence rate of decomposable submodular function minimization,” in Advances in Neural Information Processing Systems, 2014, pp. 640–648.
• (13) A. Ene and H. Nguyen, “Random coordinate descent methods for minimizing decomposable submodular functions,” in Proceedings of the International Conference on Machine Learning, 2015, pp. 787–795.
• (14) A. Ene, H. Nguyen, and L. A. Végh, “Decomposable submodular function minimization: discrete and continuous,” in Advances in Neural Information Processing Systems, 2017, pp. 2874–2884.
• (15) P. Li and O. Milenkovic, “Revisiting decomposable submodular function minimization with incidence relations,” arXiv preprint arXiv:1803.03851, 2018.
• (16) R. Johnson and T. Zhang, “On the effectiveness of laplacian normalization for graph semi-supervised learning,” Journal of Machine Learning Research, vol. 8, no. Jul, pp. 1489–1517, 2007.
• (17) M. Hein, S. Setzer, L. Jost, and S. S. Rangapuram, “The total variation on hypergraphs-learning on hypergraphs revisited,” in Advances in Neural Information Processing Systems, 2013, pp. 2427–2435.
• (18) C. Zhang, S. Hu, Z. G. Tang, and T. H. Chan, “Re-revisiting learning on hypergraphs: confidence interval and subgradient method,” in Proceedings of the International Conference on Machine Learning, 2017, pp. 4026–4034.
• (19) A. Gammerman, V. Vovk, and V. Vapnik, “Learning by transduction,” in Proceedings of the Fourteenth conference on Uncertainty in artificial intelligence.   Morgan Kaufmann Publishers Inc., 1998, pp. 148–155.
• (20) T. Joachims, “Transductive learning via spectral graph partitioning,” in Proceedings of the 20th International Conference on Machine Learning, 2003, pp. 290–297.
• (21) X. Zhu, J. Lafferty, and Z. Ghahramani, “Combining active learning and semi-supervised learning using gaussian fields and harmonic functions,” in ICML 2003 workshop on the continuum from labeled to unlabeled data in machine learning and data mining, vol. 3, 2003.
• (22) P. Li and O. Milenkovic, “Inhomogeneous hypergraph clustering with applications,” in Advances in Neural Information Processing Systems, 2017, pp. 2305–2315.
• (23) L. Page, S. Brin, R. Motwani, and T. Winograd, “The pagerank citation ranking: Bringing order to the web.” Stanford InfoLab, Tech. Rep., 1999.
• (24) T.-H. H. Chan, A. Louis, Z. G. Tang, and C. Zhang, “Spectral properties of hypergraph laplacian and approximation algorithms,” Journal of the ACM (JACM), vol. 65, no. 3, p. 15, 2018.
• (25) T. Chan, Z. G. Tang, X. Wu, and C. Zhang, “Diffusion operator and spectral analysis for directed hypergraph laplacian,” arXiv preprint arXiv:1711.01560, 2017.
• (26) P. Li and O. Milenkovic, “Submodular hypergraphs: p-laplacians, cheeger inequalities and spectral clustering,” arXiv preprint arXiv:1803.03833, 2018.
• (27) D. F. Gleich, L.-H. Lim, and Y. Yu, “Multilinear pagerank,” SIAM Journal on Matrix Analysis and Applications, vol. 36, no. 4, pp. 1507–1541, 2015.
• (28) A. Chambolle and T. Pock, “A first-order primal-dual algorithm for convex problems with applications to imaging,” Journal of Mathematical Imaging and Vision, vol. 40, no. 1, pp. 120–145, 2011.
• (29) M. Frank and P. Wolfe, “An algorithm for quadratic programming,” Naval Research Logistics, vol. 3, no. 1-2, pp. 95–110, 1956.
• (30) S. Fujishige and S. Isotani, “A submodular function minimization algorithm based on the minimum-norm base,” Pacific Journal of Optimization, vol. 7, no. 1, pp. 3–17, 2011.
• (31) D. Chakrabarty, P. Jain, and P. Kothari, “Provable submodular minimization using Wolfe’s algorithm,” in Advances in Neural Information Processing Systems, 2014, pp. 802–809.
• (32) D. Zhou, J. Huang, and B. Schölkopf, “Learning with hypergraphs: Clustering, classification, and embedding,” in Advances in Neural Information Processing Systems, 2007, pp. 1601–1608.
• (33) I. Ekeland and R. Temam, Convex Analysis and Variational Problems.   Siam, 1999, vol. 28.

Supplement

Here, we provide detailed proofs of the main lemmas and theorems, and described implementation details for the numerical experiments. We also present the convergence analysis for the Alternating Projection algorithm for QDSFM and the conic Frank-Wolfe method for computing the projection onto the induced cone, as well as an exact projection algorithm when learning on directed/undirected hypergraphs.

## Appendix A Proofs for Dual Formulation

### a.1 Proof of Lemma 4.1

In all following derivation, we may exchange the order of minimization and maximization (i.e., ) due to Proposition 2.2 ekeland1999convex (). Plugging equation (5) into (3), we obtain

 minx∑r∈[R][fr(x)]2+∥x−a∥2W = minxmaxϕr≥0,yr∈ϕrBr∑r∈[R][⟨yr,x⟩−ϕ2r4]+∥x−a∥2W = maxϕr≥0,yr∈ϕrBrminx∑r∈[R][⟨yr,x⟩−ϕ2r4]+∥x−a∥2W = maxϕr≥0,yr∈ϕrBr−14∥∑r∈[R]yr−2Ws∥2W−1−14∑rϕ2r+∥s∥2W.

By eliminating some constants, one obtains the dual (6). Next, we prove that the problem (7) is equivalent to (6), as follows from removing . Note that (7) is equivalent to

 minϕr≥0,yr∈ϕrBr,λrmaxλ∑r∈[R][∥yr−λr√R∥2W−1+ϕ2r]+⟨λ,∑r∈[R]λr−2Ws⟩ = minϕr≥0,yr∈ϕrBrmaxλminλr∑r∈[R][∥yr−λr√R∥2W−1+ϕ2r]+⟨λ,∑r∈[R]λr−2Ws⟩ = minϕr≥0,yr∈ϕrBrmaxλ∑r∈[R][14∥λ∥2W+ϕ2r]+⟨λ,√R∑r∈[R](yr−12Wλ)−2Ws⟩ = minϕr≥0,yr∈ϕrBrmaxλ−R4∥λ∥2W+√R⟨λ,∑r∈[R]yr−2Ws⟩+∑r∈[R]ϕ2r = minϕr≥0,yr∈ϕrBr∥∑r∈[R]yr−2Ws∥2W−1+∑r∈[R]ϕ2r,

which is equivalent to (6).

## Appendix B Proofs for Linear Convergence of the RCD Algorithm

### b.1 Proof of Lemma 4.2

We start by recalling the following lemma from li2018revisiting () that characterizes the geometric structure of the product base polytope.

###### Lemma B.1 (li2018revisiting ()).

Assume that is a positive diagonal matrix. Let and let be in the base polytope of the submodular function . Then, there exists a point such that and .

To prove Lemma 4.2, Lemma B.1 cannot be used directly since are in different product base polytopes, and , respectively. However, the following lemma shows one can transform to lie in the same base polytopes that contains .

###### Lemma B.2.

For a given feasible point , and a nonnegative vector , one has

 ∥∑r∈[R]yr−s∥1+ρ2∥ϕ′−ϕ∥≥∥∑r∈[R]ϕ′rϕryr−s∥1.
###### Proof.

For all , let , and define a function that depends on ,

 h(ϕ)=∥∑r∈[R]yr−s∥1=∥∑r∈[R]ϕr~yr−s∥1.

For all and ,