Frank-Wolfe Optimization for Symmetric-NMF under Simplicial Constraint

# Frank-Wolfe Optimization for Symmetric-NMF under Simplicial Constraint

Han Zhao
Machine Learning Department
Carnegie Mellon University
Pittsburgh, PA 15213
han.zhao@cs.cmu.edu
&
Geoff Gordon
Machine Learning Department
Carnegie Mellon University
Pittsburgh, PA 15213
ggordon@cs.cmu.edu
###### Abstract

Symmetric nonnegative matrix factorization has found abundant applications in various domains by providing a symmetric low-rank decomposition of nonnegative matrices. In this paper we propose a Frank-Wolfe (FW) solver to optimize the symmetric nonnegative matrix factorization problem under a simplicial constraint, which has recently been proposed for probabilistic clustering. Compared with existing solutions, this algorithm is simple to implement, and has no hyperparameters to be tuned. Building on the recent advances of FW algorithms in nonconvex optimization, we prove an convergence rate to -approximate KKT points, via a tight bound on the curvature constant, which matches the best known result in unconstrained nonconvex setting using gradient methods. Numerical results demonstrate the effectiveness of our algorithm. As a side contribution, we construct a simple nonsmooth convex problem where the FW algorithm fails to converge to the optimum. This result raises an interesting question about necessary conditions of the success of the FW algorithm on convex problems.

Frank-Wolfe Optimization for Symmetric-NMF under Simplicial Constraint

Han Zhao Machine Learning Department Carnegie Mellon University Pittsburgh, PA 15213 han.zhao@cs.cmu.edu                        Geoff Gordon Machine Learning Department Carnegie Mellon University Pittsburgh, PA 15213 ggordon@cs.cmu.edu

## 1 Introduction

Nonnegative matrix factorization (NMF) has found various applications in data mining, natural language processing, and computer vision (Xu et al., 2003; Liu et al., 2003; Kuang et al., 2012), due to its ability to provide low rank approximations and interpretable decompositions. The decision version of NMF is known to be NP-hard (Vavasis, 2009), which also implies that its optimization problem is NP-hard. Recently, a variant of NMF where the input matrix is constrained to be symmetric has become popular for clustering (He et al., 2011; Kuang et al., 2012, 2015). The problem is known as symmetric NMF (SymNMF), and its goal is to minimize under the constraint that elementwise, where is a symmetric matrix of cluster affinities. Compared with NMF, SymNMF is applicable even when the algorithm does not have direct access to the data instances, but only their pairwise similarity scores. Note that in general the input matrix does not need to be nonnegative (Kuang et al., 2012). SymNMF has been successfully applied in many different settings and was shown to be competitive with standard clustering algorithms; see (Kuang et al., 2012, 2015) and the references therein for more details.

In this paper we investigate a constrained version of SymNMF where the input matrix is required to be both nonnegative and positive semidefinite. Furthermore, we require that is normalized such that each row of sums to 1. This problem has an interesting application in probabilistic clustering (Zhao et al., 2015) where the th row of can be interpreted as the probability that the th data point lies in each clusters. Formally, we are interested in the following optimization problem, which we name as simplicial SymNMF (SSymNMF):

 minimizeW 14||P−WWT||2F (1) subject to W∈Rn×k+,W1k=1n

where and is positive semidefinite. (1) was proposed as a formulation for probabilistic clustering (Zhao et al., 2015). The input matrix is interpreted as the co-cluster affinity matrix, i.e., entry corresponds to the degree to which we encourage data instances and to be in the same cluster. Each row of then corresponds to the probability distribution of instance being in different clusters. A similar simplicial constraint has been considered in NMF as well (Nguyen et al., 2013), where the goal is to seek a probabilistic part-based decomposition for clustering.

Previous approaches to solve SymNMF or SSymNMF use the penalty method to convert it into an unconstrained optimization problem and then solve it iteratively, using either first-order or second-order methods (Kuang et al., 2012; Zhao et al., 2015; Kuang et al., 2015). Such methods usually include two loops where the outer loop gradually increases the penalty coefficients and the inner loop finds a stationary point of each fixed penalized objective, hence they are often computationally expensive and slow to converge. In this paper we first give an equivalent geometric description of (1) and then propose a variant of the classic Frank-Wolfe (FW) algorithm (Frank and Wolfe, 1956), a.k.a. the conditional gradient method (Levitin and Polyak, 1966), to solve it. We also provide a non-asymptotic convergence guarantee of our algorithm under an affine invariant stationarity measure (defined in Sec. 2). More specifically, for a given approximation parameter , we show that the algorithm converges to an -approximate KKT point of (1) in iterations. This rate is analogous to the one derived by Nesterov (2013) for general unconstrained problems (potentially nonconvex) using the gradient descent method, where the measure of stationarity is given by the norm of the gradient. The rate has recently been shown to be optimal (Cartis et al., 2010) in the unconstrained setting for gradient methods, and it also matches the best known rate to a stationary point with (accelerated) projected gradient descent (Ghadimi and Lan, 2016; Ghadimi et al., 2016) in the constrained smooth nonconvex setting.

Contributions. We first give a generalized definition of the curvature constant (Jaggi, 2013; Lacoste-Julien et al., 2013; Lacoste-Julien, 2016) that works for both convex and nonconvex functions, and we prove a tight bound of it in (1). We then propose a convergence measure in terms of the duality gap and show that the gap is 0 iff KKT conditions are satisfied. Using these two tools, we propose a FW algorithm to solve (1) and show that it has a non-asymptotic convergence rate to KKT points. On the algorithmic side, we give a procedure that has the optimal linear time complexity and constant space complexity to implement the linear minimization oracle (LMO) in the FW algorithm. As a side contribution, we construct a piecewise linear example where the FW algorithm fails to converge to the optimum. Surprisingly, we can also show that the FW algorithm works if we slightly change the objective function, despite that the new function remains piecewise linear and has an unbounded curvature constant. These two examples then raise an interesting question w.r.t. the necessary condition of the success of the FW algorithm. At the end, we conduct several numerical experiments to demonstrate the efficiency of the proposed algorithm by comparing it with the penalty method and projected gradient descent.

## 2 Preliminary

### 2.1 SymNMF UNDER SIMPLICIAL CONSTRAINT

One way to understand (1) is through its clustering based explanation: the goal is to find a probabilistic clustering of all the instances such that the given co-cluster affinity for a pair of instances is close to the true probability that and reside in the same cluster:

 Pr(xi∼xj) =k∑h=1Pr(ci=cj=h)

where we use the notation to mean “ and reside in the same cluster”; is the cluster assignment of and denotes the th row vector of . Note that the second equation holds because of the assumption of i.i.d. generation process of instances and their cluster assignments.

As a first note, the optimal solution to (1) is not unique: for any permutation over , an equivalent solution can be constructed by , where is a permutation matrix specified by . This corresponds to an equivalence class of by label switching. Hence for any fixed , there are at least optimal solutions to (1). The uniqueness of the solution to (1) up to permutation is still an open problem. Huang et al. (2014) studied sufficient and necessary conditions for the uniqueness of SymNMF, but they are NP-hard to check in general.

Zhao et al. (2015) proposed a penalty method to transform (1) into an unconstrained problem and solve it via sequential minimization. Roughly speaking, the penalty method repeatedly solves an unconstrained problem, and enforces the constraints in (1) by gradually increasing the coefficients of the penalty terms. This process iterates until a solution is both feasible and a stopping criterion w.r.t. the objective function is met; see (Zhao et al., 2015, Algo. 1). The penalty method contains 6 different hyperparameters to be tuned, and it is not even clear whether it will converge to a KKT point of (1). To the best of our knowledge, no other methods has been proposed to solve (1). To solve SymNMF, Kuang et al. (2012) proposed a projected Newton method and Vandaele et al. (2016) developed a block coordinate descent method. However, due to the coupling of columns of introduced by the simplicial constraint, it is not clear how to extend these two algorithms to solve (1). On the other hand, the simplicial constraint in (1) restricts the feasible set to be compact, which makes it possible for us to apply the FW algorithm to solve it.

### 2.2 Frank-Wolfe ALGORITHM

The FW algorithm (Frank and Wolfe, 1956; Levitin and Polyak, 1966) is a popular first-order method to solve constrained convex optimization problems of the following form:

 minimizex∈Df(x) (2)

where is a convex and continuously differentiable function over the convex and compact domain . The FW method has recently attracted a surge of interest in machine learning due to the fact that it never requires us to project onto the constraint set and its ability to cheaply exploit structure in  (Clarkson, 2010; Jaggi, 2011, 2013; Lacoste-Julien et al., 2013; Lan, 2013). Compared with projected gradient descent, it provides arguably wider applicability since projection can often be computationally expensive (e.g., for the general ball), and in some case even computationally intractable (Collins et al., 2008). At each iteration, the FW algorithm finds a feasible search corner by minimizing a linear approximation at the current iterate over :

 w(x)\vcentcolon=min^s∈Df(x)+∇f(x)T(^s−x) (3)

The linear minimization oracle (LMO) at is defined as:

 s=LMO(x)\vcentcolon=argmin^s∈Df(x)+∇f(x)T(^s−x) (4)

Given LMO, the next iterate is then updated as a convex combination of and the current iterate . If the FW algorithm starts with a corner as the initial point, then this property implies that at the -th iteration, the current iterate is a convex combination of at most corners. The difference between and is known as the Frank-Wolfe gap (FW-gap):

 g(x)\vcentcolon=f(x)−w(x)=max^s∈D∇f(x)T(x−^s) (5)

which turns out to be a special case of the general Fenchel duality gap when we transform (2) into an unconstrained problem by adding an indicator function over into the objective function (Lacoste-Julien et al., 2013, Appendix D). Due to the convexity of , we have the following inequality: , where is the globally optimal value of . Specifically, , and as a result can be used as an upper bound for the optimality gap: , so that is a certificate for the approximation quality of the current solution. Furthermore, the duality gap can be computed essentially for free in each iteration: .

It is well known that for smooth convex optimization problems the FW algorithm converges to an -approximate solution in iterations (Frank and Wolfe, 1956; Dunn and Harshbarger, 1978). This result has recently been generalized to the setting where the LMO is solved only approximately (Clarkson, 2010; Jaggi, 2013). The analysis of convergence depends on a crucial concept known as the curvature constant defined for a convex function :

 Cf\vcentcolon=supx,s∈D,γ∈(0,1],y=x+γ(s−x)2γ2(f(y)−f(x)−∇f(x)T(y−x)) (6)

The curvature constant measures the relative deviation of a convex function from its linear approximation. It is clear that . Furthermore, the definition of only depends on the inner product of the underlying space, which makes it affine invariant. In fact, the curvature constant, the FW algorithm, and its convergence analysis are all affine invariant (Lacoste-Julien et al., 2013). Later we shall generalize the curvature constant to a function that is not necessarily convex and state our result in terms of the generalized definition. The above definition of the curvature constant still works for nonconvex functions, but for a concave function , , which loses its geometric interpretation as measuring the curvature of .

To proceed our discussion, we first establish some standard terminologies used in this paper. A continuously differentiable function is called -smooth w.r.t. the norm if is -Lipschitz continuous w.r.t. the norm : . Throughout this paper, we will use to mean the Euclidean norm, i.e., the norm for vectors and the Frobenius norm for matrices if not explicitly specified. It is standard to assume to be -smooth in the convergence analysis of the FW algorithm (Clarkson, 2010; Jaggi, 2013). As we will see below, the smoothness of also implies the boundedness of on a compact set.

## 3 Frank-Wolfe FOR SymNMF UNDER SIMPLICIAL CONSTRAINT

### 3.1 A Geometric Perspective

In this section we complement our discussion of (1) in Sec. 2 with a geometric interpretation, which allows us to make a connection between the decision version of (1) to the well known problem of completely positive matrix factorization (Berman and Shaked-Monderer, 2003; Dickinson, 2013; Dickinson and Gijben, 2014). The decision version of the optimization problem in (1) is formulated as follows:

###### Definition 1 (SSymNMF).

Given a matrix , can it be factorized as for some integer such that and ?

Clearly, for , we can decompose it as , where , . Let be the th row vector of ; then is the Gram matrix of a set of dimensional vectors . Similarly, can be understood as the Gram matrix of a set of dimensional vectors , where is the th row vector of . The simplex constraint in (1) further restricts each , i.e., resides in the dimensional probability simplex. Hence equivalently, SSymNMF asks the following question:

###### Definition 2.

Given a set of instances , does there exist an integer and an embedding , such that inner product is preserved under , i.e., ?

An affirmative answer to SSymNMF will give a certificate of the existence of such embedding : . The goal of (1) can thus be understood as follows: find an embedding into the probability simplex such that the discrepancy of inner products between the image space and the original space is minimized. The fact that inner product is preserved immediately implies that distances between every pair of instances are also preserved. If , such an embedding is also known as an isometry. In this case is unitary. Note in the above definition of SSymNMF we do not restrict that , and does not have to be linear.

SSymNMF is closely connected to the strong membership problem for completely positive matrices (Berman and Plemmons, 1994; Berman and Shaked-Monderer, 2003). A completely positive matrix is a matrix that can be factorized as where for some . The set of completely positive matrices forms a convex cone, and is known as the completely positive cone (CP cone). From this definition we can see that the decision version of SymNMF corresponds to the strong membership problem of the CP cone, which has recently been shown by Dickinson and Gijben (2014) to be NP-hard. Geometrically, the strong membership problem for CP matrices asks the following question:

###### Definition 3 (Smemcp).

Given a set of instances , does there exist an integer and an embedding , such that inner product is preserved under , i.e., ?

Note that the only difference between SSymNMF and SMEMCP lies in the range of the image space: the former asks for an embedding in while the latter only asks for embedding to reside in . SSymNMF is therefore conjectured to be NP-hard as well, but this assertion has not been formally proved yet. A simple reduction from SMEMCP to SSymNMF does not work: a “yes” answer to the former does not imply a “yes” answer to the latter.

### 3.2 Algorithm

We list the pseudocode of the FW method in Alg. 1 and discuss how Alg. 1 can be efficiently applied and implemented to solve SSymNMF. We start by deriving the gradient of :

 ∇f(W)=(WWT−P)W∈Rn×k (7)

The normalization constraint keeps the feasible set decomposable (even though the objective function is not decomposable): it can be equivalently represented as the product of probability simplices . Hence at the -th iteration of the algorithm we solve the following linear optimization problem over :

 minimizeS tr(ST∇f(W(t))) (8) subject to S∈ΠnΔk−1

Because of the special structure of the constraint set in (8), given , we can efficiently compute the two key quantities and in Alg. 1 in time and space. The pseudocode is listed in Alg. 2. The key observation that allows us to achieve this efficient implementation is that at each iteration , LMO is guaranteed to be a sparse matrix that contains exactly one 1 in each row. The time complexity of Alg. 2 is , which can be further reduced to by additional space to store the index of the nonzero element at each row of LMO. Alg. 1, together with Alg. 2 as a sub-procedure, is very efficient while at the same time being simple to implement. Furthermore, it does not have any hyperparameter to be tuned: this is in sharp contrast with the penalty method.

### 3.3 Convergence Analysis

In this section we provide a non-asymptotic convergence rate for the FW algorithm for solving (1) and derive a tight bound for its curvature constant. Our analysis is based on the recent work (Lacoste-Julien, 2016), where we redefine the curvature constant so that the new definition works in both convex and nonconvex settings. Due to space limit, we only provide partial proofs for theorems and lemmas derived in this paper, and refer readers to supplementary material for all detailed proofs.

When applied to smooth convex constrained optimization problems, the FW algorithm is known to converge to an -approximate solution in iterations. However the convergence of global optimality is usually unrealistic to hope for in nonconvex optimization, where even checking local optimality itself can be computationally intractable (Murty and Kabadi, 1987). Clearly, to talk about convergence, we need to first establish a convergence criterion. For unconstrained nonconvex problems, the norm of the gradient has been used to measure convergence (Ghadimi et al., 2016; Ghadimi and Lan, 2016), since means every limit point of the sequence is a stationary point. But such a convergence criterion is not appropriate for constrained problems because a stationary point can lie on the boundary of the feasible region while not having a zero gradient. To address this issue, we will use the FW-gap as a measure of convergence. Note that the gap works as an optimality gap only if the original problem is convex. To see why this is also a good measure of convergence for constrained nonconvex problems, we first prove the following theorem: \thmt@toks\thmt@toks Let be a differentiable function and be a convex compact domain. Define . Then and iff is a Karush–Kuhn–Tucker (KKT) point.

###### Proof.

To see , we have:

Reformulate the original constrained problem in the following way:

 minimizexf(x) subject toID(x)≤0 (9)

where is the indicator function of which takes value 0 iff otherwise . Define as the normal cone at in a convex set :

 ND(x)\vcentcolon={z∣zTx≥zTy,∀y∈D}

and realize the fact that the subdifferential of the indicator function when is precisely the normal cone at , i.e., , we have:

 g(x)=0 ⇔ maxy∈D∇f(x)T(x−y)=0 ⇔ ∀y∈D,∇f(x)T(y−x)≥0 ⇔ −∇f(x)∈ND(x)

Note that the normal cone for a convex set is a convex cone, which implies that , s.t.,

 ∇f(x)+λ∂ ID(x)=0 ⇔ ∇xL(x,λ)=0

where is the Lagrangian of (3.3). By construction, satisfies the dual feasibility condition and satisfies the primal feasibility condition, which also means the complementary slackness is satisfied, i.e., . It follows that iff is a KKT point of (3.3). ∎

Remark. Note that in Thm. 3.3 we do not assume to be convex. In fact we can also relax the differentiability of , as long as the subgradient exists at every point of . The proof relies on the fact that implies is in the normal cone at . Thm. 3.3 justifies the use of as a convergence measure: if , then by the continuity of , every limit point of is a KKT point of . Since being a KKT point is also a sufficient condition for optimality when is convex, Thm. 3.3 also recovers the case where is used as convergence measure for convex problems.

For a continuously differentiable function , we now extend the definition of curvature constant as follows:

 ¯Cf\vcentcolon=supx,s∈D,γ∈(0,1],y=x+γ(s−x)2γ2∣∣f(y)−f(x)−∇f(x)T(y−x)∣∣ (10)

Clearly and the new definition reduces to when is a convex function. In general we have for any function . Again, measures the relative deviation of from its linear approximation, and is still affine invariant. The difference of from the original becomes clear when is concave: in this case , but and . On the downside, the fact that means our asymptotic bound is of constant times larger than the original one proved by Lacoste-Julien (2016), but they share the same asymptotic rate in terms of the given precision parameter . Finally, iff is affine. As in (Jaggi, 2013; Lacoste-Julien et al., 2013), for a smooth function with Lipschitz constant over compact set , we can bound in terms of : \thmt@toks\thmt@toks Let be a -smooth function over a convex compact domain , and define . Then .

###### Lemma 1.

The proof of Lemma 3.3 does not require to be convex. Furthermore, does not need to be second-order differentiable — being smooth is sufficient. We proceed to derive a convergence bound for Alg. 1 using our new , which better reflects the geometric nonlinearity of nonconvex functions. The main idea of the proof is to bound the decrease of the gap function by minimizing a quadratic function iteratively. \thmt@toks\thmt@toks Consider the problem (2) where is a continuously differentiable function that is potentially nonconvex, but has a finite curvature constant as defined by (10) over the compact convex domain . Consider running Frank-Wolfe (Algo. 1), then the minimal FW gap encountered by the iterates during the algorithm after iterations satisfies:

 ~gT≤max{2h0¯Cf,√2h0¯Cf}√T+1,∀T≥0 (11)
\thmt@toks\thmt@toks

where is the initial global suboptimality. It thus takes at most iterations to find an approximate KKT point with gap smaller than .

###### Theorem 2.

We comment that the convergence rate is the same as the one provided by Lacoste-Julien (2016) using the original curvature constant for smooth nonconvex functions, and it is also analogous to the ones derived for gradient descent for unconstrained smooth problems and (accelerated) projected gradient descent for constrained smooth problems. The convergence rate of Alg. 1 depends on the curvature constant of over . We now bound the smoothness constant and the diameter of the feasible set in (1).

Bound on smoothness constant. The objective function in (1) is second-order differentiable, hence we can bound the smoothness constant by bounding the spectral norm of the Hessian instead:

###### Lemma 2 (Nesterov (2013)).

Let be twice differentiable. If for all in the domain, then is -smooth.

Note that (7) is a matrix function of a matrix variable, whose Hessian is a matrix of order . Although one can compute all the elements of the Hessian by computing all the partial derivatives separately, this approach is tedious and may hide the structure of the Hessian matrix. Instead we apply matrix differential calculus (Magnus and Neudecker, 1985; Magnus, 2010) to derive the Hessian: \thmt@toks\thmt@toks Let and define . Then:

###### Lemma 3.
(12)

where is a commutation matrix such that . The derivation of the above lemma uses the matrix differential calculus with basic properties of tensors and commutation matrices. Before we proceed, we first present a lemma that will be useful to bound the spectral norm of the above Hessian matrix: \thmt@toks\thmt@toks .

###### Lemma 4.

We are now ready to bound the spectral norm of the Hessian and use it to bound the smoothness constant of . \thmt@toks\thmt@toks Let . is -smooth on .

###### Lemma 5.

The above lemma follows from Lemma 2 where we use Lemma 3.3 to help bound the spectral norm of the Hessian matrix in (12).

Bound on diameter of . The following lemma can be easily shown: \thmt@toks\thmt@toks Let . Then with respect to the Frobenius norm.

###### Lemma 6.

Combining Lemma 3.3 with Lemma 3.3 and assuming is a constant that does not depend on , we immediately have by Lemma 3.3. A natural question to ask is: can we get better dependency on in the upper bound for given the special structure that ? The answer is negative, as we can prove the following lower bound on the Hessian: \thmt@toks\thmt@toks .

###### Lemma 7.

Using Lemma 3.3, we can prove a tight bound on our curvature constant : \thmt@toks\thmt@toks The curvature constant for on satisfies:

 2n(n/k2−c)≤¯Cf≤2n(3n+c)

where . Specifically, we have .

###### Proof.

The upper bound part is clear by combining Lemma 3.3 and Lemma 3.3. We only need to show the lower bound. Since is twice-differentiable, we have:

 ¯Cf \vcentcolon=supx,s∈D,γ∈(0,1],y=x+γ(s−x)2γ2∣∣f(y)−f(x)−∇f(x)T(y−x)∣∣ =supx,s∈D,γ∈(0,1],y=x+γ(s−x)2γ212|(y−x)T∇2f(ξ)(y−x)| ≥supx,s∈D,γ∈(0,1],y=x+γ(s−x)1γ2||y−x||22 ⋅infξ∈D||∇2f(ξ)||2 ≥1γ2γ2diam2(D)(nk2−c)=2n(nk2−c)

The second equality is due to the mean-value theorem, and the third inequality holds by the definition of . The last inequality follows from Lemma 3.3 and 3.3. ∎

Combining all the analysis above and using Alg. 2 to implement the linear minimization oracle in Alg. 1, we can bound the time complexity of the FW algorithm to solve (1):

###### Corollary 1.

The FW algorithm (Alg. 1) achieves an -approximate KKT point of (1) in time.

###### Proof.

In each iteration Alg. 1 takes time to compute the gap function as well as the next iterate. Based on Thm. 3.3, the iteration complexity to achieve an -approximate KKT point is . The result follows from Thm. 3.3 showing that . ∎

## 4 A FAILURE CASE OF Frank-Wolfe ON A NONSMOOTH CONVEX PROBLEM

For convex problems, the theoretical convergence guarantee of Frank-Wolfe algorithm and its variants depends crucially on the smoothness of the objective function: (Freund and Grigas, 2016; Garber and Hazan, 2015; Harchaoui et al., 2015; Nesterov et al., 2015) require the gradient of the objective function to be Lipschitz continuous or Hölder continuous; the analysis in (Jaggi, 2013) requires a finite curvature constant . The Lipschitz continuous gradient condition is sufficient for the convergence analysis, but not necessary: Odor et al. (2016) shows that Frank-Wolfe works for a Poisson phase retrieval problem, where the gradient of the objective is not Lipschitz continuous. As we discuss in the last section, Lipschitz continuous gradient implies a finite curvature constant. Hence an interesting question to ask is: does there exist a constrained convex problem with unbounded curvature constant such that the Frank-Wolfe algorithm can find its optimal solution? Surprisingly, the answer to the above question is affirmative. But before we give the example, we first construct an example where the FW algorithm fails when the objective function is convex but nonsmooth.

We first construct a very simple example where the objective function is a piecewise linear and the constraint set is a polytope. We will show that when applied to this simple problem, the FW algorithm, along with its line search variant and its fully corrective variant, do not even converge to the optimal solution. In fact, as we will see shortly, the limit point of the sequence can be arbitrarily far away from the global optimum. Consider the following convex, constrained optimization problem:

 minimizex1,x2 max{5x1+x2,−5x1+x2} (13) subject to x2≤3,3x1+x2≥0,−3x1+x2≥0

We plot the objective function of this example in the left figure of Fig. 1. The unique global optimum for this problem is given by with . The feasible set contains three vertices: , and . If we apply the FW algorithm to this problem, it is straightforward to verify that the is given by:

 LMO(x)=⎧⎨⎩(−1,3)x1>0(1,3)x1<0{(−1,3),(1,3),(0,0)}x1=0

Note that when , the function is not differentiable, and the subdifferential is given by . In this case the FW algorithm chooses arbitrary subgradient from the subdifferential and computes the corresponding LMO. From the fundamental theorem of linear programming, it is easy to see that when , can be any of the three corners depending on the choice of subgradient at . Now suppose the FW algorithm stops in iterations. For any initial point where , the final point output by the FW algorithm will be a convex combination of , , and . Let be the sequence of step sizes chosen by the FW algorithm. Then we can easily check that as . Note that we can readily change this example by extending so that the distance between and the optimum becomes arbitrarily large. Furthermore, both the line search and the fully-corrective variants fail since the vertices picked by the algorithm remains the same: . Finally, for any regular probability distribution that is absolutely continuous w.r.t. the Lebesgue measure, with probability 1 the initial points sampled from the distribution will converge to suboptimal solutions. As a comparison, it can be shown that the subgradient method works for this problem since the function itself is Lipschitz continuous (Nesterov, 2013).

Pick , and , where , and plug them in the definition of the curvature constant . We have:

 Cf ≥limε→0,γ>ε2γ2(f(y)−f(x)−∇f(x)T(y−x)) =limγ→0+20γ=∞

i.e., the curvature constant of this piecewise linear function is unbounded. The problem for this failure case of FW lies in the fact that the curvature constant is infinity.

On the other hand, we can also show that FW works even when by slightly changing the objective function while keeping the constraint set:

 minimizex1,x2 max{12x1+x2,−12x1+x2} (14) subject to x2≤3,3x1+x2≥0,−3x1+x2≥0

The objective function of the second example is shown in the right figure of Fig. 1. Still, the unique global optimum for the new problem is given by with , but now the is:

 LMO(x)=(0,0),∀x∈D

It is not hard to see that FW converges to the global optimum, and the curvature constant as well for this new problem. Combining with example (4), we can see that is not a necessary condition for the success of FW algorithm on convex problems, either. Piecewise linear functions form a rich class of objectives that are frequently encountered in practice, while depending on the structure of the problem, the FW algorithm may or may not work for them. This thus raises an interesting problem: can we develop a necessary condition for the success of the FW algorithm on convex problems? Another interesting question is, can we develop sufficient conditions for piecewise linear functions under which the FW algorithm converges to global optimum?

## 5 Numerical Results

We evaluate the effectiveness of the FW algorithm (Alg. 1) in solving (1) by comparing it with the penalty method (Zhao et al., 2015) and the projected gradient descent method (PGD) on 4 datasets (Table 1). These datasets are standard for clustering analysis: two of them are used in (Zhao et al., 2015), and we add two more datasets with various sizes to make a more comprehensive evaluation. The instances in each dataset are associated with true class labels. For each data set, we use the number of classes as the true number of clusters.

Given a data matrix , we use a Gaussian kernel with fixed bandwidth 1.0 to construct the co-cluster matrix as . For each dataset, we use its number of clusters as the rank of the decomposition, i.e., . We implement the penalty method based on (Zhao et al., 2015), where we set the maximum number of inner loops to be 50, and choose the step factor for the coefficients of the two penalty terms to be 2. In each iteration of PGD, we use backtracking line search to choose the step size. For all three algorithms, the stop conditions are specified as follows: if the difference of function values in two consecutive iterations is smaller than the fixed gap , or the number of (outer) iterations exceed 50, then the algorithms will stop. Also, for each dataset, all three algorithms share the same initial point.

We plot the convergence speed to local optimum of these three algorithms in Fig. 2. Clearly, the penalty method is orders of magnitude slower than both PGD and the FW algorithm, and it usually converges to a worse solution. On the other hand, although the FW algorithm tends to have more iterations before it converges, due to its cheap computation in each iteration, it consistently takes less time than PGD. Another distinction between PGD and FW is that PGD, when implemented with backtracking line search, is a monotone descent method, while FW is not. For a better visualization to compare between the PGD and the FW algorithms, we omit the penalty method in Fig. 3, and draw the log-gap plot of the four datasets. We can confirm from Fig. 3 that both PGD and the FW algorithm have roughly the same order of convergence speed for solving (1), which is consistent with the theoretical result proved in the previous section (Thm. 3.3). However, the FW algorithm often converges faster than the PGD method, in terms of the gap between the objective function value and local optimum.

## 6 Conclusion

We propose a FW algorithm to solve the SymNMF problem under a simplicial constraint. Compared with existing solutions, the proposed algorithm enjoys a non-asymptotic convergence guarantee to KKT points, is simple to implement, contains no hyperparameter to be tuned, and is also demonstrated to be much more efficient in practice. Theoretically, we establish a close connection of this problem to the famous completely positive matrix factorization by providing an equivalent geometric description. We also derive a tight bound on the curvature constant of this problem. As a side contribution, we give a pair of nonsmooth convex examples where the FW algorithm converges or fails to converge to its optimum. This result raises an interesting question w.r.t. the necessary condition of the success of the FW algorithm.

#### Acknowledgements

HZ would like to thank Simon Lacoste-Julien for his insightful comments and helpful discussions. HZ and GG gratefully acknowledge support from ONR, award number N000141512365.

## References

• Berman and Plemmons [1994] A. Berman and R. J. Plemmons. Nonnegative matrices in the mathematical sciences. SIAM, 1994.
• Berman and Shaked-Monderer [2003] A. Berman and N. Shaked-Monderer. Completely positive matrices. World Scientific, 2003.
• Cartis et al. [2010] C. Cartis, N. I. Gould, and P. L. Toint. On the complexity of steepest descent, newton’s and regularized newton’s methods for nonconvex unconstrained optimization problems. SIAM Journal on Optimization, 20(6):2833–2852, 2010.
• Clarkson [2010] K. L. Clarkson. Coresets, sparse greedy approximation, and the Frank-Wolfe algorithm. ACM Transactions on Algorithms (TALG), 6(4):63, 2010.
• Collins et al. [2008] M. Collins, A. Globerson, T. Koo, X. Carreras, and P. L. Bartlett. Exponentiated gradient algorithms for conditional random fields and max-margin markov networks. Journal of Machine Learning Research, 9(Aug):1775–1822, 2008.
• Dickinson and Gijben [2014] P. J. Dickinson and L. Gijben. On the computational complexity of membership problems for the completely positive cone and its dual. Computational optimization and applications, 57(2):403–415, 2014.
• Dickinson [2013] P. J. C. Dickinson. The copositive cone, the completely positive cone and their generalisations. Citeseer, 2013.
• Dunn and Harshbarger [1978] J. C. Dunn and S. Harshbarger. Conditional gradient algorithms with open loop step size rules. Journal of Mathematical Analysis and Applications, 62(2):432–444, 1978.
• Frank and Wolfe [1956] M. Frank and P. Wolfe. An algorithm for quadratic programming. Naval research logistics quarterly, 3(1-2):95–110, 1956.
• Freund and Grigas [2016] R. M. Freund and P. Grigas. New analysis and results for the frank–wolfe method. Mathematical Programming, 155(1-2):199–230, 2016.
• Garber and Hazan [2015] D. Garber and E. Hazan. Faster Rates for the Frank-Wolfe Method over Strongly-Convex Sets. In ICML, pages 541–549, 2015.
• Ghadimi and Lan [2016] S. Ghadimi and G. Lan. Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Mathematical Programming, 156(1-2):59–99, 2016.
• Ghadimi et al. [2016] S. Ghadimi, G. Lan, and H. Zhang. Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization. Mathematical Programming, 155(1-2):267–305, 2016.
• Harchaoui et al. [2015] Z. Harchaoui, A. Juditsky, and A. Nemirovski. Conditional gradient algorithms for norm-regularized smooth convex optimization. Mathematical Programming, 152(1-2):75–112, 2015.
• He et al. [2011] Z. He, S. Xie, R. Zdunek, G. Zhou, and A. Cichocki. Symmetric nonnegative matrix factorization: Algorithms and applications to probabilistic clustering. IEEE Transactions on Neural Networks, 22(12):2117–2131, 2011.
• Huang et al. [2014] K. Huang, N. D. Sidiropoulos, and A. Swami. Non-negative matrix factorization revisited: Uniqueness and algorithm for symmetric decomposition. IEEE Transactions on Signal Processing, 62(1):211–224, 2014.
• Jaggi [2011] M. Jaggi. Sparse convex optimization methods for machine learning. PhD thesis, ETH Zürich, 2011.
• Jaggi [2013] M. Jaggi. Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization. In ICML, pages 427–435, 2013.
• Kuang et al. [2012] D. Kuang, C. Ding, and H. Park. Symmetric nonnegative matrix factorization for graph clustering. In SDM, pages 106–117. SIAM, 2012.
• Kuang et al. [2015] D. Kuang, S. Yun, and H. Park. SymNMF: nonnegative low-rank approximation of a similarity matrix for graph clustering. Journal of Global Optimization, 62(3):545–574, 2015.
• Lacoste-Julien [2016] S. Lacoste-Julien. Convergence rate of Frank-Wolfe for non-convex objectives. arXiv preprint arXiv:1607.00345, 2016.
• Lacoste-Julien et al. [2013] S. Lacoste-Julien, M. Jaggi, M. Schmidt, and P. Pletscher. Block-Coordinate Frank-Wolfe Optimization for Structural SVMs. In ICML, pages 53–61, 2013.
• Lan [2013] G. Lan. The complexity of large-scale convex programming under a linear optimization oracle. arXiv preprint arXiv:1309.5550, 2013.
• Levitin and Polyak [1966] E. S. Levitin and B. T. Polyak. Constrained minimization methods. USSR Computational mathematics and mathematical physics, 6(5):1–50, 1966.
• Liu et al. [2003] W. Liu, N. Zheng, and X. Lu. Non-negative matrix factorization for visual coding. In Acoustics, Speech, and Signal Processing, 2003. Proceedings.(ICASSP’03). 2003 IEEE International Conference on, volume 3, pages III–293. IEEE, 2003.
• Magnus [2010] J. R. Magnus. On the concept of matrix derivative. Journal of Multivariate Analysis, 101(9):2200–2206, 2010.
• Magnus and Neudecker [1985] J. R. Magnus and H. Neudecker. Matrix differential calculus with applications to simple, hadamard, and kronecker products. Journal of Mathematical Psychology, 29(4):474–492, 1985.
• Murty and Kabadi [1987] K. G. Murty and S. N. Kabadi. Some np-complete problems in quadratic and nonlinear programming. Mathematical programming, 39(2):117–129, 1987.
• Nesterov [2013] Y. Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2013.
• Nesterov et al. [2015] Y. Nesterov et al. Complexity bounds for primal-dual methods minimizing the model of objective function. Center for Operations Research and Econometrics, CORE Discussion Paper, 2015.
• Nguyen et al. [2013] D. K. Nguyen, K. Than, and T. B. Ho. Simplicial nonnegative matrix factorization. In Computing and Communication Technologies, Research, Innovation, and Vision for the Future (RIVF), 2013 IEEE RIVF International Conference on, pages 47–52. IEEE, 2013.
• Odor et al. [2016] G. Odor, Y.-H. Li, A. Yurtsever, Y.-P. Hsieh, Q. Tran-Dinh, M. El Halabi, and V. Cevher. Frank-Wolfe works for non-Lipschitz continuous gradient objectives: Scalable poisson phase retrieval. In Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on, pages 6230–6234. IEEE, 2016.
• Vandaele et al. [2016] A. Vandaele, N. Gillis, Q. Lei, K. Zhong, and I. Dhillon. Efficient and non-convex coordinate descent for symmetric nonnegative matrix factorization. IEEE Transactions on Signal Processing, 64(21):5571–5584, 2016.
• Vavasis [2009] S. A. Vavasis. On the complexity of nonnegative matrix factorization. SIAM Journal on Optimization, 20(3):1364–1377, 2009.
• Xu et al. [2003] W. Xu, X. Liu, and Y. Gong. Document clustering based on non-negative matrix factorization. In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pages 267–273. ACM, 2003.
• Zhao et al. [2015] H. Zhao, P. Poupart, Y. Zhang, and M. Lysy. Sof: Soft-cluster matrix factorization for probabilistic clustering. In AAAI, pages 3188–3195, 2015.

## Appendix A Missing Proofs

###### Proof.

Let and . The smoothness of implies that is continuously differentiable, hence we have:

 =∣∣f(y)−f(x)−∇f(x)T(y−x)∣∣ =∣∣∫10(∇f(x+t(y−x))−∇f(x))T(y−x) dt∣∣ (Mean-value theorem) (Triangle inequality) ≤∫10||∇f(x+t(y−x))−∇f(x)||⋅||y−x|| dt (Cauchy-Schwarz inequality) ≤∫10tLγ2||s−x||2 dt≤Lγ22diam2(D) (Smoothness assumption of f)

It immediately follows that

 ¯Cf≤2γ2Lγ22diam2(D)=diam2(D)L

###### Proof.

Let , where is the update direction found by the LMO in Alg. 1. Using the definition of , we have:

 f(y) =f(y)−f(x)−γ∇f(x)Td+f(x)+γ∇f(x)Td ≤f(x)+γ∇f(x)Td+γ22¯Cf

Now using the definition of the FW gap and for , we get:

 f(y)≤f(x)−γg(x)+γ22¯Cf,∀γ∈(0,1] (15)

Depending on whether or , the R.H.S. of (15) is a either a quadratic function with positive second order coefficient or an affine function. In the first case, the optimal that minimizes the R.H.S. is . In the second case, . Combining the constraint that , we have . Thus we obtain:

 f(y)≤f(x)−min{g2(x)2C,(g(x)−C2)Ig(x)>C} (16)

(16) holds for each iteration in Alg. 1. A cascading sum of (16) through iteration step to shows that:

 f(x(T+1))≤f(x(0))−T∑t=0min{g2(x(t))2C,(g(x(t))−C2)Ig(x(t))>C} (17)

Define be the minimal FW gap in iterations. Then we can further bound inequality (17) as:

 f(x(T+1))≤f(x(0))−(T+1)min{~g2T2C,(~gT−C2)I~gT>C} (18)

We discuss two subcases depending on whether or not. The main idea is to get an upper bound on by showing that cannot be too large, otherwise the R.H.S. of (18) can be smaller than the global minimum of , which is a contradiction. For the ease of notation, define , i.e., the initial gap to the global minimum of .

Case I. If and , from (18), then:

 0≤f(x(T+1))−minx∈Df(x)≤f(x(0))−minx∈Df(x)−(T+1)(~gT−C2)=h0−(T+1)(~gT−C2)

which implies

 C<~gT≤h0T+1+C2⇒~gT≤2h0CT+1=O(1/T)

On the other hand, solving the following inequality:

 C−C2≤~gT−C2≤~g2T2C≤4h20C2(T+1)212C

we get

 T+1≤2h0

This means that decreases in rate only for at most the first iterations.

Case II. If or . Similarly, from (18), we have:

 0≤f(x(T+1))−minx∈Df(x)≤f(x(0))−minx∈Df(x)−(T+1)~g2T2C=h0−(T+1)~g2T2C

which yields

 ~gT≤√2h0CT+1

Combining the two cases together, we get if ; otherwise . Note that for , , thus we can further simplify the upper bound of as:

 ~gT≤max{2h0C,√2h0C}√T+1

###### Proof.

Using the theory of matrix differential calculus, the Hessian of a matrix-valued matrix function is defined as:

 ∇2f(W)\vcentcolon=∂vec∇f(W)∂vecW

Using the differential notation, we can compute the differential of as:

 d∇f(W)=d(WWT−P)W=(dW)WTW+W(dW)TW+WWTdW−PdW

Vectorize both sides of the above equation and make use of the identity that for with appropriate shapes, we get:

 vecd ∇f(