Inexact Block Coordinate Descent Methods For Symmetric Nonnegative Matrix Factorization
Abstract
Symmetric nonnegative matrix factorization (SNMF) is equivalent to computing a symmetric nonnegative low rank approximation of a data similarity matrix. It inherits the good data interpretability of the wellknown nonnegative matrix factorization technique and have better ability of clustering nonlinearly separable data. In this paper, we focus on the algorithmic aspect of the SNMF problem and propose simple inexact block coordinate decent methods to address the problem, leading to both serial and parallel algorithms. The proposed algorithms have guaranteed stationary convergence and can efficiently handle largescale and/or sparse SNMF problems. Extensive simulations verify the effectiveness of the proposed algorithms compared to recent stateoftheart algorithms.
Symmetric nonnegative matrix factorization, block coordinate decent, block successive upperbounding minimization, parallel algorithm, stationary convergence.
I Introduction
Clustering is the task of grouping a set of data points into different clusters according to some measure of data similarity. It is a common technique for statistical data analysis and has been widely used in the fields such as machine learning, pattern recognition and data compression, etc.. In some applications, clustering is performed on the data which has inherent nonnegativity. This motivates the great interests in the application of nonnegative matrix factorization (NMF) to clustering. NMF has been shown to be very effective for clustering linearly separable data because of its ability to automatically extract sparse and easily interpretable factors[1]. As a close relative (or a variant) of NMF, symmetric NMF (SNMF) is more directly related to the clustering problems: it can be even viewed as a relaxed version of two classical clustering methods: means clustering and spectral clustering[2]. It inherits the good interpretability of NMF and has received considerable interests due to its better ability of clustering nonlinearly separable data given a data similarity matrix[2, 12, 3, 5, 4, 13, 6]. This paper focuses on the algorithmic aspect of SNMF.
The basic SNMF problem is described as follows. Given data points, a similarity matrix can be generated by evaluating the similarity between any two data points in terms of an appropriate similarity metric. The SNMF problem of determining clusters is to find a lowrank nonnegative factorization matrix (with generally) such that . Using Frobenius norm, the basic SNMF problem is often formulated as follows
(1) 
where the nonnegativity constraint is a componentwise inequality. Such model often leads to sparse factors which are better interpretable^{1}^{1}1With the requirement of nonnegativity on , the index of the largest entry of the th row of can be simply thought of as the cluster label of the th data points. This greatly facilitates clustering once the symmetric factorization of is obtained. for problems such as image or document analysis. Problem (1) is equivalent to the socalled completely positive matrix problem, which postulates whether can be exactly factorized as with . This problem is originated in inequality theory and quadratic forms[7, 8] and plays an important role in discrete optimization[9]. Note that the problem—whether a matrix is completely positive – has been recently shown to be NPhard[10]. As such, problem (1) is generally NPhard and thus efficient numerical algorithms are desired to obtain some highquality, but not necessarily globally optimal, solutions.
So far, there is limited algorithmic works on SNMF as compared to the general NMF. Following the popular multiplicative update rule proposed for the NMF problem[1], some pioneering works[3, 5, 4] on SNMF have proposed various modified multiplicative update rules for SNMF. The multiplicative update rules are usually simple to implement but require the similarity matrix to be at lease nonnegative to ensure the nonnegativity of at each iteration. For the case of positive definite and completely positive similarity matrix , the authors in [6] developed three faster parallel multiplicative update algorithms, including the basic multiplicative update algorithm, SNMF algorithm, and SNMF algorithm, all converging to stationary solutions to the SNMF problem. It was numerically shown that the latter two outperform the first one and other previous multiplicativeupdatebased SNMF algorithms. It should be stressed that all the above multiplicativeupdatebased SNMF algorithms implicitly assume positive at each iteration to derive the corresponding auxiliary functions[3, 5, 4, 6] and require arithmetic division operators in the updates. Consequently, they may not be numerically stable when some entries of reach zero during iterations and cannot even guarantee stationary convergence in some cases.
While the above algorithms require the similarity matrix to be nonnegative, it is noted that the similarity matrix in graphbased clustering may be neither nonnegative nor positive definite (e.g., the matrix could have negative entries in some definition of kernel matrices used in semisupervised clustering [11]). Hence, the aforementioned algorithms do not apply to the SNMF problem with a general similarity matrix . In [2], the authors proposed two numerical optimization methods to find stationary solutions to the general SNMF problem. The first one is a Newtonlike algorithm which uses partial secondorder information (e.g., reduced Hessian or its variants) to reduce the computational complexity for determining the search direction, while the second one is called alternating nonnegative least square (ANLS) algorithm, where the SNMF problem is split into two classes of nonnegative leastsquare subproblems by using penalty method coupled with twoblock coordinate descent method. Technically, the ANLS algorithm is a penalty method whose stationary convergence is much impacted by the choice of the penalty parameter. As a result, the ANLS algorithm often comes up with an approximate solution which is generally less accurate than the solution provided by the Newtonlike algorithm.
If we view each entry of or a set of entries of as one block variable, the SNMF problem is a multiblock optimization problem. A popular approach for solving multiblock optimization problems is the block coordinate descent (BCD) approach[17, 16], where only one of the block variables is optimized at each iteration while the remaining blocks are kept fixed. When the oneblock subproblem can be easily solved, the BCD algorithms often perform very efficiently in practice. Furthermore, since only one block is updated at each iteration, the periteration storage and computational burden of the BCD algorithm could be very low, which is desirable for solving largescale problems. Due to the above merits, the BCD method has been recently used to solve the SNMF problem[12], where each entry of is viewed as one block variable and the corresponding subproblem is equivalent to finding the best root of a cubic equation. The proposed BCD algorithm in [12] was designed in an efficient way by exploiting the structure of the SNMF problem. However, since the subproblem of updating each entry of may have multiple solutions, the existing convergence results cannot be applied to the proposed BCD algorithm in [12]. Though much effort was made in [12] to prove the convergence of the proposed algorithm to stationary solutions, we find that there exists a gap in the convergence proof of Theorem 1 in [12] (see Sec. IIA.2)).
To overcome the weakness of the BCD algorithm and make it more flexible, an inexact BCD algorithm, named block successive upperbounding minimization (BSUM) algorithm, was proposed for multiblock minimization problems[18], where the block variables are updated by successively minimizing a sequence of approximations of the objective function which are either locally tight upper bounds of the objective function or strictly convex local approximations of the objective function. Unlike the classical BCD method which requires for convergence the uniqueness of the minimizer of every oneblock subproblem, the BSUM algorithm and its variants or generalizations are guaranteed to achieve convergence to stationary solutions under very mild conditions[18]. Considering these merits of the BSUM algorithm, this paper proposes BSUMbased algorithms to address (regularized) SNMF problems.
The contribution of this paper is threefold:

First, two cyclic BSUM algorithms are proposed for the SNMF problem, including an entrywise BSUM algorithm and a vectorwise BSUM algorithm. The key to both algorithms is to find proper upper bounds for the objective functions of the subproblems involved in the two algorithms.

Then, we study the convergence of permutationbased randomized BSUM (PRBSUM) algorithm. For the first time, we prove that this algorithm can monotonically converge to the set of stationary solutions. As a result, this paper settles the convergence issue of the exact BCD algorithm[12] using permuted block selection rule.

Lastly, we propose parallel BSUM algorithms to address the SNMF problem. By distributing the computational load to multiple cores of a highperformance processor, the parallel BSUM algorithms can efficiently handle largescale SNMF problems.
The remainder of this paper is organized as follows. We present two cyclic inexact BCD methods, termed as sBSUM and vBSUM, for the SNMF problem in Section II. In Section III, we study the permutationbased randomized BSUM algorithm in a general framework and its application to the SNMF problem. In Section IV, we discuss how to implement the sBSUM algorithm and vBSUM algorithm in parallel in a multicore/processor architecture. Finally, Section V presents some simulation results, followed by a conclusion drawn in Section VI.
Notations: Throughout this paper, we use uppercase bold letters for matrices, lowercase bold letters for column vectors, and regular letters for scalars. For a matrix , and denote the transpose and pseudoinverse of , respectively, and means entrywise inequality. and denotes the Frobenius norm and trace of , respectively. We use the notation (or ), and to refer to the th entry, the th row and the th column of , respectively, while the submatrix of consisting of elements from the th row through th row. denotes the identity matrix whose size will be clear from the context. We define an elementwise operator . For a function , (or ) denotes its derivative (or gradient) with respect to the variable .
Ii Cyclic BSUM Algorithms For SNMF
This paper aims to provide algorithmic design in a broad view for the following generalized formulation of the SNMF problem
(2) 
where and are respectively by and by matrices, is a scalar, and is a matrix polynomial of up to fourthorder. Problem (2) is in essence a class of multivariate quartic polynomial optimization problem with nonnegative constraints. Besides the basic SNMF problem, this problem arises also from for example:
It is not difficult to see that, problem (2) is as hard as the basic SNMF problem when is up to fourthorder, and the main difficulty lies in the fourthorder matrix polynomial, which makes the problem hard to solve. Hence, we focus on the basic SNMF problem throughout the rest of the paper and develop various BSUMtype algorithms for the basic SNMF problem. Without loss of generality, we assume that the matrix is symmetric^{2}^{2}2When is not symmetric, the corresponding problem is equivalent to (2) with thereof replaced by .. We stress that all the techniques developed below can be easily generalized to the regularized SNMF problem (2).
Iia Scalarwise BSUM algorithm
The basic SNMF problem is given by
(3) 
It is readily seen that, when all the entries of but one are fixed, problem (3) reduces to minimizing a univariate quartic polynomial (UQP), which can be easily handled. Based on this fact, exact BCD algorithms—cyclicly updating each entry of by solving a UQP problem—has been developed for the SNMF problem[12, 14]. However, since the solution to the UQP problem may not be unique, the existing convergence theory of BCD does not apply to the exact BCD algorithm developed for the SNMF problem, though much efforts have been made to resolve the convergence issue in [12, 14]. In the following, we develop a BSUMbased algorithm for the SNMF problem, where each time we update one entry of by minimizing a locally tight upper bound of , hence termed as scalarBSUM (sBSUM).
IiA1 Algorithm design and complexity analysis
In the sBSUM algorithm, each time we optimize the th entry of while fixing the others. For notational convenience, let denote the th entry of to be optimized. Given the current , denoted as , we can write the new after updating the th entry as
where is a matrix with in the th entry and elsewhere. Accordingly, we can express the objective function as follows
(4) 
where
(5) 
with the tuple given by
(6)  
(7)  
(8)  
(9) 
The derivation of is relegated to Appendix A. To get an upper bound of , let us define
where . Clearly, is a locally tight upper bound of at . Using this upper bound function, the sBSUM algorithm updates (i.e., the th entry of ) by solving
(10) 
The following lemma states that is convex and moreover problem (10) has a unique solution.
Lemma II.1
is a convex function and a locally tight upper bound of . Moreover, problem (10) has a unique solution.
Clearly, is a locally tight upper bound of since and , . Hence, it suffices to show that is a convex function. By noting that
we infer that is a convex function. Particularly, is strictly convex when . Thus, problem (10) has a unique solution when . For the case when , we have
where we have used the identity in the second equality. Equating the derivative to zero and taking the nonnegative constraint into consideration, we obtain the unique solution to problem (10) as . This completes the proof.
As a result of Lemma II.1, the unique solution to problem (10) can be found by checking the critical point of (i.e., the point where ). That is, if the critical point of is nonnegative, then it is the unique optimum solution to problem (10); otherwise the optimal solution is . Hence, the key to solving problem (10) is to find the root of the cubic equation whose coefficients are specified by the tuple . Fortunately, we can easily solve the cubic equation and obtain the unique closedform solution to problem (10) in the following lemma.
Lemma II.2
Let be given in Eqs. (69) and define
The unique solution to problem (10) can be expressed as
(11) 
where is given by
(12) 
As shown in Lemma II.1 and using the fact , it is trivial to obtain the unique solution to problem (10) when . Thus, it suffices to consider the case when . The equation is equivalent to
where we have used the identity in the second equality, and the fact and the definitions of in the last equality. Clearly, we have when . Thus, it follows from Cardano’s method [15] that the equation has a unique solution as follows
Furthermore, acounting for the nonnegative constraint, we can derive the closedform solution for the case when as shown in (11). This completes the proof.

To summarize, the basic sBSUM algorithm proceeds as follows. Each time we pick to be optimized—first calculate and then update according to (11). It is readily seen that computing in (8) and in (9) are the most computationly costly steps of the sBSUM algorithm. In order to save memory and computational overhead, we can use the following expression
(13) 
to compute as the matrix has a smaller size than the matrix . Moreover, we have
(14)  
where is a matrix with in the th entry and elsewhere. Note that is a null matrix with its th column being while is a null matrix with its th row being . Hence, we only need to update the matrix ’s th row, th column, and th entry once the th entry of is updated, which can be done recusively.
Computing in (8) requires the knowledge of . Similarly, we have
where is a matrix with in the th entry and elsewhere. It follows that
(15) 
Hence, we only need to recursively update once the th entry of is updated.
Using (13) for computing , and (14) and (15) for recursive update, we formally present the proposed sBSUM algorithm in TABLE I, where the function in Step 6 is to find the unique solution to a cubic equation whose coefficients are specified by the tuple , followed by a projection onto the nonnegative orthant (i.e., Eq. (11) for Step 6); Steps 710 are recursive updates to reduce computational burden. According to the table and the previous analysis, we only need to store the matrix , , , and the matrix ’s diagonal entries in the sBSUM algorithm, which require , , and space in memory, respectively. Here, denotes the number of nonzero entries of , which is generally in the dense matrix case while in the sparse matrix case. Hence, the sBSUM algorithm requires space in memory. In addition, it is seen that the most costly step of the sBSUM is computing , that is, the computation of dominates the periteration computational complexity of the sBSUM algorithm (we refer to updating all entries of as one iteration) . Thus, it can be shown that the periteration computational complexity of the sBSUM algorithm is in the sparse case while in the dense case.
IiA2 Comparison of exact (cyclic) BCD[12] and (cyclic) sBSUM
The exact BCD algorithm and the sBSUM have similar iterations (cf. Algorithm 4 in [12] and our Algorithm 1) and the same complexity. The main difference between them lies in Step 6: if we replace Step 6 with and let the ‘solve’ function denote finding the best nonnegative root (to the corresponding cubic equation) among up to three real roots, the sBSUM reduces to the exact BCD algorithm. Thus, when holds for all iterations, the sBSUM is exactly the same as the exact BCD algorithm. Clearly, a notable advantage of the sBSUM is that a closedform unique real root is guaranteed in Step 6, but this is not the case for the BCD algorithm (Algorithm 1 in [12] is invoked to find the best root). Thanks to such uniqueness of the subproblems’ solutions, the sBSUM is guaranteed to converge to stationary solutions; see the argument in [18]. On the other hand, convergence is not guaranteed for multiblock cyclic BCD algorithm if each of its subproblem has possibly multiple solutions; see a wellknown example due to Powell [26]. The latter result implies that the cyclic BCD algorithm, although having similar iterations and the same complexity as sBSUM, may not be a good choice for the SNMF problem.
It is worth mentioning that, the work [12] attempted to provide a convergence proof for their cyclicBCDbased SNMF algorithm. However, there is a gap in their proof as explained as follows. Let be the sequence generated by the BCD algorithm and be one of its subsequences that converges to . In the proof of Theorem 1 in [12], the authors start by arguing that there exists and such that and
(16) 
for some if is for contrary assumed to be not a stationary solution, and finally obtain
(17) 
where is a sufficiently large integer. Eq. (17) means sufficient decrease in the objective function values, resulting in the unboundedness of , i.e., a contradiction, implying that is a stationary solution (because is lower bounded). However, there is a gap in their derivation: in fact the scalar in (16) should be related to or the iteration. As a result, the scalar in Eq. (17) should be an iterationrelated one, say . Then, we don’t necessary arrive at a contradiction (to the boundedness of ) since the term could be finite.
To summarize, since the subproblems of the cyclicBCDbased SNMF algorithm[12] may have multiple solutions, there is a lack of theoretical guarantee for the convergence of the cyclicBCDbased SNMF algorithm in [12]. However, the proposed sBSUM is guaranteed to reach the stationary solutions of the SNMF problem [18]. Moreover, it is interesting to note that the proposed sBSUM (also including the other BSUMbased SNMF algorithms to be discussed later) cannot get stuck at local maxima, as shown in Appendix B.
IiB Vectorwise BSUM
Generally speaking, for multiblock BCD/BSUM algorithms, the less the number of block variables is, the faster convergence the BCD/BSUM algorithms achieve. In the following, we develop a new algorithm, termed vectorBSUM (vBSUM), in which each time we update a single row of by minimizing an upper bound of .
We first study the subproblem of the vBSUM which optimizes the th row of (denoted as for convenience) given the current denoted as . To do so, let us define , , , and . Since we have
(18) 
the subproblem for updating the th row of is given by
(19) 
Define , , and . Then we can rewrite problem (19) equivalently (equivalent in the sense that they have the same optimal solution) as follows
(20) 
Clearly, problem (20) is not necessarily convex. But even if it is convex^{3}^{3}3When the matrix is a similarity matrix generated by the widely used Gaussian kernel, e.g., (where and are two data points, and is a parameter), or a normalized similarity matrix, we always have for all . On the other hand, because is nonnegative, the matrix is almost always a positive matrix whose eigenvalues are all in the interval given by [23, Chap. 8]. Moreover, when the number of data points is very large, it is very likely that and thus the matrix is positive semidefinite, implying that problem (20) is convex with very high probability in practice., it is still unlikely to obtain a closedform optimum solution to problem (20) due to the presence of the nonnegativity constraints and the coupling of the entries of introduced by . In what follows, we solve (20) by using BSUM algorithm.
Thanks to the Lipschitz continuity of the quadratic part in the objective function, we can find an upper bound for the objective function of problem (20). Specifically, we have the following bound for the quadratic term
where is the Lipschitz constant of the gradient of , which can be simply chosen as the maximum eigenvalue of or (see footnote 3 for reason). By replacing in (20) with the above upper bound evaluated at (i.e., let in the above inequality) in problem (20) and rearranging the terms, we obtain
(21) 
where . Observing that should be maximized for any fixed , we show in Lemma II.3 that problem (21) admits a unique closedform solution.
Lemma II.3
It is easily seen that when an entry of is nonpositive, the corresponding component of the optimal should equal to zero. As a result, problem (21) is equivalent to the following
(23) 
Trivially, we have the optimal if . Hence, we only need to consider the case when . First, it is readily seen that, problem (23) is further equivalent to
(24) 
Second, note that, (for any fixed ) should be maximized subject to and . By CauchySchwartz inequality, the optimal takes the form . Hence, problem (24) reduces to the following convex problem
(25) 
By the firstorder optimality condition, we know that the optimal is the unique real root of the cubic equation , which can be obtained in closedform as shown in above. This completes the proof.

Our proposed vBSUM algorithm, summarized in Table II, requires an approximate solution of problem (20) at each iteration. Such solution is obtained by iteratively solving (21) for a fixed number of times ; see Steps 58. We comment that regardless of the number of inner iterations performed (i.e., the choice of ), vBSUM is guaranteed to converge to stationary solutions of the basic SNMF problem, by applying the same analysis as that of the BSUM algorithm [18, Theorem 2]. Furthermore, from Table II, one can see that the most costly step in the vBSUM lies in Step 4 for computing , equivalently in each iteration, which requires operations periteration in the dense case while operations periteration in the sparse case. Hence, the periteration computational complexity of the vBSUM is the same as the sBSUM. In addition, it is seen that we only need to store , , , and in the algorithm, which require , , and space in memory, respectively. Hence, the vBSUM requires less space in memory in practice than that required by the sBSUM algorithm, though both with the same order of memory complexity, i.e., .
Remark II.1
Certainly, we can use the elementwise BSUM algorithm (i.e., view each row entry of as one block) to update each row of and obtain an alternative vBSUM algorithm, which is a simple variant of the sBSUM algorithm obtained by (for each ) updating , , multiple times in each iteration. For convenience, we refer to this alternative vBSUM algorithm as vsBSUM algorithm and also assume repeats as in the vBSUM algorithm for updating rows of . Note that, the vsBSUM algorithm requires operations in each iteration due to the frequent computation of (cf. Step 5 of Algorithm 1) while operations in the vBSUM algorithm, where the second term is due to Step 6 of Algorithm 2. Moreover, the vsBSUM algorithm requires times root operations as many as the vBSUM algorithm does. Therefore, we prefer the proposed vBSUM algorithm over the simple variant of the sBSUM algorithm for better efficiency.
Remark II.2
The vBSUM algorithm is a rowwise BSUM algorithm. Similarly, we can develop a columnwise BSUM algorithm for the basic SNMF problem. However, unlike the subproblem (20) in the vBSUM algorithm, the subproblem of updating each column of has no good structure and the corresponding Lipschitz constant is not easily available as the number of data points is very large. And more importantly, the vBSUM algorithm is more amendable to both randomized and parallel implementation, which will be clear in the following sections.
Iii Permutationbased Randomized BSUM Algorithm And Its Application in SNMF
The sBSUM and vBSUM algorithms both fall into the category of deterministic cyclic BSUM algorithms. As a variant of cyclic BSUM algorithm, randomized BCD/BSUM algorithm has been proposed[20], where each time one block variable is chosen to be optimized with certain probability. Differently from the basic randomized BCD/BSUM algorithm, the permutationbased randomized BCD/BSUM algorithm (termed as PRBCD/BSUM) updates the block variables in a random permutation rule, in which the blocks are randomly selected without replacement. However, the convergence of the PRBCD/BSUM algorithm has not been wellunderstood, as pointed out in a recent survey [16]. In particular, it is not known, at least in theory, whether random permutation provides any added benefit compared with the more classical cyclic block selection rules. In this section, we first study the convergence of the PRBSUM algorithm (including PRBCD as a special case) in a general framework and then propose the randomized sBSUM and vBSUM algorithms.
Iiia The PRBSUM algorithm and the convergence results
We start with a general description of the PRBSUM algorithm. Consider the following multiblock minimization problem
(26) 
where each is a closed convex set. Define and . Let satisfy the following assumption.
Assumption III.1
(27a)  
(27b)  
(27c)  
(27d) 
where is the th block component of (similarly for and ), and represent the block components of with their indices less than or larger than , respectively, denotes the directional derivative of with respect to along the direction , and denotes the directional derivative of with respect to along the direction . The assumption (27c) guarantees that the first order behavior of is the same as locally[18], hence it is referred to as the gradient consistency assumption.
The PRBSUM algorithm is described in Table III, where the ‘randperm’ function in Step 3 generates an index set containing a random permutation of and specifies the order of the update of block variables. The PR selection rule takes advantage of both the randomized rule and the cyclic rule: it guarantees timely update of all block variables, while avoiding being stuck with a bad update sequence. In the following, we prove that, with probability one (w.p.1.) the sequence generated by the PRBSUM algorithm converges to the set of stationary solutions of problem (26).
