Inexact Block Coordinate Descent Methods For Symmetric Nonnegative Matrix Factorization

Inexact Block Coordinate Descent Methods For Symmetric Nonnegative Matrix Factorization

\authorblockNQingjiang Shi, Haoran Sun, Songtao Lu, Mingyi Hong, Meisam Razaviyayn Q. Shi is now with the Dept. of Industrial and Manufacturing Systems Engineering, Iowa State University, IA 50011, USA. He is also with the School of Info. Sci. & Tech., Zhejiang Sci-Tech University, Hangzhou 310018, China. Email: qing.j.shi@gmail.comH. Sun, S. Lu and M. Hong are all with the Dept. of Industrial and Manufacturing Systems Engineering, Iowa State University, IA 50011, USA. Email: {mingyi, hrsun, songtao}@iastate.eduM. Razaviyayn is with the Dept. of Electrical and Computer Engineering, Stanford University, Standford 94305, USA. Email: meisamr@stanford.edu.
Abstract

Symmetric nonnegative matrix factorization (SNMF) is equivalent to computing a symmetric nonnegative low rank approximation of a data similarity matrix. It inherits the good data interpretability of the well-known nonnegative matrix factorization technique and have better ability of clustering nonlinearly separable data. In this paper, we focus on the algorithmic aspect of the SNMF problem and propose simple inexact block coordinate decent methods to address the problem, leading to both serial and parallel algorithms. The proposed algorithms have guaranteed stationary convergence and can efficiently handle large-scale and/or sparse SNMF problems. Extensive simulations verify the effectiveness of the proposed algorithms compared to recent state-of-the-art algorithms.

{keywords}

Symmetric nonnegative matrix factorization, block coordinate decent, block successive upper-bounding minimization, parallel algorithm, stationary convergence.

I Introduction

Clustering is the task of grouping a set of data points into different clusters according to some measure of data similarity. It is a common technique for statistical data analysis and has been widely used in the fields such as machine learning, pattern recognition and data compression, etc.. In some applications, clustering is performed on the data which has inherent nonnegativity. This motivates the great interests in the application of nonnegative matrix factorization (NMF) to clustering. NMF has been shown to be very effective for clustering linearly separable data because of its ability to automatically extract sparse and easily interpretable factors[1]. As a close relative (or a variant) of NMF, symmetric NMF (SNMF) is more directly related to the clustering problems: it can be even viewed as a relaxed version of two classical clustering methods: -means clustering and spectral clustering[2]. It inherits the good interpretability of NMF and has received considerable interests due to its better ability of clustering nonlinearly separable data given a data similarity matrix[2, 12, 3, 5, 4, 13, 6]. This paper focuses on the algorithmic aspect of SNMF.

The basic SNMF problem is described as follows. Given data points, a similarity matrix can be generated by evaluating the similarity between any two data points in terms of an appropriate similarity metric. The SNMF problem of determining clusters is to find a low-rank nonnegative factorization matrix (with generally) such that . Using Frobenius norm, the basic SNMF problem is often formulated as follows

(1)

where the nonnegativity constraint is a componentwise inequality. Such model often leads to sparse factors which are better interpretable111With the requirement of nonnegativity on , the index of the largest entry of the -th row of can be simply thought of as the cluster label of the -th data points. This greatly facilitates clustering once the symmetric factorization of is obtained. for problems such as image or document analysis. Problem (1) is equivalent to the so-called completely positive matrix problem, which postulates whether can be exactly factorized as with . This problem is originated in inequality theory and quadratic forms[7, 8] and plays an important role in discrete optimization[9]. Note that the problem—whether a matrix is completely positive – has been recently shown to be NP-hard[10]. As such, problem (1) is generally NP-hard and thus efficient numerical algorithms are desired to obtain some high-quality, but not necessarily globally optimal, solutions.

So far, there is limited algorithmic works on SNMF as compared to the general NMF. Following the popular multiplicative update rule proposed for the NMF problem[1], some pioneering works[3, 5, 4] on SNMF have proposed various modified multiplicative update rules for SNMF. The multiplicative update rules are usually simple to implement but require the similarity matrix to be at lease nonnegative to ensure the nonnegativity of at each iteration. For the case of positive definite and completely positive similarity matrix , the authors in [6] developed three faster parallel multiplicative update algorithms, including the basic multiplicative update algorithm, -SNMF algorithm, and -SNMF algorithm, all converging to stationary solutions to the SNMF problem. It was numerically shown that the latter two outperform the first one and other previous multiplicative-update-based SNMF algorithms. It should be stressed that all the above multiplicative-update-based SNMF algorithms implicitly assume positive at each iteration to derive the corresponding auxiliary functions[3, 5, 4, 6] and require arithmetic division operators in the updates. Consequently, they may not be numerically stable when some entries of reach zero during iterations and cannot even guarantee stationary convergence in some cases.

While the above algorithms require the similarity matrix to be nonnegative, it is noted that the similarity matrix in graph-based clustering may be neither nonnegative nor positive definite (e.g., the matrix could have negative entries in some definition of kernel matrices used in semi-supervised clustering [11]). Hence, the aforementioned algorithms do not apply to the SNMF problem with a general similarity matrix . In [2], the authors proposed two numerical optimization methods to find stationary solutions to the general SNMF problem. The first one is a Newton-like algorithm which uses partial second-order information (e.g., reduced Hessian or its variants) to reduce the computational complexity for determining the search direction, while the second one is called alternating nonnegative least square (ANLS) algorithm, where the SNMF problem is split into two classes of nonnegative least-square subproblems by using penalty method coupled with two-block coordinate descent method. Technically, the ANLS algorithm is a penalty method whose stationary convergence is much impacted by the choice of the penalty parameter. As a result, the ANLS algorithm often comes up with an approximate solution which is generally less accurate than the solution provided by the Newton-like algorithm.

If we view each entry of or a set of entries of as one block variable, the SNMF problem is a multi-block optimization problem. A popular approach for solving multi-block optimization problems is the block coordinate descent (BCD) approach[17, 16], where only one of the block variables is optimized at each iteration while the remaining blocks are kept fixed. When the one-block subproblem can be easily solved, the BCD algorithms often perform very efficiently in practice. Furthermore, since only one block is updated at each iteration, the per-iteration storage and computational burden of the BCD algorithm could be very low, which is desirable for solving large-scale problems. Due to the above merits, the BCD method has been recently used to solve the SNMF problem[12], where each entry of is viewed as one block variable and the corresponding subproblem is equivalent to finding the best root of a cubic equation. The proposed BCD algorithm in [12] was designed in an efficient way by exploiting the structure of the SNMF problem. However, since the subproblem of updating each entry of may have multiple solutions, the existing convergence results cannot be applied to the proposed BCD algorithm in [12]. Though much effort was made in [12] to prove the convergence of the proposed algorithm to stationary solutions, we find that there exists a gap in the convergence proof of Theorem 1 in [12] (see Sec. II-A.2)).

To overcome the weakness of the BCD algorithm and make it more flexible, an inexact BCD algorithm, named block successive upper-bounding minimization (BSUM) algorithm, was proposed for multi-block minimization problems[18], where the block variables are updated by successively minimizing a sequence of approximations of the objective function which are either locally tight upper bounds of the objective function or strictly convex local approximations of the objective function. Unlike the classical BCD method which requires for convergence the uniqueness of the minimizer of every one-block subproblem, the BSUM algorithm and its variants or generalizations are guaranteed to achieve convergence to stationary solutions under very mild conditions[18]. Considering these merits of the BSUM algorithm, this paper proposes BSUM-based algorithms to address (regularized) SNMF problems.

The contribution of this paper is threefold:

  • First, two cyclic BSUM algorithms are proposed for the SNMF problem, including an entry-wise BSUM algorithm and a vector-wise BSUM algorithm. The key to both algorithms is to find proper upper bounds for the objective functions of the subproblems involved in the two algorithms.

  • Then, we study the convergence of permutation-based randomized BSUM (PR-BSUM) algorithm. For the first time, we prove that this algorithm can monotonically converge to the set of stationary solutions. As a result, this paper settles the convergence issue of the exact BCD algorithm[12] using permuted block selection rule.

  • Lastly, we propose parallel BSUM algorithms to address the SNMF problem. By distributing the computational load to multiple cores of a high-performance processor, the parallel BSUM algorithms can efficiently handle large-scale SNMF problems.

The remainder of this paper is organized as follows. We present two cyclic inexact BCD methods, termed as sBSUM and vBSUM, for the SNMF problem in Section II. In Section III, we study the permutation-based randomized BSUM algorithm in a general framework and its application to the SNMF problem. In Section IV, we discuss how to implement the sBSUM algorithm and vBSUM algorithm in parallel in a multi-core/processor architecture. Finally, Section V presents some simulation results, followed by a conclusion drawn in Section VI.

Notations: Throughout this paper, we use uppercase bold letters for matrices, lowercase bold letters for column vectors, and regular letters for scalars. For a matrix , and denote the transpose and pseudo-inverse of , respectively, and means entry-wise inequality. and denotes the Frobenius norm and trace of , respectively. We use the notation (or ), and to refer to the -th entry, the -th row and the -th column of , respectively, while the submatrix of consisting of elements from the -th row through -th row. denotes the identity matrix whose size will be clear from the context. We define an element-wise operator . For a function , (or ) denotes its derivative (or gradient) with respect to the variable .

Ii Cyclic BSUM Algorithms For SNMF

This paper aims to provide algorithmic design in a broad view for the following generalized formulation of the SNMF problem

(2)

where and are respectively -by- and by matrices, is a scalar, and is a matrix polynomial of up to fourth-order. Problem (2) is in essence a class of multivariate quartic polynomial optimization problem with nonnegative constraints. Besides the basic SNMF problem, this problem arises also from for example:

  • when is a similarity matrix and or , problem (2) is a regularized SNMF problem which can introduce some degree of sparsity to or the rows of [2].

  • when is the covariance matrix of sensor observations of some distributed sources in a sensor network and , problem (2) models an informative-sensor identification problem with sparsity-awareness and nonnegative source signal signature basis[14].

It is not difficult to see that, problem (2) is as hard as the basic SNMF problem when is up to fourth-order, and the main difficulty lies in the fourth-order matrix polynomial, which makes the problem hard to solve. Hence, we focus on the basic SNMF problem throughout the rest of the paper and develop various BSUM-type algorithms for the basic SNMF problem. Without loss of generality, we assume that the matrix is symmetric222When is not symmetric, the corresponding problem is equivalent to (2) with thereof replaced by .. We stress that all the techniques developed below can be easily generalized to the regularized SNMF problem (2).

Ii-a Scalar-wise BSUM algorithm

The basic SNMF problem is given by

(3)

It is readily seen that, when all the entries of but one are fixed, problem (3) reduces to minimizing a univariate quartic polynomial (UQP), which can be easily handled. Based on this fact, exact BCD algorithms—cyclicly updating each entry of by solving a UQP problem—has been developed for the SNMF problem[12, 14]. However, since the solution to the UQP problem may not be unique, the existing convergence theory of BCD does not apply to the exact BCD algorithm developed for the SNMF problem, though much efforts have been made to resolve the convergence issue in [12, 14]. In the following, we develop a BSUM-based algorithm for the SNMF problem, where each time we update one entry of by minimizing a locally tight upper bound of , hence termed as scalar-BSUM (sBSUM).

Ii-A1 Algorithm design and complexity analysis

In the sBSUM algorithm, each time we optimize the -th entry of while fixing the others. For notational convenience, let denote the -th entry of to be optimized. Given the current , denoted as , we can write the new after updating the -th entry as

where is a matrix with in the -th entry and elsewhere. Accordingly, we can express the objective function as follows

(4)

where

(5)

with the tuple given by

(6)
(7)
(8)
(9)

The derivation of is relegated to Appendix A. To get an upper bound of , let us define

where . Clearly, is a locally tight upper bound of at . Using this upper bound function, the sBSUM algorithm updates (i.e., the -th entry of ) by solving

(10)

The following lemma states that is convex and moreover problem (10) has a unique solution.

Lemma II.1

is a convex function and a locally tight upper bound of . Moreover, problem (10) has a unique solution.

{proof}

Clearly, is a locally tight upper bound of since and , . Hence, it suffices to show that is a convex function. By noting that

we infer that is a convex function. Particularly, is strictly convex when . Thus, problem (10) has a unique solution when . For the case when , we have

where we have used the identity in the second equality. Equating the derivative to zero and taking the nonnegative constraint into consideration, we obtain the unique solution to problem (10) as . This completes the proof.

As a result of Lemma II.1, the unique solution to problem (10) can be found by checking the critical point of (i.e., the point where ). That is, if the critical point of is nonnegative, then it is the unique optimum solution to problem (10); otherwise the optimal solution is . Hence, the key to solving problem (10) is to find the root of the cubic equation whose coefficients are specified by the tuple . Fortunately, we can easily solve the cubic equation and obtain the unique closed-form solution to problem (10) in the following lemma.

Lemma II.2

Let be given in Eqs. (6-9) and define

The unique solution to problem (10) can be expressed as

(11)

where is given by

(12)
{proof}

As shown in Lemma II.1 and using the fact , it is trivial to obtain the unique solution to problem (10) when . Thus, it suffices to consider the case when . The equation is equivalent to

where we have used the identity in the second equality, and the fact and the definitions of in the last equality. Clearly, we have when . Thus, it follows from Cardano’s method [15] that the equation has a unique solution as follows

Furthermore, acounting for the nonnegative constraint, we can derive the closed-form solution for the case when as shown in (11). This completes the proof.

  • initialize , calculate and ,

  • repeat

  • for each and

  • solve

  • end

  • until some termination criterion is met

TABLE I: Algorithm 1: sBSUM algorithm for SNMF

To summarize, the basic sBSUM algorithm proceeds as follows. Each time we pick to be optimized—first calculate and then update according to (11). It is readily seen that computing in (8) and in (9) are the most computationly costly steps of the sBSUM algorithm. In order to save memory and computational overhead, we can use the following expression

(13)

to compute as the matrix has a smaller size than the matrix . Moreover, we have

(14)

where is a matrix with in the -th entry and elsewhere. Note that is a null matrix with its -th column being while is a null matrix with its -th row being . Hence, we only need to update the matrix ’s -th row, -th column, and -th entry once the -th entry of is updated, which can be done recusively.

Computing in (8) requires the knowledge of . Similarly, we have

where is a matrix with in the -th entry and elsewhere. It follows that

(15)

Hence, we only need to recursively update once the -th entry of is updated.

Using (13) for computing , and (14) and (15) for recursive update, we formally present the proposed sBSUM algorithm in TABLE I, where the function in Step 6 is to find the unique solution to a cubic equation whose coefficients are specified by the tuple , followed by a projection onto the nonnegative orthant (i.e., Eq. (11) for Step 6); Steps 7-10 are recursive updates to reduce computational burden. According to the table and the previous analysis, we only need to store the matrix , , , and the matrix ’s diagonal entries in the sBSUM algorithm, which require , , and space in memory, respectively. Here, denotes the number of non-zero entries of , which is generally in the dense matrix case while in the sparse matrix case. Hence, the sBSUM algorithm requires space in memory. In addition, it is seen that the most costly step of the sBSUM is computing , that is, the computation of dominates the per-iteration computational complexity of the sBSUM algorithm (we refer to updating all entries of as one iteration) . Thus, it can be shown that the per-iteration computational complexity of the sBSUM algorithm is in the sparse case while in the dense case.

Ii-A2 Comparison of exact (cyclic) BCD[12] and (cyclic) sBSUM

The exact BCD algorithm and the sBSUM have similar iterations (cf. Algorithm 4 in [12] and our Algorithm 1) and the same complexity. The main difference between them lies in Step 6: if we replace Step 6 with and let the ‘solve’ function denote finding the best nonnegative root (to the corresponding cubic equation) among up to three real roots, the sBSUM reduces to the exact BCD algorithm. Thus, when holds for all iterations, the sBSUM is exactly the same as the exact BCD algorithm. Clearly, a notable advantage of the sBSUM is that a closed-form unique real root is guaranteed in Step 6, but this is not the case for the BCD algorithm (Algorithm 1 in [12] is invoked to find the best root). Thanks to such uniqueness of the subproblems’ solutions, the sBSUM is guaranteed to converge to stationary solutions; see the argument in [18]. On the other hand, convergence is not guaranteed for multi-block cyclic BCD algorithm if each of its subproblem has possibly multiple solutions; see a well-known example due to Powell [26]. The latter result implies that the cyclic BCD algorithm, although having similar iterations and the same complexity as sBSUM, may not be a good choice for the SNMF problem.

It is worth mentioning that, the work [12] attempted to provide a convergence proof for their cyclic-BCD-based SNMF algorithm. However, there is a gap in their proof as explained as follows. Let be the sequence generated by the BCD algorithm and be one of its subsequences that converges to . In the proof of Theorem 1 in [12], the authors start by arguing that there exists and such that and

(16)

for some if is for contrary assumed to be not a stationary solution, and finally obtain

(17)

where is a sufficiently large integer. Eq. (17) means sufficient decrease in the objective function values, resulting in the unboundedness of , i.e., a contradiction, implying that is a stationary solution (because is lower bounded). However, there is a gap in their derivation: in fact the scalar in (16) should be related to or the iteration. As a result, the scalar in Eq. (17) should be an iteration-related one, say . Then, we don’t necessary arrive at a contradiction (to the boundedness of ) since the term could be finite.

To summarize, since the subproblems of the cyclic-BCD-based SNMF algorithm[12] may have multiple solutions, there is a lack of theoretical guarantee for the convergence of the cyclic-BCD-based SNMF algorithm in [12]. However, the proposed sBSUM is guaranteed to reach the stationary solutions of the SNMF problem [18]. Moreover, it is interesting to note that the proposed sBSUM (also including the other BSUM-based SNMF algorithms to be discussed later) cannot get stuck at local maxima, as shown in Appendix B.

Ii-B Vector-wise BSUM

Generally speaking, for multi-block BCD/BSUM algorithms, the less the number of block variables is, the faster convergence the BCD/BSUM algorithms achieve. In the following, we develop a new algorithm, termed vector-BSUM (vBSUM), in which each time we update a single row of by minimizing an upper bound of .

We first study the subproblem of the vBSUM which optimizes the -th row of (denoted as for convenience) given the current denoted as . To do so, let us define , , , and . Since we have

(18)

the subproblem for updating the -th row of is given by

(19)

Define , , and . Then we can rewrite problem (19) equivalently (equivalent in the sense that they have the same optimal solution) as follows

(20)

Clearly, problem (20) is not necessarily convex. But even if it is convex333When the matrix is a similarity matrix generated by the widely used Gaussian kernel, e.g., (where and are two data points, and is a parameter), or a normalized similarity matrix, we always have for all . On the other hand, because is nonnegative, the matrix is almost always a positive matrix whose eigenvalues are all in the interval given by [23, Chap. 8]. Moreover, when the number of data points is very large, it is very likely that and thus the matrix is positive semidefinite, implying that problem (20) is convex with very high probability in practice., it is still unlikely to obtain a closed-form optimum solution to problem (20) due to the presence of the nonnegativity constraints and the coupling of the entries of introduced by . In what follows, we solve (20) by using BSUM algorithm.

Thanks to the Lipschitz continuity of the quadratic part in the objective function, we can find an upper bound for the objective function of problem (20). Specifically, we have the following bound for the quadratic term

where is the Lipschitz constant of the gradient of , which can be simply chosen as the maximum eigenvalue of or (see footnote 3 for reason). By replacing in (20) with the above upper bound evaluated at (i.e., let in the above inequality) in problem (20) and rearranging the terms, we obtain

(21)

where . Observing that should be maximized for any fixed , we show in Lemma II.3 that problem (21) admits a unique closed-form solution.

Lemma II.3

The optimum solution to problem (21) can be expressed as follows

(22)

where

with
{proof}

It is easily seen that when an entry of is non-positive, the corresponding component of the optimal should equal to zero. As a result, problem (21) is equivalent to the following

(23)

Trivially, we have the optimal if . Hence, we only need to consider the case when . First, it is readily seen that, problem (23) is further equivalent to

(24)

Second, note that, (for any fixed ) should be maximized subject to and . By Cauchy-Schwartz inequality, the optimal takes the form . Hence, problem (24) reduces to the following convex problem

(25)

By the first-order optimality condition, we know that the optimal is the unique real root of the cubic equation , which can be obtained in closed-form as shown in above. This completes the proof.

  • initialize and calculate

  • repeat

  • for each

  • for   //repeat Steps 6-7 times

  • update according to (22)

  • end

  • end

  • until some termination criterion is met

TABLE II: Algorithm 2: vBSUM algorithm for SNMF

Our proposed vBSUM algorithm, summarized in Table II, requires an approximate solution of problem (20) at each iteration. Such solution is obtained by iteratively solving (21) for a fixed number of times ; see Steps 5-8. We comment that regardless of the number of inner iterations performed (i.e., the choice of ), vBSUM is guaranteed to converge to stationary solutions of the basic SNMF problem, by applying the same analysis as that of the BSUM algorithm [18, Theorem 2]. Furthermore, from Table II, one can see that the most costly step in the vBSUM lies in Step 4 for computing , equivalently in each iteration, which requires operations per-iteration in the dense case while operations per-iteration in the sparse case. Hence, the per-iteration computational complexity of the vBSUM is the same as the sBSUM. In addition, it is seen that we only need to store , , , and in the algorithm, which require , , and space in memory, respectively. Hence, the vBSUM requires less space in memory in practice than that required by the sBSUM algorithm, though both with the same order of memory complexity, i.e., .

Remark II.1

Certainly, we can use the element-wise BSUM algorithm (i.e., view each row entry of as one block) to update each row of and obtain an alternative vBSUM algorithm, which is a simple variant of the sBSUM algorithm obtained by (for each ) updating , , multiple times in each iteration. For convenience, we refer to this alternative vBSUM algorithm as v-sBSUM algorithm and also assume repeats as in the vBSUM algorithm for updating rows of . Note that, the v-sBSUM algorithm requires operations in each iteration due to the frequent computation of (cf. Step 5 of Algorithm 1) while operations in the vBSUM algorithm, where the second term is due to Step 6 of Algorithm 2. Moreover, the v-sBSUM algorithm requires times root operations as many as the vBSUM algorithm does. Therefore, we prefer the proposed vBSUM algorithm over the simple variant of the sBSUM algorithm for better efficiency.

Remark II.2

The vBSUM algorithm is a row-wise BSUM algorithm. Similarly, we can develop a column-wise BSUM algorithm for the basic SNMF problem. However, unlike the subproblem (20) in the vBSUM algorithm, the subproblem of updating each column of has no good structure and the corresponding Lipschitz constant is not easily available as the number of data points is very large. And more importantly, the vBSUM algorithm is more amendable to both randomized and parallel implementation, which will be clear in the following sections.

Iii Permutation-based Randomized BSUM Algorithm And Its Application in SNMF

The sBSUM and vBSUM algorithms both fall into the category of deterministic cyclic BSUM algorithms. As a variant of cyclic BSUM algorithm, randomized BCD/BSUM algorithm has been proposed[20], where each time one block variable is chosen to be optimized with certain probability. Differently from the basic randomized BCD/BSUM algorithm, the permutation-based randomized BCD/BSUM algorithm (termed as PR-BCD/BSUM) updates the block variables in a random permutation rule, in which the blocks are randomly selected without replacement. However, the convergence of the PR-BCD/BSUM algorithm has not been well-understood, as pointed out in a recent survey [16]. In particular, it is not known, at least in theory, whether random permutation provides any added benefit compared with the more classical cyclic block selection rules. In this section, we first study the convergence of the PR-BSUM algorithm (including PR-BCD as a special case) in a general framework and then propose the randomized sBSUM and vBSUM algorithms.

Iii-a The PR-BSUM algorithm and the convergence results

We start with a general description of the PR-BSUM algorithm. Consider the following multi-block minimization problem

(26)

where each is a closed convex set. Define and . Let satisfy the following assumption.

Assumption III.1
(27a)
(27b)
(27c)
(27d)

where is the -th block component of (similarly for and ), and represent the block components of with their indices less than or larger than , respectively, denotes the directional derivative of with respect to along the direction , and denotes the directional derivative of with respect to along the direction . The assumption (27c) guarantees that the first order behavior of is the same as locally[18], hence it is referred to as the gradient consistency assumption.

The PR-BSUM algorithm is described in Table III, where the ‘randperm’ function in Step 3 generates an index set containing a random permutation of and specifies the order of the update of block variables. The PR selection rule takes advantage of both the randomized rule and the cyclic rule: it guarantees timely update of all block variables, while avoiding being stuck with a bad update sequence. In the following, we prove that, with probability one (w.p.1.) the sequence generated by the PR-BSUM algorithm converges to the set of stationary solutions of problem (26).

  • initialize and set

  • repeat

  • for each