Nonnegative Factorization and The Maximum Edge Biclique Problem

Nonnegative Factorization and
The Maximum Edge Biclique Problem

Nicolas Gillis    François Glineur Center for Operations Research and Econometrics, Université catholique de Louvain, Voie du Roman Pays, 34, B-1348 Louvain-La-Neuve, Belgium ; nicolas.gillis@uclouvain.be and francois.glineur@uclouvain.be. The first author is a research fellow of the Fonds de la Recherche Scientifique (F.R.S.-FNRS). This text presents research results of the Belgian Program on Interuniversity Poles of Attraction initiated by the Belgian State, Prime Minister’s Office, Science Policy Programming. The scientific responsibility is assumed by the authors.
October 2008
Abstract

Nonnegative Matrix Factorization (NMF) is a data analysis technique which allows compression and interpretation of nonnegative data. NMF became widely studied after the publication of the seminal paper by Lee and Seung (Learning the Parts of Objects by Nonnegative Matrix Factorization, Nature, 1999, vol. 401, pp. 788–791), which introduced an algorithm based on Multiplicative Updates (MU). More recently, another class of methods called Hierarchical Alternating Least Squares (HALS) was introduced that seems to be much more efficient in practice.

In this paper, we consider the problem of approximating a not necessarily nonnegative matrix with the product of two nonnegative matrices, which we refer to as Nonnegative Factorization (NF) ; this is the subproblem that HALS methods implicitly try to solve at each iteration. We prove that NF is NP-hard for any fixed factorization rank, using a reduction to the maximum edge biclique problem.

We also generalize the multiplicative updates to NF, which allows us to shed some light on the differences between the MU and HALS algorithms for NMF and give an explanation for the better performance of HALS. Finally, we link stationary points of NF with feasible solutions of the biclique problem to obtain a new type of biclique finding algorithm (based on MU) whose iterations have an algorithmic complexity proportional to the number of edges in the graph, and show that it performs better than comparable existing methods.

Keywords: Nonnegative Matrix Factorization, Nonnegative Factorization, Complexity, Multiplicative Updates, Hierarchical Alternating Least Squares, Maximum Edge Biclique.

1 Introduction

(Approximate) Nonnegative Matrix Factorization (NMF) is the problem of approximating a given nonnegative matrix by the product of two low-rank nonnegative matrices: given a matrix , one has to compute two low-rank matrices such that

(1.1)

This problem was first introduced in 1994 by Paatero and Tapper [25], and more recently received a considerable interest after the publication of two papers by Lee and Seung [21, 22]. It is now well established that NMF is useful in the framework of compression and interpretation of nonnegative data ; it has for example been applied in analysis of image databases, text mining, interpretation of spectra, computational biology and many other applications (see e.g. [2, 9, 11] and references therein).

How can one interpret the outcome of a NMF? Assume each column of matrix represents an element of a data set: expression (1.1) can be equivalently written as

(1.2)

where each element is decomposed into a nonnegative linear combination (with weights ) of nonnegative basis elements (, the columns of ). Nonnegativity of allows interpretation of the basis elements in the same way as the original nonnegative elements in , which is crucial in applications where the nonnegativity property is a requirement (e.g. where elements are images described by pixel intensities or texts represented by vectors of word counts). Moreover, nonnegativity of the weight matrix corresponds to an essentially additive reconstruction which leads to a part-based representation: basis elements will represent similar parts of the columns of . Sparsity is another important consideration: finding sparse factors improves compression and leads to a better part-based representation of the data [19].

We start this paper with a brief introduction to the NMF problem: Section 2 recalls existing complexity results, introduces two well-known classes of methods: multiplicative updates [22] and hierarchical alternating least squares [6] and proposes a simple modification to guarantee their convergence. The central problem studied in this paper, Nonnegative Factorization (NF), is a generalization of NMF where the matrix to be approximated with the product of two low-rank nonnegative matrices is not necessarily nonnegative. NF is introduced in Section 3, where it is shown to be NP-hard for any given factorization rank, using a reduction to the problem of finding a maximum edge biclique. Stationary points of the NF problem used in that reduction are also studied. This section ends with a generalization of the NMF multiplicative updates rules to the NF problem and a proof of their convergence. This allows us to shed new light on the standard multiplicative updates for NMF: a new interpretation is given in Section 4, which explains the relatively poor performance of these methods and hints at possible improvements. Finally, Section 5 introduces a new type of biclique finding algorithm that relies on the application of multiplicative updates to the equivalent NF problem considered earlier. This algorithm only requires a number of operations proportional to the number of edges of the graph per iteration, and is shown to perform well when compared to existing methods.

2 Nonnegative Matrix Factorization (NMF)

Given a matrix and an integer , the NMF optimization problem using the Frobenius norm is defined as

(NMF)

denotes the set of real matrices of dimension ; the set of nonnegative matrices i.e. with every entry nonnegative, and the zero matrix of appropriate dimensions.
A wide range of algorithms have been proposed to find approximate solutions for this problem (see e.g. [2, 4, 6, 7, 10, 11, 13, 24]). Most of them use the fact that although problem (NMF) is not convex, its objective function is convex separately in each of the two factors and (which implies that finding the optimal factor corresponding to a fixed factor reduces to a convex optimization problem, and vice-versa), and try to find good approximate solutions by using alternating minimization schemes. For instance, Nonnegative Least Squares (NNLS) algorithms can be used to minimize (exactly) the cost function alternatively over factors and (see e.g. [5, 20]).

Actually, there exist other partitions of the variables that preserve convexity of the alternating minimization subproblems: since the cost function can be rewritten as , it is clearly convex as long as variables do not include simultaneously an element of a column of and an element of the corresponding row of (i.e. and for the same index ). Therefore, given a subset of indexes , (NMF) is clearly convex for both the following subsets of variables

and its complement

However, the convexity is lost as soon as one column of () and the corresponding row of () are optimized simultaneously, so that the corresponding minimization subproblem can no longer be efficiently solved up to global optimality.

2.1 Complexity

Vavasis studies in [30] the algorithmic complexity of the NMF optimization problem; more specifically, he proves that the following problem, called Exact Nonnegative Matrix Factorization111This is closely related to the nonnegative rank of matrix , which is the minimum value of for which there exists and such that (see [1])., is NP-hard:

(Exact NMF) Given a nonnegative matrix of rank , find, if possible, two nonnegative factors and of rank such that .

The NMF optimization problem is therefore also NP-hard, since when the rank is equal to the rank of the matrix , any optimal solution to the NMF optimization problem can be used to answer the Exact NMF problem (the answer being positive if and only if the optimal objective value of the NMF optimization problem is equal to zero).

The NP-hardness proof for exact NMF relies on its equivalence with a NP-hard problem in polyhedral combinatorics, and requires both the dimensions of matrix and its rank to increase to obtain NP-hardness. In contrast, in the special cases when rank is equal to or , the exact NMF problem can always be answered in the affirmative:

  1. When , it is obvious that for any nonnegative rank-one matrix there is nonnegative factors and such that .

    Moreover, the NMF optimization problem with can be solved in polynomial time: the Perron-Frobenius theorem implies that the dominant left and right singular vectors of a nonnegative matrix are nonnegative, while the Eckart-Young theorem states that the outer product of these dominant singular vectors is the best rank-one approximation of ; these vectors can be computed in polynomial-time using for example the singular value decomposition [15].

  2. When nonnegative matrix has rank , Thomas has shown [29] that exact NMF is also always possible (see also [8]). The fact that any rank-two nonnegative matrix can be exactly factorized as the product of two rank-two nonnegative matrices can be explained geometrically as follows: viewing columns of as points in , the fact that has rank implies that the set of its columns belongs to a two-dimensional subspace. Furthermore, because these columns are nonnegative, they belong to a two-dimensional pointed cone, see Figure 1. Since such a cone is always spanned by two extremes vectors, this implies that all columns of can be represented exactly as nonnegative linear combinations of two nonnegative vectors, and therefore the exact NMF is always possible222The reason why this property no longer holds for higher values of the rank is that a -dimensional cone is not necessarily spanned by a set of vectors when ..

    Figure 1: Rank-two exact NMF (): and .

    Moreover, these two extreme columns can easily be computed in polynomial time (using for example the fact that they define an angle of maximum amplitude among all pairs of columns). Hence, when the optimal rank-two approximation of matrix is nonnegative, the NMF optimization problem with can be solved in polynomial time. However, this optimal rank-two approximation is not always nonnegative, so that the complexity of the NMF optimization in the case is not known. Furthermore, to the best of our knowledge, the complexity of the exact NMF problem and the NMF optimization problem are still unknown for any fixed rank or greater than .

2.2 Multiplicative Updates (MU)

In their seminal paper [22], Lee and Seung propose multiplicative update rules that aim at minimizing the Frobenius norm between and . To understand the origin of these rules, consider the Karush-Kuhn-Tucker first-order optimality conditions for (NMF)

(2.1)
(2.2)
(2.3)

where is the Hadamard (component-wise) product between two matrices, and

(2.4)

Injecting (2.4) in (2.3), we obtain

(2.5)
(2.6)

From these equalities, Lee and Seung derive the following simple multiplicative update rules (where is Hadamard (component-wise) division)

(2.7)

for which they are able to prove a monotonicity property:

Theorem 1 ([22]).

The Frobenius norm is nonincreasing under the multiplicative update rules (2.7).

The algorithm based on the alternated application of these rules is not guaranteed to converge to a first-order stationary point (see e.g. [2], and references therein), but a slight modification proposed in [23] achieves this property (roughly speaking, MU is recast as a variable metric steepest descent method and the step length is modified accordingly). We propose another possibility to overcome this problem by replacing the above updates by the following:

Theorem 2.

For every constant , is nonincreasing under

(2.8)

for any . Moreover, every limit point of this algorithm is a stationary point of the following optimization problem

(2.9)
Proof.

See Section 3.3 of this paper, where the more general Theorem 9 is proved. ∎

2.3 Hierarchical Alternating Least Squares (HALS)

Cichoki et al. [6] and independently several other authors [14, 18] have proposed to solve the problem of Nonnegative Matrix Factorization by considering successively each rank-one factor while keeping the rest of the variables fixed, which can be expressed as

(2.10)

where matrix is called the residual matrix.

Ideally, one would like to find an optimal rank-one factor according to the Frobenius norm, i.e. solve the following problem

(2.11)

but, instead of solving this problem directly, these authors propose to optimize column and row separately in an alternating scheme, because the optimal solution to these two (convex) subproblems can be easily computed in closed form, see e.g. [16]:

(2.12)
(2.13)

This scheme, which amounts to a block coordinate descent method (for which any cyclic order on the columns of and the rows of is admissible), is called Hierarchical Alternating Least Squares (HALS)333In [16, 18], it is called Rank-one Residue Iteration (RRI) method and in [14] Alternating NMF (ANMF). and it has been observed to work remarkably well in practice: it outperforms, in most cases, the other algorithms for NMF [7, 14, 16]. Indeed, it combines a low computational cost per iteration (the same as the multiplicative updates) with a relatively fast convergence (significantly faster than the multiplicative updates), see Figure 3 for an example. We will explain later in Section 4 why this algorithm performs much better than the one of Lee and Seung.

A potential issue with this method is that, in the course of the optimization process, one of the vectors (or ) and the corresponding rank-one factor may become equal to zero (this happens for example if one of the residuals is nonpositive). This then leads to numerical instabilities (the next update is not well-defined) and a rank-deficient approximation (with a rank lower than ). A possible way to overcome this problem is to replace the zero lower bounds on and in (2.12) and (2.13) by a small positive constant, say (as for the MU), and consider the following subproblems

(2.14)

which lead to the modified closed-form update rules:

(2.15)

This idea was already suggested in [6] in order to avoid numerical instabilities. In fact, this variant of the algorithm is now well-defined in all situations because (2.14) guarantees and at each iteration. Furthermore, one can now easily prove that it converges to a stationary point.

Theorem 3.

For every constant , the limit points of the block coordinate descent algorithm initialized with positive matrices and applied to the optimization problem (2.9) are stationary points.

Proof.

We use the following result of Powell [27] (see also [3, p.268]): the limit points of the iterates of a block coordinate descent algorithm are stationary points provided that the following two conditions hold:

  • each block of variables is required to belong to a closed convex set,

  • the minimum computed at each iteration for a given block of variables is uniquely attained.

The first condition is clearly satisfied here, since and belong respectively to and , which are closed convex sets. The second condition holds because subproblems (2.14) can be shown to be strictly convex, so that their optimal value is uniquely attained by the solutions provided by rules (2.15). Strict convexity is due to the fact that the objective function of these problems are sums of quadratic terms, each involving a single variable and having a strictly positive coefficient. ∎

3 Nonnegative Factorization (NF)

Looking back at subproblem (2.11), i.e. approximating the residual with a rank-one term , we have seen that the optimal solution separately for both and can be written in a closed form. In the previous section, subproblem (2.11) was then solved by a block coordinate descent algorithm.

A question arises: Is it possible to do better? i.e. Is it possible to efficiently solve the problem for both vectors simultaneously? In order to answer this question, we introduce the problem of Nonnegative Factorization444This terminology has already been used for the problem of finding a symmetric nonnegative factorization, i.e. one where V=W, but we assign it a different meaning in this paper. which is exactly the same as Nonnegative Matrix Factorization except that the matrix to factorize can be any real matrix, i.e. is not necessarily nonnegative. Given and , the Nonnegative Factorization optimization problem using the Frobenius norm is:

(NF)

Of course, this problem is a generalization of (NMF) and is NP-hard as well. However Nonnegative Factorization will be shown below to be NP-hard for any fixed factorization rank (even ), which is not the case of (NMF) (cf. Section 2.1). The proof is based on the reduction to the maximum edge biclique problem.

3.1 Complexity

The main result of this section is the NP-hardness result for Nonnegative Factorization for any fixed factorization rank. We first show how the optimization version of the maximum edge biclique problem (MBP) can be formulated as a rank-one Nonnegative Factorization problem (NF-1d). Since the decision version of (MBP) is NP-complete [26], this implies that (NF-1d) is NP-hard. We then prove that (NF) is NP-hard as well using a simple construction.

The Maximum Edge Biclique Problem in Bipartite Graphs

A bipartite graph is a graph whose vertices can be divided into two disjoint sets and such that there is no edge between two vertices in the same set

A biclique is a complete bipartite graph i.e. a bipartite graph where all the vertices are connected

Finally, the so-called maximum edge biclique problem in a bipartite graph  is the problem of finding a biclique  in (i.e. and ) maximizing the number of edges. The decision problem: Given , does contain a biclique with at least edges? has been shown to be NP-complete [26]. Therefore the corresponding optimization problem is at least NP-hard.

Let be the adjacency matrix of the unweighted bipartite graph i.e. if and only if . In order to avoid trivialities, we will suppose that each vertex of the graph is connected to at least one other vertex i.e.  . We denote by the cardinality of i.e. the number of edges in ; note that . The set of zero values will be called , and its cardinality satisfies .

With this notation, the maximum biclique problem in can be formulated as

(MBP)

In fact, one can check easily that this objective is equivalent to since , and are binary: instead of maximizing the number of edges inside the biclique, one minimizes the number of edges outside.
Feasible solutions of (MBP) correspond to bicliques of . We will be particularly interested in maximal bicliques. A maximal biclique is a biclique which is not contained in a larger biclique: it is a locally optimal solution of (MBP).
The corresponding rank-one Nonnegative Factorization problem is defined as

(NF-1d)

with the matrix defined as

(3.1)

where is the matrix of all ones with dimension . is the matrix where the zero values have been replaced by . Clearly is not necessarily a nonnegative matrix.

To prove NP-hardness of (NF-1d), we are going to show that, if is sufficiently large, optimal solutions of (NF-1d) coincide with optimal solutions of the corresponding biclique problem (MBP). From now on, we say that a solution coincides with another solution if and only if (i.e. if and only if and for some ). We also let and .

Lemma 1.

Any optimal rank-one approximation with respect to the Frobenius norm of a matrix for which contains at least one nonpositive entry.

Proof.

If , the result is trivial. If not, since . Suppose now is a best rank-one approximation of . Therefore, since the negative values of are approximated by positive ones and since has at least one negative entry, we have

(3.2)

By the Eckart-Young theorem,

where is the maximum singular value of . Clearly,

So

which is in contradiction with (3.2). ∎

We restate here a well-known result concerning low-rank approximations (see e.g. [16, p. 29]).

Lemma 2.

The local minima of the best rank-one approximation problem with respect to the Frobenius norm are global minima.

We can now state the main result about the equivalence of (NF-1d) and (MBP).

Theorem 4.

For , any optimal solution (v,w) of (NF-1d) coincides with an optimal solution of (MBP), i.e. is binary and .

Proof.

We focus on the entries of which are positive and define

(3.3)

, and are the submatrices with indexes in . Since is optimal for , must be optimal for . Suppose there is a entry in , then

so that Lemma 1 holds for . Since is positive and is an optimal solution of (NF-1d) for , is a local minimum of the unconstrained problem i.e. the problem of best rank-one approximation. By Lemma 2, this must be a global minimum. This is a contradiction with Lemma 1: should contain at least one nonpositive entry. Therefore, which implies by optimality and then is binary and . ∎

Corollary 1.

Rank-one Nonnegative Factorization is NP-hard.

Intuitively, the reason (NF-1d) is NP-hard is that if one of the entries of is approximated by a positive value, say , the corresponding error is . Therefore, the larger , the more expensive it is to approximate by a positive number. Because of that, when increases, negatives values of will be approximated by smaller values and eventually by zeros.

Therefore, for each negative entry of , one has to decide whether to approximate it with zero or with a positive value. Moreover, when a value is approximated by zero, one has to choose which entries of and will be equal to zero, as in the biclique problem. We suspect that the hardness of Nonnegative Factorization lies in these combinatorial choices.

We can now answer our initial question: Would it be possible to solve efficiently the problem

simultaneously for both vectors ? Obviously, unless P=NP, we won’t be able to find a polynomial-time algorithm to solve this problem. Therefore, it seems hopeless to improve the HALS algorithm using this approach.

Remark 1.

Corollary 1 suggests that NMF is a difficult problem for any fixed . Indeed, even if one was given the optimal solution of an NMF problem except for one rank-one factor, it is not guaranteed that one would be able to find this last factor in polynomial-time since the corresponding residue is not necessarily nonnegative.

We now generalize Theorem 1 to factorizations of arbitrary rank.

Theorem 5.

Nonnegative Factorization (NF) is NP-hard.

Proof.

Let be the adjacency matrix of a bipartite graph and the factorization rank of (NF). We define the matrix as

which is the adjacency matrix of another bipartite graph which is nothing but the graph repeated times. is defined in the same way as , i.e.

with . Let be the optimal rank- nonnegative factorization of and consider each rank-one factor : each of them must clearly be an optimal rank-one nonnegative factorization of . Since ,

and Lemma 1 holds. Using exactly the same arguments as in Theorem 4, one can show that, ,

Therefore, the positive entries of each rank-one factor will correspond to a biclique of . By optimality of , each rank-one factor must correspond to a maximum biclique of since is the graph repeated times. Thus (NF-1d) is NP-hard implies (NF) is NP-hard. ∎

3.2 Stationary points of (NF-1d)

We have shown that optimal solutions of (NF-1d) coincide with optimal solutions of (MBP) for , which are NP-hard to find. In this section, we focus on stationary points of (NF-1d) instead: we show how they are related to the feasible solutions of (MBP). This result will be used in Section 5 to design a new type of biclique finding algorithm.

3.2.1 Definitions and Notation

The KKT conditions of (NF-1d), which define the stationary points, are exactly the same as for (NMF): is a stationary point of (NF-1d) if and only if

(3.4)
(3.5)

Of course, we are especially interested in nontrivial solutions and we then assume so that one can check that (3.4)-(3.5) are equivalent to

(3.6)

Given , we define three sets of rank-one matrices: , corresponding to the nontrivial stationary points of (NF-1d), with

, corresponding to the feasible solutions of (MBP), with

and , corresponding to the maximal bicliques of (MBP), i.e. if and only if and coincides with a maximal biclique.

3.2.2 Stationarity of Maximal Bicliques

The next theorem states that, for sufficiently large, the only nontrivial feasible solutions of (MBP) that are stationary points of (NF-1d) are the maximal bicliques.

Theorem 6.

For ,  .

Proof.

if and only if and is maximal i.e.

Since is binary and , the nonzero entries of and must be equal to each other. Moreover, so that (1) is equivalent to

.

This is equivalent to the stationarity conditions for , cf. (3.6). By symmetry, (2) is equivalent to the stationarity conditions for . ∎

Theorem 6 implies that, for sufficiently large, . It would be interesting to have the opposite affirmation: for sufficiently large, any stationary point of (NF-1d) corresponds to a maximal biclique of (MBP). Unfortunately, we will see later that this property does not hold.

3.2.3 Limit points of

However, as goes to infinity, we are going to show that the points in get closer to feasible solutions of (MBP).

Lemma 3.

The set is bounded i.e. , :

Proof.

For , by (3.6),

Lemma 4.

For , if and if , then

Proof.

By (3.6),

The same can be shown for by symmetry. ∎

Theorem 7.

As goes to infinity, stationary points of (NF-1d) get closer to feasible solutions of (MBP) i.e. , s.t. :

(3.7)
Proof.

Let and suppose . W.l.o.g. ; in fact, if , . Note that Lemma 3 implies . By (3.6),

Therefore, is a pair of singular vectors of associated with the singular value . If , the only pair of positive singular vectors of is so that coincides with a feasible solution of (MBP).
Otherwise, we define

(3.8)

and their complements , ; hence,

Using Lemma 4 and the fact that , we get

(3.9)

Therefore, since and , we obtain

It remains to show that coincide with a biclique of the (complete) graph generated by . We distinguish three cases:
(1) . (3.9) implies so that

(3.10)

which is absurd if since .
(2) . Using (3.9), we have and then

(3) . Noting , Equation (3.10) gives . Therefore,

(3.11)

Moreover, so that

(3.12)

Finally, combining (3.11) and (3.12) and noting that since ,

We can conclude that, for sufficiently large, is arbitrarily close to a feasible solution of (MBP) which corresponds to the biclique .

Recall we supposed . If , let be the indexes defined in (3.3). The above result holds for with the matrix . For sufficiently large, is then close to a feasible solution of (MBP) for . Adding zero to this feasible solution gives a feasible solution for . ∎

Example 1.

Let

Clearly, belongs to the set B, i.e. it corresponds to maximal biclique of the graph generated by . By Theorem 6, for , it belongs to i.e. is stationary points of (NF-1d).
For , one can also check that the singular values of are disjoint and that the second pair of singular vectors is positive. Since it is a positive stationary point of the unconstrained problem, it is also a stationary point of (NF-1d). As goes to infinity, it must get closer to a biclique of (MBP) (Theorem 7). Moreover is symmetric so that the right and left singular vectors are equal to each other. Figure 2 shows the evolution555By Wedin’s theorem (cf. matrix perturbation theory, see e.g. [28]), singular subspaces of associated with a positive singular value are continuously deformed with respect to . of this positive singular vector of with respect to . It converges to and then the product of the left and right singular vector converges to .

Figure 2: Evolution of .

3.3 Multiplicative Updates for Nonnegative Factorization

In this section, the MU of Lee and Seung presented in Section 2.2 to find approximate solutions of (NMF) are generalized to (NF). Other than providing a way of computing approximate solutions of (NF), this result will also help us to understand why the updates of Lee and Seung are not very efficient in practice.
The Karush-Kuhn-Tucker optimality conditions of the (NF) problem are the same as for (NMF) (see Section 2.2). Of course, any real matrix can be written as the difference of two nonnegative matrices: with . This can be used to generalize the algorithm of Lee and Seung. In fact, (2.5) and (2.6) become

(3.13)
(3.14)

and using the same idea as in Section 2.2, we get the following multiplicative update rules :

Theorem 8.

For and with , the cost function is nonincreasing under the following update rules:

(3.15)
Proof.

We only treat the proof for since the problem is perfectly symmetric. The cost function can be split into independent components related to each row of the error matrix, each depending on a specific row of , and , which we call respectively , and . Hence, we can treat each row of V separately, and we only have to show that the function

is nonincreasing under the following update

(3.16)

is a quadratic function so that

with and . Let be a quadratic model of around :