Using Underapproximations for Sparse Nonnegative Matrix Factorization

# Using Underapproximations for Sparse Nonnegative Matrix Factorization

Nicolas Gillis    François Glineur Center for Operations Research and Econometrics, Université catholique de Louvain, Voie du Roman Pays, 34, B-1348 Louvain-La-Neuve, Belgium ; nicolas.gillis@uclouvain.be and francois.glineur@uclouvain.be. The first author is a research fellow of the Fonds de la Recherche Scientifique (F.R.S.-FNRS). This text presents research results of the Belgian Program on Interuniversity Poles of Attraction initiated by the Belgian State, Prime Minister’s Office, Science Policy Programming. The scientific responsibility is assumed by the authors.
October 2009
###### Abstract

Nonnegative Matrix Factorization consists in (approximately) factorizing a nonnegative data matrix by the product of two low-rank nonnegative matrices. It has been successfully applied as a data analysis technique in numerous domains, e.g., text mining, image processing, microarray data analysis, collaborative filtering, etc.

We introduce a novel approach to solve NMF problems, based on the use of an underapproximation technique, and show its effectiveness to obtain sparse solutions. This approach, based on Lagrangian relaxation, allows the resolution of NMF problems in a recursive fashion. We also prove that the underapproximation problem is NP-hard for any fixed factorization rank, using a reduction of the maximum edge biclique problem in bipartite graphs.

We test two variants of our underapproximation approach on several standard image datasets and show that they provide sparse part-based representations with low reconstruction error. Our results are comparable and sometimes superior to those obtained by two standard Sparse Nonnegative Matrix Factorization techniques.

Keywords: Nonnegative Matrix Factorization, Underapproximation, Maximum Edge Biclique Problem, Sparsity, Image Processing.

## 1 Introduction

Nonnegative Matrix Factorization (NMF) is a recent data analysis technique with applications in image processing, text mining, spectral unmixing, air emission control, computational biology, clustering, etc. (see [1, 2, 3, 4] and references therein). NMF can be described as follows: given a nonnegative input matrix and an integer , find two nonnegative matrices and whose product approximates the input matrix as closely as possible:

 M≈VW, (1)

so that is a low-rank approximation of . Matrix factorization can be interpreted as a linear factor model: assuming that each column of the input matrix represents an element of a data set, decomposition (1) can be written as111In this text, stands for the -entry of a matrix , for its column and for its row.

 M:j≈∑kV:kWkj,∀j,

i.e., each input column is a linear combination of a set of basis elements with corresponding weights .

In contrast with standard linear factor model techniques such as Principal Component Analysis, NMF considers nonnegativity of the input columns to be an important feature and consequently requires the basis elements to be also nonnegative, so that they can be interpreted in the same way (e.g., these columns can correspond to images described by nonnegative pixel intensities or to texts represented by vectors of nonnegative word counts). Furthermore, NMF imposes nonnegativity of the weights, leading to an essentially additive reconstruction of the input columns by the basis elements. This representation is then part-based: basis elements will represent common parts of the columns of . For example, if each column of represents a face using pixel intensities, the basis elements generated by NMF can be facial features, such as eyes, noses and lips, as shown in Figure 1.

This low-rank approximation technique with nonnegativity constraints was introduced in 1994 by Paatero and Tapper [5] and started to be extensively studied after the publication of an article by Lee and Seung [6] in 1999. Since an exact representation of the input matrix cannot be obtained in general, the quality of the approximation is measured by some criterion, typically the sum of the squares of the errors on the entries, which leads to the following minimization problem222 denotes the Frobenius norm of matrix .:

 minV∈Rm×r,W∈Rr×n||M−VW||2F such that V≥0 and W≥0. (NMF)

An important feature of NMF is that its nonnegativity constraints typically induce sparse factors, i.e., factors with relatively many zero entries. Intuitively, decomposition into parts requires the basis elements to be sparse, cf. Figure 1. More formally, the reason for this behavior is that stationary points of NMF will be typically located at the boundary of the feasible domain , hence will feature zero components. This can be explained with the first-order optimality conditions: because the set of stationary points of a problem of the type

 minx∈Rnf(x) such that x≥0

is given by the following expression (where is the gradient of )

 Sf={x∈Rn|x≥0,∇f(x)≥0 and xi[∇f(x)]i=0∀i},

some components of the solution can be expected to be equal to zero.

Sparsity of the factors is an important consideration in practice: in addition to reducing memory requirements to store the basis elements and their weights, sparsity improves interpretation of the factors, especially when dealing with classification/clustering problems, e.g., in text mining [7] and computational biology [8, 9]. By contrast, unconstrained low-rank approximations such as Principal Component Analysis (PCA) do not naturally generate sparse factors (for that reason, low-rank approximations techniques with additional sparsity constraints have been recently introduced; this is referred to as Sparse Principal Component Analysis, Sparse PCA or SPCA, see, e.g., [10] and references therein).

Although solutions of NMF typically display some level of sparsity, some applications require even sparser solutions, leading to variants of NMF called Sparse Nonnegative Matrix Factorization. They are in general developed in two different ways: some authors define a priori a desired sparsity level and adapt the main iteration of their method in order to guarantee that the factors satisfy that level of sparsity throughout the application of the algorithm, see, e.g., [11, 12]. Alternatively, a penalty term can be added to the objective function to prevent the algorithm from considering dense solutions, see [13]. In particular, it is well-known that -norm penalty terms induce sparser solutions (see, e.g., [14, 9, 15]). More details about these techniques are given at the beginning of Section 4.

Unfortunately the advantages of NMF (part-based representation and sparsity) over PCA come at a certain price. First, because of the additional nonnegativity constraints, the approximation error of the input data for a given factorization rank will always be higher for NMF than in the unconstrained case. Second, optimization problem (NMF) is more difficult to solve than its unconstrained counterpart: while PCA problems can be solved in polynomial time (e.g., using a Singular Value Decomposition technique [16]), NMF problems belong to the class of NP-hard problems, as recently showed by Vavasis [17]. However, it should also be pointed out that these drawbacks (higher error, NP-hardness) are also present for competing techniques emphasizing sparsity, such as SPCA.

Because of its NP-hardness, practical algorithms cannot be expected to find provably optimal global solutions for (NMF) in a reasonable amount of time and aim instead at finding locally optimal solutions. Most methods start from some initial guess factors and improve them iteratively using nonlinear optimization schemes such as projected gradient methods [18], Newton-like methods [19, 20], (block-)coordinate descent (also called alternating nonnegative least squares – NNLS) [21, 22, 14], multiplicative updates [23], etc. (see also [1, 2, 24, 25] and references therein).

In this paper, we introduce a novel approach to solve NMF problem based on the use of an underapproximation technique and show its effectiveness to obtain sparse solutions. Section 2 introduces our underapproximation problem, motivated by a recursive technique to solve NMF, studies the sparsity of its solutions and proves that it is NP-hard for any fixed factorization rank. Nevertheless, Section 3 describes an algorithm to solve it approximately using a technique based on Lagrangian relaxation. Finally, in the last section, we test this approach on several standard image datasets, and show both qualitatively and quantitatively that it provides part-based and sparse representations that are comparable and sometimes superior to those obtained with standard Sparse Nonnegative Matrix Factorization techniques.

## 2 Nonnegative Matrix Underapproximation

### 2.1 A recursive approach

Finding a rank-one nonnegative matrix factorization, i.e., solving (NMF) with is notably easier than for higher factorization ranks: while the general problem is NP-hard, computing a globally optimal rank-one approximation can be done in polynomial time. More specifically, the first rank-one factor of the Singular Value Decomposition (SVD) of the input matrix is an optimal solution: indeed, the Perron-Frobenius theorem implies that the dominant left and right singular vectors of a nonnegative matrix are nonnegative, while the Eckart-Young theorem states that the outer product of these dominant singular vectors is the best possible rank-one approximation in the Frobenius norm.

In principle, we might try exploit this result to find factorizations of higher ranks by applying it recursively: after identification of an optimal rank-one NMF solution , one could subtract the factor from and apply the same technique to to recover the next rank-one factor. Unfortunately, this idea cannot work: the difference between and its rank-one approximation may contain negative values (typically roughly half of them), so that the next SVD factor will no longer provide a nonnegative solution. Moreover, there is no hope of replacing SVD by another efficient technique for this step since [26] shows that it is NP-hard to find the optimal nonnegative rank-one approximation to a matrix which is not nonnegative.

If we wish to keep the principle of a recursive algorithm finding one rank-one factor at a time, we have to add a constraint ensuring that the factor, when subtracted from , gives a nonnegative remainder, i.e., we need to have . Therefore we introduce a similar upper bound constraint to the general (NMF) problem and obtain a new problem we call Nonnegative Matrix Underapproximation (NMU): given and , the NMU optimization problem is defined as

 minV∈Rm×r,W∈Rr×n||M−VW||2F such that V≥0, W≥0 and VW≤M. (NMU)

Assuming we are able to solve it for , an underapproximation of any rank can then be built by following the recursive procedure outlined above. More precisely, if is a rank-one underapproximation for , i.e., and , we have that is nonnegative. can then be underapproximated , leading to , and so on. After steps, we get an underapproximation of rank

 M ≥ V:1W1:+V:2W2:+⋯+V:rWr: = [V:1V:2…V:r][W1:;W2:;…;Wr:] = VW.

Besides enabling this recursive procedure, we notice that NMU leads to a more localized part-based decomposition, in the sense that different basis elements tend to describe disjoint parts of the input data (i.e., involving different nonzero entries). This is a consequence of the underapproximation constraints which impose the extracted parts (the basis elements ) to really be common features of the columns of since

 M:j⪆∑kV:kWkj,∀j.

Basis elements can only be combined to approximate a column of if each of them represents a part of this column, i.e., none of the parts selected with a positive weight can involve a nonzero entry corresponding to a zero entry in the input column . The following example demonstrates this behavior.

###### Example 1 (Swimmer Database).

The swimmer image dataset consists of 256 binary images of a body with 4 limbs which can be each in 4 different positions. NMF is expected to find a part-based decomposition of these images, i.e., isolate different constitutive parts of the images (the body and the limbs) in each of its basis elements.

Figure 2 displays a sample of such images along with the basis elements obtained with NMF and NMU. While NMF elements are rather sparse, they are mixtures of several limbs. By contrast, NMU returns a even sparser solution and is able to extract a single body part for each of its elements.

### 2.2 Sparsity

The fact that NMU decompositions naturally generate sparser solutions than NMF can be explained as follows: since the zero entries of can only be underapproximated by zeros, we have

 Mij=0⇒(VW)ij=0⇒Vik=0 or Wkj=0,∀k

which shows that when the input matrix is sparse, many components of the NMU factors will have to be equal to zero. This observation can be made more formal: defining the sparsity of a by matrix as the proportion of its zero entries, i.e.,

 s(M)=#zeros(M)mn∈[0,1],

we have the following theorem relating sparsity of and its NMU factors.

###### Theorem 1.

For any nonnegative rank-one underapproximation of we have

 s(v)+s(w)≥s(M).
###### Proof.

For a rank-one matrix , the number of nonzeros is exactly equal to the product of the number of nonzeros in vectors and . Therefore we have that which implies . Since underapproximation satisfies , it must have more zeros than and we have

 s(M)≤s(sw)≤s(v)+s(w),

proving our claim. ∎

Recall the recursive definition of the residuals and . The following corollary relates their sparsity and the sparsity of the whole rank- approximation with that of the NMU factors.

###### Corollary 1.

For any nonnegative underapproximation of we have for each factor

 s(V:k)+s(Wk:)≥s(Rk)≥s(M),1≤k≤r,

and .

###### Proof.

We have , which implies by the previous theorem the first set of inequalities. Observing that and is sufficient to prove the second inequality. ∎

Sparsity of the residuals is monotonically nondecreasing at each step, since . Moreover, the following theorem can guarantee an increase in sparsity at each step.

###### Theorem 2.

For any locally optimal nonnegative rank-one underapproximation of , define sets and (supports of vectors and ) by

 I={i|vi>0},J={j|wj>0},

and define matrix to be the submatrix of residual whose row and column indices belong respectively to and (corresponding to the submatrix of that is not approximated by zeros). Then there is at least one zero in each row and each column of submatrix .

###### Proof.

Simply observe that if (resp. ) for some (resp. ), (resp. ) can be increased to obtain a strictly better solution, which contradicts the local optimality assumption. ∎

This ability of NMU to generate sparse part-based decomposition will be experimentally confirmed in Section 4.

### 2.3 Related work

The problem of rank-one underapproximation has been first introduced by Levin in [27] in the case of positive stochastic matrices. He introduced a specific objective function different from the Frobenius norm and used a logarithmic change of variables in order to design an iterative method based on the corresponding optimality conditions.

In [28], the rank-one underapproximation problem is cast as a convex problem (hence efficiently solvable) using again different objective functions. Solutions are then used to initialize standard NMF algorithms in order to accelerate their convergence and, in general, find better final solutions as compared to those obtained with random initializations. Similar behavior was observed for other judicious initializations in [29, 30, 31].

More recently, Dong et al. [32] studied the same problem with the additional constraint that the rank of the residual must be strictly smaller than the rank of the factorized matrix. Using the Wedderburn rank reduction formula, they proposed a numerical procedure which is able to compute the maximum rank splitting of a nonnegative matrix. However, the underlying optimization problem is NP-hard [17] and their algorithm is not guaranteed to find a solution in all cases.

Biggs et al. [33] also introduced a recursive procedure to solve NMF problems: their idea is to locate and then approximate nearly rank-one submatrices of . However, the problem of locating maximum rank-one submatrices is also shown to be NP-hard, and their algorithm is not globally optimal.

### 2.4 Complexity

We now prove that (NMU) is NP-hard, even in the rank-one case (unlike (NMF), which is polynomially solvable in the rank-one case). In order to do this, the rank-one version of the problem is proved to be equivalent to the biclique problem, which is NP-hard. The result is then generalized to (NMU) with arbitrary factorization rank using a simple construction.

A bipartite graph is a graph whose vertices can be divided into two disjoint sets such that there is no edge between two vertices in the same set. A biclique is a complete bipartite graph, i.e., a bipartite graph where all the vertices from different sets are connected by an edge. Finally, the so-called maximum edge biclique problem (the biclique problem for short) in a bipartite graph

 Gb=(V=V1∪V2,E⊆(V1×V2)),

is the problem of finding a biclique in (i.e., and ) with a maximum number of edges .

Letting be the adjacency matrix of with and , i.e.,

 Mij=1⇔(si,tj)∈E,

and introducing indicator binary variables (resp. ) to denote whether (resp. ) belongs to the biclique , the Maximum edge Biclique Problem (MBP) in a bipartite graph can be formulated as follows

 minv,w ∑i,j(Mij−viwj)2 viwj≤Mij,∀i,j, (MBP) v∈{0,1}m,w∈{0,1}n.

One can check that this objective is equivalent to . In fact, since , and are binary and .
The corresponding decision problem “Given , does contain a biclique with at least edges?” has been shown to be NP-complete [34]. Therefore, the corresponding optimization problem (MBP) is at least NP-hard.

For , (NMU) can be written as

 minv∈Rm,w∈Rn ∑i,j(Mij−viwj)2 viwj≤Mij,∀i,j, (NMU1) v≥0,w≥0,

which is very close to (MBP): the difference is that vectors and are required to be binary for (MBP) and nonnegative for (NMU1). The next lemma proves that the two problems are actually equivalent.

###### Lemma 1.

For , every optimal solution of (NMU1) is such that is binary, i.e., , and can then be trivially transformed into a binary optimal solution of (MBP).

###### Proof.

For , this is trivial. Otherwise, suppose is an optimal solution of (NMU1). Let define as

 v′i={1if vi≠00otherwise and w′j={1if wj≠00otherwise,

and analyze the different possibilities: as , we have either

• and ;

• and ;

• .

Therefore, which implies

 ||M−v′w′||F≤||M−vw||F.

By optimality of , we must have . Therefore, is an optimal binary solution of (NMU1) which is then also an optimal solution of (MBP) (note that we must have ). ∎

###### Corollary 2.

(NMU1) is NP-hard.

We now generalize Corollary 2 to the more general case of (NMU) with .

###### Theorem 3.

(NMU) is NP-hard.

###### Proof.

Let be the adjacency matrix of a bipartite graph . We define the matrix as

 A=diag(M,r)=⎛⎜ ⎜ ⎜ ⎜ ⎜⎝M0…00M0⋮⋱⋮0…M⎞⎟ ⎟ ⎟ ⎟ ⎟⎠,

which is the adjacency matrix of another bipartite graph which is the graph repeated times. Let be an optimal solution of (NMU). Since , we have . Therefore is a feasible solution of (NMU1) for the matrix , i.e., for the graph . Hence, each corresponds to a biclique of with

 Vik≠0⇔si∈Vk1andWkj≠0⇔tj∈Vk2.

By optimality of and since there are at least independent maximum biclique in , each must coincide with a maximum biclique of which corresponds to a maximum biclique of . This is due to the fact that, because is the graph repeated times, a biclique clearly cannot span two disjoint subgraphs of . Therefore, (NMU) is NP-hard since any instance of (MBP) can be polynomially reduced to an instance of (NMU). ∎

## 3 An algorithm for NMU based on Lagrangian relaxation

Since (NMU), like (NMF), is a NP-hard problem, we can not expect to solve it up to guaranteed global optimality in a reasonable (e.g., polynomial) computational time (unless ). In this section, we propose a nonlinear optimization scheme based on Lagrangian relaxation in order to compute approximate solutions of (NMU).

Drop the underapproximation constraints of (NMU) and add them into the objective function with the corresponding Lagrange multipliers (dual variables, forming a matrix) , to obtain the Lagrangian function

 L(V,W,Λ)=12||M−VW||2F+m∑i=1n∑j=1Λij(VW−M)ij,

where a factor of was introduced to make the presentation nicer. The Lagrangian relaxation subproblem LR consists in minimizing for a fixed value of the multipliers, leading to the corresponding Lagrangian dual function

 f(Λ)=minV,W≥0L(V,W,Λ) (LRΛ)

where is well-defined because the minimum of is always attained, due to the fact that is bounded below and the search space can be restricted to a compact set. Indeed, considering each rank-one factor individually and imposing w.l.o.g. , we have

 ||V:k||2F=||Wk:||2F ≤ ||VW||F ≤ ||M−Λ||F+||M−Λ−VW||F, ≤ 2||M−Λ||F∀k

where we have used the trivial solution to bound (cf. derivations of Section 3.1).

Standard application of Lagrangian duality tells us that

 (NMU)≡minV,W≥0supΛ≥0L(V,W,Λ)≥supΛ≥0minV,W≥0L(V,W,Λ) = supΛ≥0f(Λ),

where the problem on the left of the inequality is equivalent to our original NMU formulation and the problem on the right is its Lagrangian dual, whose solution will provide a (hopefully tight) lower bound on the optimal (NMU). This new problem is a nondifferentiable optimization problem with the nice property that its objective is concave and its maximization (over a convex set) is then a convex problem (see [35] and references therein).

We describe in the next section a general solution technique, which consists in repeatedly applying the following two steps:

1.

Given multipliers , compute to (approximately) minimize , i.e., solve (LR); this is discussed in Section 3.1;

2.

Given solution , update multipliers ; this is described in Section 3.2.

### 3.1 Solving the Lagrangian relaxation problem

The following derivations

 L(V,W,Λ) =∑i,j12(M−VW)2ij+∑i,jΛij(VW−M)ij =12∑i,jM2ij−∑i,jMij(VW)ij+12∑i,j(VW)2ij +∑i,jΛij(VW)ij−∑i,jΛijMij =12||(M−Λ)−VW||2F−12||Λ||2F,

show that minimizing for a fixed is equivalent to minimizing . Matrix is not necessarily nonnegative, therefore finding and such that is a more general problem than NMF. It is actually studied in detail in [26] (see also [36]) where it is called Nonnegative Factorization (NF), is formulated as

 minV∈Rm×r,W∈Rr×n||N−VW||2F such that V≥0 and W≥0, (NF)

with and and is shown to be NP-hard for any factorization rank (including ).

Some standard algorithms for NMF can easily adapted to handle an input matrix that is not nonnegative, i.e., solve a NF problem. For this work, we decided to use a recent technique called Hierarchical Alternating Least Squares (HALS), proposed in [22], which alternatively updates each column of and each row of with the following optimal closed-form solutions:

 V∗:k = argminV:k≥0||(M−Λ)−VW||2F (2) = max(0,A:k−∑rl=1,l≠kV:lBlkBkk),

with and , and

 W∗k: = argminWk:≥0||(M−Λ)−VW||2F (3) = max(0,Ck:−∑rl=1,l≠kDklWl:Dkk),

with and . This can be viewed as a simple method of (block-)coordinate descent (also called alternating variables), which has been shown to perform strikingly well in practice, and much better than the popular multiplicative updates of Lee and Seung (see [4, 15, 26]). Under some mild assumptions, every limit point of the above alternating scheme is a stationary point [4, 26].

The main computational cost of one HALS iteration is the evaluation of and : they each require (floating point) operations. One can check that the resulting total number of operations is .

###### Remark 1.

HALS is sensitive to the scaling of the initial matrices. For example, if the initial matrices and are chosen such that , optimal columns of and optimal rows of computed by formulas (2) and (3) at the first step will most likely be equal to zero. This will lead to rank deficient approximations ( for some ) and numerical problems (for , update of is not well defined and vice versa). If the initial matrices are scaled [26], i.e., by ensuring that

 1=argminα||M−αVW||F=⟨M,VW⟩⟨VW,VW⟩ (4)

where , this behavior is in general avoided. All initial matrices used in the following have been scaled.

### 3.2 Update of the multipliers Λ

The second step of our algorithm consists in updating in order to find better (i.e., higher) solutions to the Lagrangian dual problem. Using the knowledge that any optimal solution of the Lagrangian dual must satisfy the following complementarity slackness conditions

 Λ∗ij(M−V∗W∗)ij=0 ∀i,j, as well as % feasibility conditions Λ∗ij≥0 and (M−V∗W∗)ij≥0,

we see that the update rule for the multipliers should satisfy the following:

• if , should be decreased and eventually reach zero if ,

• if , should be increased to give more importance to in the cost function, hopefully in order to get a feasible solution such that .

In the sequel, we use the following rule to update , which satisfies the above requirements:

 Λ←max(0,Λ−μk(M−VW)),μk→0,

where is a predefined sequence of step lengths decreasing to zero; can be initialized to zero. This update is inspired from the concept of subgradient methods [37]; in fact, one can easily check that the quantity is a subgradient of

 f(Λ)=minV,W≥0L(V,W,Λ)

with respect to , i.e., if , we have

 f(Λ)≤f(¯Λ)+⟨¯V¯W−M,Λ−¯Λ⟩,∀Λ.

Two questions now arise

• Since an iterative algorithm is used to solve (approximately) the Lagrangian relaxation problem (cf. section 3.1), after how many of these HALS iterations do we stop and proceed to update the multipliers ?

• How do we choose the sequence of step lengths ?

Subgradient methods usually assume that the Lagrangian relaxation problem (LR) can be solved exactly and can guarantee their convergence to an optimal solution provided an appropriate sequence of step sizes is selected (see, e.g., [35]), for example satisfying the conditions

 0≤μk→0 such that ∞∑k=0μ2k<+∞ while ∞∑k=0μk=+∞.

In the sequel, we choose to use , which is such a suitable sequence. However, in our case, we cannot expect to solve (LR) in a reasonable amount of time since the problem is NP-hard. It would even probably be too expensive to wait for the stabilization of (e.g., getting close to a stationary but not necessarily optimal point). We therefore suggest to update only a constant number of times between each update of , which leads to Algorithm L-NMU. Note that because we do not solve (LR) exactly, Algorithm L-NMU is not guaranteed to converge to an optimal solution of the Lagrangian dual but, as we will see, it produces satisfactory solutions in practice.

The additional computational cost of one iteration of algorithm L-NMU when compared with one iteration of HALS for NMF consists in the computation of (needed in step 3) and the update of (at step 4), which require operations (and, in the special case , operations).

###### Remark 2.

Because convergence is not theoretically guaranteed, Algorithm L-NMU may end up with solutions that do not completely satisfy the underapproximation constraint. Although our numerical experiments show that this has no detrimental influence on the quality of the obtained sparse part-based representations (see Section 4), we give here a simple technique to transform such a solution into a feasible solution. Indeed, it is enough to consider the following QP problem (convex quadratic objective function, linear inequality constraints) which only involves the factor

 V∗=argminV≥0,VW≤M||M−VW||2F. (5)

Because of its convexity, this problem can be solved up to global optimality in a very efficient manner, and replacing the original factor by the optimal solution leads to a feasible solution to (NMU).

###### Remark 3.

Because update rule (5) is exact and computable in practice, it would be natural to consider a simpler algorithm based on its alternative application to the and factors, without using the Lagrangian relaxation technique, hoping to converge to a solution of (NMU). Unfortunately, we observed that this is quite inefficient in practice. In fact,

• it is relatively computationally expensive to solve these linearly constrained quadratic programs (with and inequalities), at least compared to the HALS closed-form update rules (2)-(3);

• since the underapproximation constraint is imposed at each step, this algorithm has much less freedom to converge to good solutions: iterates rapidly get stuck on the boundary of the feasible domain, typically with (too) many zeros and a lower rank. For example, assuming has one zero in each column, we have that for any positive matrix the corresponding optimal is equal to 0:

 ∀j,∃i s.t. Mij=0⇒∀j,∃i s.t. ∑kVikWkj=0⇒Wkj=0,∀k,j.

Therefore, such an algorithm can only work if we decide a priori which values in and should be equal to zero, i.e. if we find a good sparsity pattern for the solution, which is precisely where the difficulty of the problem lies. Note that the same behavior is observed if a HALS-type algorithm is used instead of (5) (i.e., updating columns of and rows of alternatively): after the update of one column of , the residual will have one zero in each row (cf. Theorem 2) which will prevent the other columns of to be nonzero (except if the sparsity pattern is chosen a priori).

###### Remark 4.

The L-NMU algorithm described above is not particularly well-suited to deal with very sparse input matrices. In fact, one has to store a potentially dense matrix with the Lagrangian variables . Nevertheless, Berry et al. [38] have obtained encouraging results when applying NMU to sparse anomaly detection problems in text mining. Moreover, it is possible to take advantage of the input sparsity pattern and design a computationally cheaper method. First note that the Lagrangian variables associated with a zero of will be nondecreasing in the course of the algorithm, since

 0≤(V(k)W(k))ij and (M)ij=0⇒Λ(k)ij≤Λ(k+1)ij,

where superscript denotes the solution at step . Therefore one can significantly reduce the computational cost by defining

 Λ(k)ij=−g(k) for all i,j s.t. Mij=0,

where is an arbitrary positive nondecreasing function, e.g., with , as is implicitly done in [26].

## 4 Numerical tests on image datasets

We have argued in Section 2 that NMU is potentially able to extract a better part-based representation of the data and that its factors should be sparser than those of the standard NMF, at the detriment of the approximation error. In this section, we support these claims by reporting results of computational experiments involving two variants of Algorithm L-NMU on several image datasets.

A direct comparison between NMU and NMF is not very informative in itself: while the former will provide a sparser part-based representation, the latter will feature a lower approximation error. This does not really tell us whether the improvements in the part-based representation and sparsity are worth the increase in approximation error. For that reason, we chose to compare NMU with two other sparse nonnegative matrix factorizations techniques, described below, in order to better assess whether the increase in sparsity achieved by NMU is worth the loss in reconstruction accuracy.

### 4.1 Sparse NMF

We selected and tested the following two sparse nonnegative matrix factorization techniques that are frequently used in the literature.

• Hoyer describes in [11] an algorithm relying on additional explicit sparsity constraints on the factors, enforced at each iteration by means of a projection. The approximation error is reduced via a combination of projected gradient and multiplicative updates. For our experiments, we use the MATLAB  code provided by the author333This code was downloaded from http://www.cs.helsinki.fi/u/phoyer/software.html..

It should be pointed out that Hoyer is using a different definition of sparsity: for any nonzero dimensional vector , his measure of sparsity is defined as

 sh(x)=√n−||x||1/||x||2√n−1∈[0,1]. (6)

Hence, a vector with a single nonzero entry is perfectly sparse

 sh([0…0k0…0])=1,∀k≠0,

while a vector with all entries equal to each other is completely dense

 sh([k…k])=0,∀k≠0.

In our experiments, we report sparsity using both the standard indicator and Hoyer’s measure.

• Instead of enforcing sparsity at every iteration, a sparsity-inducing penalty term can be introduced in the objective function [13]. In particular, it is well-known that adding -norm penalty terms induce sparser solutions (see, e.g., [14, 9, 15]), and we therefore solve the following problem:

 minV,W≥0||M−VW||2F+μV||V||1+μW||W||1, (sNMF)

where and and are two positive parameters controlling the sparsity of and . In order to solve (sNMF), we use the HALS algorithm which can easily be adapted to handle the additional -norm penalty terms (see, e.g., [4, 15]). This algorithm will be referred to as sNMF.

Technical details for the first technique are more complicated, but it allows the sparsity of the factors to be chosen a priori. The second technique is conceptually simpler but requires the determination of appropriate penalizing parameters by other means.

### 4.2 Tested algorithms

Algorithm L-NMU proposed in Section 3 can be used to compute underapproximations for any given factorization rank . This opens the possibility of building a rank- underapproximation in several different ways: one simple option consists in applying algorithm L-NMU directly to the rank- problem – we call this method global NMU (G-NMU). Another option consists in applying the recursive technique outlined in the introduction, used to motivate the introduction of underapproximations. More specifically, this means running algorithm L-NMU successively times to compute rank-one approximations, subtracting each approximation from the input matrix before computing the next one – we call this method recursive NMU (R-NMU). Note that many other variants are possible (e.g., computing two rank- approximations, computing successive rank-two approximations, etc.) but we only tested the two above-mentioned variants, which represent two extreme cases (no recursion and maximum recursion).

In both cases, our implementation of algorithm L-NMU computes two HALS steps between each update of the multipliers (i.e., we fixed ). Most of the computational work done in one iteration of L-NMU consists in computing , performing the two HALS steps and updating ; more specifically, one can estimate the computational cost of one iteration of G-NMU to operations, while an R-NMU iteration takes operations (repeated times in the recursive procedure).

For each dataset, we test five nonnegative factorization algorithms: NMF based on HALS updates (NMF), global NMU (G-NMU), recursive NMU (R-NMU), sparse NMF with -penalty terms (sNMF) and the algorithm of Hoyer. We also report the results of a standard Principal Component Analysis (PCA) to serve as a reference (recall that the approximation error of this unconstrained low-rank approximation, computed here with a singular value decomposition, is globally minimal for the given rank, but that its factors are neither nonnegative, nor sparse).

### 4.3 Iteration limits and CPU time

Each of the five iterative algorithms described above requires a limit on the number of its iterations; these limits were chosen in order to roughly allocate the same CPU time to each algorithm. More specifically, the standard NMF was given a 600-iterations limit, which corresponds to the computation of 600 HALS updates. The sparse sNMF, based on a slightly modified HALS update, was also allowed 600 iterations. Because a HALS update involves operations, we can deduce the following iteration budgets for G-NMU and R-NMU from the leading terms in their corresponding operation counts: G-NMU is allowed L-NMU iterations while R-NMU can take iterations.

An exception to the equal CPU time rule was made for the algorithm of Hoyer. Results obtained after an amount of CPU time similar to that of the other algorithms were too poor to be compared in a meaningful way. Indeed, because this method is based on a projected gradient method and multiplicative updates (both operations per iteration), which are known to converge at a typically much slower rate, a relatively high limit of 1000 iterations had to be fixed, although the resulting CPU time is then much larger than for the other methods (for example, on the CBCL dataset, 600 iterations of HALS took while 1000 iterations of the algorithm of Hoyer needed ).

### 4.4 Testing methodology

Recall we decided to test algorithms sNMF and Hoyer to assess the quality of the sparsity-accuracy compromise proposed by our NMU approaches. To achieve this, we decided to pit each NMU variant against a solution of sNMF/Hoyer featuring the same level sparsity, and compare the resulting approximation errors. We therefore report results for algorithms on each dataset: PCA, NMF, G-NMU, sNMF with the same sparsity as G-NMU, which we denote by sNMF{G-NMU}, Hoyer{G-NMU}, R-NMU, sNMF{R-NMU} and Hoyer{R-NMU}.

In order to enforce a sparsity similar to the NMU solution in Hoyer’s code, we compute the measure of the NMU factors and input it as a parameter of the method (see description in subsection 4.1); note however that we could only enforce this for the sparsest of the two NMU factors444Ideally, we would have imposed sparsity for both factors, but the implementation we used seemed to return poor results in that situation.. In the case of sNMF, sparsity cannot be directly controlled, and penalty parameters are found using the following adaptive procedure, which proved to work well in practice: and are initialized to 0.1 and, after each iteration, (resp. ) is increased by 5 percent if (resp. ) is below the target sparsity, and is decreased by 5 percent otherwise.

All algorithms were run times with the same initial random matrices and only the best solution with respect to the Frobenius norm of the error is reported. When testing with gray-level images, the input matrices where normalized to have their entries varying between 0 and 1, with 0 representing white and 1 representing black (when trying to decompose as a sum of parts, this make more sense than the opposite convention, since the dark regions are the constitutive parts of the objects in the image datasets we analyze). Finally, before computing reported sparsity measures of the factors, any sufficiently small555We declare an entry of a factor to be sufficiently small if it is less than of the largest entry in its column. entry is rounded to zero (indeed, because algorithms are stopped by the iteration limit before convergence, true zeros are typically not all reached). All tests were run within the MATLAB  7.1 (R14) version, on a 3GHz Intel Core™2 Dual CPU PC.

### 4.5 CBCL Face Database

The CBCL face image dataset was used for the illustrative example of Figure 1 and is made of 2429 gray-level images of faces represented with pixels. We look for an approximation of rank .

Figure 3 displays the basis elements for NMF, G-NMU, R-NMU and sNMF{G-NMU} (which was the best solution obtained in term of sparsity vs. error among all four sNMF and Hoyer variants). Both G-NMU, R-NMU and sNMF achieve a better part-based representation than NMF, generating sparser solutions. An interesting feature of R-NMU is that it extracts parts successively in order of “importance”: the first basis element is a ’mean’ face (which is dense) while the next ones describe different complementary parts (which become sparser as the recursion moves on, cf. Corollary 1 and Theorem 2).

A more quantitative assessment is provided in Table 1, reporting for the algorithms tested the relative error (in percent) of their solutions

 relative error=||M−VW||F||M||F

in the second column (“Plain”) and the corresponding sparsity measures (in percent) of factors and in the last four columns.

As expected, PCA returns the smallest error, albeit with very dense factors. NMF already features much sparser factors (slightly half of the entries in are equal to zero), at the cost of a relatively modest increase in the approximation error (). G-NMU provides an even sparser solution (three quarters of zero entries), increasing again the approximation error (). The factors recursively computed by R-NMU are in comparison not as sparse: as explained above, this is because R-NMU focuses on obtained representative parts, including relatively dense ones for the first few steps of the recursion. However, it features a much sparse weight vectors, giving again more credit to the hypothesis that better parts are extracted. The corresponding approximation error is higher than for other methods, because the intrinsically greedy approach taken by R-NMU is not as efficient as a method that optimizes all the factors simultaneously.

Is the increased sparsity provided by G-NMU worth the increase in approximation error ? Looking at the corresponding results for sNMU{G-NMU} and Hoyer{G-NMU}, i.e., for sparse NMF and Hoyer’s algorithms with a similar target sparsity, it might seem at first that the answer is negative: the other methods return solutions with similar number of nonzeros (slightly higher for Hoyer) and a lower approximation error ( and instead of ). Actually, this was expected: because it tries to return an underapproximation, i.e., factors such that , the entries in the error term are mostly nonnegative, while the other techniques, with no underapproximation constraint, obtain a smaller norm of the error by choosing the entries of to be roughly half negative, half positive. It is therefore not completely fair to compare directly the error of the NMU approach to the other techniques.

In order to compensate for this, a simple rescaling could be used, i.e., multiplying by a scalar since (cf. Equation (4)). However, we chose a different procedure that has the advantage of benefiting all algorithms, including those whose error was not suffering from the underapproximation constraint. Once a solution is computed by one of the eight algorithms, we fix the zero entries of and and optimize the approximation error, i.e., , on the remaining (nonzero) entries (again, HALS can easily be adapted to handle this situation). In essence, this allows us to compare the sparsity patterns of the different solutions. We perform 100 additional HALS steps on each solution, and report the new relative approximation error in the third column of Table 1 (“Improved”). Note that sNMF and Hoyer’s errors are also improved by this procedure; this can be explained by the fact that they were also not directly trying to minimize the approximation error (Hoyer had to take into account its sparsity constraint, and sNMF was influenced by the penalty terms added to the approximation error).

Looking now at the NMU solutions in a fairer comparison, we observe that their approximation error becomes very close to that of sNMF and Hoyer, in particular for G-NMU, and not very far from the denser NMF, so that we can conclude that the sparsity-approximation error compromise it offers is worthwhile.

### 4.6 Swimmer Database

For the swimmer image dataset described in Example 1 (256 images with pixels), the basis elements obtained with the different algorithms are displayed on Figure 4 and the corresponding approximation errors and sparsity measures are reported in Table 2.

As mentioned earlier, our two NMU algorithms are the only methods able to extract truly independent parts, while NMF and sNMF generate a combination of them. Note however that the solution generated by sNMF bears some similarity to the one of G-NMU.

### 4.7 Hubble Space Telescope Spectral Images

The next image dataset consists of 100 spectral images ( pixels) of the Hubble telescope at different frequencies [39, 40], see Figure 5. With the choice , NMF generates a nearly exact factorization (relative error ), because the spectral reflectance of the Hubble telescope results from the additive linear combination of the reflectance of eight constitutive materials. Figure 6 and Table 3 provide the visual and computational results for this dataset.

Because NMF is already a nearly exact reconstruction (Table 3), the NMU constraints are somehow redundant: NMF and G-NMU are basically equivalent and return solutions with very similar sparsity measures (albeit with a slightly lower error for G-NMU). For that reason, sNMF{G-NMU} and Hoyer{G-NMU} return results nearly identical to NMF and are omitted from the table.

Recursive R-NMU extracts parts in order of importance: first, a global picture of the telescope and then its different constitutive parts. This allows it to generate the sparsest solution, with several basis elements representing well-delimited constitutive parts of the telescope not identified by the other methods.

### 4.8 Kuls Illuminated Faces

A static scene was illuminated from many directions with a moving light source to produce the Kuls image dataset666Available at http://www.robots.ox.ac.uk/~amb/.. It consists of 20 images ( pixels) of a face. Because the images are very similar, most of the information (more than 70 percent) can be expressed with only one factor. The remaining information resides in the different orientations of the lighting. Computational and visual results for a rank-5 factorization are given by Table 4 and Figure 7. We observe that NMF and G-NMU obtain similar results: even though they are both able to extract several faces with different lighting orientations, they do not extract a sparse and part-based representation.

R-NMU first extracts a face illuminated from all directions, and then complementary parts representing different orientations of the lighting (successively on the fourth row of Figure 7: global then light from the right, left, bottom and top). This nice recursive extraction of the information is a direct consequence of the underapproximation constraints. Although sNMF (with the same sparsity requirement as R-NMU) is also able to extract a part-based representation with a slightly better approximation error, only two components are well-identified (left and right lighting mixed with top and bottom lighting).

## 5 Conclusion

In order to solve the NMF problem in a recursive way, we have introduced a new problem, namely Nonnegative Matrix Underapproximation (NMU), which was shown to be NP-hard using its equivalence with the maximum-edge biclique problem. The additional constraints of NMU are shown to induce sparser factors and to lead naturally to a better part-based representation of the data, while keeping a fairly good reconstruction. We proposed an algorithm based on Lagrangian relaxation to find approximate solutions to NMU.

We tested two factorization methods based on this algorithm, one with full recursion (R-NMU), the other without recursion (G-NMU), on several standard image datasets. After suitable post-processing, we observed that the factors computed by these methods indeed offer a good compromise between their achieved sparsity and the resulting approximation error, comparable or sometimes superior to that of two standard sparse matrix factorization techniques.

These two variants can be contrasted in the following way: where G-NMU mainly focuses on finding sparse factors with small reconstruction error, in the same spirit as sNMF and Hoyer, R-NMU typically computes an even sparser factorization corresponding to a better part-based representation, albeit with a moderate increase in the reconstruction error (due to the greedy approach). Moreover, this second variant is useful in situations where the factorization rank is not fixed a priori: the fact that it is recursive allows the user to stop the procedure as soon as the reconstruction error becomes satisfactory, without having to recompute a completely different solution from scratch every time a higher-rank factorization needs to be considered.

## Acknowledgments

The authors would like to thank the anonymous reviewers for their insightful comments which helped improve the paper.

## References

• [1] M. Berry, M. Browne, A. Langville, P. Pauca, R. Plemmons, Algorithms and Applications for Approximate Nonnegative Matrix Factorization, Computational Statistics and Data Analysis 52 (2007) 155–173.
• [2] I. Dhillon, S. Sra, Nonnegative Matrix Approximations: Algorithms and Applications, Tech. rep., University of Texas (Austin), dept. of Computer Sciences (2006).
• [3] K. Devarajan, Nonnegative Matrix Factorization: An Analytical and Interpretive Tool in Computational Biology, PLoS Computational Biology 4(7), e1000029.
• [4] N.-D. Ho, Nonnegative matrix factorization - algorithms and applications, Ph.D. thesis, Université catholique de Louvain (2008).
• [5] P. Paatero, U. Tapper, Positive matrix factorization: a non-negative factor model with optimal utilization of error estimates of data values, Environmetrics 5 (1994) 111–126.
• [6] D. Lee, H. Seung, Learning the Parts of Objects by Nonnegative Matrix Factorization, Nature 401 (1999) 788–791.
• [7] F. Shahnaz, M. Berry, A., V. Pauca, R. Plemmons, Document clustering using nonnegative matrix factorization, Information Processing and Management 42 (2006) 373–386.
• [8] Y. Gao, G. Church, Improving molecular cancer class discovery through sparse non-negative matrix factorization, Bioinformatics 21(21) (2005) 3970–3975.
• [9] H. Kim, H. Park, Sparse non-negative matrix factorizations via alternating non-negativity-constrained least squares for microarray data analysis, Bioinformatics 23(12) (2007) 1495–1502.
• [10] A. d’Aspremont, L. El Ghaoui, M. Jordan, G. Lanckriet, A Direct Formulation for Sparse PCA Using Semidefinite Programming, SIAM Rev. 49(3) (2007) 434–448.
• [11] P. Hoyer, Nonnegative Matrix Factorization with Sparseness Constraints, J. Machine Learning Research 5 (2004) 1457–1469.
• [12] M. Heiler, C. Schnörr, Learning Sparse Representations by Non-Negative Matrix Factorization and Sequential Cone Programming, Journal of Machine Learning Research 7 (2006) 1385–1407.
• [13] S. Li, X. Hou, H. Zhang, Q. Cheng, Learning spatially localized parts-based representation, in: Proceedings of IEEE Int. Conf. on Computer Vision and Pattern Recognition, 2001, pp. 207–212.
• [14] H. Kim, H. Park, Non-negative Matrix Factorization Based on Alternating Non-negativity Constrained Least Squares and Active Set Method, SIAM J. Matrix Anal. Appl. 30(2) (2008) 713–730.
• [15] C. Cichocki, A.-H. Phan, Fast local algorithms for large scale Nonnegative Matrix and Tensor Factorizations, IEICE Transactions on Fundamentals of Electronics Vol. E92-A No.3 (2009) 708–721.
• [16] G. Golub, C. Van Loan, Matrix Computation, 3rd Edition, The Johns Hopkins University Press Baltimore, 1996.
• [17] S. A. Vavasis, On the complexity of nonnegative matrix factorization, SIAM Journal on Optimization 20 (3) (2009) 1364–1377.
• [18] C.-J. Lin, Projected Gradient Methods for Nonnegative Matrix Factorization, Neural Computation 19 (2007) 2756–2779, MIT press.
• [19] I. Dhillon, D. Kim, S. Sra, Fast Newton-type Methods for the Least Squares Nonnegative Matrix Approximation problem, in: Proceedings of SIAM Conference on Data Mining, 2007.
• [20] C. Cichocki, R. Zdunek, S. Amari, Non-negative Matrix Factorization with Quasi-Newton Optimization, Lecture Notes in Artificial Intelligence, Springer 4029 (2006) 870–879.
• [21] D. Chen, R. Plemmons, Nonnegativity Constraints in Numerical Analysis, 2007, paper presented at the Symposium on the Birth of Numerical Analysis, Leuven Belgium. To appear in the Conference Proceedings, to be published by World Scientific Press, A. Bultheel and R. Cools, Eds.
• [22] C. Cichocki, R. Zdunek, S. Amari, Hierarchical ALS Algorithms for Nonnegative Matrix and 3D Tensor Factorization, Lecture Notes in Computer Science, Springer 4666 (2007) 169–176.
• [23] D. Lee, H. Seung, Algorithms for Non-negative Matrix Factorization, Advances in Neural Information Processing 13 (2001) 556–562.
• [24] C. Cichocki, R. Zdunek, S. Amari, Nonnegative Matrix and Tensor Factorization, IEEE Signal Processing Magazine (2008) 142–145.
• [25] N.-D. Ho, P. Van Dooren, V. Blondel, Descent methods for nonnegative matrix factorization, To appear in Numerical Linear Algebra in Signals, Systems and Control, Springer Verlag.
• [26] N. Gillis, F. Glineur, Nonnegative Factorization and The Maximum Edge Biclique Problem, CORE Discussion paper 2008/64 (2008).
• [27] B. Levin, On Calculating Maximum Rank One Underapproximations for Positive Arrays, Tech. rep., Columbia University, div. of Biostatistics (1985).
• [28] N. Gillis, Approximation et sous-approximation de matrices par factorisation positive: algorithmes, complexité et applications, Master’s thesis, Université catholique de Louvain, in French (2007).
• [29] R. Albright, J. Cox, D. Duling, A. Langville, C. Meyer, Initializations and Convergence for the Nonnegative Matrix Factorization, in: 12th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, 2006.
• [30] C. Boutsidis, E. Gallopoulos, SVD based initialization: A head start for nonnegative matrix factorization, Journal of Pattern Recognition 41 (2008) 1350–1362.
• [31] J. Curry, A. Dougherty, S. Wild, Improving non-negative matrix factorizations through structured initialization, Journal of Pattern Recognition 37(11) (2004) 2217–2232.
• [32] B. Dong, M. Lin, M. Chu, Nonnegative rank factorization via rank reduction, preprint (2008).
• [33] M. Biggs, A. Ghodsi, S. Vavasis, Nonnegative Matrix Factorization via Rank-One Downdate, in: 25th international conference on machine learning (ICML), 2008.
• [34] R. Peeters, The maximum edge biclique problem is NP-complete, Discrete Applied Mathematics 131(3) (2003) 651–654.
• [35] K. M. Anstreicher, L. A. Wolsey, Two ”well-known” properties of subgradient optimization, Mathematical Programming 120(1) (2009) 213–220.
• [36] C. Ding, T. Li, M. Jordan, Convex and Semi-Nonnegative Matrix Factorizations, to appear in IEEE Transactions on Pattern Analysis and Machine Intelligence (2009).
• [37] N. Shor, Minimization Methods for Non-differentiable Functions, Springer Series in Computational Mathematics, 1985.
• [38] M. Berry, N. Gillis, F. Glineur, Document Classification Using Nonnegative Matrix Factorization and Underapproximation, in: Proc. of the IEEE International Symposium on Circuits and Systems (ISCAS), 2009, pp. 2782–2785, ISBN: 978-1-4244-3828-0.
• [39] P. Pauca, J. Piper, R. Plemmons, Nonnegative matrix factorization for spectral data analysis, Linear Algebra and its Applications 406(1) (2006) 29–47.
• [40] Q. Zhang, H. Wang, R. Plemmons, P. Pauca, Tensor methods for hyperspectral data analysis: a space object material identification study, J. Optical Soc. Amer. A 25(12) (2008) 3001–3012.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters