Using Underapproximations
for Sparse Nonnegative Matrix Factorization
Abstract
Nonnegative Matrix Factorization consists in (approximately) factorizing a nonnegative data matrix by the product of two lowrank nonnegative matrices. It has been successfully applied as a data analysis technique in numerous domains, e.g., text mining, image processing, microarray data analysis, collaborative filtering, etc.
We introduce a novel approach to solve NMF problems, based on the use of an underapproximation technique, and show its effectiveness to obtain sparse solutions. This approach, based on Lagrangian relaxation, allows the resolution of NMF problems in a recursive fashion. We also prove that the underapproximation problem is NPhard for any fixed factorization rank, using a reduction of the maximum edge biclique problem in bipartite graphs.
We test two variants of our underapproximation approach on several standard image datasets and show that they provide sparse partbased representations with low reconstruction error. Our results are comparable and sometimes superior to those obtained by two standard Sparse Nonnegative Matrix Factorization techniques.
Keywords: Nonnegative Matrix Factorization, Underapproximation, Maximum Edge Biclique Problem, Sparsity, Image Processing.
1 Introduction
Nonnegative Matrix Factorization (NMF) is a recent data analysis technique with applications in image processing, text mining, spectral unmixing, air emission control, computational biology, clustering, etc. (see [1, 2, 3, 4] and references therein). NMF can be described as follows: given a nonnegative input matrix and an integer , find two nonnegative matrices and whose product approximates the input matrix as closely as possible:
(1) 
so that is a lowrank approximation of . Matrix factorization can be interpreted as a linear factor model: assuming that each column of the input matrix represents an element of a data set, decomposition (1) can be written as^{1}^{1}1In this text, stands for the entry of a matrix , for its column and for its row.
i.e., each input column is a linear combination of a set of basis elements with corresponding weights .
In contrast with standard linear factor model techniques such as Principal Component Analysis, NMF considers nonnegativity of the input columns to be an important feature and consequently requires the basis elements to be also nonnegative, so that they can be interpreted in the same way (e.g., these columns can correspond to images described by nonnegative pixel intensities or to texts represented by vectors of nonnegative word counts). Furthermore, NMF imposes nonnegativity of the weights, leading to an essentially additive reconstruction of the input columns by the basis elements. This representation is then partbased: basis elements will represent common parts of the columns of . For example, if each column of represents a face using pixel intensities, the basis elements generated by NMF can be facial features, such as eyes, noses and lips, as shown in Figure 1.
This lowrank approximation technique with nonnegativity constraints was introduced in 1994 by Paatero and Tapper [5] and started to be extensively studied after the publication of an article by Lee and Seung [6] in 1999. Since an exact representation of the input matrix cannot be obtained in general, the quality of the approximation is measured by some criterion, typically the sum of the squares of the errors on the entries, which leads to the following minimization problem^{2}^{2}2 denotes the Frobenius norm of matrix .:
(NMF) 
An important feature of NMF is that its nonnegativity constraints typically induce sparse factors, i.e., factors with relatively many zero entries. Intuitively, decomposition into parts requires the basis elements to be sparse, cf. Figure 1. More formally, the reason for this behavior is that stationary points of NMF will be typically located at the boundary of the feasible domain , hence will feature zero components. This can be explained with the firstorder optimality conditions: because the set of stationary points of a problem of the type
is given by the following expression (where is the gradient of )
some components of the solution can be expected to be equal to zero.
Sparsity of the factors is an important consideration in practice: in addition to reducing memory requirements to store the basis elements and their weights, sparsity improves interpretation of the factors, especially when dealing with classification/clustering problems, e.g., in text mining [7] and computational biology [8, 9]. By contrast, unconstrained lowrank approximations such as Principal Component Analysis (PCA) do not naturally generate sparse factors (for that reason, lowrank approximations techniques with additional sparsity constraints have been recently introduced; this is referred to as Sparse Principal Component Analysis, Sparse PCA or SPCA, see, e.g., [10] and references therein).
Although solutions of NMF typically display some level of sparsity, some applications require even sparser solutions, leading to variants of NMF called Sparse Nonnegative Matrix Factorization. They are in general developed in two different ways: some authors define a priori a desired sparsity level and adapt the main iteration of their method in order to guarantee that the factors satisfy that level of sparsity throughout the application of the algorithm, see, e.g., [11, 12]. Alternatively, a penalty term can be added to the objective function to prevent the algorithm from considering dense solutions, see [13]. In particular, it is wellknown that norm penalty terms induce sparser solutions (see, e.g., [14, 9, 15]). More details about these techniques are given at the beginning of Section 4.
Unfortunately the advantages of NMF (partbased representation and sparsity) over PCA come at a certain price. First, because of the additional nonnegativity constraints, the approximation error of the input data for a given factorization rank will always be higher for NMF than in the unconstrained case. Second, optimization problem (NMF) is more difficult to solve than its unconstrained counterpart: while PCA problems can be solved in polynomial time (e.g., using a Singular Value Decomposition technique [16]), NMF problems belong to the class of NPhard problems, as recently showed by Vavasis [17]. However, it should also be pointed out that these drawbacks (higher error, NPhardness) are also present for competing techniques emphasizing sparsity, such as SPCA.
Because of its NPhardness, practical algorithms cannot be expected to find provably optimal global solutions for (NMF) in a reasonable amount of time and aim instead at finding locally optimal solutions. Most methods start from some initial guess factors and improve them iteratively using nonlinear optimization schemes such as projected gradient methods [18], Newtonlike methods [19, 20], (block)coordinate descent (also called alternating nonnegative least squares – NNLS) [21, 22, 14], multiplicative updates [23], etc. (see also [1, 2, 24, 25] and references therein).
In this paper, we introduce a novel approach to solve NMF problem based on the use of an underapproximation technique and show its effectiveness to obtain sparse solutions. Section 2 introduces our underapproximation problem, motivated by a recursive technique to solve NMF, studies the sparsity of its solutions and proves that it is NPhard for any fixed factorization rank. Nevertheless, Section 3 describes an algorithm to solve it approximately using a technique based on Lagrangian relaxation. Finally, in the last section, we test this approach on several standard image datasets, and show both qualitatively and quantitatively that it provides partbased and sparse representations that are comparable and sometimes superior to those obtained with standard Sparse Nonnegative Matrix Factorization techniques.
2 Nonnegative Matrix Underapproximation
2.1 A recursive approach
Finding a rankone nonnegative matrix factorization, i.e., solving (NMF) with is notably easier than for higher factorization ranks: while the general problem is NPhard, computing a globally optimal rankone approximation can be done in polynomial time. More specifically, the first rankone factor of the Singular Value Decomposition (SVD) of the input matrix is an optimal solution: indeed, the PerronFrobenius theorem implies that the dominant left and right singular vectors of a nonnegative matrix are nonnegative, while the EckartYoung theorem states that the outer product of these dominant singular vectors is the best possible rankone approximation in the Frobenius norm.
In principle, we might try exploit this result to find factorizations of higher ranks by applying it recursively: after identification of an optimal rankone NMF solution , one could subtract the factor from and apply the same technique to to recover the next rankone factor. Unfortunately, this idea cannot work: the difference between and its rankone approximation may contain negative values (typically roughly half of them), so that the next SVD factor will no longer provide a nonnegative solution. Moreover, there is no hope of replacing SVD by another efficient technique for this step since [26] shows that it is NPhard to find the optimal nonnegative rankone approximation to a matrix which is not nonnegative.
If we wish to keep the principle of a recursive algorithm finding one rankone factor at a time, we have to add a constraint ensuring that the factor, when subtracted from , gives a nonnegative remainder, i.e., we need to have . Therefore we introduce a similar upper bound constraint to the general (NMF) problem and obtain a new problem we call Nonnegative Matrix Underapproximation (NMU): given and , the NMU optimization problem is defined as
(NMU) 
Assuming we are able to solve it for , an underapproximation of any rank can then be built by following the recursive procedure outlined above. More precisely, if is a rankone underapproximation for , i.e., and , we have that is nonnegative. can then be underapproximated , leading to , and so on. After steps, we get an underapproximation of rank
Besides enabling this recursive procedure, we notice that NMU leads to a more localized partbased decomposition, in the sense that different basis elements tend to describe disjoint parts of the input data (i.e., involving different nonzero entries). This is a consequence of the underapproximation constraints which impose the extracted parts (the basis elements ) to really be common features of the columns of since
Basis elements can only be combined to approximate a column of if each of them represents a part of this column, i.e., none of the parts selected with a positive weight can involve a nonzero entry corresponding to a zero entry in the input column . The following example demonstrates this behavior.
Example 1 (Swimmer Database).
The swimmer image dataset consists of 256 binary images of a body with 4 limbs which can be each in 4 different positions. NMF is expected to find a partbased decomposition of these images, i.e., isolate different constitutive parts of the images (the body and the limbs) in each of its basis elements.
Figure 2 displays a sample of such images along with the basis elements obtained with NMF and NMU. While NMF elements are rather sparse, they are mixtures of several limbs. By contrast, NMU returns a even sparser solution and is able to extract a single body part for each of its elements.
2.2 Sparsity
The fact that NMU decompositions naturally generate sparser solutions than NMF can be explained as follows: since the zero entries of can only be underapproximated by zeros, we have
which shows that when the input matrix is sparse, many components of the NMU factors will have to be equal to zero. This observation can be made more formal: defining the sparsity of a by matrix as the proportion of its zero entries, i.e.,
we have the following theorem relating sparsity of and its NMU factors.
Theorem 1.
For any nonnegative rankone underapproximation of we have
Proof.
For a rankone matrix , the number of nonzeros is exactly equal to the product of the number of nonzeros in vectors and . Therefore we have that which implies . Since underapproximation satisfies , it must have more zeros than and we have
proving our claim. ∎
Recall the recursive definition of the residuals and . The following corollary relates their sparsity and the sparsity of the whole rank approximation with that of the NMU factors.
Corollary 1.
For any nonnegative underapproximation of we have for each factor
and .
Proof.
We have , which implies by the previous theorem the first set of inequalities. Observing that and is sufficient to prove the second inequality. ∎
Sparsity of the residuals is monotonically nondecreasing at each step, since . Moreover, the following theorem can guarantee an increase in sparsity at each step.
Theorem 2.
For any locally optimal nonnegative rankone underapproximation of , define sets and (supports of vectors and ) by
and define matrix to be the submatrix of residual whose row and column indices belong respectively to and (corresponding to the submatrix of that is not approximated by zeros). Then there is at least one zero in each row and each column of submatrix .
Proof.
Simply observe that if (resp. ) for some (resp. ), (resp. ) can be increased to obtain a strictly better solution, which contradicts the local optimality assumption. ∎
This ability of NMU to generate sparse partbased decomposition will be experimentally confirmed in Section 4.
2.3 Related work
The problem of rankone underapproximation has been first introduced by Levin in [27] in the case of positive stochastic matrices. He introduced a specific objective function different from the Frobenius norm and used a logarithmic change of variables in order to design an iterative method based on the corresponding optimality conditions.
In [28], the rankone underapproximation problem is cast as a convex problem (hence efficiently solvable) using again different objective functions. Solutions are then used to initialize standard NMF algorithms in order to accelerate their convergence and, in general, find better final solutions as compared to those obtained with random initializations. Similar behavior was observed for other judicious initializations in [29, 30, 31].
More recently, Dong et al. [32] studied the same problem with the additional constraint that the rank of the residual must be strictly smaller than the rank of the factorized matrix. Using the Wedderburn rank reduction formula, they proposed a numerical procedure which is able to compute the maximum rank splitting of a nonnegative matrix. However, the underlying optimization problem is NPhard [17] and their algorithm is not guaranteed to find a solution in all cases.
Biggs et al. [33] also introduced a recursive procedure to solve NMF problems: their idea is to locate and then approximate nearly rankone submatrices of . However, the problem of locating maximum rankone submatrices is also shown to be NPhard, and their algorithm is not globally optimal.
2.4 Complexity
We now prove that (NMU) is NPhard, even in the rankone case (unlike (NMF), which is polynomially solvable in the rankone case). In order to do this, the rankone version of the problem is proved to be equivalent to the biclique problem, which is NPhard. The result is then generalized to (NMU) with arbitrary factorization rank using a simple construction.
A bipartite graph is a graph whose vertices can be divided into two disjoint sets such that there is no edge between two vertices in the same set. A biclique is a complete bipartite graph, i.e., a bipartite graph where all the vertices from different sets are connected by an edge. Finally, the socalled maximum edge biclique problem (the biclique problem for short) in a bipartite graph
is the problem of finding a biclique in (i.e., and ) with a maximum number of edges .
Letting be the adjacency matrix of with and , i.e.,
and introducing indicator binary variables (resp. ) to denote whether (resp. ) belongs to the biclique , the Maximum edge Biclique Problem (MBP) in a bipartite graph can be formulated as follows
(MBP)  
One can check that this objective is equivalent to . In fact, since , and are binary and .
The corresponding decision problem “Given , does contain a biclique with at least edges?” has been shown to be NPcomplete [34]. Therefore, the corresponding optimization problem (MBP) is at least NPhard.
For , (NMU) can be written as
(NMU1)  
which is very close to (MBP): the difference is that vectors and are required to be binary for (MBP) and nonnegative for (NMU1). The next lemma proves that the two problems are actually equivalent.
Lemma 1.
Proof.
For , this is trivial. Otherwise, suppose is an optimal solution of (NMU1). Let define as
and analyze the different possibilities: as , we have either

and ;

and ;

.
Therefore, which implies
By optimality of , we must have . Therefore, is an optimal binary solution of (NMU1) which is then also an optimal solution of (MBP) (note that we must have ). ∎
Corollary 2.
(NMU1) is NPhard.
Theorem 3.
(NMU) is NPhard.
Proof.
Let be the adjacency matrix of a bipartite graph . We define the matrix as
which is the adjacency matrix of another bipartite graph which is the graph repeated times. Let be an optimal solution of (NMU). Since , we have . Therefore is a feasible solution of (NMU1) for the matrix , i.e., for the graph . Hence, each corresponds to a biclique of with
By optimality of and since there are at least independent maximum biclique in , each must coincide with a maximum biclique of which corresponds to a maximum biclique of . This is due to the fact that, because is the graph repeated times, a biclique clearly cannot span two disjoint subgraphs of . Therefore, (NMU) is NPhard since any instance of (MBP) can be polynomially reduced to an instance of (NMU). ∎
3 An algorithm for NMU based on Lagrangian relaxation
Since (NMU), like (NMF), is a NPhard problem, we can not expect to solve it up to guaranteed global optimality in a reasonable (e.g., polynomial) computational time (unless ). In this section, we propose a nonlinear optimization scheme based on Lagrangian relaxation in order to compute approximate solutions of (NMU).
Drop the underapproximation constraints of (NMU) and add them into the objective function with the corresponding Lagrange multipliers (dual variables, forming a matrix) , to obtain the Lagrangian function
where a factor of was introduced to make the presentation nicer. The Lagrangian relaxation subproblem LR consists in minimizing for a fixed value of the multipliers, leading to the corresponding Lagrangian dual function
(LR) 
where is welldefined because the minimum of is always attained, due to the fact that is bounded below and the search space can be restricted to a compact set. Indeed, considering each rankone factor individually and imposing w.l.o.g. , we have
where we have used the trivial solution to bound (cf. derivations of Section 3.1).
Standard application of Lagrangian duality tells us that
where the problem on the left of the inequality is equivalent to our original NMU formulation and the problem on the right is its Lagrangian dual, whose solution will provide a (hopefully tight) lower bound on the optimal (NMU). This new problem is a nondifferentiable optimization problem with the nice property that its objective is concave and its maximization (over a convex set) is then a convex problem (see [35] and references therein).
We describe in the next section a general solution technique, which consists in repeatedly applying the following two steps:
 1.
 2.

Given solution , update multipliers ; this is described in Section 3.2.
3.1 Solving the Lagrangian relaxation problem
The following derivations
show that minimizing for a fixed is equivalent to minimizing . Matrix is not necessarily nonnegative, therefore finding and such that is a more general problem than NMF. It is actually studied in detail in [26] (see also [36]) where it is called Nonnegative Factorization (NF), is formulated as
(NF) 
with and and is shown to be NPhard for any factorization rank (including ).
Some standard algorithms for NMF can easily adapted to handle an input matrix that is not nonnegative, i.e., solve a NF problem. For this work, we decided to use a recent technique called Hierarchical Alternating Least Squares (HALS), proposed in [22], which alternatively updates each column of and each row of with the following optimal closedform solutions:
(2)  
with and , and
(3)  
with and . This can be viewed as a simple method of (block)coordinate descent (also called alternating variables), which has been shown to perform strikingly well in practice, and much better than the popular multiplicative updates of Lee and Seung (see [4, 15, 26]). Under some mild assumptions, every limit point of the above alternating scheme is a stationary point [4, 26].
The main computational cost of one HALS iteration is the evaluation of and : they each require (floating point) operations. One can check that the resulting total number of operations is .
Remark 1.
HALS is sensitive to the scaling of the initial matrices. For example, if the initial matrices and are chosen such that , optimal columns of and optimal rows of computed by formulas (2) and (3) at the first step will most likely be equal to zero. This will lead to rank deficient approximations ( for some ) and numerical problems (for , update of is not well defined and vice versa). If the initial matrices are scaled [26], i.e., by ensuring that
(4) 
where , this behavior is in general avoided. All initial matrices used in the following have been scaled.
3.2 Update of the multipliers
The second step of our algorithm consists in updating in order to find better (i.e., higher) solutions to the Lagrangian dual problem. Using the knowledge that any optimal solution of the Lagrangian dual must satisfy the following complementarity slackness conditions
we see that the update rule for the multipliers should satisfy the following:

if , should be decreased and eventually reach zero if ,

if , should be increased to give more importance to in the cost function, hopefully in order to get a feasible solution such that .
In the sequel, we use the following rule to update , which satisfies the above requirements:
where is a predefined sequence of step lengths decreasing to zero; can be initialized to zero. This update is inspired from the concept of subgradient methods [37]; in fact, one can easily check that the quantity is a subgradient of
with respect to , i.e., if , we have
Two questions now arise

Since an iterative algorithm is used to solve (approximately) the Lagrangian relaxation problem (cf. section 3.1), after how many of these HALS iterations do we stop and proceed to update the multipliers ?

How do we choose the sequence of step lengths ?
Subgradient methods usually assume that the Lagrangian relaxation problem (LR) can be solved exactly and can guarantee their convergence to an optimal solution provided an appropriate sequence of step sizes is selected (see, e.g., [35]), for example satisfying the conditions
In the sequel, we choose to use , which is such a suitable sequence. However, in our case, we cannot expect to solve (LR) in a reasonable amount of time since the problem is NPhard. It would even probably be too expensive to wait for the stabilization of (e.g., getting close to a stationary but not necessarily optimal point). We therefore suggest to update only a constant number of times between each update of , which leads to Algorithm LNMU. Note that because we do not solve (LR) exactly, Algorithm LNMU is not guaranteed to converge to an optimal solution of the Lagrangian dual but, as we will see, it produces satisfactory solutions in practice.
The additional computational cost of one iteration of algorithm LNMU when compared with one iteration of HALS for NMF consists in the computation of (needed in step 3) and the update of (at step 4), which require operations (and, in the special case , operations).
Remark 2.
Because convergence is not theoretically guaranteed, Algorithm LNMU may end up with solutions that do not completely satisfy the underapproximation constraint. Although our numerical experiments show that this has no detrimental influence on the quality of the obtained sparse partbased representations (see Section 4), we give here a simple technique to transform such a solution into a feasible solution. Indeed, it is enough to consider the following QP problem (convex quadratic objective function, linear inequality constraints) which only involves the factor
(5) 
Because of its convexity, this problem can be solved up to global optimality in a very efficient manner, and replacing the original factor by the optimal solution leads to a feasible solution to (NMU).
Remark 3.
Because update rule (5) is exact and computable in practice, it would be natural to consider a simpler algorithm based on its alternative application to the and factors, without using the Lagrangian relaxation technique, hoping to converge to a solution of (NMU). Unfortunately, we observed that this is quite inefficient in practice. In fact,

since the underapproximation constraint is imposed at each step, this algorithm has much less freedom to converge to good solutions: iterates rapidly get stuck on the boundary of the feasible domain, typically with (too) many zeros and a lower rank. For example, assuming has one zero in each column, we have that for any positive matrix the corresponding optimal is equal to 0:
Therefore, such an algorithm can only work if we decide a priori which values in and should be equal to zero, i.e. if we find a good sparsity pattern for the solution, which is precisely where the difficulty of the problem lies. Note that the same behavior is observed if a HALStype algorithm is used instead of (5) (i.e., updating columns of and rows of alternatively): after the update of one column of , the residual will have one zero in each row (cf. Theorem 2) which will prevent the other columns of to be nonzero (except if the sparsity pattern is chosen a priori).
Remark 4.
The LNMU algorithm described above is not particularly wellsuited to deal with very sparse input matrices. In fact, one has to store a potentially dense matrix with the Lagrangian variables . Nevertheless, Berry et al. [38] have obtained encouraging results when applying NMU to sparse anomaly detection problems in text mining. Moreover, it is possible to take advantage of the input sparsity pattern and design a computationally cheaper method. First note that the Lagrangian variables associated with a zero of will be nondecreasing in the course of the algorithm, since
where superscript denotes the solution at step . Therefore one can significantly reduce the computational cost by defining
where is an arbitrary positive nondecreasing function, e.g., with , as is implicitly done in [26].
4 Numerical tests on image datasets
We have argued in Section 2 that NMU is potentially able to extract a better partbased representation of the data and that its factors should be sparser than those of the standard NMF, at the detriment of the approximation error. In this section, we support these claims by reporting results of computational experiments involving two variants of Algorithm LNMU on several image datasets.
A direct comparison between NMU and NMF is not very informative in itself: while the former will provide a sparser partbased representation, the latter will feature a lower approximation error. This does not really tell us whether the improvements in the partbased representation and sparsity are worth the increase in approximation error. For that reason, we chose to compare NMU with two other sparse nonnegative matrix factorizations techniques, described below, in order to better assess whether the increase in sparsity achieved by NMU is worth the loss in reconstruction accuracy.
4.1 Sparse NMF
We selected and tested the following two sparse nonnegative matrix factorization techniques that are frequently used in the literature.

Hoyer describes in [11] an algorithm relying on additional explicit sparsity constraints on the factors, enforced at each iteration by means of a projection. The approximation error is reduced via a combination of projected gradient and multiplicative updates. For our experiments, we use the MATLAB code provided by the author^{3}^{3}3This code was downloaded from http://www.cs.helsinki.fi/u/phoyer/software.html..
It should be pointed out that Hoyer is using a different definition of sparsity: for any nonzero dimensional vector , his measure of sparsity is defined as
(6) Hence, a vector with a single nonzero entry is perfectly sparse
while a vector with all entries equal to each other is completely dense
In our experiments, we report sparsity using both the standard indicator and Hoyer’s measure.

Instead of enforcing sparsity at every iteration, a sparsityinducing penalty term can be introduced in the objective function [13]. In particular, it is wellknown that adding norm penalty terms induce sparser solutions (see, e.g., [14, 9, 15]), and we therefore solve the following problem:
(sNMF) where and and are two positive parameters controlling the sparsity of and . In order to solve (sNMF), we use the HALS algorithm which can easily be adapted to handle the additional norm penalty terms (see, e.g., [4, 15]). This algorithm will be referred to as sNMF.
Technical details for the first technique are more complicated, but it allows the sparsity of the factors to be chosen a priori. The second technique is conceptually simpler but requires the determination of appropriate penalizing parameters by other means.
4.2 Tested algorithms
Algorithm LNMU proposed in Section 3 can be used to compute underapproximations for any given factorization rank . This opens the possibility of building a rank underapproximation in several different ways: one simple option consists in applying algorithm LNMU directly to the rank problem – we call this method global NMU (GNMU). Another option consists in applying the recursive technique outlined in the introduction, used to motivate the introduction of underapproximations. More specifically, this means running algorithm LNMU successively times to compute rankone approximations, subtracting each approximation from the input matrix before computing the next one – we call this method recursive NMU (RNMU). Note that many other variants are possible (e.g., computing two rank approximations, computing successive ranktwo approximations, etc.) but we only tested the two abovementioned variants, which represent two extreme cases (no recursion and maximum recursion).
In both cases, our implementation of algorithm LNMU computes two HALS steps between each update of the multipliers (i.e., we fixed ). Most of the computational work done in one iteration of LNMU consists in computing , performing the two HALS steps and updating ; more specifically, one can estimate the computational cost of one iteration of GNMU to operations, while an RNMU iteration takes operations (repeated times in the recursive procedure).
For each dataset, we test five nonnegative factorization algorithms: NMF based on HALS updates (NMF), global NMU (GNMU), recursive NMU (RNMU), sparse NMF with penalty terms (sNMF) and the algorithm of Hoyer. We also report the results of a standard Principal Component Analysis (PCA) to serve as a reference (recall that the approximation error of this unconstrained lowrank approximation, computed here with a singular value decomposition, is globally minimal for the given rank, but that its factors are neither nonnegative, nor sparse).
4.3 Iteration limits and CPU time
Each of the five iterative algorithms described above requires a limit on the number of its iterations; these limits were chosen in order to roughly allocate the same CPU time to each algorithm. More specifically, the standard NMF was given a 600iterations limit, which corresponds to the computation of 600 HALS updates. The sparse sNMF, based on a slightly modified HALS update, was also allowed 600 iterations. Because a HALS update involves operations, we can deduce the following iteration budgets for GNMU and RNMU from the leading terms in their corresponding operation counts: GNMU is allowed LNMU iterations while RNMU can take iterations.
An exception to the equal CPU time rule was made for the algorithm of Hoyer. Results obtained after an amount of CPU time similar to that of the other algorithms were too poor to be compared in a meaningful way. Indeed, because this method is based on a projected gradient method and multiplicative updates (both operations per iteration), which are known to converge at a typically much slower rate, a relatively high limit of 1000 iterations had to be fixed, although the resulting CPU time is then much larger than for the other methods (for example, on the CBCL dataset, 600 iterations of HALS took while 1000 iterations of the algorithm of Hoyer needed ).
4.4 Testing methodology
Recall we decided to test algorithms sNMF and Hoyer to assess the quality of the sparsityaccuracy compromise proposed by our NMU approaches. To achieve this, we decided to pit each NMU variant against a solution of sNMF/Hoyer featuring the same level sparsity, and compare the resulting approximation errors. We therefore report results for algorithms on each dataset: PCA, NMF, GNMU, sNMF with the same sparsity as GNMU, which we denote by sNMF{GNMU}, Hoyer{GNMU}, RNMU, sNMF{RNMU} and Hoyer{RNMU}.
In order to enforce a sparsity similar to the NMU solution in Hoyer’s code, we compute the measure of the NMU factors and input it as a parameter of the method (see description in subsection 4.1); note however that we could only enforce this for the sparsest of the two NMU factors^{4}^{4}4Ideally, we would have imposed sparsity for both factors, but the implementation we used seemed to return poor results in that situation.. In the case of sNMF, sparsity cannot be directly controlled, and penalty parameters are found using the following adaptive procedure, which proved to work well in practice: and are initialized to 0.1 and, after each iteration, (resp. ) is increased by 5 percent if (resp. ) is below the target sparsity, and is decreased by 5 percent otherwise.
All algorithms were run times with the same initial random matrices and only the best solution with respect to the Frobenius norm of the error is reported. When testing with graylevel images, the input matrices where normalized to have their entries varying between 0 and 1, with 0 representing white and 1 representing black (when trying to decompose as a sum of parts, this make more sense than the opposite convention, since the dark regions are the constitutive parts of the objects in the image datasets we analyze). Finally, before computing reported sparsity measures of the factors, any sufficiently small^{5}^{5}5We declare an entry of a factor to be sufficiently small if it is less than of the largest entry in its column. entry is rounded to zero (indeed, because algorithms are stopped by the iteration limit before convergence, true zeros are typically not all reached). All tests were run within the MATLAB 7.1 (R14) version, on a 3GHz Intel Core™2 Dual CPU PC.
4.5 CBCL Face Database
The CBCL face image dataset was used for the illustrative example of Figure 1 and is made of 2429 graylevel images of faces represented with pixels. We look for an approximation of rank .
Figure 3 displays the basis elements for NMF, GNMU, RNMU and sNMF{GNMU} (which was the best solution obtained in term of sparsity vs. error among all four sNMF and Hoyer variants). Both GNMU, RNMU and sNMF achieve a better partbased representation than NMF, generating sparser solutions. An interesting feature of RNMU is that it extracts parts successively in order of “importance”: the first basis element is a ’mean’ face (which is dense) while the next ones describe different complementary parts (which become sparser as the recursion moves on, cf. Corollary 1 and Theorem 2).
A more quantitative assessment is provided in Table 1, reporting for the algorithms tested the relative error (in percent) of their solutions
in the second column (“Plain”) and the corresponding sparsity measures (in percent) of factors and in the last four columns.
Plain  Improved  

PCA  7.43  7.43  0  0  22  22 
NMF  8.12  8.11  56  11  66  22 
GNMU  12.45  8.76  74  14  74  21 
sNMF{GNMU}  8.68  8.44  74  14  74  30 
Hoyer{GNMU}  9.33  8.78  69  6  73  16 
RNMU  16.42  10.89  53  52  63  64 
sNMF{RNMU}  10.23  9.49  50  50  56  57 
Hoyer{RNMU}  8.83  8.56  54  12  64  22 
As expected, PCA returns the smallest error, albeit with very dense factors. NMF already features much sparser factors (slightly half of the entries in are equal to zero), at the cost of a relatively modest increase in the approximation error (). GNMU provides an even sparser solution (three quarters of zero entries), increasing again the approximation error (). The factors recursively computed by RNMU are in comparison not as sparse: as explained above, this is because RNMU focuses on obtained representative parts, including relatively dense ones for the first few steps of the recursion. However, it features a much sparse weight vectors, giving again more credit to the hypothesis that better parts are extracted. The corresponding approximation error is higher than for other methods, because the intrinsically greedy approach taken by RNMU is not as efficient as a method that optimizes all the factors simultaneously.
Is the increased sparsity provided by GNMU worth the increase in approximation error ? Looking at the corresponding results for sNMU{GNMU} and Hoyer{GNMU}, i.e., for sparse NMF and Hoyer’s algorithms with a similar target sparsity, it might seem at first that the answer is negative: the other methods return solutions with similar number of nonzeros (slightly higher for Hoyer) and a lower approximation error ( and instead of ). Actually, this was expected: because it tries to return an underapproximation, i.e., factors such that , the entries in the error term are mostly nonnegative, while the other techniques, with no underapproximation constraint, obtain a smaller norm of the error by choosing the entries of to be roughly half negative, half positive. It is therefore not completely fair to compare directly the error of the NMU approach to the other techniques.
In order to compensate for this, a simple rescaling could be used, i.e., multiplying by a scalar since (cf. Equation (4)). However, we chose a different procedure that has the advantage of benefiting all algorithms, including those whose error was not suffering from the underapproximation constraint. Once a solution is computed by one of the eight algorithms, we fix the zero entries of and and optimize the approximation error, i.e., , on the remaining (nonzero) entries (again, HALS can easily be adapted to handle this situation). In essence, this allows us to compare the sparsity patterns of the different solutions. We perform 100 additional HALS steps on each solution, and report the new relative approximation error in the third column of Table 1 (“Improved”). Note that sNMF and Hoyer’s errors are also improved by this procedure; this can be explained by the fact that they were also not directly trying to minimize the approximation error (Hoyer had to take into account its sparsity constraint, and sNMF was influenced by the penalty terms added to the approximation error).
Looking now at the NMU solutions in a fairer comparison, we observe that their approximation error becomes very close to that of sNMF and Hoyer, in particular for GNMU, and not very far from the denser NMF, so that we can conclude that the sparsityapproximation error compromise it offers is worthwhile.
4.6 Swimmer Database
For the swimmer image dataset described in Example 1 (256 images with pixels), the basis elements obtained with the different algorithms are displayed on Figure 4 and the corresponding approximation errors and sparsity measures are reported in Table 2.
Error  Plain  Improved  

PCA  37.98  37.98  77*  0  67  17 
NMF  40.41  40.41  84  45  73  67 
GNMU  47.70  46.85  94  75  85  78 
sNMF{GNMU}  50.52  42.04  89  66  84  73 
Hoyer{GNMU}  42.04  41.91  90  45  80  63 
RNMU  50.92  50.71  98  66  93  65 
sNMF{RNMU}  41.66  41.17  85  66  80  79 
Hoyer{RNMU}**  /  /  /  /  /  / 
As mentioned earlier, our two NMU algorithms are the only methods able to extract truly independent parts, while NMF and sNMF generate a combination of them. Note however that the solution generated by sNMF bears some similarity to the one of GNMU.
4.7 Hubble Space Telescope Spectral Images
The next image dataset consists of 100 spectral images ( pixels) of the Hubble telescope at different frequencies [39, 40], see Figure 5. With the choice , NMF generates a nearly exact factorization (relative error ), because the spectral reflectance of the Hubble telescope results from the additive linear combination of the reflectance of eight constitutive materials. Figure 6 and Table 3 provide the visual and computational results for this dataset.
Error  Plain  Improved  

PCA  0.01  0.01  57  0  62  25 
NMF  0.29  0.29  64  5  57  35 
GNMU  0.52  0.11  64  4  60  31 
RNMU  3.74  1.37  79  30  71  62 
sNMF{RNMU}  0.48  0.37  73  28  66  64 
Hoyer{RNMU}  0.77  0.68  75  0  71  12 
Because NMF is already a nearly exact reconstruction (Table 3), the NMU constraints are somehow redundant: NMF and GNMU are basically equivalent and return solutions with very similar sparsity measures (albeit with a slightly lower error for GNMU). For that reason, sNMF{GNMU} and Hoyer{GNMU} return results nearly identical to NMF and are omitted from the table.
Recursive RNMU extracts parts in order of importance: first, a global picture of the telescope and then its different constitutive parts. This allows it to generate the sparsest solution, with several basis elements representing welldelimited constitutive parts of the telescope not identified by the other methods.
4.8 Kuls Illuminated Faces
A static scene was illuminated from many directions with a moving light source to produce the Kuls image dataset^{6}^{6}6Available at http://www.robots.ox.ac.uk/~amb/.. It consists of 20 images ( pixels) of a face. Because the images are very similar, most of the information (more than 70 percent) can be expressed with only one factor. The remaining information resides in the different orientations of the lighting. Computational and visual results for a rank5 factorization are given by Table 4 and Figure 7. We observe that NMF and GNMU obtain similar results: even though they are both able to extract several faces with different lighting orientations, they do not extract a sparse and partbased representation.
RNMU first extracts a face illuminated from all directions, and then complementary parts representing different orientations of the lighting (successively on the fourth row of Figure 7: global then light from the right, left, bottom and top). This nice recursive extraction of the information is a direct consequence of the underapproximation constraints. Although sNMF (with the same sparsity requirement as RNMU) is also able to extract a partbased representation with a slightly better approximation error, only two components are wellidentified (left and right lighting mixed with top and bottom lighting).
Error  Plain  Scaled  Improved  

PCA  4.36  4.36  4.36  0  0  23  15 
NMF  4.38  4.38  4.38  1  7  9  38 
GNMU  6.27  5.77  4.49  3  20  8  48 
sNMF{GNMU}  4.42  4.42  4.41  2  20  8  47 
Hoyer{GNMU}  4.60  4.60  4.71  2  25  8  53 
RNMU  8.13  7.84  5.73  29  31  38  67 
sNMF{RNMU}  5.24  5.24  5.01  29  31  32  59 
Hoyer{RNMU}  6.82  6.82  6.54  0  71  6  92 
5 Conclusion
In order to solve the NMF problem in a recursive way, we have introduced a new problem, namely Nonnegative Matrix Underapproximation (NMU), which was shown to be NPhard using its equivalence with the maximumedge biclique problem. The additional constraints of NMU are shown to induce sparser factors and to lead naturally to a better partbased representation of the data, while keeping a fairly good reconstruction. We proposed an algorithm based on Lagrangian relaxation to find approximate solutions to NMU.
We tested two factorization methods based on this algorithm, one with full recursion (RNMU), the other without recursion (GNMU), on several standard image datasets. After suitable postprocessing, we observed that the factors computed by these methods indeed offer a good compromise between their achieved sparsity and the resulting approximation error, comparable or sometimes superior to that of two standard sparse matrix factorization techniques.
These two variants can be contrasted in the following way: where GNMU mainly focuses on finding sparse factors with small reconstruction error, in the same spirit as sNMF and Hoyer, RNMU typically computes an even sparser factorization corresponding to a better partbased representation, albeit with a moderate increase in the reconstruction error (due to the greedy approach). Moreover, this second variant is useful in situations where the factorization rank is not fixed a priori: the fact that it is recursive allows the user to stop the procedure as soon as the reconstruction error becomes satisfactory, without having to recompute a completely different solution from scratch every time a higherrank factorization needs to be considered.
Acknowledgments
The authors would like to thank the anonymous reviewers for their insightful comments which helped improve the paper.
References
 [1] M. Berry, M. Browne, A. Langville, P. Pauca, R. Plemmons, Algorithms and Applications for Approximate Nonnegative Matrix Factorization, Computational Statistics and Data Analysis 52 (2007) 155–173.
 [2] I. Dhillon, S. Sra, Nonnegative Matrix Approximations: Algorithms and Applications, Tech. rep., University of Texas (Austin), dept. of Computer Sciences (2006).
 [3] K. Devarajan, Nonnegative Matrix Factorization: An Analytical and Interpretive Tool in Computational Biology, PLoS Computational Biology 4(7), e1000029.
 [4] N.D. Ho, Nonnegative matrix factorization  algorithms and applications, Ph.D. thesis, Université catholique de Louvain (2008).
 [5] P. Paatero, U. Tapper, Positive matrix factorization: a nonnegative factor model with optimal utilization of error estimates of data values, Environmetrics 5 (1994) 111–126.
 [6] D. Lee, H. Seung, Learning the Parts of Objects by Nonnegative Matrix Factorization, Nature 401 (1999) 788–791.
 [7] F. Shahnaz, M. Berry, A., V. Pauca, R. Plemmons, Document clustering using nonnegative matrix factorization, Information Processing and Management 42 (2006) 373–386.
 [8] Y. Gao, G. Church, Improving molecular cancer class discovery through sparse nonnegative matrix factorization, Bioinformatics 21(21) (2005) 3970–3975.
 [9] H. Kim, H. Park, Sparse nonnegative matrix factorizations via alternating nonnegativityconstrained least squares for microarray data analysis, Bioinformatics 23(12) (2007) 1495–1502.
 [10] A. d’Aspremont, L. El Ghaoui, M. Jordan, G. Lanckriet, A Direct Formulation for Sparse PCA Using Semidefinite Programming, SIAM Rev. 49(3) (2007) 434–448.
 [11] P. Hoyer, Nonnegative Matrix Factorization with Sparseness Constraints, J. Machine Learning Research 5 (2004) 1457–1469.
 [12] M. Heiler, C. Schnörr, Learning Sparse Representations by NonNegative Matrix Factorization and Sequential Cone Programming, Journal of Machine Learning Research 7 (2006) 1385–1407.
 [13] S. Li, X. Hou, H. Zhang, Q. Cheng, Learning spatially localized partsbased representation, in: Proceedings of IEEE Int. Conf. on Computer Vision and Pattern Recognition, 2001, pp. 207–212.
 [14] H. Kim, H. Park, Nonnegative Matrix Factorization Based on Alternating Nonnegativity Constrained Least Squares and Active Set Method, SIAM J. Matrix Anal. Appl. 30(2) (2008) 713–730.
 [15] C. Cichocki, A.H. Phan, Fast local algorithms for large scale Nonnegative Matrix and Tensor Factorizations, IEICE Transactions on Fundamentals of Electronics Vol. E92A No.3 (2009) 708–721.
 [16] G. Golub, C. Van Loan, Matrix Computation, 3rd Edition, The Johns Hopkins University Press Baltimore, 1996.
 [17] S. A. Vavasis, On the complexity of nonnegative matrix factorization, SIAM Journal on Optimization 20 (3) (2009) 1364–1377.
 [18] C.J. Lin, Projected Gradient Methods for Nonnegative Matrix Factorization, Neural Computation 19 (2007) 2756–2779, MIT press.
 [19] I. Dhillon, D. Kim, S. Sra, Fast Newtontype Methods for the Least Squares Nonnegative Matrix Approximation problem, in: Proceedings of SIAM Conference on Data Mining, 2007.
 [20] C. Cichocki, R. Zdunek, S. Amari, Nonnegative Matrix Factorization with QuasiNewton Optimization, Lecture Notes in Artificial Intelligence, Springer 4029 (2006) 870–879.
 [21] D. Chen, R. Plemmons, Nonnegativity Constraints in Numerical Analysis, 2007, paper presented at the Symposium on the Birth of Numerical Analysis, Leuven Belgium. To appear in the Conference Proceedings, to be published by World Scientific Press, A. Bultheel and R. Cools, Eds.
 [22] C. Cichocki, R. Zdunek, S. Amari, Hierarchical ALS Algorithms for Nonnegative Matrix and 3D Tensor Factorization, Lecture Notes in Computer Science, Springer 4666 (2007) 169–176.
 [23] D. Lee, H. Seung, Algorithms for Nonnegative Matrix Factorization, Advances in Neural Information Processing 13 (2001) 556–562.
 [24] C. Cichocki, R. Zdunek, S. Amari, Nonnegative Matrix and Tensor Factorization, IEEE Signal Processing Magazine (2008) 142–145.
 [25] N.D. Ho, P. Van Dooren, V. Blondel, Descent methods for nonnegative matrix factorization, To appear in Numerical Linear Algebra in Signals, Systems and Control, Springer Verlag.
 [26] N. Gillis, F. Glineur, Nonnegative Factorization and The Maximum Edge Biclique Problem, CORE Discussion paper 2008/64 (2008).
 [27] B. Levin, On Calculating Maximum Rank One Underapproximations for Positive Arrays, Tech. rep., Columbia University, div. of Biostatistics (1985).
 [28] N. Gillis, Approximation et sousapproximation de matrices par factorisation positive: algorithmes, complexité et applications, Master’s thesis, Université catholique de Louvain, in French (2007).
 [29] R. Albright, J. Cox, D. Duling, A. Langville, C. Meyer, Initializations and Convergence for the Nonnegative Matrix Factorization, in: 12th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, 2006.
 [30] C. Boutsidis, E. Gallopoulos, SVD based initialization: A head start for nonnegative matrix factorization, Journal of Pattern Recognition 41 (2008) 1350–1362.
 [31] J. Curry, A. Dougherty, S. Wild, Improving nonnegative matrix factorizations through structured initialization, Journal of Pattern Recognition 37(11) (2004) 2217–2232.
 [32] B. Dong, M. Lin, M. Chu, Nonnegative rank factorization via rank reduction, preprint (2008).
 [33] M. Biggs, A. Ghodsi, S. Vavasis, Nonnegative Matrix Factorization via RankOne Downdate, in: 25th international conference on machine learning (ICML), 2008.
 [34] R. Peeters, The maximum edge biclique problem is NPcomplete, Discrete Applied Mathematics 131(3) (2003) 651–654.
 [35] K. M. Anstreicher, L. A. Wolsey, Two ”wellknown” properties of subgradient optimization, Mathematical Programming 120(1) (2009) 213–220.
 [36] C. Ding, T. Li, M. Jordan, Convex and SemiNonnegative Matrix Factorizations, to appear in IEEE Transactions on Pattern Analysis and Machine Intelligence (2009).
 [37] N. Shor, Minimization Methods for Nondifferentiable Functions, Springer Series in Computational Mathematics, 1985.
 [38] M. Berry, N. Gillis, F. Glineur, Document Classification Using Nonnegative Matrix Factorization and Underapproximation, in: Proc. of the IEEE International Symposium on Circuits and Systems (ISCAS), 2009, pp. 2782–2785, ISBN: 9781424438280.
 [39] P. Pauca, J. Piper, R. Plemmons, Nonnegative matrix factorization for spectral data analysis, Linear Algebra and its Applications 406(1) (2006) 29–47.
 [40] Q. Zhang, H. Wang, R. Plemmons, P. Pauca, Tensor methods for hyperspectral data analysis: a space object material identification study, J. Optical Soc. Amer. A 25(12) (2008) 3001–3012.