Block Coordinate Descent for Sparse NMF
Nonnegative matrix factorization (NMF) has become a ubiquitous tool for data analysis. An important variant is the sparse NMF problem which arises when we explicitly require the learnt features to be sparse. A natural measure of sparsity is the L norm, however its optimization is NP-hard. Mixed norms, such as L/L measure, have been shown to model sparsity robustly, based on intuitive attributes that such measures need to satisfy. This is in contrast to computationally cheaper alternatives such as the plain L norm. However, present algorithms designed for optimizing the mixed norm L/L are slow and other formulations for sparse NMF have been proposed such as those based on L and L norms. Our proposed algorithm allows us to solve the mixed norm sparsity constraints while not sacrificing computation time. We present experimental evidence on real-world datasets that shows our new algorithm performs an order of magnitude faster compared to the current state-of-the-art solvers optimizing the mixed norm and is suitable for large-scale datasets.
Block Coordinate Descent for Sparse NMF
Vamsi K. Potluru Department of Computer Science, University of New Mexico email@example.com Sergey M. Plis Mind Research Network, firstname.lastname@example.org Jonathan Le Roux Mitsubishi Electric Research Labs email@example.com Barak A. Pearlmutter Department of Computer Science, National University of Ireland Maynooth firstname.lastname@example.org Vince D. Calhoun Electrical and Computer Engineering, UNM and Mind Research Network email@example.com Thomas P. Hayes Department of Computer Science, University of New Mexico firstname.lastname@example.org
Matrix factorization arises in a wide range of application domains and is useful for extracting the latent features in the dataset (Figure 2). In particular, we are interested in matrix factorizations which impose the following requirements:
Nonnegativity is a natural constraint when modeling data with physical constraints such as chemical concentrations in solutions, pixel intensities in images and radiation dosages for cancer treatment. Low-rankedness is useful for learning a lower dimensionality representation. Sparsity is useful for modeling the conciseness of the representation or that of the latent features. Imposing all these requirements on our matrix factorization leads to the sparse nonnegative matrix factorization (SNMF) problem.
However, algorithms [14, 11] for solving SNMF which utilize the mixed norm of L/L as their sparsity measure are slow and do not scale well to large datasets. Thus, we develop an efficient algorithm to solve this problem and has the following ingredients:
A theoretically efficient projection operator () to enforce the user-defined sparsity where is the dimensionality of the feature vector as opposed to the previous approach .
2 Preliminaries and Previous Work
In this section, we give an introduction to the nonnegative matrix factorization (NMF) and SNMF problems. Also, we discuss some widely used algorithms from the literature to solve them.
Both these problems share the following problem and solution structure. At a high-level, given a nonnegative matrix of size , we want to approximate it with a product of two nonnegative matrices of sizes and , respectively:
The nonnegative constraint on matrix makes the representation a conical combination of features given by the columns of matrix . In particular, NMF can result in sparse representations, or a parts-based representation, unlike other factorization techniques such as principal component analysis (PCA) and vector quantization (VQ). A common theme in the algorithms proposed for solving these problems is the use of alternating updates to the matrix factors, which is natural because the objective function to be minimized is convex in and in , separately, but not in both together. Much effort has been focused on optimizing the efficiency of the core step of updating one of while the other stays fixed.
2.1 Nonnegative Matrix Factorization
Factoring a matrix, all of whose entries are nonnegative, as a product of two low-rank nonnegative factors is a fundamental algorithmic challenge. This has arisen naturally in diverse areas such as image analysis , micro-array data analysis , document clustering , chemometrics , information retrieval  and biology applications . For further applications, see the references in the following papers [1, 7].
We will consider the following version of the NMF problem, which measures the reconstruction error using the Frobenius norm :
where is element-wise. We use subscripts to denote column elements. Simple multiplicative updates were proposed by Lee and Seung to solve the NMF problem. This is attractive for the following reasons:
Unlike additive gradient descent methods, there is no arbitrary learning rate parameter that needs to be set.
The nonnegativity constraint is satisfied automatically, without any additional projection step.
Algorithm 1 is an example of the kind of multiplicative update procedure used, for instance, by Lee and Seung . The algorithm alternates between updating the matrices and (we have only shown the updates for —those for are analogous).
Here, indicates element-wise (Hadamard) product and matrix division is also element-wise. To remove the scaling ambiguity, the norm of columns of matrix are set to unity. Also, a small constant, say , is added to the denominator in the updates to avoid division by zero.
2.2 Sparse Nonnegative Matrix Factorization
The nonnegative decomposition is in general not unique . Furthermore, the features may not be parts-based if the data resides well inside the positive orthant. To address these issues, sparseness constraints have been imposed on the NMF problem.
Sparse NMF can be formulated in many different ways. From a user point of view, we can split them into two classes of formulations: explicit and implicit. In explicit versions of SNMF [14, 11], one can set the sparsities of the matrix factors directly. On the other hand, in implicit versions of SNMF [17, 25], the sparsity is controlled via a regularization parameter and is often hard to tune to specified sparsity values a priori. However, the algorithms for implicit versions tend to be faster compared to the explicit versions of SNMF.
In this paper, we consider the explicit sparse NMF formulation proposed by Hoyer . To make the presentation easier to follow, we first consider the case where the sparsity is imposed on one of the matrix factors, namely the feature matrix —the analysis for the symmetric case where the sparsity is instead set on the other matrix factor is analogous. The case where sparsity requirements are imposed on both the matrix factors is dealt with in the Appendix. The sparse NMF problem formulated by Hoyer  with sparsity on matrix is as follows:
Sparsity measure for a -dimensional vector is given by:
The sparsity measure (4) defined above has many appealing qualities. Some of which are as follows:
The measure closely models the intuitive notion of sparsity as captured by the norm. So, it easy for the user to specify sparsity constraints from prior knowledge of the application domain.
Simultaneously, it is able to avoid the pitfalls associated with directly optimizing the norm. Desirable properties for sparsity measures have been previously explored  and it satisfies all of these properties for our problem formulation. The properties can be briefly summarized as: (a) Robin Hood — Spreading the energy from larger coordinates to smaller ones decreases sparsity, (b) Scaling — Sparsity is invariant to scaling, (c) Rising tide — Adding a constant to the coordinates decreases sparsity, (d) Cloning — Sparsity is invariant to cloning, (e) Bill Gates — One big coordinate can increase sparsity, (f) Babies — coordinates with zeros increase sparsity.
The above sparsity measure enables one to limit the sparsity for each feature to lie in a given range by changing the equality constraints in the SNMF formulation (2.2) to inequality constraints . This could be useful in scenarios like fMRI brain analysis, where one would like to model the prior knowledge such as sizes of artifacts are different from that of the brain signals. A sample illustration on a face dataset is shown in Figure 2 (Right). The features are now evenly split into two groups of local and global features by choosing two different intervals of sparsity.
A gradient descent-based algorithm called Nonnegative Matrix Factorization with Sparseness Constraints (NMFSC) to solve SNMF was proposed . Multiplicative updates were used for optimizing the matrix factor which did not have sparsity constraints specified. Heiler and Schnörr proposed two new algorithms which also solved this problem by sequential cone programming and utilized general purpose solvers like MOSEK (http://www.mosek.com). We will consider the faster one of these called tangent-plane constraint (TPC) algorithm. However, both these algorithms, namely NMFSC and TPC, solve for the whole matrix of coefficients at once. In contrast, we propose a block coordinate-descent strategy which considers a sequence of vector problems where each one can be solved in closed form efficiently.
3 The Sequential Sparse NMF Algorithm
We present our algorithm which we call Sequential Sparse NMF (SSNMF) to solve the SNMF problem as follows:
First, we consider a problem of special form which is the building block (Algorithm 2) of our SSNMF algorithm and give an efficient, as well as exact, algorithm to solve it. Second, we describe our sequential approach (Algorithm 3) to solve the subproblem of SNMF. This uses the routine we developed in the previous step. Finally, we combine our routines developed in the previous two steps along with standard solvers (for instance Algorithm 1) to complete the SSNMF Algorithm (Algorithm 4).
Sparse-opt routine solves the following subproblem which arises when solving problem (2.2):
where vector is of size . This problem has been previously considered , and an algorithm to solve it was proposed which we will henceforth refer to as the Projection-Hoyer. Similar projection problems have been recently considered in the literature and solved efficiently [10, 5].
For any , we have that if , then .
Let us first consider the case when the vector is sorted. Then by the previous observation, we have a transition point that separates the zeros of the solution vector from the rest.
By applying the Cauchy-Schwarz inequality on and the all ones vector, we get .
The Lagrangian of the problem (5) is :
Setting the partial derivatives of the Lagrangian to zero, we get by observation 1:
where we account for the dependence of the Lagrange parameters on the transition point . We compute the objective value of problem (5) for all transition points in the range from to and select the one with the highest value. In the case, where the vector is not sorted, we just simply sort it and note down the sorting permutation vector. The complete algorithm is given in Algorithm 2. The dominant contribution to the running time of Algorithm 2 is the sorting of vector and therefore can be implemented in time333This can be further reduced to linear time by noting that we do not need to fully sort the input in order to find .. Contrast this with the running time of Projection-Hoyer whose worst case is [14, 28].
3.2 Sequential Approach —Block Coordinate Descent
Previous approaches for solving SNMF [14, 11] use batch methods to solve for sparsity constraints. That is, the whole matrix is updated at once and projected to satisfy the constraints. We take a different approach of updating a column vector at a time. This gives us the benefit of being able to solve the subproblem (column) efficiently and exactly. Subsequent updates can benefit from the newly updated columns resulting in faster convergence as seen in the experiments.
In particular, consider the optimization problem (2.2) for a column of the matrix while fixing the rest of the elements of matrices :
where and . This reduces to the problem (5) for which we have proposed an exact algorithm (Algorithm 2). We update the columns of the matrix factor sequentially as shown in Algorithm 3. We call it sequential for we update the columns one at a time. Note that this approach can be seen as an instance of block coordinate descent methods by mapping features to blocks and the Sparse-opt projection operator to a descent step.
3.3 SSNMF Algorithm for Sparse NMF
4 Implementation Issues
For clarity of exposition, we presented the plain vanilla version of our SSNMF Algorithm 4. We now describe some of the actual implementation details.
Initialization: Generate a positive random vector of size and obtain where (from equation (4)). Use the solution and its random permutations to initialize matrix . Initialize the matrix to uniform random entries in .
Termination: In our experiments, we fix the number of alternate updates or equivalently the number of times we update matrix . Other approaches include specifying total running time, relative change in objective value between iterations or approximate satisfaction of KKT conditions.
Sparsity constraints: We have primarly considered the sparse NMF model as formulated by Hoyer . This has been generalized by Heiler and Schnörr  by relaxing the sparsity constraints to lie in user-defined intervals. Note that, we can handle this formulation  by making a trivial change to Algorithm 3.
5 Experiments and Discussion
In this section, we compare the performance of our algorithm with the state-of-the-art NMFSC and TPC algorithms [14, 11]. Running times for the algorithms are presented when applied to one synthetic and three real-world datasets. Experiments report reconstruction error () instead of objective value for convenience of display. For all experiments on the datasets, we ensure that our final reconstruction error is always better than that of the other two algorithms. Our algorithm was implemented in MATLAB (http://www.mathworks.com) similar to NMFSC and TPC. All of our experiments were run on a Ghz Intel machine with GB of RAM and the number of threads set to one.
For comparing the performance of SSNMF with NMFSC and TPC, we consider the following synthetic and three real-world datasets :
CBCL face dataset consists of images of size and can be obtained at http://cbcl.mit.edu/cbcl/software-datasets/FaceData2.html.
ORL: Face dataset that consists of images of size and can be obtained at http://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html. http://www.kyb.tuebingen.mpg.de/bethge/vanhateren/iml/. We use of these images in our experiments.
sMRI: Structural MRI scans of subjects taken at the John Hopkins University were obtained. The scans were taken on a single 1.5T scanner with the imaging parameters set to mm TR, ms TE, matrix size of . We segment these images into gray matter, white matter and cerebral spinal fluid images, using the software program SPM5 (http://www.fil.ion.ucl.ac.uk/spm/software/spm5/), followed by spatial smoothing with a Gaussian kernel of mm. This results in images which are of size
5.2 Comparing Performances of Core Updates
We compare our Sparse-opt (Algorithm 2) routine with the competing Projection-Hoyer . In particular, we generate random problems for each sparsity constraint in and a fixed problem size. The problems are of size where takes integer values from to . Input coefficients are generated by drawing samples uniformly at random from . The mean values of the running times for Sparse-opt and the Projection-Hoyer for each dimension and corresponding sparsity value are plotted in Figure 2.
We compare SSNMF with SSNMF+Proj on the CBCL dataset. The algorithms were run with rank set to . The running times are shown in Figure 6.
We see that in low-dimensional datasets, the difference in running times are very small.
5.3 Comparing Overall Performances
SSNMF versus NMFSC and TPC:
We plot the performance of SSNMF against NMFSC and TPC on the synthetic dataset provided by Heiler and Schnörr  in Figure 3. We used the default settings for both NMFSC and TPC using the software provided by the authors. Our experience with TPC was not encouraging on bigger datasets and hence we show its performance only on the synthetic dataset. It is possible that the performance of TPC can be improved by changing the default settings but we found it non-trivial to do so.
SSNMF versus NMFSC:
To ensure fairness, we removed logging information from NMFSC code  and only computed the objective for equivalent number of matrix updates as SSNMF. We do not plot the objective values at the first iteration for convenience of display. However, they are the same for both algorithms because of the shared initialization . We ran the SSNMF and NMFSC on the ORL face dataset. The rank was fixed at in both the algorithms. Also, the plots of running times versus objective values are shown in Figure 4 corresponding to sparsity values ranging from to . Additionally, we ran our SSNMF algorithm and NMFSC algorithm on a large-scale dataset consisting of the structural MRI images by setting the rank to . The running times are shown in Figure 5.
5.4 Main Results
We compared the running times of our Sparse-opt routine versus the Projection-Hoyer and found that on the synthetically generated datasets we are faster on average.
Our results on switching the Sparse-opt routine with the Projection-Hoyer did not slow down our SSNMF solver significantly for the datasets we considered. So, we conclude that the speedup is mainly due to the sequential nature of the updates (Algorithm 3).
Also, we converge faster than NMFSC for fewer number of matrix updates. This can be seen by noting that the plotted points in Figures 4 and 5 are such that the number of matrix updates are the same for both SSNMF and NMFSC. For some datasets, we noted a speedup of an order of magnitude making our approach attractive for computation purposes.
6 Connections to Related Work
Other SNMF formulations have been considered by Hoyer , Mørup et al.  , Kim and Park , Pascual-Montano et al.  (nsNMF) and Peharz and Pernkopf . SNMF formulations using similar sparsity measures as used in this paper have been considered for applications in speech and audio recordings [30, 29].
We note that our sparsity measure has all the desirable properties, extensively discussed by Hurley and Rickard , except for one (“cloning”). Cloning property is satisfied when two vectors of same sparsity when concatenated maintain their sparsity value. Dimensions in our optimization problem are fixed and thus violating the cloning property is not an issue. Compare this with the L norm that satisfies only one of these properties (namely “rising tide”). Rising tide is the property where adding a constant to the elements of a vector decreases the sparsity of the vector. Nevertheless, the measure used in Kim and Park is based on the L norm. The properties satisfied by the measure in Pascual-Montano et al. are unclear because of the implicit nature of the sparsity formulation.
Pascual-Montano et al.  claim that the SNMF formulation of Hoyer, as given by problem (2.2) does not capture the variance in the data. However, some transformation of the sparsity values is required to properly compare the two formulations [14, 25]. Preliminary results show that the formulation given by Hoyer  is able to capture the variance in the data if the sparsity parameters are set appropriately. Peharz and Pernkopf  propose to tackle the L norm constrained NMF directly by projecting from intermediate unconstrained solutions to the required L constraint. This leads to the well-known problem of getting stuck in local minima. Indeed, the authors re-initialize their feature matrix with an NNLS solver to recover from the local suboptimum. Our formulation avoids the local minima associated with L norm by using a smooth surrogate.
We have proposed a new efficient algorithm to solve the sparse NMF problem. Experiments demonstrate the effectiveness of our approach on real datasets of practical interest. Our algorithm is faster over a range of sparsity values and generally performs better when the sparsity is higher. The speed up is mainly because of the sequential nature of the updates in contrast to the previously employed batch updates of Hoyer. Also, we presented an exact and efficient algorithm to solve the problem of maximizing a linear objective with a sparsity constraint, which is an improvement over the heuristic approach in Hoyer.
Our approach can be extended to other NMF variants . Another possible application is the sparse version of nonnegative tensor factorization. A different research direction would be to scale our algorithm to handle large datasets by chunking  and/or take advantage of distributed/parallel computational settings .
The first author would like to acknowledge the support from NIBIB grants 1 R01 EB 000840 and 1 R01 EB 005846. The second author was supported by NIMH grant 1 R01 MH076282-01. The latter two grants were funded as part of the NSF/NIH Collaborative Research in Computational Neuroscience Program.
- Arora et al.  Sanjeev Arora, Rong Ge, Ravindran Kannan, and Ankur Moitra. Computing a nonnegative matrix factorization – provably. In Proceedings of the 44th symposium on Theory of Computing, STOC ’12, pages 145–162, New York, NY, USA, 2012. ACM.
- Berry et al.  Michael W Berry, Murray Browne, Amy N Langville, V Paul Pauca, and Robert J Plemmons. Algorithms and applications for approximate nonnegative matrix factorization. Computational Statistics & Data Analysis, 52(1):155–173, 2007.
- Bradley et al.  Joseph K. Bradley, Aapo Kyrola, Danny Bickson, and Carlos Guestrin. Parallel coordinate descent for L1-regularized loss minimization. In ICML, pages 321–328, 2011.
- Buchsbaum and Bloch  G. Buchsbaum and O. Bloch. Color categories revealed by non-negative matrix factorization of munsell color spectra. Vision research, 42(5):559–563, 2002.
- Chen and Ye  Yunmei Chen and Xiaojing Ye. Projection onto a simplex. arXiv preprint arXiv:1101.6081, 2011.
- Cichocki and Phan  A. Cichocki and A. H. Phan. Fast local algorithms for large scale nonnegative matrix and tensor factorizations. IEICE Transactions on Fundamentals of Electronics, 92:708–721, 2009.
- Cohen and Rothblum  J. E. Cohen and U. G. Rothblum. Nonnegative ranks, decompositions, and factorizations of nonnegative matrices. Linear Algebra and its Applications, 190:149–168, 1993.
- Ding et al.  Chris Ding, Tao Li, Wei Peng, and Haesun Park. Orthogonal nonnegative matrix t-factorizations for clustering. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’06, pages 126–135, New York, NY, USA, 2006. ACM.
- Donoho and Stodden  David Donoho and Victoria Stodden. When does non-negative matrix factorization give a correct decomposition into parts? In Sebastian Thrun, Lawrence Saul, and Bernhard Schölkopf, editors, Advances in Neural Information Processing Systems 16. MIT Press, Cambridge, MA, 2004.
- Duchi et al.  John Duchi, Shai Shalev-Shwartz, Yoram Singer, and Tushar Chandra. Efficient projections onto the l1-ball for learning in high dimensions. In Proceedings of the 25th international conference on Machine learning, pages 272–279, 2008.
- Heiler and Schnörr  Matthias Heiler and Christoph Schnörr. Learning sparse representations by non-negative matrix factorization and sequential cone programming. The Journal of Machine Learning Research, 7:2006, 2006.
- Hofmann  T. Hofmann. Unsupervised learning by probabilistic latent semantic analysis. Machine Learning, 42(1):177–196, 2001.
- Hoyer  P. O. Hoyer. Non-negative sparse coding. In Neural Networks for Signal Processing, 2002. Proceedings of the 2002 12th IEEE Workshop on, pages 557–565, 2002.
- Hoyer  Patrik O. Hoyer. Non-negative matrix factorization with sparseness constraints. J. Mach. Learn. Res., 5:1457–1469, December 2004.
- Hsieh and Dhillon  C. J. Hsieh and I. Dhillon. Fast coordinate descent methods with variable selection for non-negative matrix factorization. ACM SIGKDD Internation Conference on Knowledge Discovery and Data Mining, page xx, 2011.
- Hurley and Rickard  Niall Hurley and Scott Rickard. Comparing measures of sparsity. IEEE Trans. Inf. Theor., 55:4723–4741, October 2009.
- Kim and Park  Hyunsoo Kim and Haesun Park. Sparse non-negative matrix factorizations via alternating non-negativity-constrained least squares for microarray data analysis. Bioinformatics, 23(12):1495–1502, 2007.
- Kim and Park  Jingu Kim and Haesun Park. Toward faster nonnegative matrix factorization: A new algorithm and comparisons. Data Mining, IEEE International Conference on, 0:353–362, 2008.
- Lawton and Sylvestre  W. H. Lawton and E. A. Sylvestre. Self modeling curve resolution. Technometrics, pages 617–633, 1971.
- Lee and Seung  D. D. Lee and H. S. Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755):788–791, October 1999.
- Lee and Seung  Daniel D. Lee and Sebastian H. Seung. Algorithms for non-negative matrix factorization. In NIPS, pages 556–562, 2000.
- Lin  Chih-Jen Lin. Projected gradient methods for nonnegative matrix factorization. Neural Comp., 19(10):2756–2779, October 2007.
- Mairal et al.  Julien Mairal, Francis Bach, Jean Ponce, and Guillermo Sapiro. Online learning for matrix factorization and sparse coding. The Journal of Machine Learning Research, 11:19–60, 2010.
- Mørup et al.  Morten Mørup, Kristoffer Hougaard Madsen, and Lars Kai Hansen. Approximate L0 constrained non-negative matrix and tensor factorization. In ISCAS, pages 1328–1331, 2008.
- Pascual-Montano et al.  A. Pascual-Montano, J. M. Carazo, K. Kochi, D. Lehmann, and R. D. Pascual-Marqui. Nonsmooth nonnegative matrix factorization (nsNMF). Pattern Analysis and Machine Intelligence, IEEE Transactions on, 28(3):403–415, March 2006.
- Peharz and Pernkopf  R. Peharz and F. Pernkopf. Sparse nonnegative matrix factorization with -constraints. Neurocomputing, 2011.
- Schmidt and Olsson  M. N. Schmidt and R. K. Olsson. Single-channel speech separation using sparse non-negative matrix factorization. In International Conference on Spoken Language Processing (INTERSPEECH), volume 2, page 1. Citeseer, 2006.
- Theis et al.  Fabian J Theis, Kurt Stadlthanner, and Toshihisa Tanaka. First results on uniqueness of sparse non-negative matrix factorization. In Proceedings of the 13th European Signal Processing Conference (EUSIPCOâ05), 2005.
- Virtanen  Tuomas Virtanen. Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria. Audio, Speech, and Language Processing, IEEE Transactions on, 15(3):1066–1074, 2007.
- Weninger et al.  Felix Weninger, Jordi Feliu, and Bjorn Schuller. Supervised and semi-supervised suppression of background music in monaural speech recordings. In Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on, pages 61–64. IEEE, 2012.
- Xu et al.  W. Xu, X. Liu, and Y. Gong. Document clustering based on non-negative matrix factorization. In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pages 267–273. ACM, 2003.
In some applications, it is desirable to set the sparsity on both matrix factors. However, this can lead to the situation where the variance in the data is poorly captured . To ameliorate this condition, we formulate it as the following optimization problem and call it as Bi-Sparse NMF:
where is a matrix. In the above formulation, we constrain the L norms of the columns of matrix to unity. Similarly, we constrain the L norms of rows of matrix to be unity. This scaling is absorbed by the matrix . Note that this formulation with the matrix constrained to be diagonal is equivalent to the one proposed in Hoyer when both the matrix factors have their sparsity specified.
We can solve for the matrix with any NNLS solver. A concrete algorithm is the one presented in Ding et al. and is reproduced here for convenience (Algorithm 5). If is a diagonal matrix, we only update the diagonal terms and maintain the rest at zero. Algorithms 1 and 5 can be sped up by pre-computing the matrix products which are unchanged during the iterations.
Also, the matrix captures the variance of the dataset when we have sparsity set on both the matrices .