Robust LargeScale NonNegative Matrix Factorization Using Proximal Point Algorithm
Abstract
A robust algorithm for nonnegative matrix factorization (NMF) is presented in this paper with the purpose of dealing with largescale data, where the separability assumption is satisfied. In particular, we modify the Linear Programming (LP) algorithm of [9] by introducing a reduced set of constraints for exact NMF. In contrast to the previous approaches, the proposed algorithm does not require the knowledge of factorization rank (extreme rays[3] or topics [7]). Furthermore, motivated by a similar problem arising in the context of metabolic network analysis[13], we consider an entirely different regime where the number of extreme rays or topics can be much larger than the dimension of the data vectors. The performance of the algorithm for different synthetic data sets are provided.
I Introduction
Matrix factorization has numerous applications to the realworld problems where the data matrices representing the numerical observations are often huge and hard to analyze. Meanwhile, factorizing them into lowerrank forms is able to reveal the inherent structure and features, which helps in the meaningful interpretation of the data. In a wide range of natural signals, such as pixel intensities, amplitude spectra, and occurance counts, negative values are usually physically meaningless. In order to deal with this nonnegative constraint, NonNegative Matrix Factorization (NMF) was introduced.
NMF was first proposed in [1] and used by Lee and Seung for partsbased data representation [2]. It is well known that NMF may not be unique. In this context a sufficient condition on the uniqueness of NMF was pointed out in [3]. A geometrical interpretation [4], of this condition amounts to the fact that the extreme rays (topics) generating the cone (in the nonnegative orthant) are contained in the data. Thus, for NMF, one only needs to identify these extreme rays. Additionally, it was demonstrated in [5] that under such a separability assumption, one can use Linear Programming (LP) to isolate the extreme rays from the nonextreme rays. In this paper we will focus only on such cases.
Bittorf et al.[9] presented a LPbased NMF algorithm named Hottopixx. Kumar et al.[4], instead, presented a fast conical hull algorithm to deal with the largescale NMF based on its polyhedral structure. It was shown to perform much faster than Hottopixx. However, both of these algorithms require the factorization rank, i.e. the number of extreme rays as a necessary input. Some applications will grant this as prior knowledge but from the view of robustness, the preference would go to a robust NMF algorithm. Gillis and Luce [10] reformulated the algorithm in [9] to detect the number of extreme rays automatically. Nevertheless, the limitation still exists with the fact that the number of constraints in the LP is enormous in face of the largescale data.
Alternatively, motivated by Theorem 5.4 in[5], we propose a simpler modification of the constraints to alleviate these issues. In particular we reduce the number of constraints and do not require that the number of extreme rays to be known. This allows us to use a proximal pointbased algorithm [12] to solve the LP problem efficiently for data sets with large size.
In addition we also consider an entirely different regime from the NMF applications in the literature so far, where the data lies high dimension with a much smaller factorization rank (the number of extreme rays) in comparison. Specifically, we look at the case when the number of extreme rays is much larger than the dimensionality of the space. This is caused by the computational issues, which arises in Double Description (DD) method [6] for the analysis of metabolic networks to find the Elementary Flux Modes (EFMs) as the set of extreme rays of the polyhedral cone [13]. A NMF like problem arises as an intermediate step in DD method. In this context, we believe that the computational advances in NMF can help with addressing the computational issues in the DD method [6].
The organization of the paper is as follows. Section II provides a brief review of NMF from the geometric perspective as well as the Hottopixx algorithm. Section III explains the proposed proximal point algorithm with the reformulated LP constraints. The experiments results are presented in Section IV and the paper concludes in Section V.
Notation: The matrices will be denoted by boldface capital letters and vectors by boldface small letters. In addition we use the MATLAB notation of diag to transform vectors to diagonal matrices and to extract the diagonal from the matrix in the argument. Also we use the MATLAB and the operators for matrix concatenations.
Ii Nonnegative Matrix Factorization
For the nonnoisy case, given a data matrix . Therefore, NMF aims to find two nonnegative matrices and such that . For an approximate NMF, instead, the aim is to solve the following optimization problem.
(1) 
Iia Geometry of the NMF Problem
The factorization implies that all the columns of can be represented as nonnegative combination of the columns of the matrix . The algebraic characterization can be described as below.
Definition 1.
The generated by columns is given by,
(2) 
The factorization refers geometrically to that the all lie in or on the surface of the simplical cone generated by the , as depicted in Fig. 1.
With this viewpoint in mind, we define three assumptions as follows.

Assumption 1: Extreme rays by definition are simplicial: No extreme ray is in the convex combination of the other extreme rays. This is also shown to be necessary and sufficient for exact recovery in topic modeling [8].

Assumption 2: The dataset consisting of all columns of , reside in or on the surface of a cone generated by these extreme rays of [3].

Assumption 3: Assuming that the columns of are normalized to unity there are no duplicate columns in .
The above three assumptions will be collectively referred to as separability assumption in the following: The entire dataset, i.e. all columns of , reside in or on a surface of a cone generated by a small subset of columns of , the vectors in this subset being simplicial and there are no duplicate columns in after column normalization.
In algebraic terms, for some subset of columns (extreme rays) of and where denotes the matrix built with columns of indexed by . This means that the vectors of are hidden among the columns of ( is unknown) [5]. Equivalently, it implies that the corresponding subset of rows of constitutes the weight matrix. Therefore, the computational challenge is to identify the extreme rays efficiently. In this context, we first outline the LPbased Hottopixx Algorithm from [9] .
IiB Hottopixx
Bittorf et al.[9] proposed an algorithm of NMF under separability assumption based on the following LP problem:
(3) 
and is a random vector with distinct positive entries and is referred to as a factorization localizing matrix[9], which belongs to the following polyhedral set.
(4) 
For a large scale setup they proposed an incremental gradient descent algorithm to solve the LP.
Iii Robust NMF Using Proximal Point Algorithm
As explained before, two of the prominent shortcomings of existing algorithms for the NMF problem are  (i) Dependence on knowledge of the number of extreme rays and, (ii) Dealing with a large data set resulting in an enormous number of constraints. An approach in this direction was taken in [10]. However, the number of constraints in their reformulation is still immense for large data. In this paper we present a reformulation which drastically reduces the set of constraints.
Iiia LP Reformulation
Assuming that the columns of are normalized to have an unit norm, our LP reformulation for NMF is given as
(5) 
where,
(6) 
where is the vector of all s and is the same as the vector in (3).
Proposition 1.
Suppose admits a separable factorization , compute the minimization of (5) and let , then .
In order to prove the above proposition, we consider the Lagrangian of the optimization problem in (5), which is,
(7)  
where and are the Lagrange multipliers. Then the dual form of (5) is
(8)  
s.j.t 
The proof of the proposition follows from Lemma 1 and Lemma 2 below.
Lemma 1.
If .
Proof.
For , consider the LP problem
(9) 
where denotes the vector with th entry and the rest . Assign and using the constraint , we can claim that . Under the separability assumption, there exists a collection of vectors such that,
(10)  
for and some . A feasible solution to (8) is
(11)  
With such selection, the dual cost function is equal to . From weak duality [11] it follows that . Combined with , it implies and . ∎
Lemma 2.
If .
Proof.
Proof of Proposition 1.
Let denote the factorization localizing matrix which identifies the factorization with lowest cost and has either ones or zeros on the diagonal. Then is the unique optimal solution of (5). To see this, let denote the set of simplicial columns of minimum cost. Since each column of can only belong to or not, once is given, is determined then the lowest cost is determined, which is unique.
Remark.
Note that can be readily obtained from as (in MATLAB notation).
IiiB Proximal Point Algorithm
Based on the above reformulation, a proximal pointbased (Proximal Point) algorithm is employed in this paper. A necessary preprocessing step is the column normalization, which makes sure that the sum of each column of is equal to one.
We solve this LP using the proximal point algorithm [12] in Algorithm 1 and in our implementation, are set to a large constant . The discussion on the convergence of the algorithm can be found in [12].
Input: A column normalized matrix , stopping threshold .
Output: A matrix and , and .
Iv Experiments Results
All of the experiments were run on an identical configuration: a dual Xeon W3505 (2.53GHz) machine with 6GB RAM. Proximal Point Algorithm is examined in MATLAB with the version of 2013a.
Iva Random Data Generation
To generate our instances, independent extreme rays are firstly created in , with the element value between . The remaining columns are then generated to be the random nonnegative combinations of the extreme rays, where is randomly selected for each of the points. The column normalization is carried out sequentially. Three regimes of NMF problems are analyzed here:
Furthermore, since the algorithm is free from the order of the columns, the extreme rays are allocated at the beginning of each data set.
Different size of data sets are generated to check the effectiveness of Proximal Point Algorithm, from small to largescale. In Tab. I, the last column indicates the highest level for iteration stopping criterion to achieve the listed accuracy. From the experiments, it is exhibited that our algorithm can deal with three regimes of the data with different sizes. Moreover, the identification accuracy is satisfying.
Data Set  # of Extreme Rays  Accuracy  

(C1)  25  
(C1)  25  
(C1)  300  
(C2)  15  
(C2)  75  
(C2)  225  
(C3)  45  
(C3)  150  
(C3)  625 
IvB Application to Image Processing
In this section, we apply the Proximal Point algorithm to one face image processing data set, namely, CBCL Dataset[14]. Basically, the CBCL face dataset is made of 2429 graylevel face images with pixels. We randomly choose images from the dataset with vectorization to be the generators, which means the number of extreme rays in this case is . Through the random nonnegative combination of the extreme rays, a facial data matrix is created. Applying Proximal Point algorithm to this dataset, the results of the extreme rays identification are shown in Fig. 2, which represents the initial images as generators.
V Acknowledgements
The second author would like to acknowledge several useful discussions with Prof. Prakash Ishwar at Boston University.
References
 [1] P. Paatero and U. Tapper, “Positive Matrix Factorization: A nonnegative factor model with optimal utilization of error estimates of data values,” Environmetrics, vol. 5, pp. 111126, 1994.
 [2] D. Lee and H. Seung,“Algorithms for NonNegative Matrix Factorization,” in NIPS, 2001, pp. 556562.
 [3] D. Donoho and V. Stodden, “When does nonnegative matrix factorization give a correct decomposition into parts,â in NIPS, 2004, MIT Press.
 [4] A. Kumar, V. Sindhwani, and P. Kambadur, “Fast conical hull algorithms for nearseparable nonnegative matrix factorization,” arXiv:1210.1190, 2012.
 [5] S. Arora, R. Ge, R. Kannan, and A. Moitra, “Computing a nonnegative matrix factorization – provably,” in STOC, 2012, pp.145162.
 [6] K. Fukuda and A. Prodon. “Double description method revisited.” in Combinatorics and Computer Science, vol. 1120, pp. 91111, 1996.
 [7] W. Ding, M. H. Rohban, P. Ishwar, V. Saligrama, “A New Geometric approach to latent topic modeling and discovery,”arXiv:1301.0858 [stat.ML], 2013.
 [8] W. Ding, P. Ishwar, M. H. Rohban, V. Saligrama, “Necessary and Sufficient Conditions for Novel Word Detection in Separable Topic Models,” arXiv:1310.7994 [cs.LG], 2013.
 [9] B. Recht, C. Re, J. Tropp, and V. Bittorf, “Factoring nonnegative matrices with linear programs,” in NIPS,2012, pp. 12231231.
 [10] N. Gillis and R. Luce, “Robust nearSeparable nonnegative matrix factorization using linear optimization,” arXiv:1302.4385, 2013.
 [11] D.G. Luenberger, Optimization by vector space methods. New York: Wiley, 1969.
 [12] J. Ekstein, “Nonlinear proximal point algorithms using Bregman functions with applications to convex programming,” Math. Oper. Res., vol. 18, pp. 203226, 1993.
 [13] M. Terzer and J. Stelling, “Accelerating the Computation of Elementary Modes Using Pattern Trees,” in WABI, 2006, pp. 333343.
 [14] http://cbcl.mit.edu/softwaredatasets/FaceData2.html
 [15] D. P. Bertsekas, “A new class of incremental gradient methods for least squares problems,” SIAM J. Optim., vol. 7, no. 4, pp. 913926, 1997.