Computing a Nonnegative Matrix Factorization – Provably

# Computing a Nonnegative Matrix Factorization – Provably

## Abstract

In the Nonnegative Matrix Factorization (NMF) problem we are given an nonnegative matrix and an integer . Our goal is to express as where and are nonnegative matrices of size and respectively. In some applications, it makes sense to ask instead for the product to approximate – i.e. (approximately) minimize where denotes the Frobenius norm; we refer to this as Approximate NMF.

This problem has a rich history spanning quantum mechanics, probability theory, data analysis, polyhedral combinatorics, communication complexity, demography, chemometrics, etc. In the past decade NMF has become enormously popular in machine learning, where and are computed using a variety of local search heuristics. Vavasis recently proved that this problem is NP-complete. (Without the restriction that and be nonnegative, both the exact and approximate problems can be solved optimally via the singular value decomposition.)

We initiate a study of when this problem is solvable in polynomial time. Our results are the following:

1. We give a polynomial-time algorithm for exact and approximate NMF for every constant . Indeed NMF is most interesting in applications precisely when is small.

2. We complement this with a hardness result, that if exact can be solved in time , -SAT has a sub-exponential time algorithm. This rules out substantial improvements to the above algorithm.

3. We give an algorithm that runs in time polynomial in , and under the separablity condition identified by Donoho and Stodden in 2003. The algorithm may be practical since it is simple and noise tolerant (under benign assumptions). Separability is believed to hold in many practical settings.

To the best of our knowledge, this last result is the first example of a polynomial-time algorithm that provably works under a non-trivial condition on the input and we believe that this will be an interesting and important direction for future work.

## 1Introduction

In the Nonnegative Matrix Factorization (NMF) problem we are given an matrix with nonnegative real entries (such a matrix will be henceforth called “nonnegative”) and an integer . Our goal is to express as where and are nonnegative matrices of size and respectively. We refer to as the inner-dimension of the factorization and the smallest value of for which there is such a factorization as the nonnegative rank of . An equivalent formulation is that our goal is to write as the sum of nonnegative rank-one matrices.We note that must be at least the rank of in order for such a factorization to exist. In some applications, it makes sense to instead ask for to be a good approximation to in some suitable matrix norm. We refer to the problem of finding a nonnegative and of inner-dimension that (approximately) minimizes as Approximate NMF, where denotes the Frobenius norm. Without the restriction that and be nonnegative, the problem can be solved exactly via singular value decomposition [12].

NMF is a fundamental problem that has been independently introduced in a number of different contexts and applications. Many interesting heuristics and local search algorithms (including the familiar Expectation Maximization or EM) have been proposed to find such factorizations. One compelling family of applications is data analysis, where a nonnegative factorization is computed in order to extract certain latent relationships in the data and has been applied to image segmentation [24], [25] information retrieval [16] and document clustering [35]. NMF also has applications in fields such as chemometrics [23] (where the problem has a long history of study under the name self modeling curve resolution) and biology (e.g. in vision research [7]): in some cases, the underlying physical model for a system has natural restrictions that force a corresponding matrix factorization to be nonnegative. In demography (see e.g., [15]), NMF is used to model the dynamics of marriage through a mechanism similar to the chemical laws of mass action. In combinatorial optimization, Yannakakis [37] characterized the number of extra variables needed to succinctly describe a given polytope as the nonnegative rank of an appropriate matrix (called the “slack matrix”). In communication complexity, Aho et al [1] showed that the log of the nonnegative rank of a Boolean matrix is polynomially related to its deterministic communication complexity - and hence the famous Log-Rank Conjecture of Lovasz and Saks [26] is equivalent to showing a quasi-polynomial relationship between real rank and nonnegative rank for Boolean matrices. In complexity theory, Nisan used nonnegative rank to prove lower bounds for non-commutative models of computation [28]. Additionally, the 1993 paper of Cohen and Rothblum [8] gives a long list of other applications in statistics and quantum mechanics. That paper also gives an exact algorithm that runs in exponential time.

Vavasis recently proved that the NMF problem is -hard when is large [36], but this only rules out an algorithm whose running time is polynomial in , and . Arguably, in most significant applications, is small. Usually the algorithm designer posits a two-level generative model for the data and uses NMF to compute “hidden” variables that explain the data. This explanation is only interesting when the number of hidden variables () is much smaller than the number of examples () or the number of observations per example (). In information retrieval, we often take to be a “term-by-document” matrix where the entry in is the frequency of occurrence of the term in the document in the database. In this context, a NMF computes “topics” which are each a distribution on words (corresponding to the columns of ) and each document (a column in ) can be expressed as a distribution on topics given by the corresponding column of [16]. This example will be a useful metaphor for thinking about nonnegative factorization. In particular it justifies the assertion should be small – the number of topics should be much smaller than the total number of documents in order for this representation to be meaningful. See Section Appendix A for more details.

Focusing on applications, and the overwhelming empirical evidence that heuristic algorithms do find good-enough factorizations in practice, motivates our next question.

### 1.1Our Results

Here we largely resolve Question ?. We give both an algorithm for accomplishing this algorithmic task that runs in polynomial time for any constant value of and we complement this with an intractability result which states that assuming the Exponential Time Hypothesis [20] no algorithm can solve the exact NMF problem in time .

This result is based on algorithms for deciding the first order theory of the reals - roughly the goal is to express the decision question of whether or not the matrix has nonnegative rank at most as a system of polynomial equations and then to apply algorithms in algebraic geometry to determine if this semi-algebraic set is non-empty. The complexity of these procedures is dominated by the number of distinct variables occurring in the system of polynomial equations. In fact, the number of distinct variables plays an analogous role to VC-dimension, in a sense and the running time of algorithms for determining if a semi-algebraic set is non-empty depend exponentially on this quantity. Additionally these algorithms can compute successive approximations to a point in the set at the cost of an additional factor in the run time that is polynomial in the number of bits in the input and output. The naive formulation of the NMF decision problem as a non-emptiness problem is to use variables, one for each entry in or [8]. This would be unacceptable, since even for constant values of , the associated algorithm would run in time exponential in and .

At the heart of our algorithm is a structure theorem – based on a novel method for reducing the number of variables needed to define the associated semi-algebraic set. We are able to express the decision problem for nonnegative matrix factorization using distinct variables (and we make use of tools in geometry, such as the notion of a separable partition, to accomplish this [14], [2], [18]). Thus we obtain the algorithm quoted in the above theorem. All that was known prior to our work (for constant values for ) was an exponential time algorithm, and local search heuristics akin to the Expectation-Maximization (EM) Algorithm with unproved correctness or running time.

A natural requirement on is that its columns be linearly independent. In most applications, NMF is used to express a large number of observed variables using a small number of hidden variables. If the columns of are not linearly independent then Radon’s Lemma implies that this expression can be far from unique. In the example from information retrieval, this translates to: there are candidate documents that can be expressed as a convex combination of one set of topics, or could alternatively be expressed as a convex combination of an entirely disjoint set of topics (see Section 2.1). When we add the requirement that the columns of be linearly independent, we refer to the associated problem as the Simplicial Factorization (SF) problem. In this case the doubly-exponential dependence on in the previous theorem can be improved to singly-exponential. Our algorithm is again based on the first order theory of the reals, but here the system of equations is much smaller so in practice one may be able to use heuristic approaches to solve this system (in which case, the validity solution can be easily checked).

We complement these algorithms with a fixed parameter intractability result. We make use of a recent result of Patrascu and Williams [30] (and engineer low-dimensional gadgets inspired by the gadgets of Vavasis [36]) to show that under the Exponential Time Hypothesis [20], there is no exact algorithm for NMF that runs in time . This intractability result holds also for the SF problem.

Now we turn to Question ?. We consider the nonnegative matrix factorization problem under the “separability” assumption introduced by Donoho and Stodden [10] in the context of image segmentation. Roughly, this assumption asserts that there are rows of that can be permuted to form the identity matrix. If we knew the names of these rows, then computing a nonnegative factorization would be easy. The challenge in this context, is to avoid brute-force search (which runs in time ) and to find these rows in time polynomial in , and . To the best of our knowledge the following is the first example of a polynomial-time algorithm that provably works under a non-trivial condition on the input.

Donoho and Stodden [10] argue that the separability condition is naturally met in the context of image segmentation. Additionally, Donoho and Stodden prove that separability in conjunction with some other conditions guarantees that the solution to the NMF problem is unique. Our theorem above is an algorithmic counterpart to their results, but requires only separability. Our algorithm can also be made noise tolerant, and hence works even when the separability condition only holds in an approximate sense. Indeed, an approximate separability condition is regarded as a fairly benign assumption and is believed to hold in many practical contexts in machine learning. For instance it is usually satisfied by model parameters fitted to various generative models (e.g. LDA [5] in information retrieval). (We thank David Blei for this information.)

Lastly, we consider the case in which the given matrix does not have an exact low-rank NMF but rather can be approximated by a nonnegative factorization with small inner-dimension.

The rest of the paper is organized as follows: In Section 2 we give an exact algorithm for the SF problem and in Section 3 we give an exact algorithm for the general NMF problem. In Section 4 we prove a fixed parameter intractability result for the SF problem. And in Section 5 and Section 6 we give algorithms for the separable and adversarial nonnegative fatorization problems. Throughout this paper, we will use the notation that and are the column and row of respectively.

## 2Simplicial Factorization

Here we consider the simplicial factorization problem, in which the target inner-dimension is and the matrix itself has rank . Hence in any factorization (where is the inner-dimension), must have full column rank and must have full row rank.

### 2.1Justification for Simplicial Factorization

We first argue that the extra restriction imposed in simplicial factorization is natural in many contexts: Through a re-scaling (see Section ? for more details), we can assume that the columns of , and all have unit norm. The factorization can be interpreted probabilistically: each column of can be expressed as a convex combination (given by the corresponding column of ) of columns in . In the example in the introduction, columns of represent documents and the columns of represent “topics”. Hence a nonnegative factorization is an “explanation” : each document can be expressed as a convex combination of the topics.

But if does not have full column rank then this explanation is seriously deficient. This follows from a restatement of Radon’s Lemma. Let be the convex hull of the columns for .

The observation implies that there is some candidate document that can be expressed as a convex combination of topics (in ), or instead can be expressed as a convex combination of an entirely disjoint set () of topics. The end goal of NMF is often to use the representation of documents as distributions on topics to perform various tasks, such as clustering or information retrieval. But if (even given the set of topics in a database) it is this ambiguous to determine how we should represent a given document as a convex combination of topics, then the topics we have extracted cannot be very useful for clustering! In fact, it seems unnatural to not require the columns of to be linearly independent!

Next, one should consider the process (probabilistic, presumably) that generates the datapoints, namley, columns of . Any reasonable process for generating columns of from the columns of would almost surely result in a matrix whose rank equals the rank of . But then has the same rank as .

### 2.2Algorithm for Simplicial Factorization

In this Section we give an algorithm that solves the simplicial factorization problem in time. Let be the maximum bit complexity of any coefficient in the input.

The above theorem is proved by using Lemma ? below to reduce the problem of finding a simplicial factorization to finding a point inside a semi-algebraic set with constraints and real-valued variables (or deciding that this set is empty). The decision problem can be solved using the well-known algorithm of Basu et. al.[3] solves this problem in time. We can instead use the algorithm of Renegar [32] (a bound of on the bit complexity of the coefficients in the solution due to Grigor’ev and Vorobjov [13]) to compute a rational approximation to the solution up to accuracy in time .

This reduction uses the fact that since have full rank they have “pseudo-inverses” , which are and matrices respectively such that . Thus and similarly .

(“if”) Suppose the conditions in the theorem are met. Then set and . These matrices are nonnegative and have size and respectively, and furthermore are a factorization for . Since , and are a simplicial factorization.

(“only if”) Conversely suppose that there is a simplicial factorization . Let and be arbitrary bases for the columns and rows of respectively. Let and be the corresponding and matrices. Let and be and representations in this basis for the columns and rows of - i.e. and .

Define matrices and where and are the respective pseudoinverses of . Let us check that this choice of and satisfies the conditions in the theorem.

We can re-write and hence the first condition in the theorem is satisfied. Similarly and hence the second and third condition are also satisfied.

, our structure theorem becomes much more complex – both to state and to prove. In what follows, we will not make any assumptions about and will give a structure theorem and corresponding polynomial time algorithm (for constant ) for the full nonnegative matrix factorization problem.

## 3General NMF

Now we consider the NMF problem where the factor matrices need not have full rank.

As in the Simplicial case the main idea will again be a reduction to an existence question for a semi-algebraic set, but this reduction is significantly more complicated than Lemma ?.

### 3.1General Structure Theorem: Minimality

Our goal is to re-cast nonnegative matrix factorization (for constant ) as a system of polynomial inequalities where the number of variables is constant, the maximum degree is constant and the number of constraints is polynomially bounded in and . The main obstacle is that and are large - we cannot afford to introduce a new variable to represent each entry in these matrices. We will demonstrate there is always a “minimal” choice for and so that:

1. there is a collection of linear transformations from the column-span of to and a choice function

2. and a collection of linear transformations from the row-span of to and a choice function

And these linear transformations and choice functions satisfy the conditions:

1. for each , and

2. for each , .

Furthermore, the number of possible choice functions is at most and the number of possible choice functions for is at most . These choice functions are based on the notion of a simplicial partition, which we introduce later. We then give an algorithm for enumerating all simplicial partitions (this is the primary bottleneck in the algorithm). Fixing the choice functions and , the question of finding linear transformations and that satisfy the above constraints (and the constraint that , and and are nonnegative) is exactly a system of polynomial inequalities with a variables (each matrix or is ), degree at most four and furthermore there are at most polynomial constraints.

In this subsection, we will give a procedure (which given and ) generates a “minimal” choice for and (call this minimal choice and ), and we will later establish that this “minimal” choice satisfies the structural property stated informally above.

A basic fact from linear algebra is that all maximal independent sets of columns of have exactly elements and all maximal independent sets of rows of similarly have exactly elements.

Note that the extra conditions on (i.e. the minimal basis constraint) is with respect to and the extra conditions on are with respect to . This simplifies the proof that there is always some proper chain, since we can compute a that satisfies the above conditions with respect to and then find an that satisfies the conditions with respect to .

The condition that there is some nonnegative for which is just the condition that for all , . Hence, for each vector , we can choose a minimal basis using Claim ?. Then so there is some nonnegative vector supported on such and we can set . Repeating this procedure for each column , results in a nonnegative matrix that satisfies the condition and for each , by design and is a minimal basis with respect to for .

We can re-use this argument above, setting and this interchanges the role of and . Hence we obtain a nonnegative matrix which satisfies and for each , again by design we have that and is a minimal basis with respect to for .

Notice that in the above lemma, the linear transformation that recovers the columns of is based on column subsets of , while the linear transformation to recover the rows of is based on the row subsets of (not ).

Since and and form a proper chain we have that . Also . Consider the quantity . For any , . So consider

where the last equality follows from the condition . Since we have that is the identity matrix. Hence . An identical argument with replaced with and with replaced by (and and replaced with and ) respectively implies that too.

Note that there are at most linear trasformations of the form and hence the columns of can be recovered by a constant number of linear transformations of the column span of , and similarly the rows of can also be recovered.

The remaining technical issue is we need to demonstrate that there are not too many (only polynomially many, for constant ) choice functions and and that we can enumerate over this set efficiently. In principle, even if say is just two sets, there are exponentially many choices of which (of the two) linear transformation to use for each column of . However, when we use lexicographic ordering to tie break (as in the definition of a minimal basis), the number of choice functions is polynomially bounded. We will demonstrate that the choice function arising in the definition of a proper chain can be embedded in a restricted type of geometric partitioning of which we call a simplicial partition.

### 3.2General Structure Theorem: Simplicial Partitions

Here, we establish that the choice functions and in a proper chain are combinatorially simple. The choice function can be regarded as a partition of the columns of into sets, and similarly the choice function is a partition of the rows of into sets. Here we define a geometric type of partitioning scheme which we call a simplicial partition, which has the property that there are not too many simplicial partitions (by virtue of this class having small VC-dimension), and we show that the partition functions and arising in the definition of a proper chain are realizable as (small) simplicial partitions.

If , we will be interested in a -simplicial partition.

Order the sets in according to the lexicographic ordering , so that for . Then for each , let be the rows of the matrix . Note that there are exactly rows, hence this defines a -simplicial partition.

Since and and forms a proper chain, we have that . Consider a column and the corresponding set . Recall that is the set in according to the lexicographic ordering . Also from the definition of a proper chain is a minimal basis for with respect to . Consider any set with . Then from the definition of a minimal basis we must have that . Since , we have that the transformation is a projection onto which contains . Hence , but so cannot be a nonnegative vector. Hence is not in for any . Furthermore, is in : using Lemma ? we have and so .

We can repeat the above replacing with and with , and this implies the lemma.

### 3.3Enumerating Simplicial Partitions

Here we give an algorithm for enumerating all -simplicial partitions (of, say, the columns of ) that runs in time . An important observation is that the problem of enumerating all simplicial partitions can be reduced to enumerating all partitions that arise from a single hyperplane. Indeed, we can over-specify a simplicial partition by specifying the partition (of the columns of ) that results from each hyperplane in the set of total hyperplanes that generates the simplicial partition. From this set of partitions, we can recover exactly the simplicial partition.

A number of results are known in this domain, but surprisingly we are not aware of any algorithm that enumerates all partitions of the columns of (by a single hyperplane) that runs in polynomial time (for and is constant) without some assumption on . For example, the VC-dimension of a hyperplane in dimensions is and hence the Sauer-Shelah lemma implies that there are at most distinct partitions of the columns of by a hyperplane. In fact, a classic result of Harding (1967) gives a tight upper bound of . Yet these bounds do not yield an algorithm for efficiently enumerating this structured set of partitions without checking all partitions of the data.

A recent result of Hwang and Rothblum [18] comes close to our intended application. A separable partition into parts is a partition of the columns of into sets so that the convex hulls of these sets are disjoint. Setting , the number of separable partitions is exactly the number of distinct hyperplane partitions. Under the condition that is in general position (i.e. there are no columns of lying on a dimension subspace where ), Hwang and Rothblum give an algorithm for efficiently enumerating all distinct hyperplane partitions [18].

Here we give an improvement on this line of work, by removing any conditions on (although our algorithm will be slightly slower). The idea is to encode each hyperplane partition by a choice of not too many data points. To do this, we will define a slight generalization of a hyperplane partition that we will call a hyperplane separation:

A hyperplane partition can be regarded as a mapping from columns of to where we adopt the convention that such that is mapped to .

After an appropriate linear transformation (of the columns of and the hyperplanes), we can assume that is full rank. If the already contains affinely independent columns of , then we can choose . If not we can perturb in some direction so that for any column with , we maintain the invariant that is contained on the perturbed hyperplane . Since this perturbation has non-zero inner product with some column in and so this hyperplane will eventually contain a new column from (without changing the sign of for any other column). We can continue this argument until the hyperplane contains affinely independent columns of and by design on all remaining columns agrees in sign with .

We can apply Lemma ? repeatedly. When we initially apply the lemma, we obtain a hyperplane that can be extended (as a separation) to the partition corresponding to . In the above function (defined implicitly in the lemma) this fixes the partition of the columns except those contained in . So we can then choose to be the columns of that are contained in , and recurse. If is the largest set of columns output from the recursive call, we can add columns of contained in to this set until we obtain a set of affinely independent columns contained in , and we can output this set (as ).

We can apply Lemma ? and instead enumerate the sets of points . Since these sets are nested, we can enumerate all choices as follows:

• choose at most columns corresponding to the set

• initialize an active set

• until is empty either

• choose a column to be removed from the active set

• or indicate that the current active set represents the next set and choose the sign of the corresponding hyperplane

There are at most such choices, and for each choice we can then run a linear program to determine if there is a corresponding hyperplane partition. (In fact, all partitions that result from the above procedure will indeed correspond to a hyperplane partition). The correctness of this algorithm follows from Lemma ?.

This immediately implies:

### 3.4Solving Systems of Polynomial Inequalities

The results of Basu et al [3] give an algorithm for finding a point in a semi-algebraic set defined by constraints on polynomials of total degree at most , and variables in time . Using our structure theorem for nonnegative matrix factorization, we will re-cast the decision problem of whether a nonnegative matrix has nonnegative rank as an existence question for a semi-algebraic set.

We first prove the first part of this theorem using the algorithm of Basu et al [3], and we instead use the algorithm of Renegar [32] to compute a rational approximation to the solution up to accuracy in time .

Suppose there is such a factorization. Using Lemma ?, there is also a proper chain. We can apply Lemma ? and using the algorithm in Theorem ? we can enumerate over a superset of simplicial partitions. Hence, at least one of those partitions will result in the choice functions and in the proper chain decomposition for .

Using Lemma ? there is a set of at most linear transformations which recover columns of given columns of , and similarly there is a set of at most linear transformations which recover the rows of given rows of . Note that these linear transformations are from the column-span and row-span of respectively, and hence are from subspaces of dimension at most . So apply a linear transformation to columns of and one to rows of to to recover matrices and respectively (which are no longer necessarily nonnegative) but which are dimension and respectively. There will still be a collection of at most linear transformations from columns of to columns of , and similarly for and .

We will choose variables for each linear transformation, so there are variables in total. Then we can write a set of linear constraints to enforce that for each column of , the transformation corresponding to recovers a nonnegative vector. Similarly we can define a set of constraints based on rows in .

Lastly we can define a set of constraints that enforce that we do recover a factorization for : For all , let and . Then we write the constraint . This constraint has degree at two in the variables corresponding to the linear transformations. Lemma ? implies that there is some choice of these transformations that will satisfy these constraints (when we formulate these constraints using the correct choice functions in the proper chain decomposition). Furthermore, any set of transformations that satisfies these constraints does define a nonnegative matrix factorization of inner dimension for .

And of course, if there is no inner dimension nonnegative factorization, then all calls to the algorithm of Basu et al [3] will fail and we can return that there is no such factorization.

The result in Basu et. al. [3] is a quantifier elimination algorithm in the Blum, Shub and Smale (BSS) model of computation [6]. The BSS model is a model for real number computation and it is natural to ask what is the bit complexity of finding a rational approximation of the solutions. There has been a long line of research on the decision problem for first order theory of reals: given a quantified predicate over polynomial inequalities of reals, determine whether it is true or false. What we need for our algorithm is actually a special case of this problem: given a set of polynomial inequalities over real variables, determine whether there exists a set of values for the variables so that all polynomial inequalities are satisfied. In particular, all variables in our problem are quantified by existential quantifier and there are no alternations. For this kind of problem Grigor’ev and Vorobjov [13] first gave a singly-exponential time algorithm that runs in where is the number of polynomial inequalities, is the maximum degree of the polynomials and is the number of variables. The bit complexity of the algorithm is where is the maximum length of the coefficients in the input. Moreover, their algorithm also gives an upperbound of on the number of bits required to represent the solutions. Renegar[32] gave a better algorithm that for the special case we are interested in takes time . Using his algorithm with binary search (with search range bounded by Grigor’ev et.al.[13]), we can find rational approximations to the solutions with accuracy up to in time .

We note that our results on the SF problem are actually a special case of the theorem above (because our structural lemma for simplicial factorization is a special case of our general structure theorem):

If , then we know that both and must be full rank. Hence and are both just the set . Hence we can circumvent the simplicial partition machinery, and set up a system of polynomial constraints in at most variables.

## 4Strong Intractability of Simplicial Factorization

Here we give evidence that finding a simplicial factorization of dimension probably cannot be solved in time, unless -SAT can be solved in time (in other words, if the Exponential Time Hypothesis of [20] is true). Surprisingly, even the -hardness of the problem for general was only proved quite recently by Vavasis [36]. That reduction is the inspiration for our result, though unfortunately we were unable to use it directly to get low-dimensional instances. Instead we give a new reduction using the -SUM Problem.

This definition for the -SUM Problem is slightly unconventional in that here we allow repetition (i.e. the choice of numbers need not be distinct). Patrascu and Williams [30] recently proved that if -SUM can be solved in time then -SAT has a sub-exponential time algorithm. In fact, in the instances constructed in [30] we can allow repetition of numbers without affecting the reduction since in these instances choosing any number more than once will never result in a sum that is exactly . Hence we can re-state the results in [30] for our (slightly unconventional definition for) -SUM.

Given an instance of the -SUM, we will reduce to an instance of the Intermediate Simplex problem defined in [36].

Vavasis [36] proved that Intermediate Simplex is equivalent to the Simplicial Factorization problem.

Interestingly, an immediate consequence of this theorem is that Simplicial Factorization is easy in the case in which because mapping these instances to instances of intermediate simplex results in a one dimensional problem - i.e. the polyhedron is an interval.

Given the universe for the -SUM problem, we construct a two dimensional Intermediate Simplex instance as shown in Figure 1. We will show that the Intermediate Simplex instance has exactly solutions, each representing a choice of . Later in the reduction we use such gadgets to represent the choice of numbers in the set .

Recall for a two dimensional Intermediate Simplex problem, the input consists of a polygon (which is the hexagon in Figure 1) and a set of points inside (which are the dots, except for ). A solution to this two dimensional Intermediate Simplex instance will be a triangle inside such that all the points in are contained in the triangle (in Figure 1 is a valid solution).

We first specify the polygon for the Intermediate Simplex instance. The polygon is just the hexagon inscribed in a circle with center . All angles in the hexagon are , the edges where is a small constant depending on , that we determine later. The other 3 edges also have equal lengths .

We use and to denote the and coordinates for the point (and similarly for all other points in the gadget). The hexagon is placed so that , .

Now we specify the set of points for the Intermediate Simplex instance. To get these points first take points in each of the 3 segements , , . On these points are called , , ..., , and . Similarly we have points ’s on and ’s on , . Now we have triangles (the thin lines in Figure 1). We claim (see Lemma ? below) that the intersection of these triangles is a polygon with vertices. The points in are just the vertices of this intersection.

Since the intersection of triangles is the intersection of halfplanes, it has at most vertices. Therefore we only need to prove every edge in the triangles has a segment remaining in the intersection. Notice that the gadget is symmetric with respect to rotations of around the center . By symmetry we only need to look at edges . The situation here is illustrated in Figure 2.

Since all the halfplanes that come from triangles contain the center , later when talking about halfplanes we will only specify the boundary line. For example, the halfplane with boundary and contains (as well as ) is called halfplane .

The two thick lines in Figure 2 are extensions of and , now they are rotated so that they are . The two thin lines are two possible lines and . The differences between coordinates of and are the same for all (here normalized to 1) by the construction of the points ’s and ’s. Assume the coordinates for , are and respectively. Then the coordinates for the intersection is . This means if we have segments with , segment will be the highest one when is in range (indeed, the lines with have higher slope and will win when ; the lines with have lower slope and will win when ).

We also want to make sure that all these intersection points are inside the halfplanes ’s and ’s. Since , all the ’s are within . Hence the intersection point is always close to the point , the distance is at most . At the same time, since is small, the distances of this point to all the ’s and ’s are all larger than . Therefore all the intersection points are inside the other halfplanes and the segments will indeed remain in the intersection. The intersection has edges and vertices.

The Intermediate Simplex instance has obvious solutions: the triangles , each one corresponds to a value for the -SUM problem. In the following Lemma we show that these are the only possible solutions.

Suppose is a solution of the Intermediate Simplex problem, since is in the convex hull of , it must be in . Thus one of the angles , , must be at least (their sum is ). Without loss of generality we assume this angle is and by symmetry assume is either on or . We shall show in either of the two cases, when is not one of the ’s, there will be some that is not in the halfplane (recall the halfplanes we are interested in always contain so we don’t specify the direction).

When is on , since , we have (by symmetry when the angle is exactly ). This means we can move to such that . The intersection of halfplane and the hexagon is at least as large as the intersection of halfplane and the hexagon. However, if is not any of the points (that is, ), then can be viewed as if we add to the set . By Lemma ? introducing must increase the number of vertices. One of the original vertices is not in the hyperplane , and hence not in . Therefore when is on it must coincide with one of the ’s, by symmetry must be one of ’s.

When is on , there are two cases as shown in Figure 3.

First observe that if we take , and generate the set according to , then the gadget is further symmetric with respect to flipping along the perpendicular bisector of . Now without loss of generality . Since every is now in the intersection of triangles, in particular they are also in the intersection of the original triangles, it suffices to show one of () is outside halfplane .

The first case (left part of Figure 3) is when . In this case we extend to get intersection on () and intersection on (). Again since , we have . At the same time we know , so . Similar to the previous case, we take so that . The intersection of hyperplane and the hexagon is at least as large as the intersection of halfplane and the hexagon. When , we can check , therefore we can still view as some for . Now Lemma ? shows there is some vertex not in halfplane (and hence not in halfplane ).

The final case (right part of Figure 3) is when . In this case we notice the triangle with 3 edges , , (the shaded triangle in the figure) is contained in every , thus it must also be in . However, since , we know and . In this case does not even contain the center .

### 4.2The Reduction

Suppose we are given an instance of the -SUM Problem with values . We will give a reduction to an instance of Intermediate Simplex in dimension .

To encode the choice of numbers in the set , we use gadgets defined in Section 4.1. The final solution of the Intermediate Simplex instance we constructed will include solutions to each gadget. As the solution of a gadget always corresponds to a number in (Lemma ?) we can decode the solution and get numbers, and we use an extra dimension that “computes” the sum of these numbers and ensures the sum is equal to .

We use three variables for the gadget.

is a tilted-cone that has a hexagonal base and has an apex at the origin.

We will use these gadgets to define (some of the) constraints on the polyhedron in an instance of intermediate simplex:

Hence when restricted to dimensions , , the gadget is on the plane .

We hope that in a gadget, if we choose three points corresponding to the triangle for some value , that of these three points only the point on the line will have a non-zero value for and that this value will be . The points on the lines or will hopefully have a value close to zero. We add constraints to enforce these conditions:

These constraints make sure that points on or cannot have large value.

Recall that we use to denote the coordinate of in the gadget in Section 4.1.

Theses constraints make sure that points on have values in .

The and constraints all have the property that when (i.e. the corresponding point is off of the gadget on the plane ) then these constraints gradually become relaxed.

To make sure the gadget still works, we don’t want the extra constraints on to rule out some possible values for , , ’s. Indeed we show the following claim.

The proof is by observing that Constraints have almost no effect when and Constraints have no effect when .

Constraints 1 to 4 define a polyhedron in -dimensional space and furthermore the set of constraints that define have full rank (in fact even the inequalities in the Box Constraints have full rank). Thus this polyhedron is a valid polyhedron for the Intermediate Simplex problem.

Next we specify the points in for the Intermediate Simplex problem(each of which will be contained in the polyhedron ). Let (for ) be the set in the gadget in Section 4.1. As before, let and be the and coordinates of respectively.

This completes the reduction of -SUM to intermediate simplex, and next we establish the COMPLETENESS and SOUNDNESS of this reduction.

### 4.3Completeness and Soundness

The completeness part is straight forward: for gadget we just select the triangle that corresponds to .

We will choose a set of points : We will include the and points, and for each , we will choose the triangle corresponding to the value in the gadget. Recall the triangle is in the gadget defined in Section 4.1. The points we choose have and , equal to the corresponding point in the gadget. We will set to be for the point on the line and we will set to be zero for the other two points not contained in the line . The rest of the dimensions are all set to 0.

Next we prove that the convex hull of this set of points contains all the points in : The points and are clearly contained in the convex hull of (and are in fact in !). Next consider some point in corresponding to some intersection point in the gadget . Since is in the convex hull of the triangle corresponding to in the gadget , there is a convex combination of the these three points in (which we call ) so that matches on all coordinates except possibly the -coordinate. Furthermore the point has some value in the coordinate corresponding to and this must be at most the corresponding value in (because we chose the -value in to be -). Hence we can distribute the remaining weight among the and points to recover exactly on all coordinates.

Lastly, we observe that if we equally weight all points in (except and ) we recover the point . In particular, the coordinate of should be .

Next we prove SOUNDNESS for our reduction. Suppose the solution is , which is a set of points in the polyhedron and the convex hull of points in contains all the , , , points (in Definition ?).

The points and are vertices of the polyhedron and hence cannot be expressed as a convex combination of any other set of points in .

Now we want to prove the rest of the points in set is partitioned into triples, each triple belongs to one gadget. Set .

The sets are disjoint, and additionally each set must contain at least nodes (otherwise the convex hull of even restricted to cannot contain the points ). This implies the Claim.

Recall the gadget in Section 4.1 is a two dimensional object, but it is represented as a three dimensional cone in our construction. We would like to apply Lemma ? to points on the plane (in this plane the coordinates , act the same as , in the gadget).

Since the points are in the affine hull of when restricted to , we know must be in the convex hull of . Using Lemma ? in Section 4.1, we get:

Now we know how to decode the solution and get the numbers . We will abuse notation and call the 3 points in , , (they were used to denote the corresponding points in the 2-d gadget in Section 4.1).We still want to make sure the coordinate correctly “computes” the sum of these numbers. As a first step we want to show that the of all points in must be 1 (we need this because the Constraints AB and CE are only strict when ).

Suppose, for the sake of contradiction, that (for ). Then consider the point . Since , and for any point in , there is no convex combination of points in that places non-zero weight on and equals .

Let be , we observe that the points in are the only points in that have any contribution to when we want to represent (using a convex combination). For now we restrict our attention to these three dimensions. When trying to represent we must have weight in the set (because of the contribution in coordinate). The , coordinates of are , respectively. This means if we take projection to plane must be in the convex hull of . However that is impossible because no two points in contain in their convex hull. This contradiction implies the Lemma.

Using Lemma ?, we conclude that the total weight on points in is exactly , and there is a unique convex combination of the points (restricted to ) that recover the point which is the combination. This implies the Lemma.

Now we are ready to compute the value of the point and show the sum of is indeed .

As we showed in previous Lemmas, the solution to the Intermediate Simplex problem must contain , , and for each gadget the solution has 3 points that correspond to one of the solutions of the gadget. Suppose for gadget the triangle we choose is . By Constraints we know , by Constraints we know and .

By Lemma ? there is only one way to represent , and .

Since and ’s are small, we have . However the numbers only have bits and is so small, the only valid value in the range is . Hence the sum must be equal to .

## 5Fully-Efficient Factorization under Separability

Earlier, we gave algorithms for NMF, and presented evidence that no time algorithm exists for determining if a matrix has nonnegative rank at most . Here we consider conditions on the input that allow the factorization to be found in time polynomial in , and . (In Section 5.1, we give a noise-tolerant version of this algorithm). To the best of our knowledge this is the first example of an algorithm (that runs in time poly) and provably works under a non-trivial condition on the input. Donoho and Stodden [10] in a widely-cited paper identified sufficient conditions for the factorization to be unique (motivated by applications of NMF to a database of images) but gave no algorithm for this task. We give an algorithm that runs in time poly and assumes only one of their conditions is met (separability). We note that this separability condition is quite natural in its own right, since it is usually satisfied [4] by model parameters fitted to various generative models (e.g. LDA [5] in information retrieval).

Let us understand this condition at an intuitive level in context of clustering documents by topic, which was discussed in the introduction. Recall that there a column of corresponds to a document. Each column of represents a topic and its entries specify the probability that a word occurs in that topic. The NMF thus “explains” the document as where the column vector has (nonnegative) coordinates summing to one—in other words, represents a convex combination of topics. In practice, the total number of words may number in the thousands or tens of thousands, and the number of topics in the dozens. Thus it is not unusual to find factorizations in which each topic is flagged by a word that appears only in that topic and not in the other topics [4]. The separability condition asserts that this happens for every topic1.

For simplicity we assume without loss of generality that the rows of are normalized to have unit -norm. After normalizing , we can still normalize (while preserving the factorization) by re-writing the factorization as for some nonnegative matrix . By setting the rows of will all have norm 1. When rows of and are all normalized the rows of must also have unit -norm because

The third equality uses the nonnegativity of . Notice that after this normalization, if a row of has a unique nonzero entry (the rows in Separability), that particular entry must be one.

We also assume is a simplicial matrix defined as below.

The next lemma shows that without loss of generality we may assume is simplicial.

Suppose is not simplicial, and let the row be in the convex hull of the remaining rows. Then we can represent where is a nonnegative vector with and the coordinate is 0.

Now modify as follows. For each row in that has a non-zero coordinate, we zero out the coordinate and add to the row . At the end the matrix is still nonnegative but whose column is all zeros. So delete the column and let the resulting matrix be . Let be the matrix obtained by deleting the row of . Then by construction we have . Now we claim is separable.

Since was originally separable, for each column index there is some row, say the row, that has a non-zero entry in the column and zeros everywhere else. If then by definition the above operation does not change the row of . If the index is deleted at the end. In either case the final matrix satisfies the separability condition.

Repeating the above operation for all violations of the simplicial condition we end with a separable factorization of (again with inner-dimension at most ) where is simplicial.

We can apply Lemma ? and assume without loss of generality that there is a factorization where is separable and is simplicial. The separability condition implies that every row of appears among the rows of . Thus is hiding in plain sight in ; we now show how to find it.

Say a row is a loner if (ignoring other rows that are copies of ) it is not in the convex hull of the remaining rows. The simplicial condition implies that the rows of that correspond to rows of are loners.

Suppose (for contradiction) that a row in is not a loner and but it is equal to some row . Then there is a set of rows of so that is in their convex hull and furthermore for all , is not equal to . Thus there is a nonnegative vector