A HighPerformance Parallel Algorithm for Nonnegative Matrix Factorization
Abstract
Nonnegative matrix factorization (NMF) is the problem of determining two nonnegative low rank factors and , for the given input matrix , such that . NMF is a useful tool for many applications in different domains such as topic modeling in text mining, background separation in video analysis, and community detection in social networks. Despite its popularity in the data mining community, there is a lack of efficient parallel software to solve the problem for big datasets. Existing distributedmemory algorithms are limited in terms of performance and applicability, as they are implemented using Hadoop and are designed only for sparse matrices.
We propose a distributedmemory parallel algorithm that computes the factorization by iteratively solving alternating nonnegative least squares (NLS) subproblems for and . To our knowledge, our algorithm is the first highperformance parallel algorithm for NMF. It maintains the data and factor matrices in memory (distributed across processors), uses MPI for interprocessor communication, and, in the dense case, provably minimizes communication costs (under mild assumptions). As opposed to previous implementations, our algorithm is also flexible: (1) it performs well for dense and sparse matrices, and (2) it allows the user to choose from among multiple algorithms for solving local NLS subproblems within the alternating iterations. We demonstrate the scalability of our algorithm and compare it with baseline implementations, showing significant performance improvements.
Ramakrishnan KannanGeorgia Techrkannan@gatech.edu \authorinfoGrey BallardSandia National Laboratoriesgmballa@sandia.gov \authorinfoHaesun ParkGeorgia Techhpark@cc.gatech.edu
1 Introduction
Nonnegative Matrix Factorization (NMF) is the problem of finding two low rank factors and for a given input matrix , such that , where .
Formally, NMF problem Seung and Lee [2001] can be defined as \SplitN
min_\WW≥0,\HH≥0 & f(\WW,\HH) ≡∥Å\WW\HH∥_F^2.
NMF is widely used in data mining and machine learning as a dimension reduction and factor analysis method. It is a natural fit for many real world problems as the nonnegativity is inherent in many representations of realworld data and the resulting low rank factors are expected to have natural interpretation. The applications of NMF range from text mining Pauca et al. [2004], computer vision Hoyer [2004], bioinformatics Kim and Park [2007], to blind source separation Cichocki et al. [2009], unsupervised clustering Kuang et al. [2012] and many other areas. For typical problems today, and can be on the order of thousands to millions or more, and is typically less than 100.
There is a vast literature on algorithms for NMF and their convergence properties Kim et al. [2014]. The commonly adopted NMF algorithms are – (i) Multiplicative Update (MU) Seung and Lee [2001] (ii) Hierarchical Alternative Least Squares (HALS) Cichocki et al. [2009] (iii) Block Principal Pivoting (BPP) Kim and Park [2011], and (iv) Stochastic Gradient Descent (SGD) Updates Gemulla et al. [2011]. As described in Equation 4.1, most of the algorithms in NMF literature are based on Alternating Nonnegative Least Squares (ANLS) scheme that iteratively optimizes each of the low rank factors and while keeping the other fixed. It is important to note that in such iterative alternating minimization techniques, each subproblem is a constrained convex optimization problem. Each of these subproblems is then solved using standard optimization techniques such as projected gradient, interior point, etc., and a detailed survey for solving this constrained convex optimization problem can be found in Wang and Zhang [2013]; Kim et al. [2014]. In this paper, for solving the subproblems, our implementation uses a fast activeset based method called Block Principal Pivoting (BPP) Kim and Park [2011], but the parallel algorithm proposed in this paper can be easily extended for other algorithms such as MU and HALS.
Recently with the advent of large scale internet data and interest in Big Data, researchers have started studying scalability of many foundational machine learning algorithms. To illustrate the dimension of matrices commonly used in the machine learning community, we present a few examples. Nowadays the adjacency matrix of a billionnode social network is common. In the matrix representation of a video data, every frame contains three matrices for each RGB color, which is reshaped into a column. Thus in the case of a 4K video, every frame will take approximately 27 million rows (4096 row pixels x 2196 column pixels x 3 colors). Similarly, the popular representation of documents in text mining is a bagofwords matrix, where the rows are the dictionary and the columns are the documents (e.g., webpages). Each entry in the bagofwords matrix is generally the frequency count of the word in the document . Typically with the explosion of the new terms in social media, the number of words spans to millions.
To handle such high dimensional matrices, it is important to study low rank approximation methods in a datadistributed environment. For example, in many large scale scenarios, data samples are collected and stored over many general purpose computers, as the set of samples is too large to store on a single machine. In this case, the computation must also be distributed across processors. Local computation is preferred as local access of data is much faster than remote access due to the costs of interprocessor communication. However, for low rank approximation algorithms that depend on global information (like MU, HALS, and BPP), some communication is necessary.
The simplest way to organize these distributed computations on large datasets is through a MapReduce framework like Hadoop, but this simplicity comes at the expense of performance. In particular, most MapReduce frameworks require data to be read from and written to disk at every iteration, and they involve communicationintensive global, inputdata shuffles across machines.
In this work, we present a much more efficient algorithm and implementation using tools from the field of HighPerformance Computing (HPC). We maintain data in memory (distributed across processors), take advantage of optimized libraries for local computational routines, and use the Message Passing Interface (MPI) standard to organize interprocessor communication. The current trend for highperformance computers is that available parallelism (and therefore aggregate computational rate) is increasing much more quickly than improvements in network bandwidth and latency. This trend implies that the relative cost of communication (compared to computation) is increasing.
To address this challenge, we analyze algorithms in terms of both their computation and communication costs. In particular, our proposed algorithm ensures that after the input data is initially read into memory, it is never communicated; we communicate only the factor matrices and other smaller temporary matrices. Furthermore, we prove that in the case of dense input and under the assumption that , our proposed algorithm minimizes bandwidth cost (the amount of data communicated between processors) to within a constant factor of the lower bound. We also reduce latency costs (the number of times processors communicate with each other) by utilizing MPI collective communication operations, along with temporary local memory space, performing messages per iteration, the minimum achievable for aggregating global data.
Fairbanks et al. Fairbanks et al. [2015] discuss a parallel NMF algorithm designed for multicore machines. To demonstrate the importance of minimizing communication, we consider this approach to parallelizing an alternating NMF algorithm in distributed memory. While this naive algorithm exploits the natural parallelism available within the alternating iterations (the fact that rows of and columns of can be computed independently), it performs more communication than necessary to set up the independent problems. We compare the performance of this algorithm with our proposed approach to demonstrate the importance of designing algorithms to minimize communication; that is, simply parallelizing the computation is not sufficient for satisfactory performance and parallel scalability.
The main contribution of this work is a new, highperformance parallel algorithm for nonnegative matrix factorization. The algorithm is flexible, as it is designed for both sparse and dense input matrices and can leverage many different algorithms for solving local nonnegative least squares problems. The algorithm is also efficient, maintaining data in memory, using MPI collectives for interprocessor communication, and using efficient libraries for local computation. Furthermore, we provide a theoretical communication cost analysis to show that our algorithm reduces communication relative to the naive approach, and in the case of dense input, that it provably minimizes communication. We show with performance experiments that the algorithm outperforms the naive approach by significant factors, and that it scales well for up to 100s of processors on both synthetic and realworld data.
2 Preliminaries
2.1 Notation
Table 1 summarizes the notation we use throughout. We use upper case letters for matrices and lower case letters for vectors. For example, is a matrix and is a column vector and is a row vector. The subscripts are used for subblocks of matrices. We use and to denote the numbers of rows and columns of , respectively, and we assume without loss of generality throughout.
Input Matrix  
Left Low Rank Factor  
Right Low Rank Factor  
Number of rows of input matrix  
Number of columns of input matrix  
Low Rank  
th row block of matrix  
th column block of matrix  
th subblock of  
Number of parallel processes  
Number of rows in processor grid  
Number of columns in processor grid 
2.2 Communication model
To analyze our algorithms, we use the  model of distributedmemory parallel computation. In this model, interprocessor communication occurs in the form of messages sent between two processors across a bidirectional link (we assume a fully connected network). We model the cost of a message of size words as , where is the permessage latency cost and is the perword bandwidth cost. Each processor can compute floating point operations (flops) on data that resides in its local memory; is the perflop computation cost. With this communication model, we can predict the performance of an algorithm in terms of the number of flops it performs as well as the number of words and messages it communicates. For simplicity, we will ignore the possibilities of overlapping computation with communication in our analysis. For more details on the  model, see Thakur et al. [2005]; Chan et al. [2007].
2.3 MPI collectives
Pointtopoint messages can be organized into collective communication operations that involve more than two processors. MPI provides an interface to the most commonly used collectives like broadcast, reduce, and gather, as the algorithms for these collectives can be optimized for particular network topologies and processor characteristics. The algorithms we consider use the allgather, reducescatter, and allreduce collectives, so we review them here, along with their costs. Our analysis assumes optimal collective algorithms are used (see Thakur et al. [2005]; Chan et al. [2007]), though our implementation relies on the underlying MPI implementation.
At the start of an allgather collective, each of processors owns data of size . After the allgather, each processor owns a copy of the entire data of size . The cost of an allgather is . At the start of a reducescatter collective, each processor owns data of size . After the reducescatter, each processor owns a subset of the sum over all data, which is of size . (Note that the reduction can be computed with other associative operators besides addition.) The cost of an reducescatter is . At the start of an allreduce collective, each processor owns data of size . After the allreduce, each processor owns a copy of the sum over all data, which is also of size . The cost of an allreduce is . Note that the costs of each of the collectives are zero when .
3 Related work
In the data mining and machine learning literature there is an overlap between low rank approximations and matrix factorizations due to the nature of applications. Despite its name, Nonnegative “Matrix Factorization” is really a low rank approximation. In this section, we discuss the various distributed NMF algorithms.
The recent distributed NMF algorithms in the literature are Liao et al. [2014], Liu et al. [2010], Gemulla et al. [2011], Yin et al. [2014] and Faloutsos et al. [2014]. All of these works propose distributed NMF algorithms implemented using Hadoop. Liu, Yang, Fan, He and Wang Liu et al. [2010] propose running Multiplicative Update (MU) for KL divergence, squared loss, and “exponential” loss functions. Matrix multiplication, elementwise multiplication, and elementwise division are the building blocks of the MU algorithm. The authors discuss performing these matrix operations effectively in Hadoop for sparse matrices. In similar directions, Liao, Zhang, Guan and Zhou Liao et al. [2014] implement an open source Hadoop based MU algorithm and study its scalability on largescale biological datasets. Also, Yin, Ghao and Zhang Yin et al. [2014] present a scalable NMF that can perform frequent updates, which aim to use the most recently updated data. Gemmula, Nijkamp, Erik, Haas and Sismanis Gemulla et al. [2011] propose a Generic algorithm that works on different loss functions, often involving the distributed computation of the gradient. According to the authors, this formulation can also be extended to handle nonnegative constraints. Similarly Faloutsos, Beutel, Xing and Papalexakis, Faloutsos et al. [2014] propose a distributed, scalable method for decomposing matrices, tensors, and coupled data sets through stochastic gradient descent on a variety of objective functions. The authors also provide an implementation that can enforce nonnegative constraints on the factor matrices.
4 Foundations
In this section, we will briefly introduce the Alternating Nonnegative Least Squares (ANLS) framework, multiple methods for solving nonnegative least squares problems (NLS) including Block Principal Pivoting (BPP), and a straightforward approach to parallelization of the framework.
4.1 Alternating Nonnegative Least Squares
According to the ANLS framework, we first partition the variables of the NMF problem into two blocks and .
Then we solve the following equations iteratively until a stopping criteria is satisfied.
\SplitN
\WW&←\Argmin~\WW≥0\NormBrÅ~\WW\HH_F^2,
\HH&←\Argmin~\HH≥0\NormBrÅ\WW~\HH_F^2.
The optimizations subproblem for and are NLS problems which can be solved by a number of methods from generic constrained convex optimization to activeset methods. Typical approaches use form the normal equations of the least squares problem (and then solve them enforcing the nonnegativity constraint), which involves matrix multiplications of the factor matrices with the data matrix. Algorithm 1 shows this generic approach.
The specifics of lines 4 and 5 depend on the NLS method used. For example, the update equations for MU Seung and Lee [2001] are
w_ij &←w_ij (Å\HHT)ij(\WW\HH\HHT)ij
h_ij &←h_ij (\WWTÅ)ij(\WWT\WW\HH)ij.
Note that after computing and , the cost of updating is dominated by computing , which is flops.
Given and , each entry of can be updated independently and cheaply.
Likewise, the extra computation cost of updating is flops.
HALS updates and by applying block coordinate descent on the columns of and rows of Cichocki et al. [2009].
The update rules are
\SplitN
\ww^i &←[(Å\HHT)i ∑l=1 l ≠ik(\HH\HHT)li\wwl]+(\HH\HHT)ii
\hh_i &←[(\WWTÅ)i ∑l=1 l ≠ik(\WWT\WW)li\hhl]+(\WWT\WW)ii,
where is the th column of and is the row of .
Note that the columns of and rows of are updated in order, so that the most uptodate values are used in the update.
The extra computation is again flops for updating both and .
4.2 Block Principal Pivoting
In this paper, we focus on and use the block principal pivoting Kim and Park [2011] method to solve the nonnegative least squares problem, as it has the fastest convergence rate (in terms of number of iterations). We note that any of the NLS methods can be used within our parallel frameworks (Algorithms 2 and 3).
BPP is the stateoftheart method for solving the NLS subproblems in Eq. (4.1). The main subroutine of BPP is the single righthand side NLS problem \SplitN min_\xx≥0 ∥\CC\xx\bb∥_2^2.
The KarushKuhnTucker (KKT) optimality condition for Eq. (4.2) is as follows
\SplitS
\yy&= \CC^T \CC\xx \CC^T \bb
\yy&≥0
\xx&≥0
\xx^T \yy& = 0.
The KKT condition (4.2) states that at optimality, the support sets (i.e., the nonzero elements)
of and are complementary to each other. Therefore, Eq. (4.2) is an instance of
the Linear Complementarity Problem (LCP) which arises frequently in quadratic programming.
When , active set and activeset like methods are very suitable because most
computations involve matrices of sizes , and which are
small and easy to handle.
If we knew which indices correspond to nonzero values in the optimal solution, then computing it is an unconstrained least squares problem on these indices. In the optimal solution, call the set of indices such that the active set, and let the remaining indices be the passive set. The BPP algorithm works to find this active set and passive set. Since the above problem is convex, the correct partition of the optimal solution will satisfy the KKT condition (Eq. (4.2)). The BPP algorithm greedily swaps indices between the active and passive set until finding a partition that satisfies the KKT condition. In the partition of the optimal solution, the values of the indices that belong to the active set will take zero. The values of the indices that belong to the passive set are determined by solving the least squares problem using normal equations restricted to the passive set. Kim, He and Park Kim and Park [2011], discuss the BPP algorithm in further detail. Because the algorithm is iterative, we define as the cost of solving instances of Eq. (4.2) using BPP, given the matrix and columns of the form .
4.3 Naive Parallel NMF Algorithm
In this section we present a naive parallelization of any NMF algorithm Fairbanks et al. [2015] that follows the ANLS framework (Algorithm 1). Each NLS problem with multiple righthand sides can be parallelized on the observation that the problems for multiple righthand sides are independent from each other. That is, we can solve several instances of Eq. (4.2) independently for different where is fixed, which implies that we can optimize row blocks of and column blocks of in parallel.
Algorithm 2 presents a straightforward approach to setting up the independent subproblems. Let us divide into row blocks and into column blocks . We then doublepartition the data matrix accordingly into row blocks and column blocks so that processor owns both and (see Figure 1). With these partitions of the data and the variables, one can implement any ANLS algorithm in parallel, with only one communication step for each solve.
The computation cost of Algorithm 2 depends on the local NLS algorithm. For comparison with our proposed algorithm, we assume each processor uses BPP to solve the local NLS problems. Thus, the computation at line 7 consists of computing , , and solving NLS given the normal equations formulation of rank for columns. Likewise, the computation at line 10 consists of computing , , and solving NLS for columns. In the dense case, this amounts to flops. In the sparse case, processor performs flops to compute and instead of .
The communication cost of the allgathers at lines 6 and 9, based on the expression given in Section 2.3 is . The local memory requirement includes storing each processor’s part of matrices , , and . In the case of dense , this is words, as is stored twice; in the sparse case, processor requires words for the input matrix and words for the output factor matrices. Local memory is also required for storing temporary matrices and of size words.
We summarize the algorithmic costs of Algorithm 2 in Table 2. This naive algorithm Fairbanks et al. [2015] has three main drawbacks: (1) it requires storing two copies of the data matrix (one in row distribution and one in column distribution) and both full factor matrices locally, (2) it does not parallelize the computation of and (each processor computes it redundantly), and (3) as we will see in Section 5, it communicates more data than necessary.
5 High Performance Parallel NMF
We present our proposed algorithm, HPCNMF, as Algorithm 3. The algorithm assumes a 2D distribution of the data matrix across a grid of processors (with ), as shown in Figure 2. Algorithm 3 performs an alternating method in parallel with a periteration bandwidth cost of words, latency cost of messages, and loadbalanced computation (up to the sparsity pattern of and convergence rates of local BPP computations). To minimize the communication cost and local memory requirements, and are chosen so that , in which case the bandwidth cost is .
If the matrix is very tall and skinny, i.e., , then we choose and . In this case, the distribution of the data matrix is 1D, and the bandwidth cost is words.
The matrix distributions for Algorithm 3 are given in Figure 2; we use a 2D distribution of and 1D distributions of and . Recall from Table 1 that and denote row and column blocks of , respectively. Thus, the notation denotes the th row block within the th row block of . Lines 6–14 compute for a fixed , and lines 16–24 compute for a fixed ; note that the computations and communication patterns for the two alternating iterations are analogous.
In the rest of this section, we derive the periteration computation and communication costs, as well as the local memory requirements. We also argue the communicationoptimality of the algorithm in the dense case. Table 2 summarizes the results of this section and compares them to Naive.
Algorithm  Flops  Words  Messages  Memory 

Naive  
HPCNMF ()  
HPCNMF ()  
Lower Bound 
Computation Cost
Local matrix computations occur at lines 6, 10, 16, and 20. In the case that is dense, each processor performs
flops. In the case that is sparse, processor performs flops in computing and , and flops in computing and . Local nonnegative least squares problems occur at lines 14 and 24. In each case, the symmetric positive semidefinite matrix is and the number of columns/rows of length to be computed are and , respectively. These costs together require flops. There are computation costs associated with the allreduce and reducescatter collectives, both those contribute only to lower order terms.
Communication Cost
Communication occurs during six collective operations (lines 7, 9, 12, 17, 19, and 22). We use the cost expressions presented in Section 2.3 for these collectives. The communication cost of the allreduces (lines 7 and 17) is ; the cost of the two allgathers (lines 9 and 19) is ; and the cost of the two reducescatters (lines 12 and 22) is .
In the case that , we choose and , and these communication costs simplify to . In the case that , we choose , and the costs simplify to .
Memory Requirements
The local memory requirement includes storing each processor’s part of matrices , , and . In the case of dense , this is words; in the sparse case, processor requires words for the input matrix and words for the output factor matrices. Local memory is also required for storing temporary matrices , , , and , of size words.
In the dense case, assuming and , the local memory requirement is no more than a constant times the size of the original data. For the optimal choices of and , this assumption simplifies to .
We note that if the temporary memory requirements become prohibitive, the computation of and via allgathers and reducescatters can be blocked, decreasing the local memory requirements at the expense of greater latency costs. While this case is plausible for sparse , we did not encounter local memory issues in our experiments.
Communication Optimality
In the case that is dense, Algorithm 3 provably minimizes communication costs. Theorem 5 establishes the bandwidth cost lower bound for any algorithm that computes or each iteration. A latency lower bound of exists in our communication model for any algorithm that aggregates global information Chan et al. [2007]. For NMF, this global aggregation is necessary each iteration to compute residual error in the case that is distributed across all processors, for example. Based on the costs derived above, HPCNMF is communication optimal under the assumption , matching these lower bounds to within constant factors.
[Demmel et al. [2013]] Let , , and be dense matrices, with . If , then any distributedmemory parallel algorithm on processors that load balances the matrix distributions and computes and/or must communicate at least words along its critical path. {proof} The proof follows directly from [Demmel et al., 2013, Section II.B]. Each matrix multiplication and has dimensions , so the assumption ensures that neither multiplication has “3 large dimensions.” Thus, the communication lower bound is either in the case of (or “2 large dimensions”), or , in the case of (or “1 large dimension”). If , then , so the lower bound can be written as .
We note that the communication costs of Algorithm 3 are the same for dense and sparse data matrices (the data matrix itself is never communicated). In the case that is sparse, this communication lower bound does not necessarily apply, as the required data movement depends on the sparsity pattern of . Thus, we cannot make claims of optimality in the sparse case (for general ). The communication lower bounds for and/or (where is sparse) can be expressed in terms of hypergraphs that encode the sparsity structure of Ballard et al. [2015]. Indeed, hypergraph partitioners have been used to reduce communication and achieve load balance for a similar problem: computing a lowrank representation of a sparse tensor (without nonnegativity constraints on the factors) Kaya and Uçar [2015].
6 Experiments
In the data mining and machine learning community, there had been a large interest in using Hadoop for large scale implementation. Hadoop does lots of disk I/O and was designed for processing gigantic text files. Many of the real world data sets that is available for research are large scale sparse internet text data such as bag of words, recommender systems, social networks etc. Towards this end, there had been interest towards Hadoop implementation of matrix factorization algorithm Gemulla et al. [2011]; Liu et al. [2010]; Liao et al. [2014]. However, the use of NMF is beyond the sparse internet data and also applicable for dense real world data such as video, image etc. Hence in order to keep our implementation applicable to wider audience, we chose MPI for distributed implementation. Apart from the application point of view, we decided MPI C++ implementation for other technical advantages that is necessary for scientific application such as (1) it can leverage the recent hardware improvements (2) effective communication between nodes (3) availability of numerically stable BLAS and LAPACK routines etc. We identified a few synthetic and real world datasets to experiment with our MPI implementation and a few baselines to compare our performance.
6.1 Experimental Setup
6.1.1 Datasets
We used sparse and dense matrices that are synthetically generated and from real world. We will explain the datasets in this section.

Dense Synthetic Matrix (DSYN): We generate a uniform random matrix of size 172,800 115,200 and add random Gaussian noise. The dimensions of this matrix is chosen such that it is uniformly distributable across processes. Every process will have its own prime seed that is different from other processes to generate the input random matrix .

Sparse Synthetic Matrix (SSYN): We generate a random sparse ErdősRényi matrix of the same dimensions 172,800 115,200 as the dense matrix, with density of 0.001. That is, every entry is nonzero with probability 0.001.

Dense Real World Matrix (Video): Generally, NMF is performed in the video data for back ground subtraction to detect the moving objects. The low rank matrix represents background and the error matrix has the moving objects. Detecting moving objects has many realworld applications such as traffic estimation, security monitoring, etc. In the case of detecting moving objects, only the last minute or two of video is taken from the live video camera. The algorithm to incrementally adjust the NMF based on the new streaming video is presented in Kim et al. [2014]. To simulate this scenario, we collected a video in a busy intersection of the Georgia Tech campus at 20 frames per second for two minutes. We then reshaped the matrix such that every RGB frame is a column of our matrix, so that the matrix is dense with dimensions 1,013,400 2400.

Sparse Real World Matrix Webbase : We identified this dataset of a very large directed sparse graph with nearly 1 million nodes (1,000,005) and 3.1 million edges (3,105,536). The dataset was first reported by Williams et al. Williams et al. [2009]. The NMF output of this directed graph will help us understand clusters in graphs. The size of both the real world datasets were adjusted to the nearest dimension for uniformly distributing the matrix.
6.1.2 Machine
We conducted our experiments on “Edison” at the National Energy Research Scientific Computing Center. Edison is a Cray XC30 supercomputer with a total of 5,576 compute nodes, where each node has dualsocket 12core Intel Ivy Bridge processors. Each of the 24 cores has a clock rate of 2.4 GHz (translating to a peak floating point rate of 460.8 Gflops/node) and private 64KB L1 and 256KB L2 caches; each of the two sockets has a (shared) 30MB L3 cache; each node has 64 GB of memory. Edison uses a Cray “Aries” interconnect that has a dragonfly topology. Because our experiments use a relatively small number of nodes, we consider the local connectivity: a “blade” comprises 4 nodes and a router, and sets of 16 blades’ routers are fully connected via a circuit board backplane (within a “chassis”). Our experiments do not exceed 64 nodes, so we can assume a very efficient, fully connected network.
6.1.3 Initialization
To ensure fair comparison among algorithms, the same random seed was used across different methods appropriately. That is, the initial random matrix was generated with the same random seed when testing with different algorithms (note that need not be initialized). This ensures that all the algorithms perform the same computations (up to roundoff errors), though the only computation with a running time that is sensitive to matrix values is the local NNLS solve via BPP.
6.2 Algorithms
For each of our data sets, we benchmark and compare three algorithms: (1) Algorithm 2, (2) Algorithm 3 with and (1D processor grid), and (3) Algorithm 3 with (2D processor grid). We choose these three algorithms to confirm the following conclusions from the analysis of Section 5: the performance of a naive parallelization of Naive (Algorithm 2) will be severely limited by communication overheads, and the correct choice of processor grid within Algorithm 3 is necessary to optimize performance. To demonstrate the latter conclusion, we choose the two extreme choices of processor grids and test some data sets where a 1D processor grid is optimal (e.g., the Video matrix) and some where a squarish 2D grid is optimal (e.g., the Webbase matrix).
While we would like to compare against other highperformance NMF algorithms in the literature, the only other distributedmemory implementations of which we’re aware are implemented using Hadoop and are designed only for sparse matrices Liao et al. [2014], Liu et al. [2010], Gemulla et al. [2011], Yin et al. [2014] and Faloutsos et al. [2014]. We stress that Hadoop is not designed for high performance, requiring disk I/O between steps, so a run time comparison between a Hadoop implementation and a C++/MPI implementation is not a fair comparison of parallel algorithms. To give a qualitative example of differences in run time, the running time of a Hadoop implementation of the MU algorithm on a large sparse matrix of dimension with nonzeros (with k=8) takes on the order of 50 minutes per iteration Liu et al. [2010]; our implementation takes a second per iteration for the synthetic data set (which is an order of magnitude larger in terms of rows, columns, and nonzeros) running on only 24 nodes.
6.3 Time Breakdown
To differentiate the computation and communication costs among the algorithms, we present the time breakdown among the various tasks within the algorithms for both performance experiments. For Algorithm 3, there are three local computation tasks and three communication tasks to compute each of the factor matrices:

MM, computing a matrix multiplication with the local data matrix and one of the factor matrices;

NLS, solving the set of NLS problems using BPP;

Gram, computing the local contribution to the Gram matrix;

AllGather, to compute the global matrix multiplication;

ReduceScatter, to compute the global matrix multiplication;

AllReduce, to compute the global Gram matrix.
In our results, we do not distinguish the costs of these tasks for and separately; we report their sum, though we note that we do not always expect balance between the two contributions for each task. Algorithm 2 performs all of these tasks except the ReduceScatter and the AllReduce; all of its communication is in the AllGathers.
6.4 Algorithmic Comparison
Our first set of experiments is designed primarily to compare the three parallel implementations. For each data set, we fix the number of processors to be 600 and vary the rank of the desired factorization. Because most of the computation (except for NLS) and bandwidth costs are linear in (except for the AllReduce), we expect linear performance curves for each algorithm individually.
The left side of Figure 3 shows the results of this experiment for all four data sets. The first conclusion we draw is that HPCNMF with a 2D processor grid performs significantly better than the Naive; the largest speedup is , for the sparse synthetic data and (a particularly communication bound problem). Also, as predicted, the 2D processor grid outperforms the 1D processor grid on the squarish matrices. While we expect the 1D processor grid to outperform the 2D grid for the tallandskinny Video matrix, their performance is comparable; this is because both algorithms are computation bound, as we see from Figure (g)g, so the difference in communication is negligible.
The second conclusion we can draw is that the scaling with tends to be close to linear, with an exception in the case of the Webbase matrix. We see from Figure (e)e that this problem spends much of its time in NLS, which does not scale linearly with .
We can also compare HPCNMF with a 1D processor grid with Naive for squarish matrices (SSYN, DSYN, and Webbase). Our analysis does not predict a significant difference in communication costs of these two approaches (when ), and we see from the data that Naive outperforms HPCNMF for two of the three matrices (but the opposite is true for DSYN). The main differences appear in the AllGather versus ReduceScatter communication costs and the local MM (all of which are involved in the computation). In all three cases, our proposed 2D processor grid (with optimal choice of ) performs better than both alternatives.
Naive  HPCNMF1D  HPCNMF2D  

Cores  DSYN  SSYN  Video  Webbase  DSYN  SSYN  Video  Webbase  DSYN  SSYN  Video  Webbase 
24  6.5632  48.0256  5.0821  52.8549  4.8427  84.6286  
96  1.5929  18.5507  1.4836  14.5873  1.1147  16.6966  
216  2.1819  0.6027  2.7899  7.1274  2.1548  0.9488  4.7928  9.2730  1.5283  0.4816  1.6106  7.4799 
384  1.2594  0.6466  2.2106  5.1431  1.2559  0.7695  3.8295  6.4740  0.8620  0.2661  0.8963  4.0630 
600  1.1745  0.5592  1.7583  4.6825  0.9685  0.6666  0.5994  6.2751  0.5519  0.1683  0.5699  2.7376 
6.5 Strong Scalability
The goal of our second set of experiments is to demonstrate the (strong) scalability of each of the algorithms. For each data set, we fix the rank to be 50 and vary the number of processors (this is a strongscaling experiment because the size of the data set is fixed). We run our experiments on processors/cores, which translates to nodes. The dense matrices are too large for 1 or 4 nodes, so we benchmark only on cores in those cases.
The right side of Figure 3 shows the scaling results for all four data sets, and Table 3 gives the overall periteration time for each algorithm, number of processors, and data set. We first consider the HPCNMF algorithm with a 2D processor grid: comparing the performance results on 25 nodes (300 cores) to the 1 node (24 cores), we see nearly perfect parallel speedups. The parallel speedups are for SSYN and for the Webbase matrix. We believe the superlinear speedup of the Webbase matrix is a result of the running time being dominated by NLS; with more processors, the local NLS problem is smaller and more likely to fit in smaller levels of cache, providing better performance. For the dense matrices, the speedup of HPCNMF on 25 nodes over 9 nodes is for DSYN and for Video, which are also nearly linear.
In the case of the Naive algorithm, we do see parallel speedups, but they are not linear. For the sparse data, we see parallel speedups of and with a increase in number of processors. For the dense data, we observe speedups of and with a increase in the number of processors. The main reason for not achieving perfect scaling is the unnecessary communication overheads.
7 Conclusion
In this paper, we propose a highperformance distributedmemory parallel algorithm that computes an NMF factorization by iteratively solving alternating nonnegative least squares (NLS) subproblems. We show that by carefully designing a parallel algorithm, we can avoid communication overheads and scale well to modest numbers of cores.
For the datasets on which we experimented, we showed that an efficient implementation of a naive parallel algorithm spends much of its time in interprocessor communication. In the case of HPCNMF, the problems remain computation bound on up to 600 processors, typically spending most of the time in local NLS solves.
We focus in this work on BPP, which is more expensive periteration than alternative methods like MU and HALS, because it has been shown to reduce overall running time in the sequential case by requiring fewer iterations Kim and Park [2011]. Because most of the time per iteration of HPCNMF is spent on local NLS, we believe further empirical exploration is necessary to confirm the advantages of BPP in the parallel case. We note that if we use MU or HALS for local NLS, the relative cost of interprocessor communication will grow, making the communication efficiency of our algorithm more important.
In future work, we would like to extend this algorithm to dense and sparse tensors, computing the CANDECOMP/PARAFAC decomposition in parallel with nonnegativity constraints on the factor matrices. We would also like to explore more intelligent distributions of sparse matrices: while our 2D distribution is based on evenly dividing rows and columns, it does not necessarily load balance the nonzeros of the matrix, which can lead to load imbalance in MM. We are interested in using graph and hypergraph partitioning techniques to load balance the memory and computation while at the same time reducing communication costs as much as possible. Finally, we have not yet reached the limits of the scalability of HPCNMF; we would like to expand our benchmarks to larger numbers of nodes on the same size datasets to study performance behavior when communication costs completely dominate the running time.
This research was supported in part by an appointment to the Sandia National Laboratories Truman Fellowship in National Security Science and Engineering, sponsored by Sandia Corporation (a wholly owned subsidiary of Lockheed Martin Corporation) as Operator of Sandia National Laboratories under its U.S. Department of Energy Contract No. DEAC0494AL85000.
This research used resources of the National Energy Research Scientific Computing Center, a DOE Office of Science User Facility supported by the Office of Science of the U.S. Department of Energy under Contract No. DEAC0205CH11231.
References
 Ballard et al. [2015] G. Ballard, A. Druinsky, N. Knight, and O. Schwartz. Brief announcement: Hypergraph partitioning for parallel sparse matrixmatrix multiplication. In Proceedings of the 27th ACM on Symposium on Parallelism in Algorithms and Architectures, SPAA ’15, pages 86–88, New York, NY, USA, 2015. ACM. ISBN 9781450335881. doi: 10.1145/2755573.2755613. URL http://doi.acm.org/10.1145/2755573.2755613.
 Chan et al. [2007] E. Chan, M. Heimlich, A. Purkayastha, and R. van de Geijn. Collective communication: theory, practice, and experience. Concurrency and Computation: Practice and Experience, 19(13):1749–1783, 2007. ISSN 15320634. doi: 10.1002/cpe.1206. URL http://dx.doi.org/10.1002/cpe.1206.
 Cichocki et al. [2009] A. Cichocki, R. Zdunek, A. H. Phan, and S.i. Amari. Nonnegative matrix and tensor factorizations: applications to exploratory multiway data analysis and blind source separation. Wiley, 2009.
 Demmel et al. [2013] J. Demmel, D. Eliahu, A. Fox, S. Kamil, B. Lipshitz, O. Schwartz, and O. Spillinger. Communicationoptimal parallel recursive rectangular matrix multiplication. In Proceedings of the 27th IEEE International Symposium on Parallel and Distributed Processing, IPDPS ’13, pages 261–272, 2013. doi: 10.1109/IPDPS.2013.80.
 Fairbanks et al. [2015] J. P. Fairbanks, R. Kannan, H. Park, and D. A. Bader. Behavioral clusters in dynamic graphs. Parallel Computing, 47:38–50, 2015.
 Faloutsos et al. [2014] C. Faloutsos, A. Beutel, E. P. Xing, E. E. Papalexakis, A. Kumar, and P. P. Talukdar. Flexifact: Scalable flexible factorization of coupled tensors on hadoop. In Proceedings of the SDM, pages 109–117, 2014. doi: 10.1137/1.9781611973440.13. URL http://epubs.siam.org/doi/abs/10.1137/1.9781611973440.13.
 Gemulla et al. [2011] R. Gemulla, E. Nijkamp, P. J. Haas, and Y. Sismanis. Largescale matrix factorization with distributed stochastic gradient descent. In Proceedings of the KDD, pages 69–77. ACM, 2011.
 Hoyer [2004] P. O. Hoyer. Nonnegative matrix factorization with sparseness constraints. JMLR, 5:1457–1469, 2004.
 Kaya and Uçar [2015] O. Kaya and B. Uçar. Scalable sparse tensor decompositions in distributed memory systems. Technical Report RR8722, INRIA, May 2015.
 Kim and Park [2007] H. Kim and H. Park. Sparse nonnegative matrix factorizations via alternating nonnegativityconstrained least squares for microarray data analysis. Bioinformatics, 23(12):1495–1502, 2007.
 Kim and Park [2011] J. Kim and H. Park. Fast nonnegative matrix factorization: An activesetlike method and comparisons. SIAM Journal on Scientific Computing, 33(6):3261–3281, 2011.
 Kim et al. [2014] J. Kim, Y. He, and H. Park. Algorithms for nonnegative matrix and tensor factorizations: A unified view based on block coordinate descent framework. Journal of Global Optimization, 58(2):285–319, 2014.
 Kuang et al. [2012] D. Kuang, C. Ding, and H. Park. Symmetric nonnegative matrix factorization for graph clustering. In Proceedings of SDM, pages 106–117, 2012.
 Liao et al. [2014] R. Liao, Y. Zhang, J. Guan, and S. Zhou. Cloudnmf: A mapreduce implementation of nonnegative matrix factorization for largescale biological datasets. Genomics, proteomics & bioinformatics, 12(1):48–51, 2014.
 Liu et al. [2010] C. Liu, H.c. Yang, J. Fan, L.W. He, and Y.M. Wang. Distributed nonnegative matrix factorization for webscale dyadic data analysis on mapreduce. In Proceedings of the WWW, pages 681–690. ACM, 2010.
 Pauca et al. [2004] V. P. Pauca, F. Shahnaz, M. W. Berry, and R. J. Plemmons. Text mining using nonnegative matrix factorizations. In Proceedings of SDM, 2004.
 Seung and Lee [2001] D. Seung and L. Lee. Algorithms for nonnegative matrix factorization. NIPS, 13:556–562, 2001.
 Thakur et al. [2005] R. Thakur, R. Rabenseifner, and W. Gropp. Optimization of collective communication operations in MPICH. International Journal of High Performance Computing Applications, 19(1):49–66, 2005. doi: 10.1177/1094342005051521. URL http://hpc.sagepub.com/content/19/1/49.abstract.
 Wang and Zhang [2013] Y.X. Wang and Y.J. Zhang. Nonnegative matrix factorization: A comprehensive review. TKDE, 25(6):1336–1353, June 2013. ISSN 10414347. doi: 10.1109/TKDE.2012.51.
 Williams et al. [2009] S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick, and J. Demmel. Optimization of sparse matrixvector multiplication on emerging multicore platforms. Parallel Computing, 35(3):178 – 194, 2009.
 Yin et al. [2014] J. Yin, L. Gao, and Z. Zhang. Scalable nonnegative matrix factorization with blockwise updates. In Machine Learning and Knowledge Discovery in Databases, volume 8726 of LNCS, pages 337–352, 2014. ISBN 9783662448441.