A High-Performance Parallel Algorithm for Nonnegative Matrix Factorization

A High-Performance Parallel Algorithm for Nonnegative Matrix Factorization


Non-negative matrix factorization (NMF) is the problem of determining two non-negative low rank factors and , for the given input matrix , such that . NMF is a useful tool for many applications in different domains such as topic modeling in text mining, background separation in video analysis, and community detection in social networks. Despite its popularity in the data mining community, there is a lack of efficient parallel software to solve the problem for big datasets. Existing distributed-memory algorithms are limited in terms of performance and applicability, as they are implemented using Hadoop and are designed only for sparse matrices.

We propose a distributed-memory parallel algorithm that computes the factorization by iteratively solving alternating non-negative least squares (NLS) subproblems for and . To our knowledge, our algorithm is the first high-performance parallel algorithm for NMF. It maintains the data and factor matrices in memory (distributed across processors), uses MPI for interprocessor communication, and, in the dense case, provably minimizes communication costs (under mild assumptions). As opposed to previous implementations, our algorithm is also flexible: (1) it performs well for dense and sparse matrices, and (2) it allows the user to choose from among multiple algorithms for solving local NLS subproblems within the alternating iterations. We demonstrate the scalability of our algorithm and compare it with baseline implementations, showing significant performance improvements.


Ramakrishnan KannanGeorgia Techrkannan@gatech.edu \authorinfoGrey BallardSandia National Laboratoriesgmballa@sandia.gov \authorinfoHaesun ParkGeorgia Techhpark@cc.gatech.edu

1 Introduction

Non-negative Matrix Factorization (NMF) is the problem of finding two low rank factors and for a given input matrix , such that , where . Formally, NMF problem Seung and Lee [2001] can be defined as \SplitN min_\WW≥0,\HH≥0 &  f(\WW,\HH) ≡∥Å-\WW\HH∥_F^2.

NMF is widely used in data mining and machine learning as a dimension reduction and factor analysis method. It is a natural fit for many real world problems as the non-negativity is inherent in many representations of real-world data and the resulting low rank factors are expected to have natural interpretation. The applications of NMF range from text mining Pauca et al. [2004], computer vision Hoyer [2004], bioinformatics Kim and Park [2007], to blind source separation Cichocki et al. [2009], unsupervised clustering Kuang et al. [2012] and many other areas. For typical problems today, and can be on the order of thousands to millions or more, and is typically less than 100.

There is a vast literature on algorithms for NMF and their convergence properties Kim et al. [2014]. The commonly adopted NMF algorithms are – (i) Multiplicative Update (MU) Seung and Lee [2001] (ii) Hierarchical Alternative Least Squares (HALS) Cichocki et al. [2009] (iii) Block Principal Pivoting (BPP) Kim and Park [2011], and (iv) Stochastic Gradient Descent (SGD) Updates Gemulla et al. [2011]. As described in Equation 4.1, most of the algorithms in NMF literature are based on Alternating Non-negative Least Squares (ANLS) scheme that iteratively optimizes each of the low rank factors and while keeping the other fixed. It is important to note that in such iterative alternating minimization techniques, each subproblem is a constrained convex optimization problem. Each of these subproblems is then solved using standard optimization techniques such as projected gradient, interior point, etc., and a detailed survey for solving this constrained convex optimization problem can be found in Wang and Zhang [2013]; Kim et al. [2014]. In this paper, for solving the subproblems, our implementation uses a fast active-set based method called Block Principal Pivoting (BPP) Kim and Park [2011], but the parallel algorithm proposed in this paper can be easily extended for other algorithms such as MU and HALS.

Recently with the advent of large scale internet data and interest in Big Data, researchers have started studying scalability of many foundational machine learning algorithms. To illustrate the dimension of matrices commonly used in the machine learning community, we present a few examples. Now-a-days the adjacency matrix of a billion-node social network is common. In the matrix representation of a video data, every frame contains three matrices for each RGB color, which is reshaped into a column. Thus in the case of a 4K video, every frame will take approximately 27 million rows (4096 row pixels x 2196 column pixels x 3 colors). Similarly, the popular representation of documents in text mining is a bag-of-words matrix, where the rows are the dictionary and the columns are the documents (e.g., webpages). Each entry in the bag-of-words matrix is generally the frequency count of the word in the document . Typically with the explosion of the new terms in social media, the number of words spans to millions.

To handle such high dimensional matrices, it is important to study low rank approximation methods in a data-distributed environment. For example, in many large scale scenarios, data samples are collected and stored over many general purpose computers, as the set of samples is too large to store on a single machine. In this case, the computation must also be distributed across processors. Local computation is preferred as local access of data is much faster than remote access due to the costs of interprocessor communication. However, for low rank approximation algorithms that depend on global information (like MU, HALS, and BPP), some communication is necessary.

The simplest way to organize these distributed computations on large datasets is through a MapReduce framework like Hadoop, but this simplicity comes at the expense of performance. In particular, most MapReduce frameworks require data to be read from and written to disk at every iteration, and they involve communication-intensive global, input-data shuffles across machines.

In this work, we present a much more efficient algorithm and implementation using tools from the field of High-Performance Computing (HPC). We maintain data in memory (distributed across processors), take advantage of optimized libraries for local computational routines, and use the Message Passing Interface (MPI) standard to organize interprocessor communication. The current trend for high-performance computers is that available parallelism (and therefore aggregate computational rate) is increasing much more quickly than improvements in network bandwidth and latency. This trend implies that the relative cost of communication (compared to computation) is increasing.

To address this challenge, we analyze algorithms in terms of both their computation and communication costs. In particular, our proposed algorithm ensures that after the input data is initially read into memory, it is never communicated; we communicate only the factor matrices and other smaller temporary matrices. Furthermore, we prove that in the case of dense input and under the assumption that , our proposed algorithm minimizes bandwidth cost (the amount of data communicated between processors) to within a constant factor of the lower bound. We also reduce latency costs (the number of times processors communicate with each other) by utilizing MPI collective communication operations, along with temporary local memory space, performing messages per iteration, the minimum achievable for aggregating global data.

Fairbanks et al. Fairbanks et al. [2015] discuss a parallel NMF algorithm designed for multicore machines. To demonstrate the importance of minimizing communication, we consider this approach to parallelizing an alternating NMF algorithm in distributed memory. While this naive algorithm exploits the natural parallelism available within the alternating iterations (the fact that rows of and columns of can be computed independently), it performs more communication than necessary to set up the independent problems. We compare the performance of this algorithm with our proposed approach to demonstrate the importance of designing algorithms to minimize communication; that is, simply parallelizing the computation is not sufficient for satisfactory performance and parallel scalability.

The main contribution of this work is a new, high-performance parallel algorithm for non-negative matrix factorization. The algorithm is flexible, as it is designed for both sparse and dense input matrices and can leverage many different algorithms for solving local non-negative least squares problems. The algorithm is also efficient, maintaining data in memory, using MPI collectives for interprocessor communication, and using efficient libraries for local computation. Furthermore, we provide a theoretical communication cost analysis to show that our algorithm reduces communication relative to the naive approach, and in the case of dense input, that it provably minimizes communication. We show with performance experiments that the algorithm outperforms the naive approach by significant factors, and that it scales well for up to 100s of processors on both synthetic and real-world data.

2 Preliminaries

2.1 Notation

Table 1 summarizes the notation we use throughout. We use upper case letters for matrices and lower case letters for vectors. For example, is a matrix and is a column vector and is a row vector. The subscripts are used for sub-blocks of matrices. We use and to denote the numbers of rows and columns of , respectively, and we assume without loss of generality throughout.

Input Matrix
Left Low Rank Factor
Right Low Rank Factor
Number of rows of input matrix
Number of columns of input matrix
Low Rank
th row block of matrix
th column block of matrix
th subblock of
Number of parallel processes
Number of rows in processor grid
Number of columns in processor grid
Table 1: Notation

2.2 Communication model

To analyze our algorithms, we use the -- model of distributed-memory parallel computation. In this model, interprocessor communication occurs in the form of messages sent between two processors across a bidirectional link (we assume a fully connected network). We model the cost of a message of size words as , where is the per-message latency cost and is the per-word bandwidth cost. Each processor can compute floating point operations (flops) on data that resides in its local memory; is the per-flop computation cost. With this communication model, we can predict the performance of an algorithm in terms of the number of flops it performs as well as the number of words and messages it communicates. For simplicity, we will ignore the possibilities of overlapping computation with communication in our analysis. For more details on the -- model, see Thakur et al. [2005]; Chan et al. [2007].

2.3 MPI collectives

Point-to-point messages can be organized into collective communication operations that involve more than two processors. MPI provides an interface to the most commonly used collectives like broadcast, reduce, and gather, as the algorithms for these collectives can be optimized for particular network topologies and processor characteristics. The algorithms we consider use the all-gather, reduce-scatter, and all-reduce collectives, so we review them here, along with their costs. Our analysis assumes optimal collective algorithms are used (see Thakur et al. [2005]; Chan et al. [2007]), though our implementation relies on the underlying MPI implementation.

At the start of an all-gather collective, each of processors owns data of size . After the all-gather, each processor owns a copy of the entire data of size . The cost of an all-gather is . At the start of a reduce-scatter collective, each processor owns data of size . After the reduce-scatter, each processor owns a subset of the sum over all data, which is of size . (Note that the reduction can be computed with other associative operators besides addition.) The cost of an reduce-scatter is . At the start of an all-reduce collective, each processor owns data of size . After the all-reduce, each processor owns a copy of the sum over all data, which is also of size . The cost of an all-reduce is . Note that the costs of each of the collectives are zero when .

3 Related work

In the data mining and machine learning literature there is an overlap between low rank approximations and matrix factorizations due to the nature of applications. Despite its name, Non-negative “Matrix Factorization” is really a low rank approximation. In this section, we discuss the various distributed NMF algorithms.

The recent distributed NMF algorithms in the literature are Liao et al. [2014], Liu et al. [2010], Gemulla et al. [2011], Yin et al. [2014] and Faloutsos et al. [2014]. All of these works propose distributed NMF algorithms implemented using Hadoop. Liu, Yang, Fan, He and Wang Liu et al. [2010] propose running Multiplicative Update (MU) for KL divergence, squared loss, and “exponential” loss functions. Matrix multiplication, element-wise multiplication, and element-wise division are the building blocks of the MU algorithm. The authors discuss performing these matrix operations effectively in Hadoop for sparse matrices. In similar directions, Liao, Zhang, Guan and Zhou Liao et al. [2014] implement an open source Hadoop based MU algorithm and study its scalability on large-scale biological datasets. Also, Yin, Ghao and Zhang Yin et al. [2014] present a scalable NMF that can perform frequent updates, which aim to use the most recently updated data. Gemmula, Nijkamp, Erik, Haas and Sismanis Gemulla et al. [2011] propose a Generic algorithm that works on different loss functions, often involving the distributed computation of the gradient. According to the authors, this formulation can also be extended to handle non-negative constraints. Similarly Faloutsos, Beutel, Xing and Papalexakis, Faloutsos et al. [2014] propose a distributed, scalable method for decomposing matrices, tensors, and coupled data sets through stochastic gradient descent on a variety of objective functions. The authors also provide an implementation that can enforce non-negative constraints on the factor matrices.

4 Foundations

In this section, we will briefly introduce the Alternating Non-negative Least Squares (ANLS) framework, multiple methods for solving non-negative least squares problems (NLS) including Block Principal Pivoting (BPP), and a straightforward approach to parallelization of the framework.

4.1 Alternating Non-negative Least Squares

According to the ANLS framework, we first partition the variables of the NMF problem into two blocks and . Then we solve the following equations iteratively until a stopping criteria is satisfied. \SplitN \WW&←\Argmin~\WW≥0\NormBrÅ-~\WW\HH_F^2,

The optimizations sub-problem for and are NLS problems which can be solved by a number of methods from generic constrained convex optimization to active-set methods. Typical approaches use form the normal equations of the least squares problem (and then solve them enforcing the non-negativity constraint), which involves matrix multiplications of the factor matrices with the data matrix. Algorithm 1 shows this generic approach.

1: is an matrix, is rank of approximation
2:Initialize with a non-negative matrix in .
3:while not converged do
4:     Update using and
5:     Update using and
6:end while
Algorithm 1

The specifics of lines 4 and 5 depend on the NLS method used. For example, the update equations for MU Seung and Lee [2001] are


w_ij &←w_ij (Å\HHT)ij(\WW\HH\HHT)ij
h_ij &←h_ij (\WWTÅ)ij(\WWT\WW\HH)ij. Note that after computing and , the cost of updating is dominated by computing , which is flops. Given and , each entry of can be updated independently and cheaply. Likewise, the extra computation cost of updating is flops.

HALS updates and by applying block coordinate descent on the columns of and rows of Cichocki et al. [2009]. The update rules are \SplitN \ww^i &←[(Å\HHT)i- ∑l=1 l ik(\HH\HHT)li\wwl]+(\HH\HHT)ii
\hh_i &←[(\WWTÅ)i- ∑l=1 l ik(\WWT\WW)li\hhl]+(\WWT\WW)ii, where is the th column of and is the row of . Note that the columns of and rows of are updated in order, so that the most up-to-date values are used in the update. The extra computation is again flops for updating both and .

4.2 Block Principal Pivoting

In this paper, we focus on and use the block principal pivoting Kim and Park [2011] method to solve the non-negative least squares problem, as it has the fastest convergence rate (in terms of number of iterations). We note that any of the NLS methods can be used within our parallel frameworks (Algorithms 2 and 3).

BPP is the state-of-the-art method for solving the NLS subproblems in Eq. (4.1). The main sub-routine of BPP is the single right-hand side NLS problem \SplitN min_\xx≥0 ∥\CC\xx-\bb∥_2^2.

The Karush-Kuhn-Tucker (KKT) optimality condition for Eq. (4.2) is as follows \SplitS \yy&= \CC^T \CC\xx- \CC^T \bb
\xx^T \yy& = 0. The KKT condition (4.2) states that at optimality, the support sets (i.e., the non-zero elements) of and are complementary to each other. Therefore, Eq. (4.2) is an instance of the Linear Complementarity Problem (LCP) which arises frequently in quadratic programming. When , active set and active-set like methods are very suitable because most computations involve matrices of sizes , and which are small and easy to handle.

If we knew which indices correspond to nonzero values in the optimal solution, then computing it is an unconstrained least squares problem on these indices. In the optimal solution, call the set of indices such that the active set, and let the remaining indices be the passive set. The BPP algorithm works to find this active set and passive set. Since the above problem is convex, the correct partition of the optimal solution will satisfy the KKT condition (Eq. (4.2)). The BPP algorithm greedily swaps indices between the active and passive set until finding a partition that satisfies the KKT condition. In the partition of the optimal solution, the values of the indices that belong to the active set will take zero. The values of the indices that belong to the passive set are determined by solving the least squares problem using normal equations restricted to the passive set. Kim, He and Park Kim and Park [2011], discuss the BPP algorithm in further detail. Because the algorithm is iterative, we define as the cost of solving instances of Eq. (4.2) using BPP, given the matrix and columns of the form .

4.3 Naive Parallel NMF Algorithm

In this section we present a naive parallelization of any NMF algorithm Fairbanks et al. [2015] that follows the ANLS framework (Algorithm 1). Each NLS problem with multiple right-hand sides can be parallelized on the observation that the problems for multiple right-hand sides are independent from each other. That is, we can solve several instances of Eq. (4.2) independently for different where is fixed, which implies that we can optimize row blocks of and column blocks of in parallel.

1: is an matrix distributed both row-wise and column-wise across processors, is rank of approximation
2:Local matrices: is , is , is , is
3: initializes
4:while not converged do
5:/* Compute given */
6:     collect on each processor using all-gather
7:      computes
8:/* Compute given */
9:     collect on each processor using all-gather
10:      computes
11:end while
13: is an matrix distributed row-wise across processors, is a matrix distributed column-wise across processors
Algorithm 2

Algorithm 2 presents a straightforward approach to setting up the independent subproblems. Let us divide into row blocks and into column blocks . We then double-partition the data matrix accordingly into row blocks and column blocks so that processor owns both and (see Figure 1). With these partitions of the data and the variables, one can implement any ANLS algorithm in parallel, with only one communication step for each solve.

The computation cost of Algorithm 2 depends on the local NLS algorithm. For comparison with our proposed algorithm, we assume each processor uses BPP to solve the local NLS problems. Thus, the computation at line 7 consists of computing , , and solving NLS given the normal equations formulation of rank for columns. Likewise, the computation at line 10 consists of computing , , and solving NLS for columns. In the dense case, this amounts to flops. In the sparse case, processor performs flops to compute and instead of .

The communication cost of the all-gathers at lines 6 and 9, based on the expression given in Section 2.3 is . The local memory requirement includes storing each processor’s part of matrices , , and . In the case of dense , this is words, as is stored twice; in the sparse case, processor requires words for the input matrix and words for the output factor matrices. Local memory is also required for storing temporary matrices and of size words.

We summarize the algorithmic costs of Algorithm 2 in Table 2. This naive algorithm Fairbanks et al. [2015] has three main drawbacks: (1) it requires storing two copies of the data matrix (one in row distribution and one in column distribution) and both full factor matrices locally, (2) it does not parallelize the computation of and (each processor computes it redundantly), and (3) as we will see in Section 5, it communicates more data than necessary.

Figure 1: Distribution of matrices for Naive (Algorithm 2), for . Note that is , is , is , and is .

5 High Performance Parallel NMF

We present our proposed algorithm, HPC-NMF, as Algorithm 3. The algorithm assumes a 2D distribution of the data matrix across a grid of processors (with ), as shown in Figure 2. Algorithm 3 performs an alternating method in parallel with a per-iteration bandwidth cost of words, latency cost of messages, and load-balanced computation (up to the sparsity pattern of and convergence rates of local BPP computations). To minimize the communication cost and local memory requirements, and are chosen so that , in which case the bandwidth cost is .

If the matrix is very tall and skinny, i.e., , then we choose and . In this case, the distribution of the data matrix is 1D, and the bandwidth cost is words.

The matrix distributions for Algorithm 3 are given in Figure 2; we use a 2D distribution of and 1D distributions of and . Recall from Table 1 that and denote row and column blocks of , respectively. Thus, the notation denotes the th row block within the th row block of . Lines 614 compute for a fixed , and lines 1624 compute for a fixed ; note that the computations and communication patterns for the two alternating iterations are analogous.

In the rest of this section, we derive the per-iteration computation and communication costs, as well as the local memory requirements. We also argue the communication-optimality of the algorithm in the dense case. Table 2 summarizes the results of this section and compares them to Naive.

Algorithm Flops Words Messages Memory
Lower Bound
Table 2: Algorithmic costs for Naive and HPC-NMF assuming data matrix is dense. Note that the communication costs (words and messages) also apply for sparse .
1: is an matrix distributed across a grid of processors, is rank of approximation
2:Local matrices: is , is , is , is , and is
3: initializes
4:while not converged do
5:/* Compute given */
6:      computes
7:     compute using all-reduce across all procs
8: is and symmetric
9:      collects using all-gather across proc columns
10:      computes
11: is
12:     compute using reduce-scatter across proc row to achieve row-wise distribution of
13: owns submatrix
14:      computes
15:/* Compute given */
16:      computes
17:     compute using all-reduce across all procs
18: is and symmetric
19:      collects using all-gather across proc rows
20:      computes
21: is
22:     compute using reduce-scatter across proc columns to achieve column-wise distribution of
23: owns submatrix
24:      computes
25:end while
27: is an matrix distributed row-wise across processors, is a matrix distributed column-wise across processors
Algorithm 3

Figure 2: Distribution of matrices for HPC-NMF (Algorithm 3), for and . Note that is , is , is , is , and is .
Computation Cost

Local matrix computations occur at lines 6, 10, 16, and 20. In the case that is dense, each processor performs

flops. In the case that is sparse, processor performs flops in computing and , and flops in computing and . Local non-negative least squares problems occur at lines 14 and 24. In each case, the symmetric positive semi-definite matrix is and the number of columns/rows of length to be computed are and , respectively. These costs together require flops. There are computation costs associated with the all-reduce and reduce-scatter collectives, both those contribute only to lower order terms.

Communication Cost

Communication occurs during six collective operations (lines 7, 9, 12, 17, 19, and 22). We use the cost expressions presented in Section 2.3 for these collectives. The communication cost of the all-reduces (lines 7 and 17) is ; the cost of the two all-gathers (lines 9 and 19) is ; and the cost of the two reduce-scatters (lines 12 and 22) is .

In the case that , we choose and , and these communication costs simplify to . In the case that , we choose , and the costs simplify to .

Memory Requirements

The local memory requirement includes storing each processor’s part of matrices , , and . In the case of dense , this is words; in the sparse case, processor requires words for the input matrix and words for the output factor matrices. Local memory is also required for storing temporary matrices , , , and , of size words.

In the dense case, assuming and , the local memory requirement is no more than a constant times the size of the original data. For the optimal choices of and , this assumption simplifies to .

We note that if the temporary memory requirements become prohibitive, the computation of and via all-gathers and reduce-scatters can be blocked, decreasing the local memory requirements at the expense of greater latency costs. While this case is plausible for sparse , we did not encounter local memory issues in our experiments.

Communication Optimality

In the case that is dense, Algorithm 3 provably minimizes communication costs. Theorem 5 establishes the bandwidth cost lower bound for any algorithm that computes or each iteration. A latency lower bound of exists in our communication model for any algorithm that aggregates global information Chan et al. [2007]. For NMF, this global aggregation is necessary each iteration to compute residual error in the case that is distributed across all processors, for example. Based on the costs derived above, HPC-NMF is communication optimal under the assumption , matching these lower bounds to within constant factors.


[Demmel et al. [2013]] Let , , and be dense matrices, with . If , then any distributed-memory parallel algorithm on processors that load balances the matrix distributions and computes and/or must communicate at least words along its critical path. {proof} The proof follows directly from [Demmel et al., 2013, Section II.B]. Each matrix multiplication and has dimensions , so the assumption ensures that neither multiplication has “3 large dimensions.” Thus, the communication lower bound is either in the case of (or “2 large dimensions”), or , in the case of (or “1 large dimension”). If , then , so the lower bound can be written as .

We note that the communication costs of Algorithm 3 are the same for dense and sparse data matrices (the data matrix itself is never communicated). In the case that is sparse, this communication lower bound does not necessarily apply, as the required data movement depends on the sparsity pattern of . Thus, we cannot make claims of optimality in the sparse case (for general ). The communication lower bounds for and/or (where is sparse) can be expressed in terms of hypergraphs that encode the sparsity structure of Ballard et al. [2015]. Indeed, hypergraph partitioners have been used to reduce communication and achieve load balance for a similar problem: computing a low-rank representation of a sparse tensor (without non-negativity constraints on the factors) Kaya and Uçar [2015].

6 Experiments

In the data mining and machine learning community, there had been a large interest in using Hadoop for large scale implementation. Hadoop does lots of disk I/O and was designed for processing gigantic text files. Many of the real world data sets that is available for research are large scale sparse internet text data such as bag of words, recommender systems, social networks etc. Towards this end, there had been interest towards Hadoop implementation of matrix factorization algorithm Gemulla et al. [2011]; Liu et al. [2010]; Liao et al. [2014]. However, the use of NMF is beyond the sparse internet data and also applicable for dense real world data such as video, image etc. Hence in order to keep our implementation applicable to wider audience, we chose MPI for distributed implementation. Apart from the application point of view, we decided MPI C++ implementation for other technical advantages that is necessary for scientific application such as (1) it can leverage the recent hardware improvements (2) effective communication between nodes (3) availability of numerically stable BLAS and LAPACK routines etc. We identified a few synthetic and real world datasets to experiment with our MPI implementation and a few baselines to compare our performance.

6.1 Experimental Setup

6.1.1 Datasets

We used sparse and dense matrices that are synthetically generated and from real world. We will explain the datasets in this section.

  • Dense Synthetic Matrix (DSYN): We generate a uniform random matrix of size 172,800 115,200 and add random Gaussian noise. The dimensions of this matrix is chosen such that it is uniformly distributable across processes. Every process will have its own prime seed that is different from other processes to generate the input random matrix .

  • Sparse Synthetic Matrix (SSYN): We generate a random sparse Erdős-Rényi matrix of the same dimensions 172,800 115,200 as the dense matrix, with density of 0.001. That is, every entry is nonzero with probability 0.001.

  • Dense Real World Matrix (Video): Generally, NMF is performed in the video data for back ground subtraction to detect the moving objects. The low rank matrix represents background and the error matrix has the moving objects. Detecting moving objects has many real-world applications such as traffic estimation, security monitoring, etc. In the case of detecting moving objects, only the last minute or two of video is taken from the live video camera. The algorithm to incrementally adjust the NMF based on the new streaming video is presented in Kim et al. [2014]. To simulate this scenario, we collected a video in a busy intersection of the Georgia Tech campus at 20 frames per second for two minutes. We then reshaped the matrix such that every RGB frame is a column of our matrix, so that the matrix is dense with dimensions 1,013,400 2400.

  • Sparse Real World Matrix Webbase : We identified this dataset of a very large directed sparse graph with nearly 1 million nodes (1,000,005) and 3.1 million edges (3,105,536). The dataset was first reported by Williams et al. Williams et al. [2009]. The NMF output of this directed graph will help us understand clusters in graphs. The size of both the real world datasets were adjusted to the nearest dimension for uniformly distributing the matrix.

6.1.2 Machine

We conducted our experiments on “Edison” at the National Energy Research Scientific Computing Center. Edison is a Cray XC30 supercomputer with a total of 5,576 compute nodes, where each node has dual-socket 12-core Intel Ivy Bridge processors. Each of the 24 cores has a clock rate of 2.4 GHz (translating to a peak floating point rate of 460.8 Gflops/node) and private 64KB L1 and 256KB L2 caches; each of the two sockets has a (shared) 30MB L3 cache; each node has 64 GB of memory. Edison uses a Cray “Aries” interconnect that has a dragonfly topology. Because our experiments use a relatively small number of nodes, we consider the local connectivity: a “blade” comprises 4 nodes and a router, and sets of 16 blades’ routers are fully connected via a circuit board backplane (within a “chassis”). Our experiments do not exceed 64 nodes, so we can assume a very efficient, fully connected network.

6.1.3 Initialization

To ensure fair comparison among algorithms, the same random seed was used across different methods appropriately. That is, the initial random matrix was generated with the same random seed when testing with different algorithms (note that need not be initialized). This ensures that all the algorithms perform the same computations (up to roundoff errors), though the only computation with a running time that is sensitive to matrix values is the local NNLS solve via BPP.

6.2 Algorithms

For each of our data sets, we benchmark and compare three algorithms: (1) Algorithm 2, (2) Algorithm 3 with and (1D processor grid), and (3) Algorithm 3 with (2D processor grid). We choose these three algorithms to confirm the following conclusions from the analysis of Section 5: the performance of a naive parallelization of Naive (Algorithm 2) will be severely limited by communication overheads, and the correct choice of processor grid within Algorithm 3 is necessary to optimize performance. To demonstrate the latter conclusion, we choose the two extreme choices of processor grids and test some data sets where a 1D processor grid is optimal (e.g., the Video matrix) and some where a squarish 2D grid is optimal (e.g., the Webbase matrix).

While we would like to compare against other high-performance NMF algorithms in the literature, the only other distributed-memory implementations of which we’re aware are implemented using Hadoop and are designed only for sparse matrices Liao et al. [2014], Liu et al. [2010], Gemulla et al. [2011], Yin et al. [2014] and Faloutsos et al. [2014]. We stress that Hadoop is not designed for high performance, requiring disk I/O between steps, so a run time comparison between a Hadoop implementation and a C++/MPI implementation is not a fair comparison of parallel algorithms. To give a qualitative example of differences in run time, the running time of a Hadoop implementation of the MU algorithm on a large sparse matrix of dimension with nonzeros (with k=8) takes on the order of 50 minutes per iteration Liu et al. [2010]; our implementation takes a second per iteration for the synthetic data set (which is an order of magnitude larger in terms of rows, columns, and nonzeros) running on only 24 nodes.

6.3 Time Breakdown

To differentiate the computation and communication costs among the algorithms, we present the time breakdown among the various tasks within the algorithms for both performance experiments. For Algorithm 3, there are three local computation tasks and three communication tasks to compute each of the factor matrices:

  • MM, computing a matrix multiplication with the local data matrix and one of the factor matrices;

  • NLS, solving the set of NLS problems using BPP;

  • Gram, computing the local contribution to the Gram matrix;

  • All-Gather, to compute the global matrix multiplication;

  • Reduce-Scatter, to compute the global matrix multiplication;

  • All-Reduce, to compute the global Gram matrix.

In our results, we do not distinguish the costs of these tasks for and separately; we report their sum, though we note that we do not always expect balance between the two contributions for each task. Algorithm 2 performs all of these tasks except the Reduce-Scatter and the All-Reduce; all of its communication is in the All-Gathers.

Figure 3: Experiments on Sparse and Dense Datasets

6.4 Algorithmic Comparison

Our first set of experiments is designed primarily to compare the three parallel implementations. For each data set, we fix the number of processors to be 600 and vary the rank of the desired factorization. Because most of the computation (except for NLS) and bandwidth costs are linear in (except for the All-Reduce), we expect linear performance curves for each algorithm individually.

The left side of Figure 3 shows the results of this experiment for all four data sets. The first conclusion we draw is that HPC-NMF with a 2D processor grid performs significantly better than the Naive; the largest speedup is , for the sparse synthetic data and (a particularly communication bound problem). Also, as predicted, the 2D processor grid outperforms the 1D processor grid on the squarish matrices. While we expect the 1D processor grid to outperform the 2D grid for the tall-and-skinny Video matrix, their performance is comparable; this is because both algorithms are computation bound, as we see from Figure (g)g, so the difference in communication is negligible.

The second conclusion we can draw is that the scaling with tends to be close to linear, with an exception in the case of the Webbase matrix. We see from Figure (e)e that this problem spends much of its time in NLS, which does not scale linearly with .

We can also compare HPC-NMF with a 1D processor grid with Naive for squarish matrices (SSYN, DSYN, and Webbase). Our analysis does not predict a significant difference in communication costs of these two approaches (when ), and we see from the data that Naive outperforms HPC-NMF for two of the three matrices (but the opposite is true for DSYN). The main differences appear in the All-Gather versus Reduce-Scatter communication costs and the local MM (all of which are involved in the computation). In all three cases, our proposed 2D processor grid (with optimal choice of ) performs better than both alternatives.

Cores DSYN SSYN Video Webbase DSYN SSYN Video Webbase DSYN SSYN Video Webbase
24 6.5632 48.0256 5.0821 52.8549 4.8427 84.6286
96 1.5929 18.5507 1.4836 14.5873 1.1147 16.6966
216 2.1819 0.6027 2.7899 7.1274 2.1548 0.9488 4.7928 9.2730 1.5283 0.4816 1.6106 7.4799
384 1.2594 0.6466 2.2106 5.1431 1.2559 0.7695 3.8295 6.4740 0.8620 0.2661 0.8963 4.0630
600 1.1745 0.5592 1.7583 4.6825 0.9685 0.6666 0.5994 6.2751 0.5519 0.1683 0.5699 2.7376
Table 3: Per-iteration running times of parallel NMF algorithms for .

6.5 Strong Scalability

The goal of our second set of experiments is to demonstrate the (strong) scalability of each of the algorithms. For each data set, we fix the rank to be 50 and vary the number of processors (this is a strong-scaling experiment because the size of the data set is fixed). We run our experiments on processors/cores, which translates to nodes. The dense matrices are too large for 1 or 4 nodes, so we benchmark only on cores in those cases.

The right side of Figure 3 shows the scaling results for all four data sets, and Table 3 gives the overall per-iteration time for each algorithm, number of processors, and data set. We first consider the HPC-NMF algorithm with a 2D processor grid: comparing the performance results on 25 nodes (300 cores) to the 1 node (24 cores), we see nearly perfect parallel speedups. The parallel speedups are for SSYN and for the Webbase matrix. We believe the superlinear speedup of the Webbase matrix is a result of the running time being dominated by NLS; with more processors, the local NLS problem is smaller and more likely to fit in smaller levels of cache, providing better performance. For the dense matrices, the speedup of HPC-NMF on 25 nodes over 9 nodes is for DSYN and for Video, which are also nearly linear.

In the case of the Naive algorithm, we do see parallel speedups, but they are not linear. For the sparse data, we see parallel speedups of and with a increase in number of processors. For the dense data, we observe speedups of and with a increase in the number of processors. The main reason for not achieving perfect scaling is the unnecessary communication overheads.

7 Conclusion

In this paper, we propose a high-performance distributed-memory parallel algorithm that computes an NMF factorization by iteratively solving alternating non-negative least squares (NLS) subproblems. We show that by carefully designing a parallel algorithm, we can avoid communication overheads and scale well to modest numbers of cores.

For the datasets on which we experimented, we showed that an efficient implementation of a naive parallel algorithm spends much of its time in interprocessor communication. In the case of HPC-NMF, the problems remain computation bound on up to 600 processors, typically spending most of the time in local NLS solves.

We focus in this work on BPP, which is more expensive per-iteration than alternative methods like MU and HALS, because it has been shown to reduce overall running time in the sequential case by requiring fewer iterations Kim and Park [2011]. Because most of the time per iteration of HPC-NMF is spent on local NLS, we believe further empirical exploration is necessary to confirm the advantages of BPP in the parallel case. We note that if we use MU or HALS for local NLS, the relative cost of interprocessor communication will grow, making the communication efficiency of our algorithm more important.

In future work, we would like to extend this algorithm to dense and sparse tensors, computing the CANDECOMP/PARAFAC decomposition in parallel with non-negativity constraints on the factor matrices. We would also like to explore more intelligent distributions of sparse matrices: while our 2D distribution is based on evenly dividing rows and columns, it does not necessarily load balance the nonzeros of the matrix, which can lead to load imbalance in MM. We are interested in using graph and hypergraph partitioning techniques to load balance the memory and computation while at the same time reducing communication costs as much as possible. Finally, we have not yet reached the limits of the scalability of HPC-NMF; we would like to expand our benchmarks to larger numbers of nodes on the same size datasets to study performance behavior when communication costs completely dominate the running time.


This research was supported in part by an appointment to the Sandia National Laboratories Truman Fellowship in National Security Science and Engineering, sponsored by Sandia Corporation (a wholly owned subsidiary of Lockheed Martin Corporation) as Operator of Sandia National Laboratories under its U.S. Department of Energy Contract No. DE-AC04-94AL85000.

This research used resources of the National Energy Research Scientific Computing Center, a DOE Office of Science User Facility supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231.


  • Ballard et al. [2015] G. Ballard, A. Druinsky, N. Knight, and O. Schwartz. Brief announcement: Hypergraph partitioning for parallel sparse matrix-matrix multiplication. In Proceedings of the 27th ACM on Symposium on Parallelism in Algorithms and Architectures, SPAA ’15, pages 86–88, New York, NY, USA, 2015. ACM. ISBN 978-1-4503-3588-1. doi: 10.1145/2755573.2755613. URL http://doi.acm.org/10.1145/2755573.2755613.
  • Chan et al. [2007] E. Chan, M. Heimlich, A. Purkayastha, and R. van de Geijn. Collective communication: theory, practice, and experience. Concurrency and Computation: Practice and Experience, 19(13):1749–1783, 2007. ISSN 1532-0634. doi: 10.1002/cpe.1206. URL http://dx.doi.org/10.1002/cpe.1206.
  • Cichocki et al. [2009] A. Cichocki, R. Zdunek, A. H. Phan, and S.-i. Amari. Nonnegative matrix and tensor factorizations: applications to exploratory multi-way data analysis and blind source separation. Wiley, 2009.
  • Demmel et al. [2013] J. Demmel, D. Eliahu, A. Fox, S. Kamil, B. Lipshitz, O. Schwartz, and O. Spillinger. Communication-optimal parallel recursive rectangular matrix multiplication. In Proceedings of the 27th IEEE International Symposium on Parallel and Distributed Processing, IPDPS ’13, pages 261–272, 2013. doi: 10.1109/IPDPS.2013.80.
  • Fairbanks et al. [2015] J. P. Fairbanks, R. Kannan, H. Park, and D. A. Bader. Behavioral clusters in dynamic graphs. Parallel Computing, 47:38–50, 2015.
  • Faloutsos et al. [2014] C. Faloutsos, A. Beutel, E. P. Xing, E. E. Papalexakis, A. Kumar, and P. P. Talukdar. Flexi-fact: Scalable flexible factorization of coupled tensors on hadoop. In Proceedings of the SDM, pages 109–117, 2014. doi: 10.1137/1.9781611973440.13. URL http://epubs.siam.org/doi/abs/10.1137/1.9781611973440.13.
  • Gemulla et al. [2011] R. Gemulla, E. Nijkamp, P. J. Haas, and Y. Sismanis. Large-scale matrix factorization with distributed stochastic gradient descent. In Proceedings of the KDD, pages 69–77. ACM, 2011.
  • Hoyer [2004] P. O. Hoyer. Non-negative matrix factorization with sparseness constraints. JMLR, 5:1457–1469, 2004.
  • Kaya and Uçar [2015] O. Kaya and B. Uçar. Scalable sparse tensor decompositions in distributed memory systems. Technical Report RR-8722, INRIA, May 2015.
  • Kim and Park [2007] H. Kim and H. Park. Sparse non-negative matrix factorizations via alternating non-negativity-constrained least squares for microarray data analysis. Bioinformatics, 23(12):1495–1502, 2007.
  • Kim and Park [2011] J. Kim and H. Park. Fast nonnegative matrix factorization: An active-set-like method and comparisons. SIAM Journal on Scientific Computing, 33(6):3261–3281, 2011.
  • Kim et al. [2014] J. Kim, Y. He, and H. Park. Algorithms for nonnegative matrix and tensor factorizations: A unified view based on block coordinate descent framework. Journal of Global Optimization, 58(2):285–319, 2014.
  • Kuang et al. [2012] D. Kuang, C. Ding, and H. Park. Symmetric nonnegative matrix factorization for graph clustering. In Proceedings of SDM, pages 106–117, 2012.
  • Liao et al. [2014] R. Liao, Y. Zhang, J. Guan, and S. Zhou. Cloudnmf: A mapreduce implementation of nonnegative matrix factorization for large-scale biological datasets. Genomics, proteomics & bioinformatics, 12(1):48–51, 2014.
  • Liu et al. [2010] C. Liu, H.-c. Yang, J. Fan, L.-W. He, and Y.-M. Wang. Distributed nonnegative matrix factorization for web-scale dyadic data analysis on mapreduce. In Proceedings of the WWW, pages 681–690. ACM, 2010.
  • Pauca et al. [2004] V. P. Pauca, F. Shahnaz, M. W. Berry, and R. J. Plemmons. Text mining using nonnegative matrix factorizations. In Proceedings of SDM, 2004.
  • Seung and Lee [2001] D. Seung and L. Lee. Algorithms for non-negative matrix factorization. NIPS, 13:556–562, 2001.
  • Thakur et al. [2005] R. Thakur, R. Rabenseifner, and W. Gropp. Optimization of collective communication operations in MPICH. International Journal of High Performance Computing Applications, 19(1):49–66, 2005. doi: 10.1177/1094342005051521. URL http://hpc.sagepub.com/content/19/1/49.abstract.
  • Wang and Zhang [2013] Y.-X. Wang and Y.-J. Zhang. Nonnegative matrix factorization: A comprehensive review. TKDE, 25(6):1336–1353, June 2013. ISSN 1041-4347. doi: 10.1109/TKDE.2012.51.
  • Williams et al. [2009] S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick, and J. Demmel. Optimization of sparse matrix-vector multiplication on emerging multicore platforms. Parallel Computing, 35(3):178 – 194, 2009.
  • Yin et al. [2014] J. Yin, L. Gao, and Z. Zhang. Scalable nonnegative matrix factorization with block-wise updates. In Machine Learning and Knowledge Discovery in Databases, volume 8726 of LNCS, pages 337–352, 2014. ISBN 978-3-662-44844-1.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description