Matrix Factorization on GPUs with Memory Optimization and Approximate Computing
Abstract
Matrix factorization (MF) discovers latent features from observations, which has shown great promises in the fields of collaborative filtering, data compression, feature extraction, word embedding, etc. While many problemspecific optimization techniques have been proposed, alternating least square (ALS) remains popular due to its general applicability (e.g. easy to handle positiveunlabeled inputs), fast convergence and parallelization capability. Current MF implementations are either optimized for a single machine or with a need of a large computer cluster but still are insufficient. This is because a single machine provides limited compute power for largescale data while multiple machines suffer from the network communication bottleneck.
To address the aforementioned challenge, accelerating ALS on graphics processing units (GPUs) is a promising direction. We propose the novel approach in enhancing the MF efficiency via both memory optimization and approximate computing. The former exploits GPU memory hierarchy to increase data reuse, while the later reduces unnecessary computing without hurting the convergence of learning algorithms. Extensive experiments on largescale datasets show that our solution not only outperforms the competing CPU solutions by a large margin but also has a 2x4x performance gain compared to the stateoftheart GPU solutions. Our implementations are opensourced and publicly available.
I Introduction
Matrix factorization (MF) is one of the most important data mining techniques due to its implementation simplicity and broad applicability. For instance, MF is the core of modern recommender systems [15, 33, 13]. MF has also been widely used in compressing large models (e.g. deep neural networks) for mobile usage [1], calculating word embedding [15, 28], etc. However, the big data processing, with massive data is generated at an unprecedented rate, demands further acceleration of MF. For example, the number of active users of Facebook exceed 1.860 billion in the fourth quarter of 2016^{1}^{1}1https://ibm.biz/BdstmU. Solving MF efficiently under such a large scale challenges many existing solutions.
Although many studies [30, 39, 3, 16, 37, 9, 36, 22, 27] have been conducted to accelerate MF, they are still insufficient to process large scale data set. These methods either use multiple threads on one machine or multiple processes on distributed systems. The former one uses shared memory which is efficient but hard to handle big data in realworld settings. On the other hand, the communication cost becomes the major bottleneck in distributed systems, which significantly reduces its efficiency in terms of aggregated floating point operations per second (FLOPS). Nevertheless, with recent successes of deep learning [4] using graphics processing units (GPUs), there comes a new venue for expediting other data mining algorithms [2, 31]. GPU has superior compute power and memory bandwidth compared to CPU [10]. Moreover, GPUs on one server can leverage interconnections such as NVLink [25] (40 GB/s per link with four links per GPU) which is much faster than any existing network. Therefore, we consider to solve the problem of MF using the alternating least square (ALS) method on GPUs.
In this paper, we propose a novel approach in solving MF on GPUs, termed cuMFals , with major contributions in twofold:

To fully utilize the GPU memory hierarchy, we identify hotspot variables to retain data as close to compute as possible. Afterward, the memory loading process is accelerated through an innovative and nonconventional scheme by exploiting GPU architectural features such as occupancy and cache.

We develop an iterative conjugate gradient (CG) solver on GPUs. This approximate solver reduced the compute complexity from to ( is the dimension of latent features) without hurting convergence. This optimization brings a speedup of 4x compared to calculating in exact. Moreover, CG works naturally with Nvidia’s newly developed half precision feature, which further doubles the speed.
A graphical illustration of our proposed approach is shown in Figure 1. By jointly optimizing the memory loading scheme and the approximate compute strategy, we are able to not only outperform all distributed CPU solutions by a large margin, but also make 2x4x improvements over the stateoftheart GPU implementation. Such performance gains have been validated through extensive experiments on various GPU architectures. Our implementation is opensourced on GitHub^{2}^{2}2https://github.com/cuMF/cumf_als, available as a library to accelerate many applications.
Ii Preliminaries
Compute (C)  Memory (M)  C/M  
ALS  get_hermitian  
solve  
SGD  1 
Matrix factorization (MF) factorizes a matrix (with nonzero elements) into two lowrank matrices and , such that . For any and , such tat and , is the entry of . Thus, , where are the column of and the column of , respectively. Then the optimization problem of MF is given as:
(1) 
where and are the number of nonzero elements of and , respectively; is the regularization parameter. Two important approaches ALS and SGD both minimize equation (1), yet using different approaches that we will discuss the next.
ALS: ALS is an iterative method that first optimizes while fixing , and then solves while fixing . In every iteration, all observations () are used to update the current variable. Moreover, both subproblems are convex and the update procedures for them are given below.
Update : The optimal solution of the column of is obtained by solving the following linear system:
(2) 
Update : Similarly, the optimal solution of column of is obtained by solving:
(3) 
Here, and are the row and column of , respectively. It is worth mentioning that the updates of each and are independent. In other words, every row of matrix can be updated in parallel while keeping fixed. The same procedure is applicable to update as well. To ease repeating, throughout the rest of this paper, we will only focus on solving .
The solution of equation (2) has a closed form as
(4) 
which involves calculating a matrix inverse. Nevertheless, matrix inverse is compute intensive and unnecessary in solving the linear system in (2). Instead, many literatures [18, 29] solve the problem in a twostep fashion:
(i) compute intermediate results of and , which are called get_hermitian and get_bias, respectively;
(ii) solve the linear system, which will be referred as solve.
Our method also follows the twostep solving scheme. However, for each step, we propose a novel technique to better utilize both compute and memory resources of GPUs. Comparing get_hermitian and get_bias, we note that the compute complexity is dominated by the former one. Thus, we firstly focus on the optimization of get_hermitian, not get_bias in ALS in this paper.
SGD: SGD is also an iterative algorithm. However, differ to ALS, under each iteration, SGD only work with a small subset of observations denoted as (a.k.a. minibatch), where is the number of iterations. Usually, samples in are randomly selected from all observations. Then, the updating equations for both and for the iteration are given as:
(5)  
where is known as the learning rate. The vanilla SGD algorithm requires passing over randomly sampled data multiple times (till converge). When multiple updates run in parallel, for example, two samples and are updating at the same time, their updates to may overwrite each other. To address this issue, previous studies either partition into blocks with no overlapping rows and columns [39, 37, 9, 32], or let multiple workers independently update ignoring conflicts [22].
Complexity: Following the roofline model [34], we calculate the computation and memory complexity for both ALS and SGD and summarize them in Table I. Comparatively, ALS has a higher computetomemory ratio than SGD, which means ALS is compute intensive while SGD is memory intensive. Although the compute complexity of ALS is heavier than SGD per iteration, the number of iterations to converge is significant fewer [15]. Moreover, parallelization of ALS is easier since no sophisticated locking scheme is needed [15, 35]. At last, ALS is more suitable for the case of MF with implicit inputs, which makes it more broadly applicable.
Based on this complexity analysis, we focus our study in accelerating ALS in this paper, while SGD would be used as comparing topic.
Approximate computing: This term applied to computation that returns approximated result as the tradeoff between accurracy and cost/performance. [7] explores some of the applications and hardware designs for approximate computing. Their work showed acceleration gain of 1.9X to 2.1X with only 2.5% quality loss. [19] provides a detailed survey on both software techniques and hardware design for a large variety of applications to leverage the power of approximate computing for cost or performance gain. We exploit approximate computing in two aspects. Firstly, our iterative process in linear equation solver would stop within some tolerable converaging values. Secondly, we use reduced precison hardware feature to maximize the utilization of memory bandwidth and memory capacity.
Iii Memory optimization for high flops
As mentioned in section II, an ALS update includes two steps, i.e. get_hermitian and solve. This section describes the memory optimization on get_hermitian and the next section introduces the approximate computing techniques on solve. As seen from Table I, get_hermitian has a compute complexity of . This is big in largescale problems where can be tens of billions, which leads to our first observation.
Observation 1. get_hermitian is compute intensive.
For a compute intensive function to achieve high FLOPS, it needs to retain data as close as possible to compute units [10, 34]. In other words, it needs effective caching to reduce read from external memory, a.k.a. DRAM, as DRAM cannot sustain high FLOPS of GPUs. Specifically to cuMFals , we need to identify the frequentlyused variables, exploit the GPU memory hierarchy and place hotter variables in faster memory. This leads to the following solution.
Solution 1. Utiilize register and shared memory.
To decide what to cache and to where, we analyze the memory usage in calculating :

is read and written once when adding each . Therefore, is read and written by times, that is on the average.

Each needs to be read times when calculating .
We can now allocate variables into different places in GPU memory hierarchy based on their reuse. Usually , and as a result is more frequently accessed than . Therefore, deserves the fastest cache, i.e. register, and is put into shared memory, the second fastest cache. Figure 2 illustrates the memory optimization to get_hermitian. For a given , its required features, i.e. s such that , are staged from in global memory (the matrix at the top) into a shared memory space of size (the thinner matrix in the middle), in batches. For each staged feature , we calculate in tiles of size and add to the corresponding subblocks of in registers (the symmetric matrix at the bottom). Each subblock in aggregates the outer product of two tiles in . Consider the symmetricity of , we only need to calculate the bottom half of it. stored in registers is flushed to global memory when all required s are added in.
Because we choose to excessively use registers, they become the constrained resources. Consequently, the occupancy of get_hermitian, i.e number of s that can be calculated concurrently is low.
Observation 2. Aggressive use of registers leads to low occupancy, which makes read from global memory latencybound instead of bandwidthbound.
Current Nvidia GPUs have 65536 float registers in each streammultiprocessor (SM). When , each thread of get_hermitian needs 168 registers and each block needs 64 threads. As a result, an SM can hold threadblocks, i.e. an SM can update 6 rows concurrently. Compared with the SM capacity to hold 32 threadblocks, this is a low occupancy. This indicates that there are relatively few concurrent threads loading from global memory, which leads to the next solution.
Solution 2. Use noncoalesced and cacheassisted read, which is nonconventional but proven faster.
In GPU programming, memory coalescing is considered as a best practice to achieve good performance [14]. Memory coalescing means that adjacent threads should access adjacent global memory addresses. It can consolidate memory load requests and avoid wasting the bandwidth. When using coalescing, read from global memory can bypass L1 cache, because loaded data are all used.
Given the low occupancy of get_hermitian, coalesced read, despite its efficiency, cannot saturate the memory bandwidth. On the other hand, when the occupancy is low, the working data can almost fit into L1 cache. For example, when and , the s being actively loaded per SM is (threadblock) (bytes per float) KB. This number is between Nvidia Maxwell’s L1 cache of 48 KB and L2 cache of 128 KB (3 MB shared by 24 SMs). Inspired by this observation, we use a parallel but noncoalesced read scheme as illustrated in Figure 3 (b). Without losing generality, we load 32 features using 32 threads. With the coalesced scheme in Figure 3 (a), 32 threads together read one column before moving to the next one. Alternatively, in the noncoalesced scheme in Figure 3 (b), 32 threads read 32 columns concurrently, with each thread reading one column. Because of the small working data set size, L1 and L2 cache can efficiently serve as the coalescing cache. That is, the noncoalesced load requests issued from are going to hit L1 and L2, which makes it even more efficient than coalesced read.
To showcase the effectiveness of solution 2, we measure the performance of coalesced and noncoalesced read in get_hermitian. We use the Netflix dataset (see Section VA for more details on datasets) and measure the time of three phases in get_hermitian: load from global memory to shared memory (load), compute (compute), and write to global memory (write). Figure 4 shows the performance of both updateX and update procedures, in three different settings: coal means coalesced read, the setting illustrated in Figure 3 (a); nonCoalL1 means noncoalesced read, the setting in Figure 3 (b) and with L1 cache; nonCoalnoL1 means noncoalesced read with L1 cache bypassed. Result shows:

For shared memory load, noncoalesced with L1 performs best; noncoalesced without L1 is worse and coalesced read is worst.

The compute time is almost constant in all settings. This is because they all need fused multipleadd (FMA) operations.

UpdateX and update need to write and floats to global memory, respectively. Since in Netflix data, update takes longer in write.
Iv Approximate computing in solver
Section III describes how to efficiently obtain . After that, we need to solve equations as illustrated in equation (2).
Iva Approximate solver with CG
The direct solver, e.g., the batch LU solver in cuBLAS [23], gives an exact solution to with compute complexity of , or for m rows. This cubic complexity leads to long solve time, especially when is big. As shown in Table I, when becomes big, ’s rows become sparse, and gets closer to . To demonstrate this, we measure the solver time of 10 ALS iterations on Netflix data. Column LU_FP32 in Figure 5 shows that, the time taken by the LU solver is almost twice as much as that by get_hermitian. This clearly indicates that after applying optimization on get_hermitian, solve executing time now becomes dominant. This leads to the following observation.
Observation 3. Solve is compute intensive and dominant.
This observation inspires us to seek an alternative to the direct solver. We notice that, as an iterative approach, ALS updates and based on estimations from the previous iteration. As errors exist in estimations, the solution of each step is inherently inaccurate. Therefore, solution accuracy may be sacrificed in exchange for compute speed, leading to our attempt for an approximate solver.
Solution 3. An approximate conjugate gradient solver.
The iterative CG solver is introduced in [11]. With iterations each of complexity , it yields the exact solution with complexity . Based on this, we seek to further reduce computation while maintaining convergence quality. The pseudo code of our approximate CG solver is summarized in Algorithm 1, where is given to control the number of iterations (see Line 3), and for tolerance control. Empirically this approximation does not impact ALS’s convergence, while effectively reducing the solver’s complexity from to when .
IvB Use reduced precision
Replacing LU solver with an approximate CG solver, the solver’s computetomemory ratio (recall Table I) now drops from to , converting the original compute intensive problem into a memory intensive one.
Observation 4. CG solver is memory intensive.
As seen in Algorithm 1, CG solver is dominated by dense matrixvector multiply (Line 4), which is in turn dominated by reading that is of memory complexity . This insight inspires us that further acceleration is possible by reducing the size of .
Solution 4. Use reduced precision in CG solver to double the effective memory bandwidth.
We choose the newly introduced 16bit floating point format (FP16, compared with the default 32bit floating point format FP32) in Nvidia GPUs to store . This optimization saves 50% memory bandwidth to load and consequently doubles the loading speed. To validate, we run 10 iterations of ALS using Netflix data on a Maxwell GPU. Figure 5 shows the total solver time. The time of CG solver with FP32 (CGFP32) is only of that of LU solver with FP 32 (LUFP32). When CG uses FP16, CGFP16 takes of the time compared with CGFP32. In total, CGFP16 can reduce the runtime to compared with LUFP32.
Does L1 cache benefit the CG solver?
Figure 5 also illustrates that, loading with L1 cache does not yield any performance benefit. This is coherent with the analysis in section III: L1 cache is only useful to coalesce the noncoalesced memory access when occupancy is low. With batch CG’s high occupancy and coalesced read, L1 cache is not useful at all. This also explains why L1 cache is disabled by default in Nvidia GPUs.
V Experiments
In this section, we show the advantages of the proposed cuMFals framework compared to a set of stateoftheart implementations for both CPU and GPU. Our experiments are designed to answer the following questions:

How fast cuMFals is compared to competing implementations?

How does cuMFals compare to SGD?

Can cuMFals extent to the setting of MF with implicit feedback (a.k.a oneclass or positiveunlabeled inputs)?
Va Datasets
We utilize three publicly available datasets as follows:

Netflix [38]: The Netflix dataset consists ratings on movies. Each rating is in the scale of one to five.

YahooMusic [6]: Similar to the Netflix, this dataset contains 250 million ratings in the range of 1 to 100 for music collected by the Yahoo! Music Radio service.

Hugewiki [39]: Hugewiki contains a snapshot of Wikipedia. The observation matrix describes the frequency of English terms appeared in different documents.
The detailed statistics are summarized in Table II. We choose 0.92, 22 and 0.52 as the convergence value for Netflix, YahooMusic and Hugewiki, respectively, because these values are used by many papers and considered wellaccepted.
Dataset  
Netflix  480,189  17,770  99M  100  0.05  0.92 
YahooMusic  1,000,990  624,961  252.8M  100  1.4  22 
Hugewiki  50,082,603  39,780  3.1B  100  0.05  0.52 
VB Experiment setting
For the purpose of comprehensive evaluations, experiments are conducted on three different generations of Nvidia GPUs: Kepler, Maxwell and Pascal. Table III illustrates the configurations of the three servers we use. CPUonly experiments are conducted on the most powerful Pascal server unless otherwise mentioned.
For quantitative comparison, we follow the standard experiment setting [39, 37] by reporting how fast the root mean square error (RMSE) on the testing test reduces. The stopping criteria for all algorithms is when the RSME on testing set reaches an “acceptable level”. Specifically, the acceptable RSME is 0.92, 22.0 and 0.52 for Netflix, YahooMusic and Hugewiki, respectively. Furthermore, we make use of the original training and testing files from the providers of Netflix and YahooMusic datasets while randomly extract 10% of the data as the testing set for Hugewiki. It is worth mentioning that, we focus on systemlevel efficiency in terms of running time instead of the recommendation accuracy. To achieve the goal, we use the same set of parameters ( and ) as reported from earlier works [31, 39, 37], which is also shown in Table II.
Kepler  

CPU  Two 8core Intel Xeon E52667, 256 GB RAM 
GPU  Two Kepler K40, each: 4 TFLOPS, 12 GB RAM, 288 GB/s 
Maxwell  
CPU  Two 12core Intel Xeon E52670, 512 GB RAM 
GPU  Four Titan X, each: 7 TFLOPS, 12 GB RAM, 340 GB/s 
Pascal  
CPU  Two 10core IBM Power8 with SMT 8, 512 GB RAM 
GPU  Four Tesla P100, each: 11 TFLOPS, 16 GB, 740 GB/s 
VC Convergence speed: is cuMFals fast?
There are many studies and systems on accelerating MF [30, 39, 3, 16, 37, 9, 36, 22, 27, 35, 20]. Among them, we compare with the representative works below because they are with stateofart performance (i.e., convergence speed) or scalability.

NOMAD [37]: NOMAD is a CPUbased solution using SGD. Different from LIBMF, it runs on multiple machines using message passing interface (MPI) to communicate.

BIDMach [2]: BIDMach is a single GPU library that contains a set of matrix functions on top of which machine learning algorithms can be built. It also implements ALS based on a general purpose sparse matrix function.

HPCALS [8]: It implements ALS on single GPU by exploiting registers and shared memory. However, it has no noncoalesced read, approximate solver or reduced precision.

GPUSGD [35]: Section II discussed the difference between ALS and SGD solvers for MF. We also compared with a CUDAbased SGD solution that solves MF problems with one or multiple GPUs, using matrix blocking and Hogwild!style algorithms [22] to parallelize the SGD updates. For individual SGD updates, it leverages GPU architectural features such as cache, warpshuffle instructions, and halfprecision floats. In Section VE we will discuss and compare these two methods in detail.
It is worth mentioning that, the performance of CPUbased algorithms not necessarily improves as the number of threads/machines increases due to two reasons: 1) synchronization on shared data structures and 2) communication overhead. Enlarging the number of compute resource may even hurt their performance [8]. Therefore, we use 40 threads for LIBMF, which achieves the best performance. For NOMAD, we use the best settings as reported in [37], which are 32 machines for Netflix and Yahoo, and 64 machines for Hugewiki. Moreover, BIDMach and HPCALS can only use one GPU, while GPUALS and cuMFals can adopt multiple GPU settings. We test all GPUbased algorithms using one GPU on both Netflix and YahooMusic. Furthermore, to show how well both GPUALS and our framework scale with the number of GPUs, we use four GPUs for both algorithms on the Hugewiki dataset.
Figure 6 shows the relation between test RMSE and training time while table IV summarizes the time when RMSE reaches an acceptable level. Clearly, cuMFals outperforms all CPU solutions with a large margin. Specifically, on both Netflix and YahooMusic datasets, cuMFals with single Pascal GPU (cuMFals@P) achieves 5.6x7x performance gain compared to LIBMF. As for Hugewiki, cuMFals with four Pascal GPUs only takes 68 seconds to converge, which is significantly faster compared to 459 seconds for NOMAD (6.7x) and 3021 seconds for LIBMF (44.4x). The reason BIDMach is not included in the table is that it does not converge to the acceptance level. Regardless the convergence, we can observe the ALS kernel of BIDMach runs at 40 GFLOPS, which is similar to the reported measurement in their original paper [2]. However, 40 GFLOPS is much lower than cuMFals (see section VD for our performance in terms of FLOPS). On the other hand, since HPCALS is not opensource, we only compare our performance of per iteration time on Netflix, which has been reported in their paper. Results show that cuMFals runs twice as fast as HPCALS on the same hardware (Kepler K40). Furthermore, compared with GPUALS, cuMFals has a significant performance advantage thanks to our memory optimization and approximate computing techniques. On Netflix with Maxwell GPU, cuMFals only needs 6.5 seconds to converge while GPUALS needs 28 seconds, i.e. a 4x speedup. As a summary, cuMFals also outperforms all stateofart GPU solutions.
VD Has cuMFals fully exploited GPU?
In this section, we validate whether the proposed cuMFals framework has fully exploited the potential of GPU hardware. The analytics are done by examining if the computeintensive kernel get_hermitian has achieved high FLOPS, and if the memoryintensive CG solver has achieved high memory bandwidth.
Has get_hermitian achieved high FLOPS?
To obtain get_hermitian: for , one needs to read the sparse matrix and perform matrix multiplications. The of size of each multiplication is , where . To our best knowledge, no existing GPU library including cuBLAS has implemented the get_hermitian function for us to compare. The closest baseline is the batchedmatrixmultiplication gemmBatched [24] in cuBLAS, which calculates matrix multiplications of the same dimension: . Although gemmBatched cannot parallelize matrix multiplications with different sizes, to compare, we set the dimension of each computation in our get_hermitian to be the same. Under this setting, two algorithms can be fairly compared and we measure the FLOPS achieved by both with Kepler, Maxwell and Pascal on Netflix.
Experimental results, as illustrated in Figure 7(a), shows that cuMFals achieves higher FLOPS in all three generations of GPUs. This is impressive because get_hermitian compared to gemmBatched in cuBLAS needs to perform extra work. Specifically, get_hermitian needs to read sparse to get references to , and batchmultiply matrices with variable sizes. However, cuBLAS only needs to read dense input and batchmultiply matrices with the same size, which has no time cost on finding references. Moreover, regarding FLOPS efficiency (i.e., the achievedFLOPS divided by the device’s peakFLOPS) Figure 7(a) indicates that cuMFals achieves better performance in Nvidia newly developed architectures. This can be explained by the fact that performance of get_hermitian is generally limited by the number of registers. Comparing Kepler, Maxwell and Pascal, the number of registers per core increases as technology evolves. At the same time, it reveals that our design discipline matches the development trend of GPU.
Has the CG solver achieved high memory bandwidth?
We measure the memory transfer rate (GB/s) between GPU SMs and its DRAM, and compare it to the bandwidth achieved by CUDA function cudaMemcpy. Since cudaMemcpy only copies memory and deals with no computation, the comparison with it can indicate how well the CG solver can saturate the device memory bandwidth. As seen from Figure 7(b), cuMFals achieves a higher bandwidth than cudaMemcpy on all three types of GPUs. This demonstrates that our proposed CG solver utilizes the memory bandwidth efficiently.
VE ALS vs. SGD on GPUs
As discussed in early Section II, ALS and SGD have their own attributes in solving the problem of MF. SGD runs faster per iteration but requires more iterations. When the rating matrix gets denser, ALS has more advantage because SGD’s complexity grows [15] and also becomes harder to parallelize [35].
We implemented both ALS and SGD^{3}^{3}3https://github.com/cuMF/ and compare their performance on GPUs. We follow the same setting as before, and report the results in Figure 8. Come as no surprise, ALS runs slower in each iteration, but requires fewer iterations to coverage. On one GPU, ALS converges slightly faster than SGD on Netflix, but slightly slower than SGD on YahooMusic and Hugewiki. However, with four GPUs (als@4), ALS converges faster than SGD on Hugewiki data.
Moreover, as seen in Table I, SGD’s computation complexity () grows linearly with . This makes it inefficient when the rating matrix becomes more dense. This issue becomes severe when dealing with implicit inputs, where the rating matrix is considered fully dense, i.e., [12]. ALS can easily adapt to the setting of MF with implicit inputs, which we will discuss separately in section VF.
VF Implicit matrix factorization
MF with implicit inputs has been widely used in reallife applications [33, 12] where explicit ratings are replaced by implicit ones such as purchase or number of clicks. To show ALS is able to handle implicit inputs, we follow the same setting as shown in [12], by considering a binary matrix . if the implicit observations , and otherwise. The original paper also adds confidence measures to predictions, which leads to the following cost function:
where and is a given scaling constant. In other words, any is no longer treated as a missing rating, but as a zerorating with low confidence . Under this assumption [12], is not sparse and therefore SGD will be costly. In such a way, SGD loses its competitiveness. Therefore, we compare cuMFals with two opensource libraries for implicit MF: implicit^{4}^{4}4http://github.com/benfred/implicit and QMF^{5}^{5}5http://github.com/quora/qmf. Experiments demonstrate that cuMFals converges under the implicit setting and the per iteration time of cuMFals , implicit and QMF are 2.2, 90, and 360 seconds, respectively.
Vi Related Work
CPU  GPU  

SGD  lockfree: workers independently sample & update  
singlenode: HogWild! [22]; multinodes: FactorBird [30], Petuum [5]  
blocking: workers pick nonoverlapping blocks  
blockDim=#workers: DSGD [9]  
blockDim#workers: LIBMF [39], NOMAD [37], DSGD++ [32]  
nested blocking: dcMF [21], MLGFMF [27]  single and multiple GPUs: GPUSGD – SGD with lockfree and blocking [35]  
ALS  replicate all features: PALS [38], DALS [32]  
partial replicate features: SparkALS [18], GraphLab [17], Sparkler [16]  
rotate features: Facebook [13]  
approximate ALS: [29]  single GPU: BIDMach [2], HPCALS [8]  
single and multiple GPUs: GPUALS [31] and cuMFals  
CCD  multicore and multi node: CCD++ [36]  single GPU: parallel CCD++ [20] 
This section reviews related work on parallel matrix factorization with SGD, ALS and cyclic coordinate descent (CCD) algorithms. Table V is a summary and details are in the following subsections.
Via Parallel SGD
SGD is inherently serial where each time one sample is selected to update. To accelerate this process, two samples can update in parallel if they are neither in same row nor same column. This observation has led to two ways to parallel SGD for MF: lockfree Hogwild! [22] and blocking [36, 39, 9, 27]. Hogwild! observes that when is very sparse and the number of parallel workers is much less than the dimension of , they can independently update samples with a low probability of conflict. Blocking divides into several subblocks, and subblocks that do not share rows or columns can update in parallel.
CPU approaches. SGD has been parallelized in multicore [39, 27], multinode MPI [32, 37], MapReduce [9] and parameterserver [30, 5] systems. These methods partition into blocks with no overlapping rows or columns, and work on these blocks in parallel. They further optimize the algorithm with asynchronous communication, overlapping communication and computation, and shared memory. For example, LIBMF [39] is very efficient on multicores. However, it stops scaling when using few dozens cores [35, 21], because of the locking in a shared data structure. Moreover, LIBMF is a singlemachine solution and therefore cannot deal with largescale problems. NOMAD [37] extends the idea of block partitioning, and alleviate the issue of global locking. It performs similarly to LIBMF on a single machine and can scale to a 64node HPC cluster. Parameter server [5] can be used to implement distributed SGD. For example, Petuum [5] can scale MF to hundreds of cores in a cluster, and Factorbird [30] is a parameter server specifically implemented for matrix factorization.
GPU approaches. Both Hogwild and blocking schemes are implemented in [35]. It has efficient kernels for SGD update, leveraging cache, warpshuffle instructions, and halfprecision.
ViB Parallel ALS and CCD
CPU approaches for ALS. PALS [38] and SparkALS [18] parallelize ALS by feature full replication and partial replication, respectively. These approaches are not feasible when feature matrices get extremely large. Facebook [13] tackles this issue by partitioning the feature matrix and rotate its parts among multiple nodes. GraphLab [17] distributes the feature matrix among multiple machines. When updating in a machine, needed features are fetched ondemand from other machines.
GPU approaches for ALS. BIDMach [2] provides generic matrix kernels for many machine learning algorithms including MF. However, its sparse kernel is not specifically optimized for ALS and slower than cuMFals . HPCALS [8] optimizes the get_hermitian kernel similar to us. However, they used neither noncoalesced read, nor approximate solver nor reduced precision.
Parallel CCD. CCD++ [36] performs sequential updates on one row of the decomposed matrix while fixing other variables. CCD++ has lower time complexity but makes less progress per iteration, compared with ALS. [20] further accelerates CCD++ on GPUs using loop fusion and tiling. The resulting algorithm is shown to be faster than CCD++ on CPUs [36] as well as GPUALS [31] that is without memory optimization and approximate computing.
Vii Conclusion
Due to the importance of MF in the field of data mining, in this paper, we accelerate ALS, one of the most important MF solving algorithms with GPUs. Specifically, we identify challenges that make ALS running slow such as constraints in the utilization of capacity and bandwidth of memory, and computation intensiveness. To alleviate these problems, we propose a novel framework named cuMFals. Our algorithm exploits the GPU memory architectures and shortens the time of reading data via an innovative scheme. At the same time, the proposed algorithm also eliminates unnecessary computing in solving MF without hurting convergence. We conduct extensive experiments under various settings. Our proposed method achieves the stateoftheart performance, suggesting that cuMFals can significantly advance the task of MF. We also integrated cuMFals into Spark MLlib, accelerating its ALS algorithm^{6}^{6}6https://github.com/IBMSparkGPU/CUDAMLlib.
In future work, we would like to further analyze different GPUaccelerated MF algorithms and investigate algorithm selection based on dataset characteristics such as dimensions and sparsity, and hardware resource constraints such as number of GPUs. We also plan to investigate a hybrid solution that combines SGD and ALS. A scenario could be using ALS for the initial batch training and SGD for incremental updates of the model. Last but not the least, we would like to exploit the new Nvidia Tensor Cores [26] hardware that natively supports halfprecision arithmetic, to further speed up cuMFals .
References
 [1] J. Ba and R. Caruana. Do deep nets really need to be deep? In NIPS, pages 2654–2662, 2014.
 [2] J. Canny, H. Zhao, B. Jaros, Y. Chen, and J. Mao. Machine learning at the limit. In IEEE BigData, Big Data 2015, pages 233–242, 2015.
 [3] W.S. Chin, Y. Zhuang, Y.C. Juan, and C.J. Lin. A learningrate schedule for stochastic gradient methods to matrix factorization. In PAKDD. Springer, 2015.
 [4] A. Coates, B. Huval, T. Wang, D. Wu, B. Catanzaro, and N. Andrew. Deep learning with COTS HPC systems. In ICML, pages 1337–1345, 2013.
 [5] H. Cui, J. Cipar, Q. Ho, J. K. Kim, S. Lee, A. Kumar, J. Wei, W. Dai, G. R. Ganger, P. B. Gibbons, G. A. Gibson, and E. P. Xing. Exploiting bounded staleness to speed up big data analytics. In USENIX ATC, pages 37–48, 2014.
 [6] G. Dror, N. Koenigstein, Y. Koren, and M. Weimer. The Yahoo! Music Dataset and KDDCup ’11. In KDD Cup 2011 competition, 2012.
 [7] H. Esmaeilzadeh, A. Sampson, L. Ceze, and D. Burger. Neural acceleration for generalpurpose approximate programs. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO45, pages 449–460, 2012.
 [8] M. Gates, H. Anzt, J. Kurzak, and J. Dongarra. Accelerating collaborative filtering using concepts from high performance computing. In IEEE BigData, pages 667–676, 2015.
 [9] R. Gemulla, E. Nijkamp, P. J. Haas, and Y. Sismanis. Largescale matrix factorization with distributed stochastic gradient descent. In KDD, pages 69–77, 2011.
 [10] J. L. Hennessy and D. A. Patterson. Computer architecture: a quantitative approach. Elsevier, 2011.
 [11] M. R. Hestenes and E. Stiefel. Methods of conjugate gradients for solving linear systems, volume 49. NBS, 1952.
 [12] Y. Hu, Y. Koren, and C. Volinsky. Collaborative filtering for implicit feedback datasets. In ICDM, pages 263–272. IEEE, 2008.
 [13] M. Kabiljo and A. Ilic. Recommending items to more than a billion people. https://code.facebook.com/posts/861999383875667, 2015. [Online; accessed 17Aug2015].
 [14] D. B. Kirk and W. H. Wenmei. Programming massively parallel processors: a handson approach. Morgan Kaufmann, 2012.
 [15] Y. Koren, R. M. Bell, and C. Volinsky. Matrix factorization techniques for recommender systems. Computer, 42(8):30–37, 2009.
 [16] B. Li, S. Tata, and Y. Sismanis. Sparkler: Supporting largescale matrix factorization. In EDBT, pages 625–636, 2013.
 [17] Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and J. M. Hellerstein. Distributed GraphLab: a framework for machine learning and data mining in the cloud. In VLDB, pages 716–727, 2012.
 [18] X. Meng, J. K. Bradley, B. Yavuz, E. R. Sparks, S. Venkataraman, D. Liu, J. Freeman, D. B. Tsai, M. Amde, S. Owen, D. Xin, R. Xin, M. J. Franklin, R. Zadeh, M. Zaharia, and A. Talwalkar. Mllib: Machine learning in apache spark. Journal of Machine Learning Research, 17:34:1–34:7, 2016.
 [19] S. Mittal. A survey of techniques for approximate computing. ACM Comput. Surv., pages 62:1–62:33, Mar. 2016.
 [20] I. Nisa, A. SukumaranRajam, R. Kunchum, and P. Sadayappan. Parallel CCD++ on GPU for matrix factorization. In GPGPU@PPoPP, pages 73–83, 2017.
 [21] Y. Nishioka and K. Taura. Scalable taskparallel sgd on matrix factorization in multicore architectures. In ParLearning, 2015.
 [22] F. Niu, B. Recht, C. Re, and S. J. Wright. HOGWILD!: A lockfree approach to parallelizing stochastic gradient descent. In NIPS, pages 693–701, 2011.
 [23] Nvidia. cuBLAS. http://docs.nvidia.com/cuda/cublas/, 2015. [Online; accessed 17Aug2015].
 [24] Nvidia. cuBLAS. http://docs.nvidia.com/cuda/cublas/#cublaslttgtgemmbatched, 2016. [Online; accessed 7Nov2016].
 [25] Nvidia. NVIDIA NVLink. http://www.nvidia.com/object/nvlink.html, 2016. [Online; accessed 26Nov2016].
 [26] Nvidia. Programming Tensor Cores in CUDA 9. https://devblogs.nvidia.com/programmingtensorcorescuda9/, 2017. [Online; accessed 21Jan2018].
 [27] J. Oh, W. Han, H. Yu, and X. Jiang. Fast and robust parallel SGD matrix factorization. In KDD, pages 865–874, 2015.
 [28] J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for word representation. In EMNLP, pages 1532–1543, 2014.
 [29] I. Pillaszy, D. Zibriczky, and D. Tikk. Fast ALSbased matrix factorization for explicit and implicit feedback datasets. In RecSys, pages 71–78, 2010.
 [30] S. Schelter, V. Satuluri, and R. B. Zadeh. Factorbirda parameter server approach to distributed matrix factorization. In NIPS Workshop on Distributed Matrix Computations, 2014.
 [31] W. Tan, L. Cao, and L. Fong. Faster and cheaper: Parallelizing largescale matrix factorization on gpus. In HPDC, pages 219–230, 2016.
 [32] C. Teflioudi, F. Makari, and R. Gemulla. Distributed matrix completion. In ICDM, pages 655–664, 2012.
 [33] A. van den Oord, S. Dieleman, and B. Schrauwen. Deep contentbased music recommendation. In NIPS, pages 2643–2651, 2013.
 [34] S. Williams, A. Waterman, and D. Patterson. Roofline: An insightful visual performance model for multicore architectures. Commun. ACM, 52(4):65–76, 2009.
 [35] X. Xie, W. Tan, L. L. Fong, and Y. Liang. CuMF_SGD: Fast and Scalable Matrix Factorization. In HPDC, 2017.
 [36] H.F. Yu, C.J. Hsieh, S. Si, and I. S. Dhillon. Scalable coordinate descent approaches to parallel matrix factorization for recommender systems. In ICDM, pages 765–774, 2012.
 [37] H. Yun, H.F. Yu, C.J. Hsieh, S. Vishwanathan, and I. S. Dhillon. NOMAD: Nonlocking, stochastic multimachine algorithm for asynchronous and decentralized matrix completion. In VLDB, pages 975–986, 2014.
 [38] Y. Zhou, D. M. Wilkinson, R. Schreiber, and R. Pan. Largescale parallel collaborative filtering for the netflix prize. In AAIM, pages 337–348, 2008.
 [39] Y. Zhuang, W. Chin, Y. Juan, and C. Lin. A fast parallel SGD for matrix factorization in shared memory systems. In RecSys, pages 249–256, 2013.