CuMF˙SGD: Fast and Scalable Matrix Factorization
Abstract
Matrix factorization (MF) has been widely used in recommender systems, database systems, topic modeling, word embedding and others. Stochastic gradient descent (SGD) is popular in solving MF problems because it can deal with large data sets and is easy to do incremental learning. We observed that SGD for MF is memory bound. Meanwhile, singlenode CPU systems with caches perform well only for small data sets; distributed systems have higher aggregated memory bandwidth but suffer from relatively slow network connection. This observation inspires us to accelerate MF by utilizing GPUs’s high memory bandwidth and fast intranode connection.
We present cuMF_SGD, a CUDAbased SGD solution for largescale MF problems. On a single GPU, we design two workload scheduling schemes (batchHogwild! and wavefrontupdate) that fully exploit the massive amount of cores. Especially, batchHogwild!, a vectorized version of Hogwild!, overcomes the issue of memory discontinuity. We develop highlyoptimized kernels for SGD update, leveraging cache, warpshuffle instructions, halfprecision floats, etc. We also design a partition scheme to utilize multiple GPUs while addressing the wellknown convergence issue when parallelizing SGD. Evaluations on three data sets with only one Maxwell or Pascal GPU show that cuMF_SGD runs 3.1X28.2X as fast compared with stateofart CPU solutions on 164 CPU nodes. Evaluations also show that cuMF_SGD scales well with multiple GPUs on large data sets. Finally, we believe that the lessons learned from building cuMF_SGD are applicable to other machine learning algorithms on, e.g., (1) embedding layers in deep learning and (2) bipartite graph.
CuMF˙SGD: Fast and Scalable Matrix Factorization
\@float
copyrightbox[b]
\end@floatMatrix factorization (MF) has been widely used in recommender systems [?] by many companies (i.e. Amazon, Netflix, Facebook [?] and Spotify). It can also be used in topic modeling, word embedding [?], database system [?], and has a natural connection to the embedding layers in deep neural network. Let us use the recommender system as example. Figure CuMF˙SGD: Fast and Scalable Matrix Factorization shows a rating matrix of , with sparse ratings from users to items. We assume that can be factorized into the multiplication of two lowrank feature matrices () and (), such that . The derived feature matrices and can be used to predict the missing ratings in , or as features of corresponding users/items in downstream machine learning tasks. Matrix factorization often involves large data sets. For example, the number of users/items may range from thousands to hundredsofmillions, and the number of observed samples in may range from millions to tensofbillions [?]. Therefore, there is a need to scale and speed up largescale matrix factorization.
There are mainly three algorithms to solve matrix factorization, i.e., coordinate gradient descent(CGD), alternate least square (ALS), and stochastic gradient descent (SGD). As previous works show that CGD is prone to reach local optima [?], we do not focus on it in this paper. ALS is easy to parallelize, and able to deal with implicit feedback such as purchase history [?]. Meanwhile, SGD usually converges faster because it is not as computationintensive as ALS; SGD is also more applicable in incremental learning settings where new ratings are continuously fed in. With our previous work already tackled ALS [?], we focus on SGD in this paper.
The optimization work on matrix factorization contains two streams: algorithm and system. The algorithmic stream tries to optimize update schemes such as learning rate in gradient descent, in order to reduce the number of epochs (an epoch is a full pass through the training set) needed to converge [?]. The system stream tries to accelerate the computation, in order to run each epoch faster [?, ?, ?, ?, ?, ?]. We focus the system stream and the proposed techniques can be combined with other algorithmic optimizations. Our research is based on the following observations.
Observation 1. MF with SGD is memory bound.
When solving MF with SGD, in each step of an epoch, we randomly select one observed sample, read the corresponding features and , do an inner product of them, update and , and eventually write them back (details to be given in Section CuMF˙SGD: Fast and Scalable Matrix Factorization). Obviously, the compute operations per byte is very low. For example,, if we use singleprecision (4 byte float) and , one SGD update involves 2432 bytes memory access (384 bytes for + 2048 bytes for read/write) and 1536 floating point operations (256 ops for dot product and 1280 ops for feature update). That is, the flops/byte ratio is . To put this number into perspective, a modern processor can achieve 600 Gflops/s and memory bandwidth 60 GB/s. This means that the operation’s flops/byte ratio has to be as large as to saturate the computation units. The low flops/byte ratio of MF with SGD indicates that, its performance is bounded by memory bandwidth.
Stateofart SGDbased MF solutions are based on either sharedmemory multithreading [?] or distributed systems [?]. We observed that neither of them is capable of offering sustained high memory bandwidth to MF.
Observation 2. Singlenode CPU systems with caching achieve high memory bandwidth only for small data sets. Distributed systems have higher theoretical bandwidth, but handicapped by the relatively weak network connection.
Sharedmemory CPU systems [?, ?, ?] rely heavily on cache to achieve high memory throughput. To understand this, we evaluate a singlenode and sharedmemory MF library LIBMF [?] with three data sets (details shown in Section CuMF˙SGD: Fast and Scalable Matrix Factorization). As seen from Figure CuMF˙SGD: Fast and Scalable Matrix Factorization(a), on the small Netflix data set, LIBMF achieves an observed 194 GB/s bandwidth^{1}^{1}1Observed bandwidth means the aggregated bandwidth offered by DRAM and cache that is observed by the compute unit., much larger than the actual system DRAM bandwidth ( GB/s). However, on a much larger Hugewiki data set, the achieved memory bandwidth drops by 45%, to 106 GB/s. This is because that Hugewiki data set is large enough to vanish a lot of data locality. This evaluation demonstrates that singlenode CPU systems cannot achieve high memory bandwidth when solving largescale MF problems.
Distributed systems are frequently used to accelerate timeconsuming applications [?, ?]. Distributed systems can aggregate the memory bandwidth and caches on multiple nodes. However, SGD is inherently sequential and consequently different nodes need to reconcile the parameters at the end of each epoch. As a result, despite of the high aggregated memory bandwidth, the performance is bounded by the limited network bandwidth between computer nodes. Figure CuMF˙SGD: Fast and Scalable Matrix Factorization(b) evaluates NOMAD [?], a distributed MF system. We measure its memory efficiency which is defined as the ratio of achieved memory bandwidth to the aggregated memory bandwidth of all nodes. The memory efficiency of NOMAD rapidly decreases when scale to multiple nodes, due to the slow overnetwork communication.
Observation 3. GPUs are with much higher memory bandwidth and enjoy fast interdevice connection within a single node, making GPU an ideal candidate to accelerate MF.
GPUs are widely used to accelerate applications with large data sets, e.g., machine learning and data base systems [?, ?, ?]. Observations 12 inspire us to resort to GPUs for the following reasons. Firstly, GPUs are with high offchip memory bandwidth. For example, NVIDIA Maxwell GPUs has theoretical 360 GB/s offchip memory bandwidth [?] and the newer generation Pascal GPUs provide 780 GB/s [?]. These are several times to anorderofmagnitude higher than CPUs. Secondly, GPUs do not rely on cache to reduce latency; instead, they rely on thousands of concurrent threads running on hundreds of cores to achieve high throughput [?, ?]. Therefore, unlike CPUs, GPUs’ achieved memory bandwidth does not degrade when working data set exceeds cache capacity, as shown in Figure CuMF˙SGD: Fast and Scalable Matrix Factorization(a). Thirdly, multiple GPUs can enjoy very fast interconnect in a node. For example, the recent NVLink [?] can offer 160 GB/s perGPU bandwidth to CPU and peerGPUs. This is much faster than PCIe 3.0 (32GB/s for 16x) and InfiniBand (6.8 GB/s for 56Gb/s FDR).
Proposed solution: cuMF_SGD to accelerate MF by utilizing one or multiple GPUs’ high memory bandwidth and intranode connection.
Parallelizing SGD on GPUs is challenging. Due to the architecture distinct, simply mapping CPUs’ algorithms to GPUs will lead to extremely low performance and suboptimal resources usage [?, ?]. Moreoever, SGD is inherently serial, studies [?] have shown that existing MF solutions do not scale well using merely 30 threads. Hence, to accelerate SGD to MF on GPUs, comprehensive understanding of GPU architectural features and novel algorithms are needed.
We present cuMF_SGD^{2}^{2}2http://github.com/cumf/cumf_sgd/, a fast and scalable SGDbased MF solution on GPUs. Inspired by the lockfree [?] and the blockbased [?, ?] approaches, and given the separated CPU/GPU memory space, cuMF_SGD adopts a hybrid twolevel execution scheme. (1) At the top level, the rating and feature matrices are partitioned into blocks and distributed to multiple GPUs. (2) At the second level, each GPU does SGD with its own partition. One GPU only synchronizes with other GPUs through the shared CPU memory, when it completes the processing of its partition. Within each GPU, the GPUlocal data partition is further distributed to the hundreds of thread blocks. Each thread block processes SGD updates with highlyoptimized vector operations. By this means, cuMF_SGD is able to scale to massive threads on multiple GPUs and perform highlyvectorized operations.
The contributions of this paper are as follows:

Workload characterization. We identify that SGDbased MF is bounded by memory bandwidth and synchronization overhead, instead of computation. We also identify the two challenges in implementing MF on GPUs using SGD, i.e., workload partitioning to avoid conflict and efficient update to exploit GPU hardware.

Optimization on a single GPU. We design two ways to partition and schedule the work within a single GPU, i.e., (1) matrix blockingbased algorithm with lightweight, wavebased scheduling policy, and (2) batchHogwild! which can be seen as a minibatch version of the original Hogwild! algorithm. Besides the scheduling schemes, we also develop highly optimized GPU kernels for SGD update. We leverage the architectural features such as cache, warp shuffle instructions, and halfprecision floats with 16 bits.

Deal with big data on multiple GPUs. We design a scheme to partition large data sets and solve them on multiple GPUs. We overlap data transfer and computation to minimize the execution time. We also analyze the relation between the number of partitions and the number of workers, which impacts the randomness of the SGD update, and ultimately the convergence speed.

Evaluation. We implement cuMF_SGD in a sharedmemory system with multiple GPUs. Experimental results show that, cuMF_SGD with one GPU is 3.1X28.2X as fast compared with stateofart CPU solutions on 164 CPU nodes. CuMF_SGD is also able to scale to multiple GPUs and different generations GPUs.
The remaining of this paper is organized as follows. Section CuMF˙SGD: Fast and Scalable Matrix Factorization analyzes the targeted GPU architecture, the workload of SGD for MF, and its parallelization schemes. Section CuMF˙SGD: Fast and Scalable Matrix Factorization presents the singleGPU cuMF_SGD, including the two scheduling schemes and the GPU kernel. Section \thefigure discusses how to solve largescale problems by matrix partitioning. Section CuMF˙SGD: Fast and Scalable Matrix Factorization presents and discusses experiment results. Section \thefigure discusses the related work and Section CuMF˙SGD: Fast and Scalable Matrix Factorization concludes this paper.
This section first briefly introduces the GPU architecture, and the SGD algorithm for matrix factorization. Then, we discuss the parallelism schemes for MF with SGD. We argue that, the current lockfree and matrix blocking schemes are not scalable on massive GPU cores. This leads to the innovations to be presented in Sections 3 and 4.
To overcome the limited memory bandwidth of a single CPU node and limited network bandwidth of distributed systems, we use a heterogeneous platform with GPUs, as shown in Figure CuMF˙SGD: Fast and Scalable Matrix Factorization. GPUs have high intradevice memory bandwidth and are connected via PCIe or NVLink [?] with high interdevice bandwidth. The CPUs take care of data preprocessing, data movement, and toplevel workload scheduling, while the GPUs deal with feature update with massive parallelism.
GPUs are throughputoriented processors [?] with thousands of cores and high bandwidth memory (200800 GB/s). To fully exploit the performance potential, GPU applications need to be carefully designed to exploit the data and computation parallelism. In the next two subsections we show that is nontrivial for SGD as SGD is inherently serial.
Given users and items and a sparse rating matrix , where indicates the preference or rating of user on item. The goal of matrix factorization is to train a feature matrix and a feature matrix such that:
The training process of matrix factorization is to minimize the following cost function:
where and are regularization parameters to avoid overfitting and is the number of nonzero samples in matrix . The key idea of SGD is in every single step, randomly select one sample, e.g., from , to calculate the gradient w.r.t to the following cost function:
Then update feature vectors with learning rate :
SGD is inherently serial where each time one sample is selected to update. Given a data set with samples, an epoch (aka., iteration) involves executing updates one after other. Usually, it takes tens to hundreds of epochs to converge. To accelerate this process, it was observed that two samples can update in parallel if they are neither in same row nor same column. In this paper, we call these samples as ”independent samples”. Figure CuMF˙SGD: Fast and Scalable Matrix Factorization shows an example. Sample A (row 0, column 1) and Sample B (row 1, column 0) are independent as they are associated with different rows in and different columns in . Meanwhile, Sample A and Sample C are dependent as they update the sample column 1 in . Ideally, SGD can be parallelized without losing accuracy, if we update independent samples in parallel, and update dependent samples sequentially [?].
To accelerate the SGD algorithm, how to partition the workloads to parallel workers becomes one major challenge. The efficiency of the workload scheduling scheme has a profound impact to the convergence speed. The workload scheduling policies in existing work [?, ?, ?, ?, ?] can be divided into two categories, Hogwild! and matrixblocking.
Hogwild!(Figure CuMF˙SGD: Fast and Scalable Matrix Factorization(a)) is a lockfree approach to parallelize SGD[?]. In this paper, we call that a conflict happens if two or more concurrent threads update samples in the same row or column at the same time. Intuitively, certain synchronization should be used to avoid conflicts. However, Hogwild! observes that such synchronization is not needed when the is very sparse and the number of concurrent threads is much less than the number of samples. The intuition is that, when the aforementioned requirements are met, the probability of conflicts is very low and the incurred accuracy loss can be ignored. Based on the low synchronization overhead, it is used by some MF solutions. However, we need to enhance Hogwild! in cuMF_SGD in two aspects: (1) Hogwild! assumes a global shared memory space, which is not feasible in our hybrid CPU/GPU setting where each GPU has its own memory space. We can only run Hogwild! in a single GPU and need another layer of partition among multiple GPUs. (2) The vanilla Hogwild! is not cache friendly because of random access. How to balance the randomness in SGD update and cache efficiency is an important issue at design time.
Matrixblocking (Figure CuMF˙SGD: Fast and Scalable Matrix Factorization(b)) divides the rating matrix into several subblocks, and subblocks that do not share rows or columns can update in parallel. Matrixblocking is used by many recent work [?, ?, ?, ?]. Matrixblocking has advantage in that, it totally avoids conflict. However, in matrixblocking parallel workers need to ask a global scheduler on which blocks to proceed next. This global scheduler has been shown not scalable to manycore architectures [?]. Hence, we need to enhance existing matrixblocking schemes to scale to the many cores on GPUs.
This section presents how cuMF_SGD solves MF with one GPU. We preassume that all required data resides in GPU device memory. We discuss multiGPU implementation in Section \thefigure. We need to tackle two issues in single GPU. Section CuMF˙SGD: Fast and Scalable Matrix Factorization discusses the issue of computation optimization, i.e., to optimize each individual SGD update by exploiting GPU hardware. Section CuMF˙SGD: Fast and Scalable Matrix Factorization discusses workload scheduling, i.e., to distribute the many SGD updates to thousands of concurrent GPU threads.
In MF, one SGD update consists of four steps: 1) read one sample () from the rating matrix, 2) read two feature vectors (, ), 3) compute prediction error , and 4) update the features. Except the first step, other three steps are all vector operations at length . is an input parameter and typically ranges from to . On a CPU, a parallel worker can be a thread or process, where vector instructions such as SSE and AVX can be used to accelerate the computation. GPUs are SIMD architectures [?], where a thread block is a vector group. Hence, in cuMF_SGD, we use a thread block as a parallel worker. Figure CuMF˙SGD: Fast and Scalable Matrix Factorization shows a code snippet of the computational part of cuMF_SGD, where we use as an example. We highlight the major optimization techniques in Figure CuMF˙SGD: Fast and Scalable Matrix Factorization and explain them in the following.
Cache. Since Fermi architectures, NVIDIA GPUs feature onchip L1 cache and allows the programmers to control the cache behavior of each memory instruction (cache or bypass). While many GPU applications do not benefit from cache due to cache contention [?], some memory instructions may benefit from cache as the accessed data may be frequently reused in the future (temporal reuse) or by other threads (spatial reuse). Following the model provided by [?], we observe that the memory load of the rating matrix benefits from cache and use the intrinsic instruction __ldg [?] to enable cacheassisted read.
Memory coalescing. On GPUs, when threads within one warp access the data within one cache line, the access is coalesced to minimize the bandwidth consumption [?]. This is called memory coalescing. In cuMF_SGD, the read/write of and are carefully coalesced to ensure that consecutive threads access consecutive memory addresses.
Warp shuffle. Warp shuffle instructions [?] are used to compute the dot product and broadcast the result. Compared with traditional sharedmemorybased approaches, this warp shufflebased approach performs better because: (1) warp shuffle instructions have extra hardware support, (2) register is faster than shared memory, and (3) no thread synchronization is involved. To exploit the warp shuffle feature, we fix the thread blocks size as warp size(32).
ILP. Modern GPUs support compileraided super scalar to exploit the instructionlevel parallelism (ILP). In cuMF_SGD, when , a thread is responsible to process independent scalars. Hence, with awareness of the lowlevel architecture information, we reorder the instructions to maximize the benefit of ILP.
Register usage. Register file is an important resource on GPUs [?]. As the total number of registers on GPUs are limited, while each thread uses too many registers, the register consumption may become the limitation to concurrency. In our case, we identify that the concurrency is only limited by the number of thread blocks of GPUs [?]. Hence, we allocate as many as possible registers to each thread such that every reusable variable is kept in the fastest register file.
Halfprecision. As addressed before, SGD is memory bound. Most of the memory bandwidth is spent on the read/write to the feature matrices. Recently, GPU architectures support the storage of halfprecision (2 bytes vs. 4 bytes of singleprecision) and fast transformation between floating point and halfprecision. In practice, after parameter scaling, halfprecision is precise enough to store the feature matrices and do not incur accuracy loss. CuMF_SGD uses halfprecision to store feature matrices, which halves the memory bandwidth need when accessing feature matrices.
The original SGD algorithm is serial, with samples in the rating matrix picked up randomly and updated in sequence. To exploit parallelism, a workload scheduling policy that assigns tasks to parallel workers becomes necessary. We start from investigating the existing CPUbased scheduling policies. Specifically, we select a representative system LIBMF [?], a shared memory SGD solution to MF. LIBMF proposes a novel workload scheduling policy which successfully solves the load imbalance problem and achieve high efficiency. As shown in Figure CuMF˙SGD: Fast and Scalable Matrix Factorization(a), LIBMF divides the rating matrix to several blocks and uses a global scheduling table to manage the parallel workers. Whenever a worker is free, an idle independent block is scheduled to it. The process is repeated until convergence. However, we and others [?] observe that LIBMF faces scalability issues because of the global scheduling table it uses.
Figure CuMF˙SGD: Fast and Scalable Matrix Factorization(b) shows a scalability study of LIBMF. LIBMFGPU is a GPU version of LIBMF that combines the workload scheduling policy of LIBMF and our GPU computation implementation described in Section CuMF˙SGD: Fast and Scalable Matrix Factorization. We use SGD updates per second as the performance metric:
where , , indicate number of iterations, number of nonzero samples in the input matrix , and elapsed time in seconds, respectively.
Evaluations show that the performance of LIBMF saturates around 30 concurrent workers (CPU threads), which is consistent with the previous study [?]. We perform some GPUspecific optimization techniques when implementing LIBMFGPU, however, it still can only scale to 240 thread blocks, much lower than the hardware limit(768 thread blocks). The reason why LIBMF cannot scale to many parallel workers is that, it uses a global scheduling table to manage all parallel workers.
At each time, only one parallel worker can access the table and it is also time consuming to find a free block to assign the work to. Therefore, when the number of workers increase, the waiting time also increases. As the number of worker grows, the waiting time becomes dominating. This shows that, cuMF_SGD can not simply reuse existing scheduling policies. To overcome this scheduling overhead, we propose two GPUspecific scheduling schemes, batchHogwild! and Wavefrontupdate. BatchHogwild! avoids blockbased scheduling and improves the cache efficiency by process samples in batch. Wavefrontupdate is still blockbased, but only requires a local lookup instead of the expensive global lookup in LIBMF.
We propose batchHogwild!, a variant of Hogwild! [?] with better cache efficiency. Hogwild! is efficient as its lockfree scheme incurs low scheduling overhead. It is not efficient, however, in terms of data locality [?]. In vanilla Hogwild!, each parallel worker randomly selects one sample from the rating matrix at each step. After each update, Hogwild! may not access the consecutive samples in the rating matrix and corresponding rows and columns in the feature matrices during a long time interval, leading to low cache efficiency. As discussed in Section CuMF˙SGD: Fast and Scalable Matrix Factorization, we carefully align the memory access to feature matrices to achieve perfect memory coalescing and the high memory bandwidth on GPUs makes accessing feature matrices no longer a performance bottleneck. To accelerate the access to rating matrix, we exploit the spatial data locality using L1 data cache. We let each parallel worker, instead of fetch one sample randomly at a time, fetches consecutive samples and update them serially. Note that these samples are consecutive in terms of their memory storage; because we shuffle samples, they are still random in terms of their coordinates in . By doing so, the data locality is fully exploited. Consider the L1 cache line size is 128 bytes and the size of each sample is 12 bytes (one floating point and two integers), is enough to exploit the locality. We evaluate different values of and find that they yield similar benefit. Therefore we choose without loss of generality.
As discussed, existing scheduling schemes [?, ?] impose a global synchronization, where all workers look up a global table to find both row and column coordinates to update. This is expensive and has been shown not scalable to the hundreds of workers on GPUs. To overcome this, we propose wavefrontupdate, a lightweight scheme that locks and look up columns only.
We explain the idea of wavefrontupdate using Figure CuMF˙SGD: Fast and Scalable Matrix Factorization. We use four parallel works to process an which is partitioned into blocks. Each worker is assigned to a row in this grid, and each generates a permutation of as its column update sequence. By this means, an epoch is conducted in eight waves given this sequence. In each wave, one worker update one block, and workers do not update blocks in the same column. Assume Worker 1 has the sequence defined as and Workder 3 has sequence . With this sequence, Worker 1 updates Block 1 in wave 1 and Block 5 in wave 2. To avoid conflicts, we propose a lightweight synchronization scheme between waves using the column lock array. As shown the figure, we use an array to indicate the status of each column. Before a worker moves to next wave, it checks the status of the next column defined in its sequence. For example, after Worker 1 finishes Block 1, it needs to check the status of column 4 and does not need to care about other columns’ status. When Work 3 finishes Block 3 and releases column 4, Worker 1 is allowed to move to wave 2. There are two main benefits by doing so: (1) reduce the twodimension lookup table in [?, ?] to an onedimension array, (2) minimize the workload imbalance problem, as a worker can start the next block earlier compared to waiting for all other workers to finish.
We evaluate both techniques in terms of performance and convergence speed using the Netflix data set and the Maxwell platform^{3}^{3}3Details of the data set and platform are presented in Section CuMF˙SGD: Fast and Scalable Matrix Factorization.. We use metric to quantitative the performance. Figure CuMF˙SGD: Fast and Scalable Matrix Factorization(a) shows the scalability of batchHogwild! and Wavefrontupdate with different number of parallel workers (i.e., thread blocks). When increasing the number of parallel workers, both techniques achieve nearlinear scalability. When the number of parallel workers hits the hardware limit (768) of the Maxwell GPU, both techniques achieve 0.27 billion updates per second, which is 2.5 times of LIBMF. Therefore, we conclude that our proposed solutions can perfectly solve the scalability problem of the scheduling policy and fully exploit the equipped hardware resources on GPUs. We also evaluate the convergence speed of both techniques. We use the root mean square root error on the standard test data set as the indication of convergence. Figure CuMF˙SGD: Fast and Scalable Matrix Factorization(b) shows the decrease of Test RMSE in iterations. Overall, batchHogwild! converges a little bit faster than Wavefrontupdate. The reason is, batchHogwild! enforces more randomness in update sequence, compared with the blockbased wavefrontupdate. Based on this observation, we use batchHogwild! as the default scheme on one GPU.
Section CuMF˙SGD: Fast and Scalable Matrix Factorization presents how to solve MF in a single GPU, assuming the rating matrix and feature matrices fully reside in GPU memory. However, the limited GPU memory capacity [?] prevents cuMF_SGD from solving large scale problems. For example, NVIDIA TITAN X GPU has 12 GB device memory that can only store 1 billion samples (one sample needs one float and two integers). Nowadays, realworld problems may have samples [?]. Techniques such as Unified Virtual Memory [?] allow GPU to use CPU memory but with high overhead. Consider these factors, to solve largescale MF problems that can not fit into one GPU’s device memory, we need to partition the data sets and stage the partitions to GPUs in batches. Moreover, We should overlap data transfer with computation to alleviate the delay caused by slow CPUGPU memory transfer. Please note that, the partitions can be processed by one or multiple GPUs.
Figure CuMF˙SGD: Fast and Scalable Matrix Factorization shows our proposed multiple GPUs solution. The main idea is to divide the rating matrix into multiple blocks; each block is small enough to fit into a GPU’s device memory such that independent blocks can update concurrently on different GPUs. The multiple GPU solution works as follows,

Divide the rating matrix into blocks. Meanwhile, divide feature matrix into segments and feature matrix into segments accordingly.

When a GPU is idle, randomly select one matrix block from those independent blocks and dispatch it to the GPU.

Transfer the matrix block and corresponding feature submatrices and to the GPU. Then update the matrix block using the single GPU implementation discussed in Section CuMF˙SGD: Fast and Scalable Matrix Factorization. After the update, transfer and back to CPU.

Iterate from 2 until convergence or the given number of iterations is reached.
We further explain the proposed scheme using the example shown in Figure CuMF˙SGD: Fast and Scalable Matrix Factorization(a). In Step 1, we divide into blocks and use two GPUs. In Step 2, we send block to GPU 0 and to GPU 1. Again, consider the nature of MF, updating only touches submatrices & while updating only touches & . Hence, GPU 0 only needs to store , , and in its device memory while GPU 1 only needs to store , , and . By doing so, the problem is divided and conquered by multiple GPUs. After deciding the block scheduling order, cuMF_SGD transfers , , to GPU 0 and , , to GPU 1, then performs the computation on two GPUs in parallel. The GPU side computation follows the rules we discussed in Section CuMF˙SGD: Fast and Scalable Matrix Factorization. After finishing the computation, the updated , , , and are transferred back to CPU memory. Note that we don’t have to transfer or back to CPU memory as they are readonly.
Scalability problem. We mentioned LIBMF faces scalability issue, as the scheduling overhead increases quickly with the number of workers [?]. Our multipleGPU scheduling scheme has similar complexity with that of LIBMF. However, it does not face the same scalability issue as we only need to schedule to a few GPUs instead of hundreds of workers.
GPUs’ memory bandwidth are much larger than the CPUGPU memory transfer bandwidth. For example, NVIDIA TITAN X GPUs provide 360 GB/s device memory bandwidth while the CPUGPU memory bandwidth is only 16 GB/s (PCIe v3 16x). In single GPU implementation, CPUGPU memory transfer only happens at the start and end of MF, and therefore not dominant. However, when the data set can not fit into the GPU memory, memory transfer happens frequently and has higher impact on the overall performance.
Given the memory transfer overhead, we overlap the memory transfers and computation when solving large problems, as shown in Figure CuMF˙SGD: Fast and Scalable Matrix Factorization(b). Due to space limitation , we only plot one GPU. The key idea is, at the block scheduling time, instead of randomly selecting one independent block for the GPU, the optimized technique randomly selects multiple blocks at a time. Those blocks are pipelined to overlap the memory transfer and computation: we schedule two blocks to GPU 0, and overlap the memory transfer of the second block () with the computation of the first block (). Note that the two blocks scheduled to one GPU do not need to be independent as they are updated in serial; meanwhile, blocks scheduled to different GPUs have to be independent with each other to avoid conflicts. By doing so, we can reduce the overhead of slow CPUGPU memory transfer and improve the overall performance.
Discussion. Allocating more blocks to one GPU would yield more performance benefit as more memory/computation overlapping can be achieved. However, the number of available blocks is limited by how do we divide the rating matrix . Consider we divide to and we have two GPUs running in parallel, the number of blocks per GPU cannot be more than . In practice, is determined by the size of the rating matrix and the available hardware resources on the GPU. We will discuss it in Section CuMF˙SGD: Fast and Scalable Matrix Factorization.
Multiple GPUs management. We implement it using multiple CPU threads within one process. Within the process, there is one host thread and multiple worker threads, where each GPU is bound to one worker thread. The host thread manages the workload scheduling and informs worker threads of the scheduling decision. Each worker thread then starts data transfer and launches compute kernels on a GPU.
Overlapping. Each worker thread is responsible to overlap the computation and CPUGPU memory transfers. We use CUDA streams to achieve this. A stream contains a list of GPU commands that are executed in serial, and commands in different streams are executed in parallel if hardware resources permit. Each worker thread uses three streams that manage CPUGPU memory transfer, GPUCPU memory transfer, and GPU kernel launch, respectively.
We implement cuMF_SGD using CUDA C (source code at http://github.com/cumf/cumf_sgd/), evaluate its performance on public data sets, and demonstrate its advantage in terms of performance and cost. Section CuMF˙SGD: Fast and Scalable Matrix Factorization introduces the experimental environment. The following experiments are designed to answer these questions:

Compared with stateoftheart SGDbased approaches on CPUs [?, ?], is cuMF_SGD better and why? (Section \thetable)

What is the implication of using different generations of GPUs? (Section \thefigure)

Compared with the ALSbased GPU library cuMF_ALS that we published earlier [?], what is the advantage of cuMF_SGD? (Section CuMF˙SGD: Fast and Scalable Matrix Factorization)

Parallelizing SGD is always tricky and may lead to converge problems. What are the factors impacting parallelizing SGD? (Section CuMF˙SGD: Fast and Scalable Matrix Factorization)

When scale up to multiple GPUs, is cuMF_SGD still efficient? (Section CuMF˙SGD: Fast and Scalable Matrix Factorization)
Platform. We evaluate cuMF_SGD on heterogeneous platforms with both CPU and GPUs. Table CuMF˙SGD: Fast and Scalable Matrix Factorization shows the configuration of the two servers used in experiments.
Maxwell Platform  

CPU  12core Intel Xeon CPU E52670*2 (up to 48 threads), 512 GB memory 
GPU  TITAN X GPU*4, per GPU: 24 SMs, up to 768 thread blocks, 12 GB device memory. 
Pascal Platform  
CPU  2*10 PowerNV 8 processors with SMT 8 and NVLink. 
GPU  Tesla P100 GPU*4, per GPU: 56 SMs, up to 1792 thread blocks, 16 GB device memory. 
Data sets. We use three public data sets: Netflix, Yahoo!Music, and Hugewiki. Details of them are shown in Table CuMF˙SGD: Fast and Scalable Matrix Factorization. Netflix and Yahoo!Music come with a test set but Hugewiki does not. We randomly sample and extract out 1% of the data set for testing purpose.
Dataset  Netflix  Yahoo!Music  Hugewiki 

480,190  1,000,990  50,082,604  
17,771  624,961  39,781  
128  128  128  
99,072,112  252,800,275  3,069,817,980  
1,408,395  4,003,960  31,327,899 
Parameter. As mentioned in the introduction, this paper focus on systemlevel but not algorithmiclevel optimization. Therefore, we did not spend much effort in parameter turning. Instead, we use the parameters adopted by earlier work [?, ?, ?, ?]. For learning rate, we adopt the learning rate scheduling techniques used by Yun et al. [?], where the learning rate at epoch is monotonically reduced in the following routine:
is the given initial learning rate and is another given parameter. The parameters used by cuMF_SGD are listed in Table CuMF˙SGD: Fast and Scalable Matrix Factorization.
Dataset  

Netflix  0.05  0.08  0.3 
Yahoo!Music  1.0  0.08  0.2 
Hugewiki  0.03  0.08  0.3 
We compare cuMF_SGD with the following stateoftheart approaches.

LIBMF [?]. LIBMF is a representative blockingbased solution on sharedmemory systems. Its main design purpose is to balance the workload across CPU threads and accelerate the memory access. It leverages SSE instructions and a novel learning rate schedule to speed up the convergence [?].
We exhaustively evaluate all possible parallel parameters on the Maxwell platform and select the optimal one. For example, we use 40 CPU threads and divide the input rating matrix into blocks; we set its initial learning rate as .

NOMAD [?]. NOMAD is a representative distributed matrix factorization solution. It uses a 64node HPC cluster to solve MF. It proposes a decentralized scheduling policy to reduce the synchronization overhead and discusses how to reduce the internode communication. We present the best results presented in the original paper, i.e., using 32 nodes for Netflix and Yahoo!Music data sets and using all 64 nodes for Hugewiki data set on the HPC cluster.

CuMF_SGD. We evaluate cuMF_SGD on both Maxwell and Pascal platforms, with all three data sets. We name the results on Maxwell as cuMF_SGDM and those on Pascal as cuMF_SGDP. We use one GPU in this subsection. The number of parallel workers (thread blocks) is set as the maximum of the corresponding GPU architecture (768 on Maxwell platform and 1792 on Pascal platform). As Hugewiki can not fit into one GPU’s memory, we divide it into blocks and at each scheduling, we schedule 8 blocks to overlap memory transfer and computation.
Figure CuMF˙SGD: Fast and Scalable Matrix Factorization shows the test RMSE w.r.t. the training time. Table CuMF˙SGD: Fast and Scalable Matrix Factorization summarizes the training time required to converge to a reasonable RMSE (0.92, 22.0, and 0.52 for Netflix, Yahoo!Music, and Hugewki, respectively). Results show that with only one GPU, cuMF_SGDP and cuMF_SGDM perform much better (3.1X to 28.2X) on all data sets compared to all competitors, including NOMAD on a 64node HPC cluster. In the following we analyze the reasons.
Data set  Netflix  Yahoo!Music  Hugewiki 

LIBMF  23.0s  37.9s  3020.7s 
NOMAD  9.6s(2.4X)  108.7s(0.35X)  459.1s(6.6X) 
CuMF_SGDM  7.5s(3.1X)  8.8s(4.3X)  442.3s(6.8X) 
CuMF_SGDP  3.3s(7.0X)  3.8s(10.0X)  107.0s(28.2X) 
Comparison with LIBMF. As shown in Figure CuMF˙SGD: Fast and Scalable Matrix Factorization and Table CuMF˙SGD: Fast and Scalable Matrix Factorization, cuMF_SGD outperforms LIBMF on all data sets, on both Maxwell and Pascal. More precisely, cuMF_SGDM is 3.1X  6.8X as fast as LIBMF and cuMF_SGDP is 7.0X  28.2X as fast. CuMF_SGD outperforms LIBMF because it can do more updates per second, as shown in Figure CuMF˙SGD: Fast and Scalable Matrix Factorization(a). We already mentioned that that matrix factorization is memory bound; LIBMF is also aware of that and strives to keep all frequently used data in the CPU cache. However, the limited cache capacity on a single CPU makes LIMBF suboptimal in large data sets. As shown in Figure CuMF˙SGD: Fast and Scalable Matrix Factorization(b), LIBMF achieves an effective memory bandwidth of 194 GB/s^{4}^{4}4The achieved memory bandwidth measures the data processed by the compute units per second, and can be higher than the theoretical offchip memory bandwidth thanks to the cache effect. on the Netflix data set (with 99M samples) – close to cuMF_SGDM. However its achieved bandwidth drops almost by half, to 106 GB/s on the larger Hugewiki data set (with 3.1B samples) – while cuMF_SGD achieves similar bandwidth in all data sets.
On the scheduling policy of LIBMF. Simply porting LIBMF to GPUs leads to resource underutilization due to the scalability of it scheduling policy (recall Figure CuMF˙SGD: Fast and Scalable Matrix Factorization). In contrast, the workload scheduling policy and memory/computation pattern of cuMF_SGD are specifically designed to fully exploit the computation and memory resources on GPUs. Hence, as shown in Figure CuMF˙SGD: Fast and Scalable Matrix Factorization (b), cuMF_SGD achieves much higher bandwidth than LIBMF. Moreover, cuMF_SGD uses halfprecision (2 bytes for a float number) to store feature matrices. As a result, it can perform twice updates as LIBMF with the same bandwidth consumption.
Compared with NOMAD. As presented in [?], NOMAD uses 32 nodes for Netflix and Yahoo!Music and 64 HPC nodes for Hugewiki. Despite of the tremendous hardware resources, NOMAD is still outperformed by cuMF_SGD on all data sets. As observed in Section CuMF˙SGD: Fast and Scalable Matrix Factorization, MF is a memory bounded and data communication happens frequently between parallel workers. When NOMAD distributes parallel workers to different nodes, the network bandwidth which is much less than intranode communication, becomes the bottleneck. Consequently, NOMAD achieves suboptimal scalability when scale from single node to multiple nodes, especially for small data sets. For example, on Yahoo!Music, NOMAD performs even worse than LIBMF that uses only one machine.
NOMAD (on a 64node HPC cluster) has similar performance with cuMF_SGDM on Hugewiki, while it is much slower than cuMF_SGDP. Obviously, cuMF_SGD is not only faster, using a single CPU card, it is also more costefficient.
We have evaluated cuMF_SGD on the two current generations of GPUs: Maxwell and Pascal. We believe that cuMF_SGD is able to scale to future GPU architectures with minor tuning effort. In this section, we explain the performance gap between Maxwell and Pascal in three aspects: computation resources, offchip memory bandwidth, and CPUGPU memory bandwidth.
Computation resources. We show the SGD updatespersecond metric of two platforms with different number of parallel workers using Netflix in Figure CuMF˙SGD: Fast and Scalable Matrix Factorization(a). Results show that the Pascal platform scales to more parallel workers and achieves much higher than Maxwell. This is because the Maxwell platform has 24 streaming multiprocessors (SMs) within each GPU, with each SM allowing up to 32 parallel workers (thread blocks). Hence, one Maxwell GPU allows up to 768 parallel workers. Meanwhile, the Pascal GPU used has 56 SMs and allows 32 thread blocks on each SM. Hence, a Pascal GPU allows up to 1792 parallel workers, which is 2.3 times of that of Maxwell GPU. Overall, a Pascal GPU is more powerful than a Maxwell GPU in term of the amount of computation resources.
Offchip memory bandwidth. As we discussed before, SGD is memory bound. Optimized for throughput, GPUs are able to overlap memory access and computation by fast context switch among parallel workers [?]. When there are enough parallel workers running on GPUs, long memory latencies can be hidden, which is exactly what happens with cuMF_SGD. In this scenario, memory bandwidth, instead of memory latency, becomes the limitation of the performance. Pascal platforms provides twice as much theoretical peak offchip memory bandwidth (780 GB/s) as Maxwell platforms(360 GB/s). Figure CuMF˙SGD: Fast and Scalable Matrix Factorization(b) shows the achieved memory bandwidth on two platforms with different number of parallel workers. On Maxwell and Pascal, cuMF_SGD achieves up to 266 GB/s and 567 GB/s, respectively.
CPUGPU memory bandwidth. Netflix and Yahoo!Music data sets are small enough to fit into the GPU device memory. For Hugewiki, memory transfer occurs multiple times as the data cannot fit into GPU device memory. In Section CuMF˙SGD: Fast and Scalable Matrix Factorization, we propose to overlap data transfer with computation. Despite of this optimization, the CPUGPU memory bandwidth still has noticeable impact on the overall performance as the perfect overlapping cannot be achieved. On the Maxwell platform, the memory transfer between CPU and GPU is via PCIe v3 16x with 16 GB/s bandwidth (we observe that on average, the achieved bandwidth is 5.5 GB/s). The very recent Pascal platform is with NVLink [?] that can provide 40 GB/s in theory (we observe an average 29.1 GB/s CPUGPU memory transfer bandwidth, which is 5.3X as that on Maxwell). This also explains why cuMF_SGD achieves much more speedup on Hugewiki using Pascal platform (28.2X) than that on Maxwell platform (6.8X).
Our earlier work cuMF_ALS [?] represents the stateofart ALSbased matrix factorization solution on GPUs. We use one GPU for cuMF_SGD, and one and four GPUs for cuMF_ALS. Figure CuMF˙SGD: Fast and Scalable Matrix Factorization compares their performance on three data sets on Maxwell. We observe that cuMF_SGD is faster than cuMF_ALS1 and achieves similar performance with cuMF_ALS4 with only one GPU.
It expected that cuMF_SGD is faster than cuMF_ALS, with the following reason. Each epoch of SGD needs memory access of and computation of . Each epoch of ALS needs memory access of and computation of . ALS’s epochs run slower due to its much more intensive computation. Although ALS needs fewer epochs to coverage, as a whole it converges slower. Despite the fact that cuMF_ALS is slower than cuMF_SGD, we still maintain both solutions at https://github.com/cuMF/ because they serve different purposes: SGD converges fast and easy to do incremental update, while ALS is easy to parallelize and is able to deal with nonsparse rating matrices [?].
The original SGD algorithm is serial. To speed it up, we discuss how to parallelize it on one GPU in Section CuMF˙SGD: Fast and Scalable Matrix Factorization and on multiple GPUs in Section CuMF˙SGD: Fast and Scalable Matrix Factorization. It is wellknown that SGD parallelization may have subtle implications on convergence [?, ?]. In the context of matrix factorization, the implication varies on the two schemes proposed in Section CuMF˙SGD: Fast and Scalable Matrix Factorization: Hogwild! and matrixblocking.
For one GPU, Section \thefigure proposes the batchHogwild! scheme to partition work . As a vectorized version of Hogwild!, batchHogwild! inherits the limitation of Hogwild!. Given a rating matrix of and parallel workers, convergence is ensured only when the following condition satisfied [?]:
For multiple GPUs, Section \thefigure proposes to first divide the rating matrix into blocks and process one block on one GPU in parallel if possible. In this case, the above condition needs to change to:
Our empirical study on Hugewiki data set shows that, needs to be to converge. We validated that, Hugewiki data set has and we choose ; convergence is achieved when , and fails when . We believe this is a limitation for all Hogwild!style solutions.
The purpose of matrixblocking is to avoid conflicts between parallel workers. However, we observe that matrixblocking can have negative impact on convergence. Figure CuMF˙SGD: Fast and Scalable Matrix Factorization illustrates the the convergence speed of LIBMF on Netflix data set with different parameters. In this study, we fix the number of parallel workers ; without loss of generality, we divide into blocks and vary the value of . Figure CuMF˙SGD: Fast and Scalable Matrix Factorization shows that when is less than or close to , convergence speed is much slower or even cannot be achieved. We have similar observations on other data sets and using cuMF_SGD. We briefly explain the reason with a simple example shown in Figure CuMF˙SGD: Fast and Scalable Matrix Factorization.
In Figure CuMF˙SGD: Fast and Scalable Matrix Factorization, we divide the rating matrix into blocks and use 2 parallel workers. In theory, 4 blocks can have possible update orders. We also show all update orders in Figure CuMF˙SGD: Fast and Scalable Matrix Factorization. However, only orders 18 out of the total 24 are feasible so as to avoid update conflicts. For example, when Block 1 is issued to one worker, only Block 4 can be issued to another worker. Hence, Blocks 2 and 3 cannot be updated between 1 and 4, which precludes order 912. This demonstrated that when , all independent blocks have to be updated concurrently to make all workers busy, which enforces certain update order constraints and hurts the randomness. As a consequence, convergence speed can deteriorate. In practice, when cuMF_SGD uses two GPUs, should at least be divided into blocks.
System wise, cuMF_SGD is designed to scale to multiple GPUs. However, algorithmic wise, the scaling is restricted by factors such as problem dimension and number of parallel workers, as discussed earlier in Section CuMF˙SGD: Fast and Scalable Matrix Factorization. Among the three data sets used in this paper, Netflix and Hugewiki have very small (, , receptively), preventing cuMF_SGD from solving them on multiple GPUs. In comparison, Yahoo!Music can be solved on multiple GPUs as the dimension of it is . We divide its into blocks and run it with two Pascal GPUs. Figure CuMF˙SGD: Fast and Scalable Matrix Factorization shows the convergence speed. With 2 Pascal GPUs, cuMF_SGD takes 2.5s to converge to RMSE 22, which is 1.5X as fast as 1 Pascal GPU (3.8s). The reason behind this sublinear scalability is that, the multiGPU cuMF_SGD needs to spend time on CPUGPU memory transfer so as to synchronize two GPUs.
Algorithms. SGD has been widely used to solve matrix factorization [?]. Serial SGD can be parallelized to achieve better performance. ALS is naturally easy to parallelize and it can also been used in dense matrix factorization. Coordinate descent is another algorithm to solve matrix factorization [?, ?]. It updates the feature matrix along one coordinate direction in each step. Our earlier work [?] focuses on ALS algorithm.
Parallel SGD solutions have been discussed in multicore [?, ?, ?], multinode [?, ?], MapReduce [?, ?] and parameterservers [?, ?] settings. Existing works are mostly inspired by Hogwild! [?] that allows lockfree update, or matrixblocking that partitions to avoid conflicts, or a combination of them. LIBMF [?, ?] is a representative sharedmemory multicore system. Evaluations have shown that it outperforms all previous approaches with one machine. Although it has been optimized for cache efficiency, it is still not efficient at processing large scale data sets. Moreover, the high complexity of its scheduling policy makes it infeasible to scale to many cores. NOMAD [?] partitions the data on HPC clusters to improve the cache performance. At the meantime, they propose to minimize the communication overhead. Compared with LIBMF, it has similar performance on one machine and is able to scale to 64 nodes.
Parallelization is also used in coordinate descent [?]. Compared with SGD, coordinate descent has lower overhead and runs faster at the first few epochs of training. However, due to the algorithmic limitation, coordinate descent is prone to reach local optima [?] in the later epochs of training.
Compared with CGD and SGD, ALS is inherently easy to parallel, ALS based parallel solutions are widely discussed [?, ?, ?, ?, ?]. Our earlier work, cuMF_ALS [?] focuses on optimizing ALS to matrix factorization on GPUs. As ALS algorithm is more compute intensive, it runs slower than cuMF_SGD.
GPU solutions. Prior to our work, [?] applies Restricted Boltzmann Machines on GPUs to solve MF. [?] implements both SGD and ALS on GPU to solve MF. In contrast, cuMF_SGD outperforms them because we optimize both memory access and workload scheduling.
Matrix factorization is widely used in recommender systems and other applications. SGDbased MF is limited by memory bandwidth which single and multiGPU systems cannot efficiently provision. We propose a GPUbased solution, by observing that GPUs offers abundant memory bandwidth and can enjoy fast intranode connection. We design workload partition and schedule schemes to dispatch tasks insides a GPU and across GPUs, without impacting the randomness required by SGD. We also develop highlyoptimized GPU kernels for individual SGD updates. With only one Maxwell or Pascal GPU, cuMF_SGD runs 3.1X28.2X as fast compared with stateofart CPU solutions on 164 CPU nodes. Evaluations also show that cuMF_SGD scales well on multiple GPUs in large data sets.
 1 Y. Koren, R. Bell, and C. Volinsky, “Matrix factorization techniques for recommender systems,” Computer.

2
“Recommending items to more than a billion people..”
https://code.facebook.com/posts/861999383875667/recommendingitemstomorethanabillionpeople/.  3 J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors for word representation.,” in EMNLP, 2014.
 4 M. Sarwat, Database Management System Support for Collaborative Filtering Recommender Systems. PhD thesis, UNIVERSITY OF MINNESOTA, 2014.
 5 W.S. Chin, Y. Zhuang, Y.C. Juan, and C.J. Lin, “A fast parallel stochastic gradient method for matrix factorization in shared memory systems,” ACM Transactions on Intelligent Systems and Technology (TIST), 2015.
 6 W. Tan, L. Cao, and L. Fong, “Faster and cheaper: Parallelizing largescale matrix factorization on gpus,” in Proceedings of the 25th ACM International Symposium on HighPerformance Parallel and Distributed Computing, HPDC ’16, 2016.
 7 W.S. Chin, Y. Zhuang, Y.C. Juan, and C.J. Lin, “A learningrate schedule for stochastic gradient methods to matrix factorization,” in PacificAsia Conference on Knowledge Discovery and Data Mining, Springer, 2015.
 8 C. Teflioudi, F. Makari, and R. Gemulla, “Distributed matrix completion,” in 2012 IEEE 12th International Conference on Data Mining, IEEE, 2012.
 9 H.F. Yu, C.J. Hsieh, S. Si, and I. Dhillon, “Scalable coordinate descent approaches to parallel matrix factorization for recommender systems,” in 2012 IEEE 12th International Conference on Data Mining, IEEE, 2012.
 10 H. Yun, H.F. Yu, C.J. Hsieh, S. V. N. Vishwanathan, and I. Dhillon, “Nomad: Nonlocking, stochastic multimachine algorithm for asynchronous and decentralized matrix completion,” Proc. VLDB Endow., 2014.
 11 B. Recht, C. Re, S. Wright, and F. Niu, “Hogwild: A lockfree approach to parallelizing stochastic gradient descent,” in Advances in Neural Information Processing Systems, pp. 693–701, 2011.
 12 Z. Liu, Y.X. Wang, and A. Smola, “Fast differentially private matrix factorization,” in Proceedings of the 9th ACM Conference on Recommender Systems, RecSys’15, 2015.
 13 S. Blanas, Y. Li, and J. M. Patel, “Design and evaluation of main memory hash join algorithms for multicore cpus,” in Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, SIGMOD ’11, 2011.
 14 F. Yang, J. Li, and J. Cheng, “Husky: Towards a more efficient and expressive distributed computing framework,” Proceedings of the VLDB Endowment, 2016.
 15 Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and J. M. Hellerstein, “Distributed graphlab: a framework for machine learning and data mining in the cloud,” Proceedings of the VLDB Endowment, 2012.
 16 K. S. Bøgh, S. Chester, and I. Assent, “Workefficient parallel skyline computation for the gpu,” Proc. VLDB Endow., 2015.
 17 K. Zhang, K. Wang, Y. Yuan, L. Guo, R. Lee, and X. Zhang, “Megakv: a case for gpus to maximize the throughput of inmemory keyvalue stores,” Proceedings of the VLDB Endowment, 2015.
 18 K. Wang, K. Zhang, Y. Yuan, S. Ma, R. Lee, X. Ding, and X. Zhang, “Concurrent analytical query processing with gpus,” Proceedings of the VLDB Endowment, 2014.

19
“NVIDIA Maxwell Architecture ..”
https://developer.nvidia.com/maxwellcomputearchitecture. 
20
“NVIDIA Pascal Architecture ..”
http://www.geforce.com/hardware/10series/architecture.  21 E. A. Sitaridi and K. A. Ross, “Gpuaccelerated string matching for database applications,” The VLDB Journal, 2016.
 22 J. Zhou, Q. Guo, H. Jagadish, W. Luan, A. K. Tung, Y. Yang, and Y. Zheng, “Generic inverted index on the gpu,” arXiv preprint arXiv:1603.08390, 2016.

23
“NVIDIA NVLink.”
http://www.nvidia.com/object/nvlink.html.  24 Y. Yang, P. Xiang, J. Kong, and H. Zhou, “A gpgpu compiler for memory optimization and parallelism management,” in Proceedings of the 31st ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’10, (New York, NY, USA), pp. 86–97, ACM, 2010.
 25 B. Zhao, Q. Luo, and C. Wu, “Parallelizing astronomical source extraction on the gpu,” in eScience (eScience), 2013 IEEE 9th International Conference on, 2013.
 26 Y. Nishioka and K. Taura, “Scalable taskparallel sgd on matrix factorization in multicore architectures,” in Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium Workshop, IPDPSW ’15, (Washington, DC, USA), pp. 1178–1184, IEEE Computer Society, 2015.
 27 R. Gemulla, E. Nijkamp, P. J. Haas, and Y. Sismanis, “Largescale matrix factorization with distributed stochastic gradient descent,” in Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, 2011.

28
“NVIDIA CUDA programming guide..”
http://docs.nvidia.com/cuda/cudacprogrammingguide.  29 D. Song and S. Chen, “Exploiting simd for complex numerical predicates,” in 2016 IEEE 32nd International Conference on Data Engineering Workshops (ICDEW), 2016.
 30 X. Xie, Y. Liang, G. Sun, and D. Chen, “An efficient compiler framework for cache bypassing on gpus,” in IEEE/ACM International Conference on ComputerAided Design, 2013.
 31 Y. Yang, P. Xiang, J. Kong, and H. Zhou, “A GPGPU compiler for memory optimization and parallelism management,” in 2010 ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’10, pp. 86–97, 2010.
 32 C. del Mundo and W.c. Feng, “Enabling efficient intrawarp communication for fourier transforms in a manycore architecture,” in Supercomputing, 2013. Proceedings of the 2013 ACM/IEEE International Conference on, 2013.
 33 J. Lai and A. Seznec, “Performance upper bound analysis and optimization of sgemm on fermi and kepler gpus,” in Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization(CGO), 2013.
 34 S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk, and W.m. W. Hwu, “Optimization principles and application performance evaluation of a multithreaded gpu using cuda,” in Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’08, 2008.
 35 C.J. Hsieh and I. S. Dhillon, “Fast coordinate descent methods with variable selection for nonnegative matrix factorization,” in Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, 2011.
 36 J. Oh, W.S. Han, H. Yu, and X. Jiang, “Fast and robust parallel sgd matrix factorization,” in Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2015.
 37 C. Teflioudi, F. Makari, and R. Gemulla, “Distributed matrix completion,” in IEEE 12th International Conference on Data Mining, IEEE, 2012.
 38 B. Li, S. Tata, and Y. Sismanis, “Sparkler: supporting largescale matrix factorization,” in Proceedings of the 16th International Conference on Extending Database Technology, ACM, 2013.
 39 S. Schelter, V. Satuluri, and R. Zadeh, “Factorbirda parameter server approach to distributed matrix factorization,” arXiv preprint arXiv:1411.0602, 2014.
 40 H. Cui, J. Cipar, Q. Ho, J. K. Kim, S. Lee, A. Kumar, J. Wei, W. Dai, G. R. Ganger, P. B. Gibbons, et al., “Exploiting bounded staleness to speed up big data analytics,” in USENIX Annual Technical Conference (USENIX ATC), 2014.
 41 X. Meng, J. Bradley, B. Yuvaz, E. Sparks, S. Venkataraman, D. Liu, J. Freeman, D. Tsai, M. Amde, S. Owen, et al., “Mllib: Machine learning in apache spark,” JMLR, 2016.
 42 Y. Zhou, D. Wilkinson, R. Schreiber, and R. Pan, “Largescale parallel collaborative filtering for the netflix prize,” in International Conference on Algorithmic Applications in Management, Springer, 2008.
 43 M. Gates, H. Anzt, J. Kurzak, and J. Dongarra, “Accelerating collaborative filtering using concepts from high performance computing,” in Big Data, 2015 IEEE International Conference on, 2015.
 44 X. Cai, Z. Xu, G. Lai, C. Wu, and X. Lin, “Gpuaccelerated restricted boltzmann machine for collaborative filtering,” in International Conference on Algorithms and Architectures for Parallel Processing, Springer, 2012.
 45 D. Zastrau and S. Edelkamp, “Stochastic gradient descent with gpgpu,” in Annual Conference on Artificial Intelligence, Springer, 2012.