CuMF˙SGD: Fast and Scalable Matrix Factorization

CuMF˙SGD: Fast and Scalable Matrix Factorization

Xiaolong Xie   Work done when the author was with IBM. Center for Energy-efficient Computing and Applications, School of EECS, Peking University, China           xiexl_pku, Wei Tan IBM Thomas J. Watson Research Center, New York, U.S.A wtan, Liana L. Fong IBM Thomas J. Watson Research Center, New York, U.S.A wtan, Yun Liang Center for Energy-efficient Computing and Applications, School of EECS, Peking University, China           xiexl_pku,

Matrix factorization (MF) has been widely used in recommender systems, database systems, topic modeling, word embedding and others. Stochastic gradient descent (SGD) is popular in solving MF problems because it can deal with large data sets and is easy to do incremental learning. We observed that SGD for MF is memory bound. Meanwhile, single-node CPU systems with caches perform well only for small data sets; distributed systems have higher aggregated memory bandwidth but suffer from relatively slow network connection. This observation inspires us to accelerate MF by utilizing GPUs’s high memory bandwidth and fast intra-node connection.

We present cuMF_SGD, a CUDA-based SGD solution for large-scale MF problems. On a single GPU, we design two workload scheduling schemes (batch-Hogwild! and wavefront-update) that fully exploit the massive amount of cores. Especially, batch-Hogwild!, a vectorized version of Hogwild!, overcomes the issue of memory discontinuity. We develop highly-optimized kernels for SGD update, leveraging cache, warp-shuffle instructions, half-precision floats, etc. We also design a partition scheme to utilize multiple GPUs while addressing the well-known convergence issue when parallelizing SGD. Evaluations on three data sets with only one Maxwell or Pascal GPU show that cuMF_SGD runs 3.1X-28.2X as fast compared with state-of-art CPU solutions on 1-64 CPU nodes. Evaluations also show that cuMF_SGD scales well with multiple GPUs on large data sets. Finally, we believe that the lessons learned from building cuMF_SGD are applicable to other machine learning algorithms on, e.g., (1) embedding layers in deep learning and (2) bipartite graph.

CuMF˙SGD: Fast and Scalable Matrix Factorization




Matrix factorization (MF) has been widely used in recommender systems [?] by many companies (i.e. Amazon, Netflix, Facebook [?] and Spotify). It can also be used in topic modeling, word embedding [?], database system [?], and has a natural connection to the embedding layers in deep neural network. Let us use the recommender system as example. Figure CuMF˙SGD: Fast and Scalable Matrix Factorization shows a rating matrix of , with sparse ratings from users to items. We assume that can be factorized into the multiplication of two low-rank feature matrices () and (), such that . The derived feature matrices and can be used to predict the missing ratings in , or as features of corresponding users/items in downstream machine learning tasks. Matrix factorization often involves large data sets. For example, the number of users/items may range from thousands to hundreds-of-millions, and the number of observed samples in may range from millions to tens-of-billions [?]. Therefore, there is a need to scale and speed up large-scale matrix factorization.

Figure \thefigure: An example of matrix factorization where =4, =4, =2.

There are mainly three algorithms to solve matrix factorization, i.e., coordinate gradient descent(CGD), alternate least square (ALS), and stochastic gradient descent (SGD). As previous works show that CGD is prone to reach local optima [?], we do not focus on it in this paper. ALS is easy to parallelize, and able to deal with implicit feedback such as purchase history [?]. Meanwhile, SGD usually converges faster because it is not as computation-intensive as ALS; SGD is also more applicable in incremental learning settings where new ratings are continuously fed in. With our previous work already tackled ALS [?], we focus on SGD in this paper.

The optimization work on matrix factorization contains two streams: algorithm and system. The algorithmic stream tries to optimize update schemes such as learning rate in gradient descent, in order to reduce the number of epochs (an epoch is a full pass through the training set) needed to converge [?]. The system stream tries to accelerate the computation, in order to run each epoch faster  [?, ?, ?, ?, ?, ?]. We focus the system stream and the proposed techniques can be combined with other algorithmic optimizations. Our research is based on the following observations.

Observation 1. MF with SGD is memory bound.

When solving MF with SGD, in each step of an epoch, we randomly select one observed sample, read the corresponding features and , do an inner product of them, update and , and eventually write them back (details to be given in Section CuMF˙SGD: Fast and Scalable Matrix Factorization). Obviously, the compute operations per byte is very low. For example,, if we use single-precision (4 byte float) and , one SGD update involves 2432 bytes memory access (384 bytes for + 2048 bytes for read/write) and 1536 floating point operations (256 ops for dot product and 1280 ops for feature update). That is, the flops/byte ratio is . To put this number into perspective, a modern processor can achieve 600 Gflops/s and memory bandwidth 60 GB/s. This means that the operation’s flops/byte ratio has to be as large as to saturate the computation units. The low flops/byte ratio of MF with SGD indicates that, its performance is bounded by memory bandwidth.

State-of-art SGD-based MF solutions are based on either shared-memory multi-threading [?] or distributed systems [?]. We observed that neither of them is capable of offering sustained high memory bandwidth to MF.

Observation 2. Single-node CPU systems with caching achieve high memory bandwidth only for small data sets. Distributed systems have higher theoretical bandwidth, but handicapped by the relatively weak network connection.

Shared-memory CPU systems [?, ?, ?] rely heavily on cache to achieve high memory throughput. To understand this, we evaluate a single-node and shared-memory MF library LIBMF [?] with three data sets (details shown in Section CuMF˙SGD: Fast and Scalable Matrix Factorization). As seen from Figure CuMF˙SGD: Fast and Scalable Matrix Factorization(a), on the small Netflix data set, LIBMF achieves an observed 194 GB/s bandwidth111Observed bandwidth means the aggregated bandwidth offered by DRAM and cache that is observed by the compute unit., much larger than the actual system DRAM bandwidth ( GB/s). However, on a much larger Hugewiki data set, the achieved memory bandwidth drops by 45%, to 106 GB/s. This is because that Hugewiki data set is large enough to vanish a lot of data locality. This evaluation demonstrates that single-node CPU systems cannot achieve high memory bandwidth when solving large-scale MF problems.

Distributed systems are frequently used to accelerate time-consuming applications [?, ?]. Distributed systems can aggregate the memory bandwidth and caches on multiple nodes. However, SGD is inherently sequential and consequently different nodes need to reconcile the parameters at the end of each epoch. As a result, despite of the high aggregated memory bandwidth, the performance is bounded by the limited network bandwidth between computer nodes. Figure CuMF˙SGD: Fast and Scalable Matrix Factorization(b) evaluates NOMAD [?], a distributed MF system. We measure its memory efficiency which is defined as the ratio of achieved memory bandwidth to the aggregated memory bandwidth of all nodes. The memory efficiency of NOMAD rapidly decreases when scale to multiple nodes, due to the slow over-network communication.

Figure \thefigure: (1) The observed memory bandwidth of LIBMF drops when solving large data sets. (b) NOMAD achieves lower memory efficiency when scaling to multiple nodes.

Observation 3. GPUs are with much higher memory bandwidth and enjoy fast inter-device connection within a single node, making GPU an ideal candidate to accelerate MF.

GPUs are widely used to accelerate applications with large data sets, e.g., machine learning and data base systems [?, ?, ?]. Observations 1-2 inspire us to resort to GPUs for the following reasons. Firstly, GPUs are with high off-chip memory bandwidth. For example, NVIDIA Maxwell GPUs has theoretical 360 GB/s off-chip memory bandwidth [?] and the newer generation Pascal GPUs provide 780 GB/s [?]. These are several times to an-order-of-magnitude higher than CPUs. Secondly, GPUs do not rely on cache to reduce latency; instead, they rely on thousands of concurrent threads running on hundreds of cores to achieve high throughput [?, ?]. Therefore, unlike CPUs, GPUs’ achieved memory bandwidth does not degrade when working data set exceeds cache capacity, as shown in Figure CuMF˙SGD: Fast and Scalable Matrix Factorization(a). Thirdly, multiple GPUs can enjoy very fast interconnect in a node. For example, the recent NVLink [?] can offer 160 GB/s per-GPU bandwidth to CPU and peer-GPUs. This is much faster than PCIe 3.0 (32GB/s for 16x) and InfiniBand (6.8 GB/s for 56Gb/s FDR).

Proposed solution: cuMF_SGD to accelerate MF by utilizing one or multiple GPUs’ high memory bandwidth and intra-node connection.

Parallelizing SGD on GPUs is challenging. Due to the architecture distinct, simply mapping CPUs’ algorithms to GPUs will lead to extremely low performance and suboptimal resources usage [?, ?]. Moreoever, SGD is inherently serial, studies [?] have shown that existing MF solutions do not scale well using merely 30 threads. Hence, to accelerate SGD to MF on GPUs, comprehensive understanding of GPU architectural features and novel algorithms are needed.

We present cuMF_SGD222, a fast and scalable SGD-based MF solution on GPUs. Inspired by the lock-free [?] and the block-based [?, ?] approaches, and given the separated CPU/GPU memory space, cuMF_SGD adopts a hybrid two-level execution scheme. (1) At the top level, the rating and feature matrices are partitioned into blocks and distributed to multiple GPUs. (2) At the second level, each GPU does SGD with its own partition. One GPU only synchronizes with other GPUs through the shared CPU memory, when it completes the processing of its partition. Within each GPU, the GPU-local data partition is further distributed to the hundreds of thread blocks. Each thread block processes SGD updates with highly-optimized vector operations. By this means, cuMF_SGD is able to scale to massive threads on multiple GPUs and perform highly-vectorized operations.

The contributions of this paper are as follows:

  • Workload characterization. We identify that SGD-based MF is bounded by memory bandwidth and synchronization overhead, instead of computation. We also identify the two challenges in implementing MF on GPUs using SGD, i.e., workload partitioning to avoid conflict and efficient update to exploit GPU hardware.

  • Optimization on a single GPU. We design two ways to partition and schedule the work within a single GPU, i.e., (1) matrix blocking-based algorithm with lightweight, wave-based scheduling policy, and (2) batch-Hogwild! which can be seen as a mini-batch version of the original Hogwild! algorithm. Besides the scheduling schemes, we also develop highly optimized GPU kernels for SGD update. We leverage the architectural features such as cache, warp shuffle instructions, and half-precision floats with 16 bits.

  • Deal with big data on multiple GPUs. We design a scheme to partition large data sets and solve them on multiple GPUs. We overlap data transfer and computation to minimize the execution time. We also analyze the relation between the number of partitions and the number of workers, which impacts the randomness of the SGD update, and ultimately the convergence speed.

  • Evaluation. We implement cuMF_SGD in a shared-memory system with multiple GPUs. Experimental results show that, cuMF_SGD with one GPU is 3.1X-28.2X as fast compared with state-of-art CPU solutions on 1-64 CPU nodes. CuMF_SGD is also able to scale to multiple GPUs and different generations GPUs.

The remaining of this paper is organized as follows. Section CuMF˙SGD: Fast and Scalable Matrix Factorization analyzes the targeted GPU architecture, the workload of SGD for MF, and its parallelization schemes. Section CuMF˙SGD: Fast and Scalable Matrix Factorization presents the single-GPU cuMF_SGD, including the two scheduling schemes and the GPU kernel. Section \thefigure discusses how to solve large-scale problems by matrix partitioning. Section CuMF˙SGD: Fast and Scalable Matrix Factorization presents and discusses experiment results. Section \thefigure discusses the related work and Section CuMF˙SGD: Fast and Scalable Matrix Factorization concludes this paper.

This section first briefly introduces the GPU architecture, and the SGD algorithm for matrix factorization. Then, we discuss the parallelism schemes for MF with SGD. We argue that, the current lock-free and matrix blocking schemes are not scalable on massive GPU cores. This leads to the innovations to be presented in Sections 3 and 4.

To overcome the limited memory bandwidth of a single CPU node and limited network bandwidth of distributed systems, we use a heterogeneous platform with GPUs, as shown in Figure CuMF˙SGD: Fast and Scalable Matrix Factorization. GPUs have high intra-device memory bandwidth and are connected via PCIe or NVLink [?] with high inter-device bandwidth. The CPUs take care of data pre-processing, data movement, and top-level workload scheduling, while the GPUs deal with feature update with massive parallelism.

Figure \thefigure: The compute architecture for cuMF_SGD.

GPUs are throughput-oriented processors [?] with thousands of cores and high bandwidth memory (200-800 GB/s). To fully exploit the performance potential, GPU applications need to be carefully designed to exploit the data and computation parallelism. In the next two sub-sections we show that is non-trivial for SGD as SGD is inherently serial.

Given users and items and a sparse rating matrix , where indicates the preference or rating of user on item. The goal of matrix factorization is to train a feature matrix and a feature matrix such that:

The training process of matrix factorization is to minimize the following cost function:

where and are regularization parameters to avoid overfitting and is the number of non-zero samples in matrix . The key idea of SGD is in every single step, randomly select one sample, e.g., from , to calculate the gradient w.r.t to the following cost function:

Then update feature vectors with learning rate :

SGD is inherently serial where each time one sample is selected to update. Given a data set with samples, an epoch (aka., iteration) involves executing updates one after other. Usually, it takes tens to hundreds of epochs to converge. To accelerate this process, it was observed that two samples can update in parallel if they are neither in same row nor same column. In this paper, we call these samples as ”independent samples”. Figure CuMF˙SGD: Fast and Scalable Matrix Factorization shows an example. Sample A (row 0, column 1) and Sample B (row 1, column 0) are independent as they are associated with different rows in and different columns in . Meanwhile, Sample A and Sample C are dependent as they update the sample column 1 in . Ideally, SGD can be parallelized without losing accuracy, if we update independent samples in parallel, and update dependent samples sequentially [?].

Figure \thefigure: Samples in different rows and columns are independent, e.g., samples A and B. Otherwise they are dependent, e.g. samples A and C.

To accelerate the SGD algorithm, how to partition the workloads to parallel workers becomes one major challenge. The efficiency of the workload scheduling scheme has a profound impact to the convergence speed. The workload scheduling policies in existing work  [?, ?, ?, ?, ?] can be divided into two categories, Hogwild! and matrix-blocking.

Figure \thefigure: Two SGD parallelization schemes for MF: Hogwild! and matrix-blocking.

Hogwild!(Figure CuMF˙SGD: Fast and Scalable Matrix Factorization(a)) is a lock-free approach to parallelize SGD[?]. In this paper, we call that a conflict happens if two or more concurrent threads update samples in the same row or column at the same time. Intuitively, certain synchronization should be used to avoid conflicts. However, Hogwild! observes that such synchronization is not needed when the is very sparse and the number of concurrent threads is much less than the number of samples. The intuition is that, when the aforementioned requirements are met, the probability of conflicts is very low and the incurred accuracy loss can be ignored. Based on the low synchronization overhead, it is used by some MF solutions. However, we need to enhance Hogwild! in cuMF_SGD in two aspects: (1) Hogwild! assumes a global shared memory space, which is not feasible in our hybrid CPU/GPU setting where each GPU has its own memory space. We can only run Hogwild! in a single GPU and need another layer of partition among multiple GPUs. (2) The vanilla Hogwild! is not cache friendly because of random access. How to balance the randomness in SGD update and cache efficiency is an important issue at design time.

Matrix-blocking (Figure CuMF˙SGD: Fast and Scalable Matrix Factorization(b)) divides the rating matrix into several sub-blocks, and sub-blocks that do not share rows or columns can update in parallel. Matrix-blocking is used by many recent work [?, ?, ?, ?]. Matrix-blocking has advantage in that, it totally avoids conflict. However, in matrix-blocking parallel workers need to ask a global scheduler on which blocks to proceed next. This global scheduler has been shown not scalable to many-core architectures [?]. Hence, we need to enhance existing matrix-blocking schemes to scale to the many cores on GPUs.

This section presents how cuMF_SGD solves MF with one GPU. We pre-assume that all required data resides in GPU device memory. We discuss multi-GPU implementation in Section \thefigure. We need to tackle two issues in single GPU. Section CuMF˙SGD: Fast and Scalable Matrix Factorization discusses the issue of computation optimization, i.e., to optimize each individual SGD update by exploiting GPU hardware. Section CuMF˙SGD: Fast and Scalable Matrix Factorization discusses workload scheduling, i.e., to distribute the many SGD updates to thousands of concurrent GPU threads.

In MF, one SGD update consists of four steps: 1) read one sample () from the rating matrix, 2) read two feature vectors (, ), 3) compute prediction error , and 4) update the features. Except the first step, other three steps are all vector operations at length . is an input parameter and typically ranges from to . On a CPU, a parallel worker can be a thread or process, where vector instructions such as SSE and AVX can be used to accelerate the computation. GPUs are SIMD architectures [?], where a thread block is a vector group. Hence, in cuMF_SGD, we use a thread block as a parallel worker. Figure CuMF˙SGD: Fast and Scalable Matrix Factorization shows a code snippet of the computational part of cuMF_SGD, where we use as an example. We highlight the major optimization techniques in Figure CuMF˙SGD: Fast and Scalable Matrix Factorization and explain them in the following.

Figure \thefigure: The exemplify code of computation part of cuMF_SGD, where . The used optimization techniques are highlighted.

Cache. Since Fermi architectures, NVIDIA GPUs feature on-chip L1 cache and allows the programmers to control the cache behavior of each memory instruction (cache or bypass). While many GPU applications do not benefit from cache due to cache contention [?], some memory instructions may benefit from cache as the accessed data may be frequently reused in the future (temporal reuse) or by other threads (spatial reuse). Following the model provided by [?], we observe that the memory load of the rating matrix benefits from cache and use the intrinsic instruction __ldg [?] to enable cache-assisted read.

Memory coalescing. On GPUs, when threads within one warp access the data within one cache line, the access is coalesced to minimize the bandwidth consumption [?]. This is called memory coalescing. In cuMF_SGD, the read/write of and are carefully coalesced to ensure that consecutive threads access consecutive memory addresses.

Warp shuffle. Warp shuffle instructions [?] are used to compute the dot product and broadcast the result. Compared with traditional shared-memory-based approaches, this warp shuffle-based approach performs better because: (1) warp shuffle instructions have extra hardware support, (2) register is faster than shared memory, and (3) no thread synchronization is involved. To exploit the warp shuffle feature, we fix the thread blocks size as warp size(32).

ILP. Modern GPUs support compiler-aided super scalar to exploit the instruction-level parallelism (ILP). In cuMF_SGD, when , a thread is responsible to process independent scalars. Hence, with awareness of the low-level architecture information, we reorder the instructions to maximize the benefit of ILP.

Register usage. Register file is an important resource on GPUs [?]. As the total number of registers on GPUs are limited, while each thread uses too many registers, the register consumption may become the limitation to concurrency. In our case, we identify that the concurrency is only limited by the number of thread blocks of GPUs [?]. Hence, we allocate as many as possible registers to each thread such that every reusable variable is kept in the fastest register file.

Half-precision. As addressed before, SGD is memory bound. Most of the memory bandwidth is spent on the read/write to the feature matrices. Recently, GPU architectures support the storage of half-precision (2 bytes vs. 4 bytes of single-precision) and fast transformation between floating point and half-precision. In practice, after parameter scaling, half-precision is precise enough to store the feature matrices and do not incur accuracy loss. CuMF_SGD uses half-precision to store feature matrices, which halves the memory bandwidth need when accessing feature matrices.

The original SGD algorithm is serial, with samples in the rating matrix picked up randomly and updated in sequence. To exploit parallelism, a workload scheduling policy that assigns tasks to parallel workers becomes necessary. We start from investigating the existing CPU-based scheduling policies. Specifically, we select a representative system LIBMF [?], a shared memory SGD solution to MF. LIBMF proposes a novel workload scheduling policy which successfully solves the load imbalance problem and achieve high efficiency. As shown in Figure CuMF˙SGD: Fast and Scalable Matrix Factorization(a), LIBMF divides the rating matrix to several blocks and uses a global scheduling table to manage the parallel workers. Whenever a worker is free, an idle independent block is scheduled to it. The process is repeated until convergence. However, we and others [?] observe that LIBMF faces scalability issues because of the global scheduling table it uses.

Figure CuMF˙SGD: Fast and Scalable Matrix Factorization(b) shows a scalability study of LIBMF. LIBMF-GPU is a GPU version of LIBMF that combines the workload scheduling policy of LIBMF and our GPU computation implementation described in Section CuMF˙SGD: Fast and Scalable Matrix Factorization. We use SGD updates per second as the performance metric:

where , , indicate number of iterations, number of non-zero samples in the input matrix , and elapsed time in seconds, respectively.

Evaluations show that the performance of LIBMF saturates around 30 concurrent workers (CPU threads), which is consistent with the previous study [?]. We perform some GPU-specific optimization techniques when implementing LIBMF-GPU, however, it still can only scale to 240 thread blocks, much lower than the hardware limit(768 thread blocks). The reason why LIBMF cannot scale to many parallel workers is that, it uses a global scheduling table to manage all parallel workers.

At each time, only one parallel worker can access the table and it is also time consuming to find a free block to assign the work to. Therefore, when the number of workers increase, the waiting time also increases. As the number of worker grows, the waiting time becomes dominating. This shows that, cuMF_SGD can not simply re-use existing scheduling policies. To overcome this scheduling overhead, we propose two GPU-specific scheduling schemes, batch-Hogwild! and Wavefront-update. Batch-Hogwild! avoids block-based scheduling and improves the cache efficiency by process samples in batch. Wavefront-update is still block-based, but only requires a local look-up instead of the expensive global look-up in LIBMF.

Figure \thefigure: (a) LIBMF uses a centralized table to manage parallel workers. (b) LIBMF scales to only 30 CPU threads and 240 GPU thread blocks.

We propose batch-Hogwild!, a variant of Hogwild! [?] with better cache efficiency. Hogwild! is efficient as its lock-free scheme incurs low scheduling overhead. It is not efficient, however, in terms of data locality [?]. In vanilla Hogwild!, each parallel worker randomly selects one sample from the rating matrix at each step. After each update, Hogwild! may not access the consecutive samples in the rating matrix and corresponding rows and columns in the feature matrices during a long time interval, leading to low cache efficiency. As discussed in Section CuMF˙SGD: Fast and Scalable Matrix Factorization, we carefully align the memory access to feature matrices to achieve perfect memory coalescing and the high memory bandwidth on GPUs makes accessing feature matrices no longer a performance bottleneck. To accelerate the access to rating matrix, we exploit the spatial data locality using L1 data cache. We let each parallel worker, instead of fetch one sample randomly at a time, fetches consecutive samples and update them serially. Note that these samples are consecutive in terms of their memory storage; because we shuffle samples, they are still random in terms of their coordinates in . By doing so, the data locality is fully exploited. Consider the L1 cache line size is 128 bytes and the size of each sample is 12 bytes (one floating point and two integers), is enough to exploit the locality. We evaluate different values of and find that they yield similar benefit. Therefore we choose without loss of generality.

Figure \thefigure: Wavefront-update. Each parallel worker is assigned to a row and pre-randomize its column update sequence. E.g., when Worker 3 completes Block 3 in Wave 1, it releases Column 4 such that Worker 1 can start Block 5 in Wave 2.

As discussed, existing scheduling schemes [?, ?] impose a global synchronization, where all workers look up a global table to find both row and column coordinates to update. This is expensive and has been shown not scalable to the hundreds of workers on GPUs. To overcome this, we propose wavefront-update, a light-weight scheme that locks and look up columns only.

We explain the idea of wavefront-update using Figure CuMF˙SGD: Fast and Scalable Matrix Factorization. We use four parallel works to process an which is partitioned into blocks. Each worker is assigned to a row in this grid, and each generates a permutation of as its column update sequence. By this means, an epoch is conducted in eight waves given this sequence. In each wave, one worker update one block, and workers do not update blocks in the same column. Assume Worker 1 has the sequence defined as and Workder 3 has sequence . With this sequence, Worker 1 updates Block 1 in wave 1 and Block 5 in wave 2. To avoid conflicts, we propose a light-weight synchronization scheme between waves using the column lock array. As shown the figure, we use an array to indicate the status of each column. Before a worker moves to next wave, it checks the status of the next column defined in its sequence. For example, after Worker 1 finishes Block 1, it needs to check the status of column 4 and does not need to care about other columns’ status. When Work 3 finishes Block 3 and releases column 4, Worker 1 is allowed to move to wave 2. There are two main benefits by doing so: (1) reduce the two-dimension look-up table in [?, ?] to an one-dimension array, (2) minimize the workload imbalance problem, as a worker can start the next block earlier compared to waiting for all other workers to finish.

Figure \thefigure: Performance Comparison of batch-Hogwild! and wavefront-update on Netflix data set. Both techniques scale much better than LIBMF.

We evaluate both techniques in terms of performance and convergence speed using the Netflix data set and the Maxwell platform333Details of the data set and platform are presented in Section CuMF˙SGD: Fast and Scalable Matrix Factorization.. We use metric to quantitative the performance. Figure CuMF˙SGD: Fast and Scalable Matrix Factorization(a) shows the scalability of batch-Hogwild! and Wavefront-update with different number of parallel workers (i.e., thread blocks). When increasing the number of parallel workers, both techniques achieve near-linear scalability. When the number of parallel workers hits the hardware limit (768) of the Maxwell GPU, both techniques achieve 0.27 billion updates per second, which is 2.5 times of LIBMF. Therefore, we conclude that our proposed solutions can perfectly solve the scalability problem of the scheduling policy and fully exploit the equipped hardware resources on GPUs. We also evaluate the convergence speed of both techniques. We use the root mean square root error on the standard test data set as the indication of convergence. Figure CuMF˙SGD: Fast and Scalable Matrix Factorization(b) shows the decrease of Test RMSE in iterations. Overall, batch-Hogwild! converges a little bit faster than Wavefront-update. The reason is, batch-Hogwild! enforces more randomness in update sequence, compared with the block-based wavefront-update. Based on this observation, we use batch-Hogwild! as the default scheme on one GPU.

Figure \thefigure: (a) Multiple GPUs solution of cuMF_SGD, where the rating matrix is partitioned and each partition can fit into a GPU’s device memory. (b) Optimizing the multi-GPU solution by overlapping memory transfer with computation.

Section CuMF˙SGD: Fast and Scalable Matrix Factorization presents how to solve MF in a single GPU, assuming the rating matrix and feature matrices fully reside in GPU memory. However, the limited GPU memory capacity [?] prevents cuMF_SGD from solving large scale problems. For example, NVIDIA TITAN X GPU has 12 GB device memory that can only store 1 billion samples (one sample needs one float and two integers). Nowadays, real-world problems may have samples [?]. Techniques such as Unified Virtual Memory [?] allow GPU to use CPU memory but with high overhead. Consider these factors, to solve large-scale MF problems that can not fit into one GPU’s device memory, we need to partition the data sets and stage the partitions to GPUs in batches. Moreover, We should overlap data transfer with computation to alleviate the delay caused by slow CPU-GPU memory transfer. Please note that, the partitions can be processed by one or multiple GPUs.

Figure CuMF˙SGD: Fast and Scalable Matrix Factorization shows our proposed multiple GPUs solution. The main idea is to divide the rating matrix into multiple blocks; each block is small enough to fit into a GPU’s device memory such that independent blocks can update concurrently on different GPUs. The multiple GPU solution works as follows,

  1. Divide the rating matrix into blocks. Meanwhile, divide feature matrix into segments and feature matrix into segments accordingly.

  2. When a GPU is idle, randomly select one matrix block from those independent blocks and dispatch it to the GPU.

  3. Transfer the matrix block and corresponding feature sub-matrices and to the GPU. Then update the matrix block using the single GPU implementation discussed in Section CuMF˙SGD: Fast and Scalable Matrix Factorization. After the update, transfer and back to CPU.

  4. Iterate from 2 until convergence or the given number of iterations is reached.

We further explain the proposed scheme using the example shown in Figure CuMF˙SGD: Fast and Scalable Matrix Factorization(a). In Step 1, we divide into blocks and use two GPUs. In Step 2, we send block to GPU 0 and to GPU 1. Again, consider the nature of MF, updating only touches sub-matrices & while updating only touches & . Hence, GPU 0 only needs to store , , and in its device memory while GPU 1 only needs to store , , and . By doing so, the problem is divided and conquered by multiple GPUs. After deciding the block scheduling order, cuMF_SGD transfers , , to GPU 0 and , , to GPU 1, then performs the computation on two GPUs in parallel. The GPU side computation follows the rules we discussed in Section CuMF˙SGD: Fast and Scalable Matrix Factorization. After finishing the computation, the updated , , , and are transferred back to CPU memory. Note that we don’t have to transfer or back to CPU memory as they are read-only.

Scalability problem. We mentioned LIBMF faces scalability issue, as the scheduling overhead increases quickly with the number of workers [?]. Our multiple-GPU scheduling scheme has similar complexity with that of LIBMF. However, it does not face the same scalability issue as we only need to schedule to a few GPUs instead of hundreds of workers.

GPUs’ memory bandwidth are much larger than the CPU-GPU memory transfer bandwidth. For example, NVIDIA TITAN X GPUs provide 360 GB/s device memory bandwidth while the CPU-GPU memory bandwidth is only 16 GB/s (PCIe v3 16x). In single GPU implementation, CPU-GPU memory transfer only happens at the start and end of MF, and therefore not dominant. However, when the data set can not fit into the GPU memory, memory transfer happens frequently and has higher impact on the overall performance.

Given the memory transfer overhead, we overlap the memory transfers and computation when solving large problems, as shown in Figure CuMF˙SGD: Fast and Scalable Matrix Factorization(b). Due to space limitation , we only plot one GPU. The key idea is, at the block scheduling time, instead of randomly selecting one independent block for the GPU, the optimized technique randomly selects multiple blocks at a time. Those blocks are pipelined to overlap the memory transfer and computation: we schedule two blocks to GPU 0, and overlap the memory transfer of the second block () with the computation of the first block (). Note that the two blocks scheduled to one GPU do not need to be independent as they are updated in serial; meanwhile, blocks scheduled to different GPUs have to be independent with each other to avoid conflicts. By doing so, we can reduce the overhead of slow CPU-GPU memory transfer and improve the overall performance.

Discussion. Allocating more blocks to one GPU would yield more performance benefit as more memory/computation overlapping can be achieved. However, the number of available blocks is limited by how do we divide the rating matrix . Consider we divide to and we have two GPUs running in parallel, the number of blocks per GPU cannot be more than . In practice, is determined by the size of the rating matrix and the available hardware resources on the GPU. We will discuss it in Section CuMF˙SGD: Fast and Scalable Matrix Factorization.

Multiple GPUs management. We implement it using multiple CPU threads within one process. Within the process, there is one host thread and multiple worker threads, where each GPU is bound to one worker thread. The host thread manages the workload scheduling and informs worker threads of the scheduling decision. Each worker thread then starts data transfer and launches compute kernels on a GPU.

Overlapping. Each worker thread is responsible to overlap the computation and CPU-GPU memory transfers. We use CUDA streams to achieve this. A stream contains a list of GPU commands that are executed in serial, and commands in different streams are executed in parallel if hardware resources permit. Each worker thread uses three streams that manage CPU-GPU memory transfer, GPU-CPU memory transfer, and GPU kernel launch, respectively.

We implement cuMF_SGD using CUDA C (source code at, evaluate its performance on public data sets, and demonstrate its advantage in terms of performance and cost. Section CuMF˙SGD: Fast and Scalable Matrix Factorization introduces the experimental environment. The following experiments are designed to answer these questions:

  • Compared with state-of-the-art SGD-based approaches on CPUs [?, ?], is cuMF_SGD better and why? (Section \thetable)

  • What is the implication of using different generations of GPUs? (Section \thefigure)

  • Compared with the ALS-based GPU library cuMF_ALS that we published earlier [?], what is the advantage of cuMF_SGD? (Section CuMF˙SGD: Fast and Scalable Matrix Factorization)

  • Parallelizing SGD is always tricky and may lead to converge problems. What are the factors impacting parallelizing SGD? (Section CuMF˙SGD: Fast and Scalable Matrix Factorization)

  • When scale up to multiple GPUs, is cuMF_SGD still efficient? (Section CuMF˙SGD: Fast and Scalable Matrix Factorization)

Platform. We evaluate cuMF_SGD on heterogeneous platforms with both CPU and GPUs. Table CuMF˙SGD: Fast and Scalable Matrix Factorization shows the configuration of the two servers used in experiments.

Maxwell Platform
CPU 12-core Intel Xeon CPU E5-2670*2 (up to 48 threads), 512 GB memory
GPU TITAN X GPU*4, per GPU: 24 SMs, up to 768 thread blocks, 12 GB device memory.
Pascal Platform
CPU 2*10 PowerNV 8 processors with SMT 8 and NVLink.
GPU Tesla P100 GPU*4, per GPU: 56 SMs, up to 1792 thread blocks, 16 GB device memory.
Table \thetable: Configuration of the Maxwell [?] and Pascal [?] Platform.

Data sets. We use three public data sets: Netflix, Yahoo!Music, and Hugewiki. Details of them are shown in Table CuMF˙SGD: Fast and Scalable Matrix Factorization. Netflix and Yahoo!Music come with a test set but Hugewiki does not. We randomly sample and extract out 1% of the data set for testing purpose.

Dataset Netflix Yahoo!Music Hugewiki
480,190 1,000,990 50,082,604
17,771 624,961 39,781
128 128 128
99,072,112 252,800,275 3,069,817,980
1,408,395 4,003,960 31,327,899
Table \thetable: Details of data sets used.

Parameter. As mentioned in the introduction, this paper focus on system-level but not algorithmic-level optimization. Therefore, we did not spend much effort in parameter turning. Instead, we use the parameters adopted by earlier work [?, ?, ?, ?]. For learning rate, we adopt the learning rate scheduling techniques used by Yun et al. [?], where the learning rate at epoch is monotonically reduced in the following routine:

is the given initial learning rate and is another given parameter. The parameters used by cuMF_SGD are listed in Table CuMF˙SGD: Fast and Scalable Matrix Factorization.

Netflix 0.05 0.08 0.3
Yahoo!Music 1.0 0.08 0.2
Hugewiki 0.03 0.08 0.3
Table \thetable: Used parameters for each dataset.

We compare cuMF_SGD with the following state-of-the-art approaches.

  • LIBMF [?]. LIBMF is a representative blocking-based solution on shared-memory systems. Its main design purpose is to balance the workload across CPU threads and accelerate the memory access. It leverages SSE instructions and a novel learning rate schedule to speed up the convergence [?].

    We exhaustively evaluate all possible parallel parameters on the Maxwell platform and select the optimal one. For example, we use 40 CPU threads and divide the input rating matrix into blocks; we set its initial learning rate as .

  • NOMAD [?]. NOMAD is a representative distributed matrix factorization solution. It uses a 64-node HPC cluster to solve MF. It proposes a decentralized scheduling policy to reduce the synchronization overhead and discusses how to reduce the inter-node communication. We present the best results presented in the original paper, i.e., using 32 nodes for Netflix and Yahoo!Music data sets and using all 64 nodes for Hugewiki data set on the HPC cluster.

    Figure \thefigure: Test RMSE over training time on three data sets. CuMF_SGD converges faster than all other approaches with only one GPU card.
  • CuMF_SGD. We evaluate cuMF_SGD on both Maxwell and Pascal platforms, with all three data sets. We name the results on Maxwell as cuMF_SGD-M and those on Pascal as cuMF_SGD-P. We use one GPU in this subsection. The number of parallel workers (thread blocks) is set as the maximum of the corresponding GPU architecture (768 on Maxwell platform and 1792 on Pascal platform). As Hugewiki can not fit into one GPU’s memory, we divide it into blocks and at each scheduling, we schedule 8 blocks to overlap memory transfer and computation.

Figure CuMF˙SGD: Fast and Scalable Matrix Factorization shows the test RMSE w.r.t. the training time. Table CuMF˙SGD: Fast and Scalable Matrix Factorization summarizes the training time required to converge to a reasonable RMSE (0.92, 22.0, and 0.52 for Netflix, Yahoo!Music, and Hugewki, respectively). Results show that with only one GPU, cuMF_SGD-P and cuMF_SGD-M perform much better (3.1X to 28.2X) on all data sets compared to all competitors, including NOMAD on a 64-node HPC cluster. In the following we analyze the reasons.

Data set Netflix Yahoo!Music Hugewiki
LIBMF 23.0s 37.9s 3020.7s
NOMAD 9.6s(2.4X) 108.7s(0.35X) 459.1s(6.6X)
CuMF_SGD-M 7.5s(3.1X) 8.8s(4.3X) 442.3s(6.8X)
CuMF_SGD-P 3.3s(7.0X) 3.8s(10.0X) 107.0s(28.2X)
Table \thetable: Training time speedup normalized by LIBMF.
Figure \thefigure: Updates per second and achieved memory bandwidth of LIBMF, cuMF_SGD-M, and cuMF_SGD-P. The achieved memory bandwidth explains the advantage of cuMF_SGD.

Comparison with LIBMF. As shown in Figure CuMF˙SGD: Fast and Scalable Matrix Factorization and Table CuMF˙SGD: Fast and Scalable Matrix Factorization, cuMF_SGD outperforms LIBMF on all data sets, on both Maxwell and Pascal. More precisely, cuMF_SGD-M is 3.1X - 6.8X as fast as LIBMF and cuMF_SGD-P is 7.0X - 28.2X as fast. CuMF_SGD outperforms LIBMF because it can do more updates per second, as shown in Figure CuMF˙SGD: Fast and Scalable Matrix Factorization(a). We already mentioned that that matrix factorization is memory bound; LIBMF is also aware of that and strives to keep all frequently used data in the CPU cache. However, the limited cache capacity on a single CPU makes LIMBF suboptimal in large data sets. As shown in Figure CuMF˙SGD: Fast and Scalable Matrix Factorization(b), LIBMF achieves an effective memory bandwidth of 194 GB/s444The achieved memory bandwidth measures the data processed by the compute units per second, and can be higher than the theoretical off-chip memory bandwidth thanks to the cache effect. on the Netflix data set (with 99M samples) – close to cuMF_SGD-M. However its achieved bandwidth drops almost by half, to 106 GB/s on the larger Hugewiki data set (with 3.1B samples) – while cuMF_SGD achieves similar bandwidth in all data sets.

On the scheduling policy of LIBMF. Simply porting LIBMF to GPUs leads to resource under-utilization due to the scalability of it scheduling policy (recall Figure CuMF˙SGD: Fast and Scalable Matrix Factorization). In contrast, the workload scheduling policy and memory/computation pattern of cuMF_SGD are specifically designed to fully exploit the computation and memory resources on GPUs. Hence, as shown in Figure CuMF˙SGD: Fast and Scalable Matrix Factorization (b), cuMF_SGD achieves much higher bandwidth than LIBMF. Moreover, cuMF_SGD uses half-precision (2 bytes for a float number) to store feature matrices. As a result, it can perform twice updates as LIBMF with the same bandwidth consumption.

Compared with NOMAD. As presented in [?], NOMAD uses 32 nodes for Netflix and Yahoo!Music and 64 HPC nodes for Hugewiki. Despite of the tremendous hardware resources, NOMAD is still outperformed by cuMF_SGD on all data sets. As observed in Section CuMF˙SGD: Fast and Scalable Matrix Factorization, MF is a memory bounded and data communication happens frequently between parallel workers. When NOMAD distributes parallel workers to different nodes, the network bandwidth which is much less than intra-node communication, becomes the bottleneck. Consequently, NOMAD achieves suboptimal scalability when scale from single node to multiple nodes, especially for small data sets. For example, on Yahoo!Music, NOMAD performs even worse than LIBMF that uses only one machine.

NOMAD (on a 64-node HPC cluster) has similar performance with cuMF_SGD-M on Hugewiki, while it is much slower than cuMF_SGD-P. Obviously, cuMF_SGD is not only faster, using a single CPU card, it is also more cost-efficient.

Figure \thefigure: CuMF_SGD vs. cuMF_ALS. With one GPU, CuMF_SGD converges faster than cuMF_ALS-1 (one GPU) and similar to cuMF_ALS-4 (four GPUs).
Figure \thefigure: Updates-per-second and achieved memory bandwidth cuMF_SGD on Maxwell and Pascal, using the Netflix data set. CuMF_SGD performs better on the more recent Pascal platform.

We have evaluated cuMF_SGD on the two current generations of GPUs: Maxwell and Pascal. We believe that cuMF_SGD is able to scale to future GPU architectures with minor tuning effort. In this section, we explain the performance gap between Maxwell and Pascal in three aspects: computation resources, off-chip memory bandwidth, and CPU-GPU memory bandwidth.

Computation resources. We show the SGD updates-per-second metric of two platforms with different number of parallel workers using Netflix in Figure CuMF˙SGD: Fast and Scalable Matrix Factorization(a). Results show that the Pascal platform scales to more parallel workers and achieves much higher than Maxwell. This is because the Maxwell platform has 24 streaming multiprocessors (SMs) within each GPU, with each SM allowing up to 32 parallel workers (thread blocks). Hence, one Maxwell GPU allows up to 768 parallel workers. Meanwhile, the Pascal GPU used has 56 SMs and allows 32 thread blocks on each SM. Hence, a Pascal GPU allows up to 1792 parallel workers, which is 2.3 times of that of Maxwell GPU. Overall, a Pascal GPU is more powerful than a Maxwell GPU in term of the amount of computation resources.

Off-chip memory bandwidth. As we discussed before, SGD is memory bound. Optimized for throughput, GPUs are able to overlap memory access and computation by fast context switch among parallel workers [?]. When there are enough parallel workers running on GPUs, long memory latencies can be hidden, which is exactly what happens with cuMF_SGD. In this scenario, memory bandwidth, instead of memory latency, becomes the limitation of the performance. Pascal platforms provides twice as much theoretical peak off-chip memory bandwidth (780 GB/s) as Maxwell platforms(360 GB/s). Figure CuMF˙SGD: Fast and Scalable Matrix Factorization(b) shows the achieved memory bandwidth on two platforms with different number of parallel workers. On Maxwell and Pascal, cuMF_SGD achieves up to 266 GB/s and 567 GB/s, respectively.

CPU-GPU memory bandwidth. Netflix and Yahoo!Music data sets are small enough to fit into the GPU device memory. For Hugewiki, memory transfer occurs multiple times as the data cannot fit into GPU device memory. In Section CuMF˙SGD: Fast and Scalable Matrix Factorization, we propose to overlap data transfer with computation. Despite of this optimization, the CPU-GPU memory bandwidth still has noticeable impact on the overall performance as the perfect overlapping cannot be achieved. On the Maxwell platform, the memory transfer between CPU and GPU is via PCIe v3 16x with 16 GB/s bandwidth (we observe that on average, the achieved bandwidth is 5.5 GB/s). The very recent Pascal platform is with NVLink [?] that can provide 40 GB/s in theory (we observe an average 29.1 GB/s CPU-GPU memory transfer bandwidth, which is 5.3X as that on Maxwell). This also explains why cuMF_SGD achieves much more speedup on Hugewiki using Pascal platform (28.2X) than that on Maxwell platform (6.8X).

Our earlier work cuMF_ALS [?] represents the state-of-art ALS-based matrix factorization solution on GPUs. We use one GPU for cuMF_SGD, and one and four GPUs for cuMF_ALS. Figure CuMF˙SGD: Fast and Scalable Matrix Factorization compares their performance on three data sets on Maxwell. We observe that cuMF_SGD is faster than cuMF_ALS-1 and achieves similar performance with cuMF_ALS-4 with only one GPU.

It expected that cuMF_SGD is faster than cuMF_ALS, with the following reason. Each epoch of SGD needs memory access of and computation of . Each epoch of ALS needs memory access of and computation of . ALS’s epochs run slower due to its much more intensive computation. Although ALS needs fewer epochs to coverage, as a whole it converges slower. Despite the fact that cuMF_ALS is slower than cuMF_SGD, we still maintain both solutions at because they serve different purposes: SGD converges fast and easy to do incremental update, while ALS is easy to parallelize and is able to deal with non-sparse rating matrices [?].

The original SGD algorithm is serial. To speed it up, we discuss how to parallelize it on one GPU in Section CuMF˙SGD: Fast and Scalable Matrix Factorization and on multiple GPUs in Section CuMF˙SGD: Fast and Scalable Matrix Factorization. It is well-known that SGD parallelization may have subtle implications on convergence [?, ?]. In the context of matrix factorization, the implication varies on the two schemes proposed in Section CuMF˙SGD: Fast and Scalable Matrix Factorization: Hogwild! and matrix-blocking.

For one GPU, Section \thefigure proposes the batch-Hogwild! scheme to partition work . As a vectorized version of Hogwild!, batch-Hogwild! inherits the limitation of Hogwild!. Given a rating matrix of and parallel workers, convergence is ensured only when the following condition satisfied [?]:

For multiple GPUs, Section \thefigure proposes to first divide the rating matrix into blocks and process one block on one GPU in parallel if possible. In this case, the above condition needs to change to:

Our empirical study on Hugewiki data set shows that, needs to be to converge. We validated that, Hugewiki data set has and we choose ; convergence is achieved when , and fails when . We believe this is a limitation for all Hogwild!-style solutions.

The purpose of matrix-blocking is to avoid conflicts between parallel workers. However, we observe that matrix-blocking can have negative impact on convergence. Figure CuMF˙SGD: Fast and Scalable Matrix Factorization illustrates the the convergence speed of LIBMF on Netflix data set with different parameters. In this study, we fix the number of parallel workers ; without loss of generality, we divide into blocks and vary the value of . Figure CuMF˙SGD: Fast and Scalable Matrix Factorization shows that when is less than or close to , convergence speed is much slower or even cannot be achieved. We have similar observations on other data sets and using cuMF_SGD. We briefly explain the reason with a simple example shown in Figure CuMF˙SGD: Fast and Scalable Matrix Factorization.

Figure \thefigure: Convergence speed of LIBMF on Netflix. We fix #parallel-workers and vary value to partition to blocks.

In Figure CuMF˙SGD: Fast and Scalable Matrix Factorization, we divide the rating matrix into blocks and use 2 parallel workers. In theory, 4 blocks can have possible update orders. We also show all update orders in Figure CuMF˙SGD: Fast and Scalable Matrix Factorization. However, only orders 18 out of the total 24 are feasible so as to avoid update conflicts. For example, when Block 1 is issued to one worker, only Block 4 can be issued to another worker. Hence, Blocks 2 and 3 cannot be updated between 1 and 4, which precludes order 912. This demonstrated that when , all independent blocks have to be updated concurrently to make all workers busy, which enforces certain update order constraints and hurts the randomness. As a consequence, convergence speed can deteriorate. In practice, when cuMF_SGD uses two GPUs, should at least be divided into blocks.

Figure \thefigure: A simple example to demonstrate the limitation of matrix blocking. The rating matrix is divided into blocks and updated using two parallel workers.

System wise, cuMF_SGD is designed to scale to multiple GPUs. However, algorithmic wise, the scaling is restricted by factors such as problem dimension and number of parallel workers, as discussed earlier in Section CuMF˙SGD: Fast and Scalable Matrix Factorization. Among the three data sets used in this paper, Netflix and Hugewiki have very small (, , receptively), preventing cuMF_SGD from solving them on multiple GPUs. In comparison, Yahoo!Music can be solved on multiple GPUs as the dimension of it is . We divide its into blocks and run it with two Pascal GPUs. Figure CuMF˙SGD: Fast and Scalable Matrix Factorization shows the convergence speed. With 2 Pascal GPUs, cuMF_SGD takes 2.5s to converge to RMSE 22, which is 1.5X as fast as 1 Pascal GPU (3.8s). The reason behind this sub-linear scalability is that, the multi-GPU cuMF_SGD needs to spend time on CPU-GPU memory transfer so as to synchronize two GPUs.

Figure \thefigure: Convergence of cuMF_SGD on Yahoo!Music: two Pascal GPUs is 1.5X as fast as one.

Algorithms. SGD has been widely used to solve matrix factorization [?]. Serial SGD can be parallelized to achieve better performance. ALS is naturally easy to parallelize and it can also been used in dense matrix factorization. Coordinate descent is another algorithm to solve matrix factorization [?, ?]. It updates the feature matrix along one coordinate direction in each step. Our earlier work [?] focuses on ALS algorithm.

Parallel SGD solutions have been discussed in multi-core [?, ?, ?], multi-node [?, ?], MapReduce [?, ?] and parameter-servers [?, ?] settings. Existing works are mostly inspired by Hogwild! [?] that allows lock-free update, or matrix-blocking that partitions to avoid conflicts, or a combination of them. LIBMF [?, ?] is a representative shared-memory multi-core system. Evaluations have shown that it outperforms all previous approaches with one machine. Although it has been optimized for cache efficiency, it is still not efficient at processing large scale data sets. Moreover, the high complexity of its scheduling policy makes it infeasible to scale to many cores. NOMAD [?] partitions the data on HPC clusters to improve the cache performance. At the meantime, they propose to minimize the communication overhead. Compared with LIBMF, it has similar performance on one machine and is able to scale to 64 nodes.

Parallelization is also used in coordinate descent  [?]. Compared with SGD, coordinate descent has lower overhead and runs faster at the first few epochs of training. However, due to the algorithmic limitation, coordinate descent is prone to reach local optima [?] in the later epochs of training.

Compared with CGD and SGD, ALS is inherently easy to parallel, ALS based parallel solutions are widely discussed [?, ?, ?, ?, ?]. Our earlier work, cuMF_ALS [?] focuses on optimizing ALS to matrix factorization on GPUs. As ALS algorithm is more compute intensive, it runs slower than cuMF_SGD.

GPU solutions. Prior to our work,  [?] applies Restricted Boltzmann Machines on GPUs to solve MF.  [?] implements both SGD and ALS on GPU to solve MF. In contrast, cuMF_SGD outperforms them because we optimize both memory access and workload scheduling.

Matrix factorization is widely used in recommender systems and other applications. SGD-based MF is limited by memory bandwidth which single and multi-GPU systems cannot efficiently provision. We propose a GPU-based solution, by observing that GPUs offers abundant memory bandwidth and can enjoy fast intra-node connection. We design workload partition and schedule schemes to dispatch tasks insides a GPU and across GPUs, without impacting the randomness required by SGD. We also develop highly-optimized GPU kernels for individual SGD updates. With only one Maxwell or Pascal GPU, cuMF_SGD runs 3.1X-28.2X as fast compared with state-of-art CPU solutions on 1-64 CPU nodes. Evaluations also show that cuMF_SGD scales well on multiple GPUs in large data sets.

  • 1 Y. Koren, R. Bell, and C. Volinsky, “Matrix factorization techniques for recommender systems,” Computer.
  • 2 “Recommending items to more than a billion people..”
  • 3 J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors for word representation.,” in EMNLP, 2014.
  • 4 M. Sarwat, Database Management System Support for Collaborative Filtering Recommender Systems. PhD thesis, UNIVERSITY OF MINNESOTA, 2014.
  • 5 W.-S. Chin, Y. Zhuang, Y.-C. Juan, and C.-J. Lin, “A fast parallel stochastic gradient method for matrix factorization in shared memory systems,” ACM Transactions on Intelligent Systems and Technology (TIST), 2015.
  • 6 W. Tan, L. Cao, and L. Fong, “Faster and cheaper: Parallelizing large-scale matrix factorization on gpus,” in Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing, HPDC ’16, 2016.
  • 7 W.-S. Chin, Y. Zhuang, Y.-C. Juan, and C.-J. Lin, “A learning-rate schedule for stochastic gradient methods to matrix factorization,” in Pacific-Asia Conference on Knowledge Discovery and Data Mining, Springer, 2015.
  • 8 C. Teflioudi, F. Makari, and R. Gemulla, “Distributed matrix completion,” in 2012 IEEE 12th International Conference on Data Mining, IEEE, 2012.
  • 9 H.-F. Yu, C.-J. Hsieh, S. Si, and I. Dhillon, “Scalable coordinate descent approaches to parallel matrix factorization for recommender systems,” in 2012 IEEE 12th International Conference on Data Mining, IEEE, 2012.
  • 10 H. Yun, H.-F. Yu, C.-J. Hsieh, S. V. N. Vishwanathan, and I. Dhillon, “Nomad: Non-locking, stochastic multi-machine algorithm for asynchronous and decentralized matrix completion,” Proc. VLDB Endow., 2014.
  • 11 B. Recht, C. Re, S. Wright, and F. Niu, “Hogwild: A lock-free approach to parallelizing stochastic gradient descent,” in Advances in Neural Information Processing Systems, pp. 693–701, 2011.
  • 12 Z. Liu, Y.-X. Wang, and A. Smola, “Fast differentially private matrix factorization,” in Proceedings of the 9th ACM Conference on Recommender Systems, RecSys’15, 2015.
  • 13 S. Blanas, Y. Li, and J. M. Patel, “Design and evaluation of main memory hash join algorithms for multi-core cpus,” in Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, SIGMOD ’11, 2011.
  • 14 F. Yang, J. Li, and J. Cheng, “Husky: Towards a more efficient and expressive distributed computing framework,” Proceedings of the VLDB Endowment, 2016.
  • 15 Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and J. M. Hellerstein, “Distributed graphlab: a framework for machine learning and data mining in the cloud,” Proceedings of the VLDB Endowment, 2012.
  • 16 K. S. Bøgh, S. Chester, and I. Assent, “Work-efficient parallel skyline computation for the gpu,” Proc. VLDB Endow., 2015.
  • 17 K. Zhang, K. Wang, Y. Yuan, L. Guo, R. Lee, and X. Zhang, “Mega-kv: a case for gpus to maximize the throughput of in-memory key-value stores,” Proceedings of the VLDB Endowment, 2015.
  • 18 K. Wang, K. Zhang, Y. Yuan, S. Ma, R. Lee, X. Ding, and X. Zhang, “Concurrent analytical query processing with gpus,” Proceedings of the VLDB Endowment, 2014.
  • 19 “NVIDIA Maxwell Architecture ..”
  • 20 “NVIDIA Pascal Architecture ..”
  • 21 E. A. Sitaridi and K. A. Ross, “Gpu-accelerated string matching for database applications,” The VLDB Journal, 2016.
  • 22 J. Zhou, Q. Guo, H. Jagadish, W. Luan, A. K. Tung, Y. Yang, and Y. Zheng, “Generic inverted index on the gpu,” arXiv preprint arXiv:1603.08390, 2016.
  • 23 “NVIDIA NVLink.”
  • 24 Y. Yang, P. Xiang, J. Kong, and H. Zhou, “A gpgpu compiler for memory optimization and parallelism management,” in Proceedings of the 31st ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’10, (New York, NY, USA), pp. 86–97, ACM, 2010.
  • 25 B. Zhao, Q. Luo, and C. Wu, “Parallelizing astronomical source extraction on the gpu,” in eScience (eScience), 2013 IEEE 9th International Conference on, 2013.
  • 26 Y. Nishioka and K. Taura, “Scalable task-parallel sgd on matrix factorization in multicore architectures,” in Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium Workshop, IPDPSW ’15, (Washington, DC, USA), pp. 1178–1184, IEEE Computer Society, 2015.
  • 27 R. Gemulla, E. Nijkamp, P. J. Haas, and Y. Sismanis, “Large-scale matrix factorization with distributed stochastic gradient descent,” in Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, 2011.
  • 28 “NVIDIA CUDA programming guide..”
  • 29 D. Song and S. Chen, “Exploiting simd for complex numerical predicates,” in 2016 IEEE 32nd International Conference on Data Engineering Workshops (ICDEW), 2016.
  • 30 X. Xie, Y. Liang, G. Sun, and D. Chen, “An efficient compiler framework for cache bypassing on gpus,” in IEEE/ACM International Conference on Computer-Aided Design, 2013.
  • 31 Y. Yang, P. Xiang, J. Kong, and H. Zhou, “A GPGPU compiler for memory optimization and parallelism management,” in 2010 ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’10, pp. 86–97, 2010.
  • 32 C. del Mundo and W.-c. Feng, “Enabling efficient intra-warp communication for fourier transforms in a many-core architecture,” in Supercomputing, 2013. Proceedings of the 2013 ACM/IEEE International Conference on, 2013.
  • 33 J. Lai and A. Seznec, “Performance upper bound analysis and optimization of sgemm on fermi and kepler gpus,” in Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization(CGO), 2013.
  • 34 S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk, and W.-m. W. Hwu, “Optimization principles and application performance evaluation of a multithreaded gpu using cuda,” in Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’08, 2008.
  • 35 C.-J. Hsieh and I. S. Dhillon, “Fast coordinate descent methods with variable selection for non-negative matrix factorization,” in Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, 2011.
  • 36 J. Oh, W.-S. Han, H. Yu, and X. Jiang, “Fast and robust parallel sgd matrix factorization,” in Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2015.
  • 37 C. Teflioudi, F. Makari, and R. Gemulla, “Distributed matrix completion,” in IEEE 12th International Conference on Data Mining, IEEE, 2012.
  • 38 B. Li, S. Tata, and Y. Sismanis, “Sparkler: supporting large-scale matrix factorization,” in Proceedings of the 16th International Conference on Extending Database Technology, ACM, 2013.
  • 39 S. Schelter, V. Satuluri, and R. Zadeh, “Factorbird-a parameter server approach to distributed matrix factorization,” arXiv preprint arXiv:1411.0602, 2014.
  • 40 H. Cui, J. Cipar, Q. Ho, J. K. Kim, S. Lee, A. Kumar, J. Wei, W. Dai, G. R. Ganger, P. B. Gibbons, et al., “Exploiting bounded staleness to speed up big data analytics,” in USENIX Annual Technical Conference (USENIX ATC), 2014.
  • 41 X. Meng, J. Bradley, B. Yuvaz, E. Sparks, S. Venkataraman, D. Liu, J. Freeman, D. Tsai, M. Amde, S. Owen, et al., “Mllib: Machine learning in apache spark,” JMLR, 2016.
  • 42 Y. Zhou, D. Wilkinson, R. Schreiber, and R. Pan, “Large-scale parallel collaborative filtering for the netflix prize,” in International Conference on Algorithmic Applications in Management, Springer, 2008.
  • 43 M. Gates, H. Anzt, J. Kurzak, and J. Dongarra, “Accelerating collaborative filtering using concepts from high performance computing,” in Big Data, 2015 IEEE International Conference on, 2015.
  • 44 X. Cai, Z. Xu, G. Lai, C. Wu, and X. Lin, “Gpu-accelerated restricted boltzmann machine for collaborative filtering,” in International Conference on Algorithms and Architectures for Parallel Processing, Springer, 2012.
  • 45 D. Zastrau and S. Edelkamp, “Stochastic gradient descent with gpgpu,” in Annual Conference on Artificial Intelligence, Springer, 2012.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description