Snap Machine Learning
We describe an efficient, scalable machine learning library that enables very fast training of generalized linear models. We demonstrate that our library can remove the training time as a bottleneck for machine learning workloads, opening the door to a range of new applications. For instance, it allows more agile development, faster and more fine-grained exploration of the hyper-parameter space, enables scaling to massive datasets and makes frequent re-training of models possible in order to adapt to events as they occur. Our library, named Snap Machine Learning (Snap ML), combines recent advances in machine learning systems and algorithms in a nested manner to reflect the hierarchical architecture of modern distributed systems. This allows us to effectively leverage available network, memory and heterogeneous compute resources. On a terabyte-scale publicly available dataset for click-through-rate prediction in computational advertising, we demonstrate the training of a logistic regression classifier in 1.53 minutes, a 46x improvement over the fastest reported performance.
rightsretained \setcopyrightnone \settopmatterprintacmref=false
The widespread adoption of machine learning and artificial intelligence has been, in part, driven by the ever-increasing availability of data. Large datasets enable training of more expressive models, thus leading to higher quality insights. However, when the size of such datasets grows to billions of training examples and/or features, the training of even relatively simple models becomes prohibitively time consuming. This long turn-around time (from data preparation to inference) can be a severe hindrance to the research, development and deployment of large-scale machine learning models in critical applications.
However, it is not only large data applications in which training time can become a bottleneck. For example, real-time or close-to-real-time applications, in which models must react rapidly to changing events, are another important scenario where fast training times are vital. For instance, consider security applications in critical infrastructure when a new, previously unseen, phenomenon is currently evolving. In such situations, it may be beneficial to train, or incrementally re-train, the existing models with new data. One’s ability to respond to such events necessarily depends on the training time, which can become critical even when the data itself is relatively small.
A third area when fast training is highly desirable is the field of ensemble learning. It is well known that most data science competitions today are won by large ensembles of models [toscher2009bigchaos]. In order to design a winning ensemble, a data scientist typically spends a significant amount of time trying out different combinations of models and tuning the large number of hyper-parameters that arise. In such a scenario, the ability to train models orders of magnitude faster naturally results in a more agile development process. A library that provides such acceleration can give its user a valuable edge in the field of competitive data science or any applications where best-in-class accuracy is desired. One such application is click-through rate prediction in online advertising, where it has been estimated that even 0.1% better accuracy can lead to increased earning of the order of hundreds of millions of dollars [model-ensemble-click-prediction-bing-search-ads].
A growing number of small and medium enterprises rely on machine learning as part of their everyday business. Such companies often lack the on-premises infrastructure required to perform the compute-intensive workloads that are characteristic of the field. As a result, they may turn to cloud providers in order to gain access to such resources. Since cloud resources are typically billed by the hour, the time required to train machine learning models is directly related to outgoing costs. For such an enterprise cloud user, the ability to train faster can have an immediate effect on their profit margin.
The above examples illustrate the demand for fast, scalable, and resource-savvy machine learning libraries. Today there is an abundance of general-purpose environments, offering a broad class of functions for machine learning model training, inference, and data manipulation. Some of the most prominent and broadly-used ones are listed below, along with certain advantages and limitations.
scikit-learn [scikit-learn] is an open-source module for machine learning in python. It is widely used due to its user-friendly interface, comprehensive documentation the wide range of functionality that it offers. While scikit-learn does not natively provide any GPU support, it can call lower-level native C++ libraries such as LIBLINEAR to achieve high-performance. A key limitation of scikit-learn is that it does not scale to datasets that do not fit into the memory of a single machine.
Apache MLlib [spark-ml] is Apache Spark\textquotesingles** scalable machine learning library. It provides distributed training of a variety of machine learning models and provides easy-to-use APIs in Java**, Scala and Python. It does not natively support GPU acceleration, and while it can leverage underlying native library like BLAS for certain operations, it tends to exhibit slower performance relative to the same distributed algorithms implemented natively in C++ using high performance computing frameworks such as MPI [CoCoAccel2017].
TensorFlow [tensorflow2015] is an open source software library for numerical computation using data flow graphs. While TensorFlow** can be used to implement algorithms at a lower-level as a series of mathematical operations, it also provides a number of high-level APIs that can be train generalized linear models without needing to implement them oneself. It transparently supports GPU acceleration, multi-threading and can scale across multiple nodes. When it comes to training of large-scale linear models, a downside of TensorFlow is the relatively limited support for sparse data structures, which are frequently important in such applications.
In this work we describe a new library that exploits the hierarchical memory and compute structure of modern systems. We focus on the training of generalized linear models. We combine recent advances from algorithm and system design to optimally leverage all hardware resources available in modern computing environments. The three main features that distinguish our system are distributed training, GPU acceleration and full support of sparse data structures:
Distributed Training We build our system as a data-parallel framework. This enables us to scale out and train on massive datasets that exceed the memory capacity of a single machine which is crucial for large-scale applications. To guarantee good performance in a distributed setting we build on a state-of-the art framework that achieves communication-efficient training, respects data-locality and provides adaptivity to diverse system characteristics.
GPU Acceleration. To speed up training algorithms we support GPU accelerators in a distributed environment. Such specialized hardware devices are widely available in today’s cloud offerings and have attracted a lot of attention especially in the context of training deep neural networks. We build on specialized solvers designed to leverage the massively parallel architecture of GPUs. By offloading the entire solver to the GPUs we avoid large data transfer overheads and respect data locality in GPU memory. To make this approach scalable we take advantage of recent developments in heterogeneous learning in order to enable GPU acceleration even if only a small fraction of the data can indeed be stored in the accelerator memory.
Sparse data structures Many machine learning datasets are sparse. Hence the support of sparse data structures is essential pillar of our library. In this context, we detail some new optimizations for the algorithms used in our system when applied to sparse data structures.
2 System Description
We start with a high-level, conceptual description of the structure of Snap ML and detail individual components in the subsequent sections 3 and 4. Our system implements several hierarchical levels of parallelism in order to partition the workload among different nodes in a cluster, take full advantage of accelerator units and exploit multi-core parallelism on the individual compute units.
2.1 1st Level Parallelism
The first level of parallelism spans across individual worker nodes in a cluster. The data is stored distributedly across multiple worker nodes that are connected over a network interface as shown in Figure 1. This data-parallel approach enables the training on large-scale datasets that exceed the memory capacity of a single device.
2.2 2nd Level Parallelism
On the individual worker nodes we can leverage one or multiple accelerator unit such as, e.g., GPUs, by systematically splitting the workload between the host and the accelerator units. The different workloads are then executed in parallel enabling full utilization of all hardware resources on each worker and hence achieving the second level of parallelism across heterogeneous compute units illustrated in Figure 2.
2.3 3rd Level Parallelism
In order to efficiently execute the workloads assigned to the individual compute units we leverage the parallelism offered by the respective compute architecture. We use specially designed solvers to take full advantage of the massively parallel architecture of modern GPUs and implement multi-threaded code for processing the workload on the CPU. This results in the additional, third level of parallelism across cores as illustrated in Figure 3.
3 Algorithmic Core
In the following we describe the algorithmic backend of our system in more detail. The core innovation relates to how we nest multiple state-of-the-art algorithmic building blocks to reflect the hierarchical structure of a distributed architecture.
3.1 Distributed Framework
The first level of parallelism serves mainly to increase the overall memory capacity of our system in order to enable scaling to massive datasets. In such a data-parallel setting the data is partitioned across worker nodes. Hence, every worker node has only access to its local partition of the training data and the main challenge of designing an efficient distributed algorithm relates to the high cost of exchanging information over the network between the nodes. Of particular interest are methods that allow flexibility to adjust to various systems characteristics such as communication latency and computation efficiency as studied in [CoCoAccel2017]. This is typically achieved by introducing a tunable hyper-parameter to steer the amount of work that is performed between two consecutive rounds of communication.
One distributed algorithmic framework that offers such a adjustable hyper-parameter is CoCoA [cocoa]. We refer to it as a framework rather than an algorithm because it defines how to split the workload between the different nodes and which information to exchange between nodes in order to synchronize their work, but it does not dictate which solver should be used on the individual worker nodes to solve their local optimization problem. This allows us to take advantage of the available compute and memory resources when choosing a solver. Hence, CoCoA is perfectly suited to reflect the first level of parallelism of our system where the implementation of the local solver is handed over to a higher hierarchical level of the system. A known limitation of the CoCoA framework is its bad performance for non-quadratic loss functions such as logistic regression; this can be attributed to the quadratic approximation of the loss function used to construct the local subproblems. To improve the performance for these particular applications we modify the local subproblems in CoCoA to incorporate local second-order information. For implementation details of CoCoA we refer to Section 4.
3.2 Heterogeneous Nodes
The second level of parallelism is implemented to take advantage of one, or potentially multiple, accelerator units available on the individual worker nodes to solve the local optimization task in the CoCoA framework. How we integrate these accelerator units in the optimization procedure depends on the locally available accelerator memory. Therefore we distinguish three application scenarios as illustrated in Figure a, i.e.,
aggregated GPU memory local training data
aggregated GPU memory local training data
no GPU available
If the aggregated GPU memory can fit the entire dataset we partition the local training data evenly across the available GPU accelerators. Then, the data resides locally in the accelerator memory and the individual GPUs are treated as independent workers and a second level of CoCoA implemented to solve the local subproblems in a distributed fashion. This nested structure reflects the hierarchical structure of the distributed system and we can account for the fact that the communication within a node is less expensive than communication across the network by enabling multiple communication rounds on the inner level within a single communication round across the network.
This is the most common setting encountered in distributed systems but also the most challenging one because the local training data can not be stored entirely in GPU memory. To nevertheless take full advantage of the GPU accelerators we build the local solver on the recently developed DuHL scheme [DuHL2017]. DuHL provides a systematic way to split the workload between the host CPU unit, storing the local data in memory, and the accelerator units being attached to it by respecting the tight memory constraint of the accelerator. This results in parallelizable tasks that can be executed simultaneously. More details about the implementation and the extension to sparse data structures are given in Section 4.
In the case where each worker has only a single CPU, there is no parallelism across compute units to exploit at this level. However, for worker machines that have a multi-socket architecture with multiple CPUs, there is typically a cost associated with inter-socket communication. Furthermore, the issue of false sharing can arise when threads on different sockets attempt to write to the same location in memory. These effects mean than the multi-threaded solvers detailed in the next section cannot be readily applied across sockets. To solve this problem, one can implement CoCoA across the sockets. In this manner, there is a separate local solver running on the cores of each socket, and the communication between them can be effectively controlled in a synchronous manner.
We note that for scenario (1), (2) and (3) we have orchestrated the work between the individual compute nodes within a single worker but we have not specified the algorithm that is implemented on the GPU (and the CPU) to solve its local optimization task. This is implemented by the third hierarchical level of our system.
3.3 Multi-Core Parallelism
The third level of parallelism concerns the parallelism across different cores of the CPU and/or the GPU.
To efficiently solve the optimization problem assigned to the GPU accelerator we implement the twice parallel asynchronous stochastic coordinate descent solver (TPA-SCD) [TPASCD2017]. It maps a stochastic coordinate solver to the massively parallel architecture of modern GPUs by leveraging the different levels of parallelism, the shared memory available within multi-processors and the built-in atomic add operation. This solver can be used as a stand-alone solver, as a local solver within the CoCoA framework as in scenario (1) or as part of DuHL in in a heterogeneous setting as in scenario (2). In Section 4 we discuss the implementation of this solver in more detail.
The CPU can take two roles, depending on which of the three scenarios, discussed in the previous section, applies.
If the entire local partition of the training data can be distributed among the GPUs, we ship all the work to the GPUs [TPASCD2017]. In this case the CPU takes the role of the master node in the second level of CoCoA and manages the communication and synchronization of the work between the GPU accelerators.
If the CPU is used as part of DuHL it is assigned the task of identifying the subset of the training data to be processed on the GPU for the next round. This consists of randomly sampling datapoints and computing the respective importance value to the optimization task. This computation is independent for individual datapoints and can thus be parallelized over the available cores of the CPU by assigning a subset of the datapoints to each core. This allows full utilization of the compute power of the CPU in parallel to the GPU as discussed in the previous section.
If no GPU is available the CPU will run the local optimization task. Therefore we use PASSCoDe [passcode], a parallel solver designed for multi-core CPUs. For implementation details see Section 4.5.
4 System Implementation
The implementation of our system reflects the nested hierarchical structure of the algorithmic core illustrated in Figure 4. At the heart of our system is a C++/CUDA library that implements the computational heavy and performance critical parts of the system in level 2 and level 3. The data-parallelism on the outer most level can be implemented on top of any distributed computing framework where the individual workers call the underlying library - we will discuss Spark and MPI in more detail. To detail the implementation challenges we will again proceed level-by-level:
4.1 Level 1 Parallelism
The distributed framework implementing the first level of parallelism handles the data-partitioning and the communication of updates over the network between different nodes in the CoCoA framework. In order to satisfy needs of different users we have implemented our system on top of two different distributed frameworks – one being the widely-used Apache Spark framework [spark] and the other being the high-performance computing framework MPI [mpi]. In the following we will discuss these two implementations separately.
To seamlessly integrate our system in Spark-based applications we provide an implementation of our system on top of Apache Spark. This concerns the outer most level of CoCoA. To maintain good performance on top of Spark we have adopted the implementations and optimizations for CoCoA proposed in [CoCoAccel2017], i.e., we implemented
meta-RDDs to avoid large overheads of Spark related to data management and enable the use of GPUs within the Spark framework.
persistent local memory to avoid unnecessary communication overheads and keep the local model parameters in memory on the worker nodes.
As noted in [CoCoAccel2017] these optimizations are not fully compatible with the Spark programming model and break the data resiliency. However, for our performance optimized system, training times are very short; even for large-scale data training is in the order of minutes (see Section 5). Thus the overheads introduced by restarting the training procedure if a node fails are negligible. It would also be relatively straightforward to store a snapshot of model on disk in regular intervals and then reload this model together with the required data partitions and use it as warm-start for a new run. To provide built in data resiliency is left for future work.
We also built an implementation of our system entirely in C++ on top of the high performance MPI framework. To initially distributed the training data we use a load-balancing scheme to distribute the computational load evenly across worker nodes as in [CoCoAccel2017]. Regarding the communication of the update vectors in CoCoA, MPI offers more flexibility in designing the communication pattern, i.e., instead of centralized communication we can also use MPIs AllReduce primitives and achieve all-to-all communication among the nodes.
4.2 Level 2 Parallelism – Scenario (1)
In scenario 1 where the aggregated accelerator memory on a single node is large enough to store the local data partition we use the CoCoA framework to orchestrate work between the different accelerator units. This second level of CoCoA is implemented inside the C++ core of our library using asynchronous functionality provided by the CUDA framework for dealing with multiple GPUs. This is in contrast to how CoCoA is implemented in Level 1, where Spark or MPI was used, since it is a common scenario that one may want to leverage multiple GPUs in one machine without incurring the complexity and overheads of distributed computing frameworks.
4.3 Level 2 Parallelism – Scenario (2)
For scenario (2) we implement DuHL to distributed the workload between the CPU and the attached GPUs. For a detailed description of the algorithm procedure we refer to [DuHL2017]. In the following we will detail the implementation of DuHL, propose some novel extensions to address sparse data structures and discuss several performance optimizations. The core of the DuHL scheme is a logic to repeatedly determine the subset of the training data for the next round, manage the content of the GPU memory and coordinate the work between host and accelerator units to avoid idle times.
The DuHL scheme in [DuHL2017] is proposed for a single accelerator unit, however, in modern servers we can often find multiple GPU accelerators per node. Thus, we have extended the DuHL scheme to take advantage of multiple accelerator units by combining it with CoCoA. We use CoCoA to distribute the local workload among the available accelerator units and then assign the CPU cores evenly among them to run the DuHL scheme for every GPU accelerator on the dedicated local subproblem.
GPU Memory Management
For dense data structures, data vectors are of fixed size and the management of the GPU memory is fairly easy – the CPU only needs to keep track of the indices of the data vectors currently residing in the GPU memory. Then, every time an optimization round on the GPU has finishes, it sorts the importance values, finds to new index set, compares it to the old index set and swaps (i.e., over-writes) data vectors if necessary. However, for sparse data structures this is more challenging since we need to deal with the variable size memory allocation problem. Therefore use a custom memory pool allocator [Lattner05pldi] that partitions the memory into fixed-sized blocks and uses offset-based linked-lists to keep track of the data vector location and status (free, used). Then, to update the GPU memory in every round, we iteratively delete less important data vectors and fill the available memory space with more important data vectors until the data vectors on the GPU represent the set of most important ones.
To avoid large data transfer and memory management overheads when updating the GPU memory after every round, we group multiple training data vectors into batches and treat them as a single data unit. The importance of such a batch is then determined by the aggregated importance value of the vectors composing the batch. This modification theoretically preserves the superior convergence properties of the DuHL scheme over its random counterpart where batches are sampled uniformly at random. We found this modification to be necessary for sparse data structures in order to achieve good performance.
To avoid idle times on the GPU we stop the importance value computation on the CPU as soon as the GPU has finished. We achieve this using CUDA events as proposed by the authors in [DuHL2017].
DuHL works particularly well for sparse models such as SVM or Lasso, however for dense models such as logistic regression, we were, for certain datasets not always able to see a significant convergence gain of using the DuHL. While we need to further investigate the origin of this performance drop, and adapt DuHL accordingly, we provide an optimized sequential alternative to DuHL. That is, we process data sequentially in batches - this can be viewed as DuHL with a degenerate importance function returning the time passed since the data vector has been processed the last time. This sequential approach does not enjoy the favorable convergence behavior of DuHL, however we still fully utilize all resources by assigning an alternative workload to the CPU - the CPU is used to perform permutations of future batches to be processed on the GPU. This split of workload has the additional advantage that data transfer overheads can fully be interleaved with the computational wokload. To achieve this we make use of CUDA streams where one stream is assigned to copying data on the GPU for the next iteration while the other stream executes the computational workload. We will analyze the performance of this approach in more detail in Section LABEL:sec:profiling.
4.4 Level 2 Parallelism – Scenario (3)
Currently, inter-socket CoCoA is not explicitly implemented in Snap ML. We defer the problem to the top-level of parallelism and just spawn a separate MPI process or Spark executor that is pinned to each socket. While this approach is acceptable, we acknowledge that since inter-socket communication is typically much cheaper that communication over the network, there may be some potential benefit to communicating more frequently between the workers on each socket relative to the workers on different machines. Therefore, we plan to implement inter-socket CoCoA explicitly using multi-threaded C++ code in the near future.
4.5 Level 3 Parallelism
To benefit of the parallelism offered by GPUs when solving the local optimization problem we implement the TPA-SCD solver as detailed in [TPASCD2017] and [DuHL2017]. TPA-SCD defines a stochastic coordinate descent, in which every coordinate update is assigned to a different thread block for asynchronous execution on the streaming multi-processors of the GPU. Within each thread block, the inner products that are essential to solve the coordinate-wise subproblems are computing using warps of tightly-coupled threads. In the previous literature, TPA-SCD was applied to objective functions such as ridge regression[TPASCD2017] and support vector machines [DuHL2017], both of which have the desirable property that the objective function can be minimized exactly with respect to a single model coordinate while keeping the other coordinates fixed. In Snap ML, we also support objective functions for which this is not the case such as the dual form of logistic regression. To address this issue, instead of solving the coordinate-wise subproblem exactly, we make a single step of Newton’s method, using the previous value of the model as the initial point. We find that the computations required to compute the Newton step (i.e, the first and second derivative) and also be expressed in terms of a simple inner product and thus all of the same TPA-SCD machinery can be applied. In the broader context of our system, these GPU solvers can be used as a stand-alone solver, as a local solver in a distributed framework as in scenario (1), or as part of DuHL as in scenario (2).
For scenario (3) where the optimization problem is solved on the CPU we use the PASSCoDe algorithm [passcode]. The authors propose three different version of PASSCoDe handling the synchronization differently, these implementations where studied and compared in [TPASCD2017]. Motivated by these results we have implemented the PASSCoDe-atomic version of the PASSCoDe algorithm as a default solver. It is not as fast as the Passcode-wild alternative but its convergence behavior is more well defined.
|#features||#examples (train)||#examples (test)||size train [binary]||size train [libsvm]||pre-processing|
|criteo-kaggle||1 million||34.4 million||11.5 million||11 GB||19 GB||data as provided on the LIBSVM webpage [libsvmdataset]|
|criteo-1b||10 million||1 billion||100 million||98 GB||274 GB||custom one-hot encoding of categorical variables|
|Terabyte Click Logs||1 million||4.2 billion||180 million||641 GB||2.3 TB||data as provided on the LIBSVM webpage [libsvmdataset]|
5 Experimental Results
In the following we will analyze the performance of Snap ML in different hardware secenarios and benchmark its performance against sklearn, TensorFlow and Apache Spark.
For the experimental section we focus on the machine learning application of click-through rate prediction (CTR), which is one of the central uses of machine learning in the internet. CTR is a massive-scale binary classification task where the goal is to predict whether or not a user will click on an advert based on a set of anonymized features. For our experiments we use the publicly available Terabyte Click Logs dataset recently released by Criteo Labs [criteopressrelease]. It consists of a portion of Criteo\textquotesingles traffic over a period of 24 days where every day an average of 160 million examples were collected - an example corresponds to a displayed ad served by Criteo and is labeled by whether or not this advert has been clicked. We use the data collected during the first 23 days for the training of our models and the last day for testing. For single node experiments in Section LABEL:sec:single-node as well as the analysis of individual components of Snap ML in Section LABEL:sec:multi-node we will use smaller versions of this dataset such as i) the one released by Criteo Labs as part of their 2014 Kaggle competition where we perform a random 75%/25% train/test split and ii) a custom subsampled version of the Terabytes Click Logs data where we use the first 1 billion examples for training and the next 100 million examples for testing. For the different datasets are specified in Table 1. Note that different pre-processing was used for the different versions of the dataset.