Parallax: Automatic DataParallel Training of Deep Neural Networks
Abstract
The employment of highperformance servers and GPU accelerators for training deep neural network models have greatly accelerated recent advances in machine learning (ML). ML frameworks, such as TensorFlow, MXNet, and Caffe2, have emerged to assist ML researchers to train their models in a distributed fashion. However, correctly and efficiently utilizing multiple machines and GPUs is still not a straightforward task for framework users due to the nontrivial correctness and performance challenges that arise in the distribution process.
This paper introduces Parallax, a tool for automatic parallelization of deep learning training in distributed environments. Parallax not only handles the subtle correctness issues, but also leverages various optimizations to minimize the communication overhead caused by scaling out. Experiments show that Parallax built atop TensorFlow achieves scalable training throughput on multiple CNN and RNN models, while requiring little effort from its users.
Parallax: Automatic DataParallel Training of Deep Neural Networks
Soojeong Kim, GyeongIn Yu, Hojin Park, Sungwoo Cho, Eunji Jeong, 
Hyeonmin Ha, Sanha Lee, Joo Seong Jeong, ByungGon Chun 
Seoul National University 
1 Introduction
It is a common practice nowadays for deep learning (DL) practitioners to utilize a cluster of GPU resources for training deep neural networks. This is mainly motivated by the fact that recent deep neural network architectures involve very large computations [29, 16, 32] and are trained on large datasets [26, 8], typically requiring multiple GPUs in order to finish training within a reasonable time limit. There are a few parallelization strategies for accelerating training on multiple GPUs: running multiple model replicas that process disjoint datasets (data parallelism), partitioning a single model into several smaller models (model parallelism), and a mixture of the previous two strategies (hybrid parallelism). Among these techniques, data parallelism is the most widely used thanks to its simplicity [13, 29, 17], and is supported by most DL frameworks such as TensorFlow [6], Caffe2 [1], Horovod [2], MXNet [10], and PyTorch [23], to increase training throughput by processing data in parallel.
However, adapting a singleGPU DL model to work in a distributed environment for data parallelism is not a trivial task for users. Users must consider not only what data will be shared and synchronized between model replicas and what communication method will be used, but also how computation and data will be distributed across multiple GPUs and machines. A suboptimal decision can easily lead to unsatisfactory convergence or lowerthanexpected training throughput gain from parallelization.
One of the key challenges of parallelization is reducing the communication overhead. Even though most DL frameworks support data communication between GPUs and machines through some framework API, improving the communication efficiency is still left as a task for the user. Different communication mechanisms are appropriate for different models depending on their computation characteristics, but this is not pointed out by the framework. The user must know the available communication mechanisms and understand how to avoid possible bottlenecks and apply optimizations to gain the parallelization performance, all by themselves. This problem discourages DL framework users and must be addressed to improve the usability of DL frameworks.
In this paper, we introduce Parallax, a tool for automatically performing dataparallel distributed training of deep neural networks. Parallax transforms a singleGPU model into a model for distributed execution, employing various optimizations to mitigate communication overhead such as hybrid communication, local aggregation, and smart operation placement. Parallax ensures correct computation and scalable performance with minimum user intervention.
We have implemented Parallax on top of TensorFlow [6] 1.6 with Horovod [2]. Experiments on two classification models [16, 29] and two language models [17, 32] show that Parallax can speed up DL training on a GPU cluster of 8 machines each equipped with 6 GPU cards. Parallax achieves up to 43.6x speedup for the classification models on 48 GPUs and 17.1x speedup for the language models on 36 GPUs. Most importantly, the performance gain is earned with absolutely no manual optimizations from the user – merely a few lines of code are needed to use the Parallax API.
The rest of the paper is organized as follows. Section 2 describes the DL background related to Parallax, and Section 3 examines the challenges of dataparallel distributed training. Section 4 gives an overview of how Parallax works, while Section 5 explains the details of model transformation. Sections 6 and 7 describe the implementation details and evaluation results. Section 8 presents related work and Section 9 concludes.
2 Background
In this section, we briefly discuss the categories of DL frameworks. We then discuss dataparallel distributed training and its communication methods to synchronize model parameters.
2.1 DL Frameworks
We can classify existing DL frameworks into two categories based on when a computation graph is executed after its declaration.
Defineandrun frameworks [6, 10, 1, 7, 20] delay the execution of computation graphs until the user explicitly requests computation results (i.e., deferred execution). The user first declares various operations that will be used in the computation graph. Meanwhile, the framework prepares graph execution by allocating memory appropriately and resolving dependencies between operations, but does not actually execute the graph immediately. By postponing computation, it is possible to apply various graph optimizations that require views of the whole computation graph, including parallel execution of mutually independent operations, efficient buffer management for reducing memory footprint, and operator fusion [5, 21].
Definebyrun frameworks [23, 30] execute graphs as soon as they are defined (i.e., immediate execution), similar to programming languages with interactive shells. While such frameworks are incapable of providing graph optimizations as defineandrun frameworks do, they are often preferred when users require better programmability or debuggability thanks to their interactive nature.
The deferred execution scheme of defineandrun frameworks provide us an opportunity to transform computation graphs with global information on hand, such as all model parameters, parameter update methods, and gradients, which is necessary for correct and precise automatic parallelization. On the other hand, definebyrun frameworks give us no room to apply graph transformations regarding the entire graph structure because all operations are run as soon as they are declared. Therefore, Parallax focuses on defineandrun frameworks, inserting an automatic graph transformation step between the declaration of a computation graph and its execution.
2.2 DataParallel Distributed Training
A neural network is trained by applying a series of iterative computations, in which a set of model parameters are updated at each iteration. A model refers to a neural network architecture, while model parameters indicate the weights of a neural network. The training process is carried out via the gradient descent algorithm; the loss value of a model is calculated from forward computations, and the loss is passed back through the model according to the backpropagation algorithm to compute gradients. These gradients are then used to update model parameters. Batch gradient descent, which utilizes all available input data during one iteration, is preferred for computational efficiency, but the memory size limit of GPUs leads to the use of minibatches that contain only a subset of the whole data.
Dataparallel distributed training is utilized to process several minibatches simultaneously with multiple GPUs. GPUs are set to perform the same computation on different minibatches, each producing a unique set of gradients. In case of asynchronous training, the gradients from one GPU are used to update model parameters without waiting for other GPUs. On the other hand, for synchronous training, all GPUs wait for one another to finish their gradient computation for model parameters. Then, the computed gradients are aggregated before being applied to model parameter updates. For both asynchronous and synchronous training, data communication between GPUs and machines is necessary to share the computed gradients.
Processing multiple minibatches in dataparallel training involves tuning various model hyperparameters, such as the learning rate and batch size. In asynchronous training cases, the staleness of model parameter updates is known to negatively impact the model’s accuracy and produce relatively unpredictable tuning results [9, 34, 15]. Thus, many DL models are trained synchronously [13, 32, 27, 31], and this research also assumes synchronous training.
2.3 Communication Methods: Parameter
Server and Message Passing Interface
Two widelyused communication methods for distributed training are the Parameter Server (PS) [28] and the Message Passing Interface (MPI) [14]. The PS method, initially proposed for topic modeling [28], has been extensively used in many frameworks such as TensorFlow [6], MXNet [10] and Adam [11] because of the simple PS abstraction of separating model parameter synchronization from minibatch computation as a separate process. A typical PS architecture consists of server and worker processes as described in Figure 2. Server processes store subsets of model parameter values (ParamN) in memory, while worker processes pull parameters from servers to perform local computations on their respective minibatches (input data) and later push parameter updates back to servers. As a result, model parameter synchronization between workers is done indirectly via server processes.
For MPI communication, there is no process dedicated just to hold model parameters, as shown in Figure 2. Rather, all workers are given a replica of model parameters (Params) and share locally computed gradients via MPI collective primitives such as AllReduce and AllGatherv. These primitives are used to collect gradients from each worker and broadcast the aggregated values back to all workers. The model parameters housed in workers are updated using the aggregated gradients, thereby all parameters are always synchronized between replicas.
Existing DL frameworks that support dataparallel training tend to hide the above backgrounds (to some extent) from users to avoid complicating user experience. However, several challenges, which will be described in the next section, still exist when utilizing such frameworks.
3 Challenges
When training DL models, many users adopt data parallelism to increase training speed. To accomplish data parallelism, it is common to first build a primary model that is trainable on a single GPU before extending the model to be trained in distributed environments. Some DL frameworks [6, 1, 2] currently provide basic mechanisms for distributed training, such as gradient aggregation or model parameter synchronization.
However, modifying a singleGPU model to be trainable on multiple GPUs and machines is not a trivial task despite such mechanisms. To modify the singleGPU model, users must manually use APIs, such as SyncReplicasOptimizer of TensorFlow [6] or pull (or push) of MXNet [10], with an understanding of the distribution mechanism that the underlying framework provides. For instance, distributed TensorFlow assumes that users are aware of the PS architecture. Thus, the users have to add code for distributing model parameters across servers and synchronizing workers.
This manual modification may lead to incorrect and/or inefficient distributed training due to users’ mistakes. Incorrect distributed training may fail to make a model converge. Inefficient model distribution causes poor performance and resources underutilization. Debugging these problems is a timeconsuming and painful task. Next, we describe the challenges of synchronous dataparallel training with regard to correctness and scalability in detail.
3.1 Correctness
Synchronous dataparallel training is equivalent to singleGPU training if the total minibatch size is the same. For example, dataparallel training that uses 8 GPUs and minibatch of 32 per GPU is equivalent to singleGPU training with minibatch of 256. In this sense, the correctness of the synchronous dataparallel training can be defined as the ability to calculate the result mathematically identical to the result of singleGPU training with the same total minibatch size.
To analyze differences between dataparallel training and singleGPU training regarding correctness in detail, dataparallel training can be divided into four steps: input data processing, computing gradients using forward computation and backpropagation, gradient aggregation, and parameter updates. The first two steps are almost identical in both singleGPU and dataparallel training; the only difference is that dataparallel training replicates these computations on multiple GPUs with disjointly partitioned input data for each GPU. However, the remaining two steps are tricky because singleGPU training does not need gradient aggregation or parameter synchronization between multiple GPUs during parameter updates. There are a few scenarios that affect correctness regarding gradient aggregation and parameter updates when users do not use DL frameworks cautiously.
Correct Gradient Aggregation. Current DL frameworks require users to modify a singleGPU code to aggregate the computed gradients in each GPU before updating parameters. However, if any additional computation exists between the gradient computation and the parameter update, it is confusing to users whether to aggregate the gradient before or after that additional computation. An example would be gradient norm clipping [22], which is widely used for RNN models to handle the exploding gradient problem. Gradient norm clipping limits the magnitude of the calculated gradients with a global norm, which is the square root of the sum of L2 norms from a subset of or all of the gradients. In singleGPU training, gradients are computed for a minibatch and then clipping is performed on the computed gradients. Because aggregated gradients in distributed training are the same with the computed gradients from a single GPU if the total minibatch size is the same for both, the distributed training should execute clipping after aggregation of the gradients for correct dataparallel training. If a user aggregates the gradients after clipping them, the computation result differs from that of the singleGPU version, which makes tuning hyperparameters complicated.
The highlevel distribution APIs of widely used DL frameworks with the PS architecture do not provide a way to handle gradient clipping correctly. TensorFlow hides the gradient aggregation from users, thus it disallows additional computation after the aggregation. In the case of MXNet, it provides an independent update function for each set of parameters (e.g., a weight matrix for a convolutional layer or LSTM cell) and its gradients. Therefore, accessing the gradients of multiple sets of parameters is difficult even though it is necessary to compute a global norm. As a result, a user who wants to analyze and manipulate the aggregated gradients has to implement a new aggregation mechanism by understanding and modifying the underlying distribution API implementation.
Correct Parameter Update. The DL frameworks with the PS architecture provide APIs focused on synchronization of model parameters, consisting of weights and biases of the neural network. However, most of the DL models contain extra parameters that should be updated separately from the model parameters so that correctly updating all the parameters is a nontrivial task. For example, the variants of the gradient descent algorithm such as momentum gradient descent [25] and adaptive gradient descent [11, 12] require an extra set of parameters, namely AccumParams, to accumulate the values of model parameters from the past iterations. According to the definition, the correct update of AccumParams involves updating them when the values of the corresponding model parameters are changing. Existing frameworks handle the momentum or adative gradient descent algorithms with APIs such as optimizer or updater, thus AccumParams are correctly updated in these APIs. However, if a user wants to experiment with a new gradient descent algorithm, then there is no API for the new algorithm; maintaining AccumParams in the new algorithm remains the user’s responsibility.
Another example of extra parameters is the moving averages of model parameters, which smooth out noises across multiple iterations [29, 31]. The moving averages should also be stored in the parameter servers and be updated when the model parameters are updated. To handle moving averages correctly, the distribution APIs of current DL frameworks require extra user effort. For example, the distribution API of TensorFlow forces users to inform the framework that the moving averages are being used in their model, by manually passing this information as an extra argument when invoking the distribution API. Furthermore, MXNet users must define a new parameter update function for moving averages as MXNet does not support any API for moving averages.
3.2 Scalability
Scalability in distributed training is the ability to improve performance proportionally to the number of GPUs for the training. However, distributed training incurs the overhead of communication across machines, which limits scalability. In dataparallel training, communication occurs in two cases. First, when aggregating gradients from each worker, the gradients are shared between GPUs and machines. The communication overhead for the aggregation changes according to the communication method, MPI or PS. Second, communication is necessary to transfer data between operations in different GPUs or machines. For example, when an operation in a GPU needs values in another machine, the values must be transferred to the GPU. The amount of data transfer changes according to how the operations are placed in a distributed environment. It is challenging to reduce the communication overhead by choosing an efficient communication method and finding optimized operation placement, which we discuss next.
Communication Methods. As described in Section 2.3, there are two major communication styles available for distributed training: PS and MPI. Typically, one communication style is considerably more efficient than the other depending on the parameters in a DL model to be trained.
Parameters of a DL model can be categorized into dense and sparse parameters. We define a sparsity of parameters with a ratio of actually updated elements to the total number of parameters. Dense parameters have a sparsity of one, and sparse parameters have a sparsity of less than one. A representative example of dense parameters would be a weight of a convolutional layer in a convolutional neural network (CNN). At each iteration with a minibatch, all the elements in the weight matrix of the convolutional layer are used to calculate the final output and updated by the gradient values. On the other hand, word embedding parameters, which maps the words to vectors in language models, is an example of the sparse parameters. Because input sentences in a minibatch include only a subset of the entire vocabulary, the subset of the word embedding parameters is utilized and updated at each iteration.
Experiments in Table 1 show that the type of parameters is an important factor when selecting a communication method. Among selected applications, CNN models (ResNet50 and Inceptionv3) contain only dense parameters, but RNN models (LM and NMT) contain the sparse parameters for word embeddings in addition to dense parameters for LSTM cell states. According to the results, MPI is preferable when training models only with the dense parameters, but PS is preferable for models containing the sparse parameters. However, to the best of our knowledge, no prior work considers the parameter type when selecting the communication method. This means users of current DL frameworks must choose an efficient communication method by themselves regarding the type of model parameters and the ratio of sparse and dense parameters in a DL model.
Operation Placement. Effective operation placement can significantly reduce data transfer between GPUs and between machines. Current DL frameworks with PS communication have simple operation placement rules. The model parameters, parameter update operations and gradient aggregation operations are assigned to servers, while all the other remaining operations, mostly computation intensive operations, are assigned to workers to utilize GPUs. This simple placement rule may seem reasonable, but it does not place operations optimally for highperformance training.
For example, according to the criterion, even the operations that require a small amount of computation, such as slicing data (slice), gathering a subset of values with specified indices (gather), and casting data into a new type (cast), are assigned to the workers. However, in some cases, placing them on the servers improves training throughput, because it is possible to reduce communication without slowing down computation. A mixed precision [19] model fits this case. The model has 32bit floating point parameters, but they are cast to 16bit when calculating gradients. Thus, if casting operations for both parameters and gradients are placed on the same server, a server can send and receive 16bit data rather than 32bit data, which halves the amount of transferred data.
In addition, since there are multiple parameter servers, selecting an optimal server among multiple servers is another issue to consider. To prevent network bottlenecks, model parameters need to be distributed evenly among the servers. Moreover, it is better to colocate the parameter with its related computations, including gradient aggregation, post computations after combining gradients (e.g., gradient clipping), and calculations of moving averages for a parameter.
These challenges require system support for automatic dataparallel distributed training of DL models.
4 Overview of Parallax
Parallax is an automatic data parallelization tool built on TensorFlow [6], a defineandrun framework. Parallax enables users to utilize distributed multiGPU environments easily when they have a singleGPU computation graph (i.e., a machine learning model developed for training on a single GPU). To address the challenges described in Section 3, the tool guarantees the following three properties: transparency, correctness, and performance scalability.

Transparency: Parallax users do not need to write new code for parallelization that requires prior knowledge on the data communication logic of TensorFlow. Instead, the tool provides an API that receives a singleGPU computation graph as input and automatically transforms the graph into a multiGPU, multimachine computation graph.

Correctness: With the transformation rules we define (Section 5), Parallax does not affect the computation correctness during transformation. Thus, users can train their models using multiple GPU devices without being concerned about the results becoming different from singleGPU training.

Scalablility: Parallax analyzes the singleGPU computation graph before transformation to generate a graph optimized for scalable performance. In particular, the analysis considers operation placement and communication methods.
The stpdf shown in Figure 3 outline the overall execution model for Parallax. Once users define a singleGPU computation graph, they pass the graph to Parallax along with the available cluster resource information. Parallax then analyzes the computation graph and parallelizes the graph according to its graph transformation mechanisms. During the transformation, Parallax utilizes both PS and MPI communication methods to achieve scalable performance. Finally, Parallax executes the transformed graph. We explain the graph transformation mechanisms in detail in Section 5.
Parallax provides simple programming interfaces: shard and get_runner (Table 2). Unlike singleGPU training, input data must be divided into disjoint subsets to be processed by different GPUs for dataparallel distributed training. Parallax helps automatize this process with the shard API, which receives input data and splits the data into multiple subsets so that each GPU can read a unique subset. Next, get_runner is the main interface that accepts a computation graph as well as resource information including the IP addresses of machines and GPU IDs, and an optional Parallax configuration object specifying communication options if needed.
We illustrate how to use the Parallax API with a code snippet example for training the ResNet50 [16] model, a DL model widely used for image classification. Figure 4 shows code for training the ResNet50 model on a single GPU, without using Parallax. First, a graph object is declared, single_gpu_graph, which is followed by the logic for preprocessing input data, the loss function, the gradients from backpropagation, and the gradient descent method for updating the model parameters (lines 210). Afterwards, the graph runner object graph_runner is accessed to execute single_gpu_graph for many iterations (lines 1214).
To parallelize the training script from Figure 4 with Parallax, the code can be modified as in Figure 5. Only two main modifications are required. First, the input data must be partitioned across GPUs for data parallelism. As mentioned above, this can be accomplished with the shard interface. The ds object in line 5 represents the whole input data, while the ds object returned by shard in line 6 returns a partitioned dataset to be fed to several GPUs.
Second, the computation graph is transformed to be exectuable on multiple GPUs through the get_runner interface. In lines 1518 and line 21, the graph_runner object returned by the get_runner interface should be used in place of the original graph runner from Figure 4, as the graph runner returned from the original framework interface is not aware of the fact that the computation graph has been converted for a distributed multiGPU environment.
We also give freedom to users to adjust other extra configurations. The get_runner interface optionally receives a ParallaxConfig object, which contains various fields for training options. For example, the average_dense (or average_sparse) flags are used for flexible aggregation methods, indicating whether to compute the average of dense (or sparse) gradients over all GPUs or to compute the sum instead.
5 Automatic Graph Transformation
Parallax executes graph transformation with two main goals in mind: correctness and performance scalability.
Correctness. As discussed in Section 3, it is difficult to manually transform a singleGPU computation graph into a distributed version without avoiding errors. Parallax carries out the transformation process adhering to several specific rules systematically, relieving users from the burden. Typically, when a user defines a model in an existing framework, the user first declares a forward computation graph, and then calls the autodifferentiation function to add backpropagation operations to the graph. The gradients of the model are computed by this backpropagation portion of the graph. Parallax slightly modifies the autodifferentiation function to record gradients of model parameters when constructing a singleGPU graph. Later, when the graph is replicated across GPUs, Parallax identifies the gradients from each GPU and adds gradient aggregation operations to update model parameters using aggregated gradient values. When updating a parameter, Parallax ensures that a parameter is updated only once during an iteration, by its corresponding aggregated gradient value, regardless of the number of workers. The same rule also applies to updating extra parameters such as moving average parameters discussed in Section 3.
Scalability. Parallax performs automatic graph transformation that achieves not only correctness but also scalability by reducing communication overheads. Parallax considers factors that affect communication: the style of communication, the type of parameters, and operation placement across multiple GPUs and machines. Parallax optimizes communication to speed up training.
Next, we present graph transformation for PS and MPI in Sections 5.1 and 5.2 respectively because they have different aggregation and synchronization. We then explain a set of optimization techniques to improve performance in Section 5.3.
5.1 Transformation for PS
Parallax transforms a singleGPU graph for PS communication by creating a copy of forward and backward graph operations for each worker and partitioning model parameters across servers. Parallax applies different replication and operation placement policies to parameter, parameter update operations, and main computation operations.
Figure 6 shows an example of the graph transformation. Parallax launches a (parameter) server on each machine and a worker on each GPU in the given resource specification. This colocation of workers and a server in a machine works well since workers are GPUintensive while servers run lightweight computation, which runs only on CPUs.
Parallax partitions model parameters (Params={Params1,Params2}) across servers, evenly based on their sizes. This partitioning avoids server network transfer imbalance. Parallax assigns update operations (Update={Update1, Update2}) in the same server with their parameters to be updated. Identifying model parameters and their update operations is feasible because DL frameworks [6, 7, 10, 1] treat them differently from mathematical operations, such as add or multiplication. Main computation operations that are used to compute gradients are replicated as many as the number of GPUs (Model and Grads). Model and Grads represent operations for forward computations and backpropagation, respectively. Parallax detects gradients using the recorded autodifferentiation results during graph construction, and identifies main computation operations by searching all the ancestor operations from the gradients in the graph. Gradients from each GPU are aggregated twice using aggregation operations for GPUs within a machine (LocalAgg={LocalAgg1, LocalAgg2}) and between machines (GlobalAgg={GlobalAgg1, GlobalAgg2}). The local aggregation reduces the amount of data communication between workers and servers, which is more expensive than communication between GPUs in the same machine. The outputs of GlobalAgg are used to update model parameters. Parallax places a global aggregation operation (e.g.,GlobalAgg1) on the same server with the model parameters (e.g., Param1) to minimize data transfer between machines.
5.2 Transformation for MPI
Figure 7 shows graph transformation for MPI communication. It is relatively straightforward compared to the transformation for PS communication because each device carries individual copies of global states (i.e., model parameters) and does not access states on the other devices. Parallax replicates all operations in the original single GPU graph and places a replica for each GPU in the resource specification. The transformation is simple because of the homogeneity of all the processes (workers) that participate in training, unlike the PS architecture. To aggregate gradients across devices, AllReduce (or AllGatherv) operations take place between operations that produce gradients using backpropagation and their successors. Parallax uses AllReduce to aggregate the gradients of dense parameters while it uses AllGatherv to aggregate the gradients of sparse parameters.
Parameter Type  PS  MPI 
Dense  
Sparse 
5.3 Optimization
5.3.1 Hybrid Communication
As shown in Table 1, PS and MPI communication mechanisms have their strengths. PS communication performs better for sparse models, while MPI communication shows better performance for dense models. This observation motivates the hybridization of the two mechanisms to achieve the best of both worlds. We introduce hybrid communication, a new communication mechanism that utilizes MPI communication for dense parameters and PS communication for sparse parameters.
Table 3 shows the estimated amount of data transfer for a single GPU when a specific communication method is used. For dense parameters, the amount is similar for both mechanisms: bytes for PS and bytes for MPI, where is the parameter size in bytes, and is the number of GPUs. In PS communication, each GPU sends and receives bytes of data in a single iteration. This formula is derived from the fact that each GPU reads bytes of parameters from servers for feedforward and backpropagation and produces bytes of gradients. On the other hand, each GPU in MPI communication needs gradients generated from all GPUs to update parameter replicas; thus, each GPU sends bytes of data to GPUs (total bytes) and receives bytes from other GPUs. However, employing an efficient ringbased AllReduce algorithm [24] reduces the amount of data transfer to . With the algorithm, each GPU sends and receives bytes for communication stpdf, gradients are aggregated for the first communication stpdf, and the aggregated values are broadcasted for the next communication stpdf. Since both PS and MPI transfer the similar amount of data ( and bytes), communication speed highly depends on the optimization of their implementations.
For example, NVIDIA’s library for collective communication (NCCL [3]) provides highly optimized AllReduce with the ring algorithm. NCCL is aware of the topology of hardware, such as PCI and network interfaces, to optimize communication between GPUs and between machines. As a result, it achieves high performance and scales well on dense models such as ResNet50 [16], Inceptionv3 [29], and VGG16 [27], according to a recent study [2]. We also confirmed it with our experiments in Section 7. Thus, Parallax adopts MPI using NCCL for dense parameter communication.
Exchanging sparse parameters using MPI communication requires much more data transfer compared to PS communication. In PS communication, each GPU sends and receives bytes of data in a single iteration, where represents sparsity, which is the average ratio of activated parameters over all parameters. Calculation of the amount of data transfer in PS is the same for both dense and parameters except , which is added because the only subset of sparse parameters () is necessary at each iteration. For the MPI communication, AllGatherv is used for sparse parameters. Reducing the amount of data transfer as in AllReduce is not applicable to AllGatherv because AllGatherv aggregates gradients of each worker by appending them instead of adding values as in AllReduce. It results in the data transfer of bytes for a GPU. Therefore, the amount of data transfer in MPI communication times larger than PS communication. As becomes large, the difference between the two communication methods becomes significant.
Thus, Parallax introduces a hybrid communication mechanism (Figure 8) as an optimization by applying MPI communication for dense parameters and PS communication for sparse parameters, so that each worker in a GPU has a replica of dense parameters, while a server process in each machine manages only sparse parameters and communicates with workers. If there are only dense parameters in a model, such as ResNet50 [16] and Inceptionv3 [29], the hybrid communication works same as MPI.
5.3.2 Reducing Communication Overhead
Reducing the amount of data transfer between machines directly impacts communication overhead, thus improving scalability. Parallax employs several communication optimization techniques.
Local aggregation. As we mentioned in Section 5.1, Parallax performs gradient aggregation in two phases (Figure 6): aggregation of gradients from multiple GPU devices in a machine then aggregation of partiallyaggregated gradients from multiple machines. This twophase aggregation helps reduce the amount of data transfer between machines, which costs more than communication in a machine. Note that this local aggregation is only applied to sparse paramter gradients because dense parameter gradients are aggregated by MPI.
Placing operations that occur after aggregation but before parameter update. In the PS architecture, Parallax places a set of parameters together with its aggregation operation and update operation on the same server, in general, to minimize data transfer between machines. However, a more finegrained approach is necessary to handle some models that contain operations that depend on the results of gradient aggregation. At first glance, such operations occur after gradient aggregation, so it may seem logical to place them all on the respective servers, where the final global aggregation takes place. Unfortunately, this naive placement may lead to unnecessary data transfer between servers.
Figure 9 shows an example of an operation that occurs after gradient aggregation and how Parallax places the related operations. In this example, gradients are clipped by a global norm value, which is computed as the square root of the sum of L2 norms from all parameter gradient values. The clipping operation is used for some language models such as LM and NMT to prevent the gradient value from growing too big (i.e., exploding) and avoid divergence.
The placement starts by finding the aggregated gradients (Grads). After finding them, Parallax traverses through their descendant operations in a breadthfirst manner until it locates the parameter update operations (Update). During this operation traversal, Parallax assigns each operation it encounters either to a server or a worker process, depending on whether it takes inputs from multiple processes or not. Parallax regards operations that receive inputs from multiple processes as shared operations, and operations that have inputs from only a single process as local operations. Local operations are placed in the same processes as its inputs, while shared operations are placed on a single dedicated server, usually the first server.
For example, each L2Norm operation in Figure 9 is a local operation, having Grads as its sole input. Thus, Parallax places L2Norm operations on the same server processes as the gradient Grads. On the other hand, GlobalNorm depends on multiple L2Norm operations from different servers, therefore it is a shared operation and is placed on the first server server1.
Meanwhile, the Clip operations each have two inputs; one is a shared operation (GlobalNorm) and the other is a local operation (Grads). In this case, the Clip operation is shared by definition, but we assign them to the same process as their local input operation regardless. The reasoning behind this is that although some data transfer for shared operations is unavoidable, we can still minimize the amount of data being communicated across processes by putting the operations together with their local input. Such a placement rule focuses on improving overall data locality as much as possible by keeping operations on the same process.
Operation Placement Between Servers and Workers. Parallax sets an appropriate graph boundary between servers and workers to minimize data transfer between machines. To start with, Parallax considers whether an operation is for gradient aggregation or parameter update. If so, these operations are assigned to a server while the remaining operations placed on workers as in Figure 10(a). Parallax then changes device placement of some operations after the initial placement to achieve efficient communication. Parallax inspects operations that are placed at the boundary of workers and servers. Parallax relocates them if moving them from one process to another decreases the amount of data transfer between GPUs or between machines. Figure 10 shows one possible circumstance that our optimization can be applied. Initially, slice([100] > [10]) operation is assigned to workers; however, moving it to a server can significantly reduce the amount of data transferred to workers. On the other hand, moving gather operation for computed gradients in workers to a server increases the data transfer due to the larger input size than the output of gather operation. Thus, Parallax places it on each worker as it does before. This optimization rule only moves operations with light computation like slice, cast, and gather because server processes utilizes only CPUs. Moving heavy computation from a GPU to a CPU can increase computation time, thus nullifying the optimization effect. Figure 10(b) presents the final result of operation placement in Parallax. With the placement rules, Parallax reduces communication overheads.
6 Implementation
We implemented Parallax on TensorFlow [6] v1.6 with AllReduce and AllGatherv operations in Horovod [2] v0.11.2. We implemented the graph transformation and distributed execution in Python.
Graph transformation Graph transformation of Parallax consists of inserting gradient aggregation and placing operations to specific resources. Placing operations can be done with the tf.device API. However, embedding gradient aggregation requires additional steps as follows. We first place accumulators in servers to aggregate the gradients of sparse parameters. Each accumulator handles gradients for a set of parameters. When gradients are aggregated in an accumulator, a worker asks each server to read the aggregated gradients from it and update the corresponding parameters. To provide correct parameter updates as done in a singleGPU code, Parallax ensures that only one worker, namely a chief worker, initiates the operations for reading aggregated gradients and updating parameters. The other workers wait these operations initiated by the chief worker to finish. The chief’s notification arrives through shared queues among workers.
If the other workers also need aggregated gradients to record them during training or to compute a global norm for clipping, Parallax changes the workerside graphs to read the aggregated gradients from the variables where the chief worker saves them temporarily. In case of local aggregation, Parallax adds additional accumulators in each machine, and a worker in the machine becomes a local chief worker to collect gradients within a machine and send them to servers.
In addition, we modified the TensorFlow core to store gradients information, which is the result of autodifferentiation for model parameters, in MetaGraphDef protobuf in TensorFlow. The modified MetaGraphDef enables associating model parameters with their gradients exactly. Parallax uses this information for inserting gradient aggregation operations.
Distributed Execution When a user starts training using Parallax, Parallax spawns server and worker processes using ssh following resource specification, which contains machine addresses and GPU identifiers for distributed execution. Parallax transforms a singleGPU graph using the graph transformation rules in each worker process and runs the transformed graph in the given distributed environment.
7 Evaluation
We evaluate Parallax with experiments to answer the following questions:
7.1 Experiment Setup
Cluster Configuration. All the experiments were conducted on a GPU cluster of 8 machines. Each machine is equipped with two 18core Intel Xeon E52695 @ 2.10 GHz processors with 256 GB RAM and 6 NVIDIA GeForce TITAN Xp GPU cards. The machines are connected via Mellanox ConnectX4 cards with 100Gbps InfiniBand. They run Ubuntu 16.04, CUDA 9.0, cuDNN 7, OpenMPI v3.0.0, and NCCL v2.1.
Frameworks. As baselines, we selected TensorFlow v1.6 as a representative DL framework for PS communication, and Horovod [2] v0.11.2 on TensorFlow for MPI communication. Throughout this section, TFPS denotes TensorFlow with PS communication. We let Horovod use NCCL for AllReduce because NCCL is highlyoptimized for communication between GPUs compared to the OpenMPI. However, we inevitably use OpenMPI for AllGatherv, which is not supported in NCCL.
Models and Datasets. We trained wellknown two CNN models and two RNN models in our experiments. ResNet50 [16] and Inceptionv3 [29], CNN models, are trained with the ImageNet (ILSVRC 2012) [26] dataset that has 1.28M training images and 50K validation images in 1000 categories. One RNN model is LM [17], a language model that learns a probability distribution over sequences of words in a language. It consists of a single layer of 2048 LSTM cells projected to a 512dimensional embedding. We trained the LM model on the One Billion Word Benchmark [8] that includes one billion words and 800K vocabularies. The NMT [32] model is a widely used RNN model for machine translation. We used 4layer LSTMs of 1024 units with a bidirectional encoder of 1024dimensional embedding. We used the WMT GermanEnglish dataset [4] for NMT model training. As described in Table 1, the CNN models are dense models, which consist of only dense parameters, while the RNN models are sparse models, which contain both dense and sparse parameters. The batch size per GPU for ResNet50 and Inceptionv3 is 64, and 128 for LM and NMT.
7.2 Model Convergence
Parallax performs automatic graph transformation correctly to make the transformed graph converge in a distributed environment. To check the correctness, we trained ResNet50 and LM models on Parallax, TFPS and Horovod. and compared our accuracy results with ones reported in prior work [13, 17].
Figure 11 shows the convergence graphs of ResNet50 and LM models. Both models are trained synchronously, using 8 machines with 48 GPUs to train ResNet50 while 6 machines with 36 GPUs for LM.
Figure 10(a) shows the top1 validation error plot for ResNet50. We confirmed that the final validation error after the model is trained for 90 epochs is similar to one reported by Goyal et al. [13]. The final top1 error in Parallax is 24.03%, which is slightly worse than 23.74% (the error reported) because of the different gamma initialization of batch normalization for the last convolution layer in a residual block. TFPS and Horovod achieve 23.85% and 23.88% top1 errors, respectively.
Figure 10(b) shows the perplexity plot for LM. Perplexity is the average perword logprobability on the test dataset. Parallax successfully makes the LM model converge after 5 epochs with 47.39 perplexity, which is better than 47.5, which is reported by Jozefowicz et al. [17]. TFPS and Horovod reached 47.43 and 47.05 perplexities, respectively. The difference comes from the fact that we used different learning rate and maximum gradient norm and we used synchronous SGD with 36 GPUs and Jozefowicz et al. used asynchronous SGD with 32 GPUs.
Figure 11 also shows the convergence time of Parallax, TFPS, and Horovod. Parallax is faster than TFPS and Horovod thanks to its optimization techniques, which we evaluate in depth next. To train ResNet50 for 90 epochs, TFPS is 1.4x slower than Parallax and Horovod is 1.1x slower than Parallax. To train LM for 5 epochs, TFPS is 2.8x slower than Parallax and Horovod is 6.3x slower than Parallax.
7.3 Performance and Scalability
Next, we show the performance of Parallax by comparing the training throughput of Parallax against those of TFPS and Horovod. In addition, we also evaluate the scalability of Parallax as we increase the number of GPUs.
Training Throughput. Figure 12 shows the training throughput of Parallax, TFPS and Horovod. According to Figures 11(a) and 11(b), dense models, which use NCCL AllReduce, achieve slightly higher throughput compared to TFPS. because NCCL in Horovod is optimized for AllReduce. Parallax achieves throughput similar to Horovod because Parallax also utilizes NCCL AllReduce through Horovod for the dense models. Even though Parallax has little performance gain compared to Horovod, Parallax reduces users’ burden to create a distributed version by its automatic graph transformation.
In contrast to the dense models, the three frameworks have significant performance differences for the sparse models. Figures 11(c) and 11(d) are training throughput for LM and NMT. Parallax is up to 2.7 times faster than TFPS for LM and up to 2.2 times faster for NMT. Parallax shows the best performance because of its optimization techniques including hybrid communication. We discuss the effects of individual optimization techniques in Section 7.4.
Scalability of Parallax. Figure 13 presents the scalability of Parallax for the four models. We define normalized throughput as the ratio of the throughput over the throughput obtained with a single GPU. For ResNet50 and Inceptionv3, Parallax scales out well, 39.8x and 43.6x speedups on 48 GPUs. LM and NMT scales worse than ResNet50 and Inceptionv3. The normalized throughput of LM is 9.4 and that of NMT is 17.1 with 48 GPUs. LM and NMT stress communication more than ResNet50 and Inceptionv3. LM and NMT have computation lighter than ResNet50 and Inceptionv3, so LM and NMT spends less time in doing computation. Furthermore, NMT has a large number of parameters. The number of parameters exchanged per GPU is 101 million for NMT, 26 million for ResNet50, and 24 million for Inceptionv3.
7.4 Optimization Effect
To analyze how each optimization technique discussed in Section 5.3.2 affects the training throughput of Parallax, we cumulatively enabled our optimization techniques and measured the training throughput of LM and NMT, as shown in Table 4. We used 8 machines with 48 GPUs for the experiments. We focus on LM and NMT since the optimization techniques are effective when PS is utilized. Specifically, we analyze the throughput of the following cases:

BASE: utilizes PS communication without any optimization.

+HYB: enables hybrid communication.

+LA: enables local aggregation optimization.

+OPAU: enables placing operations that occur after aggregation but before parameter update.

+OPSW: enables operation placement between servers and workers.







LM  101k  112k  251k  259k  276k  
NMT  87k  117k  188k  188k  191k 
After all the optimizations are enabled (+OPSW), the training throughput of models is improved by 2.5 times compared to the baseline (BASE), using PS communication without any optimization. Since PS outperforms MPI with LM and NMT, we used TFPS as a baseline of this evaluation to show the effectiveness of Hybrid communication.
+HYB improves the throughput of BASE by 10.8% for LM and 33.7% for NMT since MPI communication handles dense parameters. +LA is the most effective optimization technique. It enhances the performance of +HYB by 2.3x for LM and 1.6x for NMT. +OPAU is slightly effective (3% speedup) only for the LM model because it moves elementwise multiplication operation for aggregated gradients from workers to servers to remove the data transfer for sending the aggregated gradients to all the workers. +OPSW changes the placement of slice operations in the LM model. The inputs of slice operations are located at servers so that moving the operations to the same servers with their inputs results in 6% speedup.
8 Related Work
Exiting DL frameworks, such as TensorFlow [6] and PyTorch [23], support dataparallel training with multiple GPUs and machines. However, they have limitations.
For example, TensorFlow provides dataparallelization APIs, such as SyncReplicasOptimizer, replica_device_setter, MonitoredTrainingSession and Server. These APIs require additional modifications when converting a singleGPU graph to a distributed one, and users are still responsible for debugging if distributed training does not work correctly and efficiently. To handle this issue, TensorFlow is currently adding the Estimator API. It automatically runs a singleGPU code on multiple GPUs in a machine minimizing extra modifications except separating input preprocessing code into a function. However, its functionality is still limited to GPUs in a machine. MXNet [10] supports dataparallel training using a distributed keyvalue store for data synchronization between machines and GPUs. MXNet modifies a singleGPU code to pull parameters and to push gradients using the store. The keyvalue store improves communication efficiency by offloading some computations from a worker to servers, but the optimization is limited. Utilizing PyTorch citepytorch for distributed training is much more complicated than TensorFlow and MXNet. PyTorch users have to construct a communication group, average gradients, and add aggregation methods. Horovod [2] reduces users’ effort in TensorFlow and PyTorch by adding MPI communication on both frameworks. Parallax also uses Horovod MPI operators for TensorFlow due to its simplicity. However, Horovod works only with MPI, and cannot take advantage of PS communication for sparse parameters as in Parallax.
Recent work has explored combining PS and MPIstyle communication [18, 33]. MXNETMPI [18] divides GPUs into multiple groups, and a group of GPUs communicates using MPI independently with other groups while these groups are synchronized using PS. For this new communication mechanism, the research introduces a new MPI Elastic SGD algorithm, which allows synchronous SGD methods within an MPI group and asynchronous SGD methods for the group to mitigate network contention in synchronous SGD and the staleness problem in asynchronous training. However, it introduces a new hyperparameter, the number of GPUs in an MPI group, and tuning learning rate or batch size is also dependent on this new hyperparameter. As a result, tuning hyperparameters is much challenging compared to the synchronous training used in Parallax. Poseidon [33] combines PS and sufficient factor broadcasting (SFB) communication, which uses peertopeer connections of workers similar to MPI. SFB communicates sufficient factors of the gradient matrix for fully connected (FC) layers in neural network models to reduce communication overheads. Poseidon exchanges most of the gradients with PS communication and handles only gradients of FC layers with SFB. While Poseidon’s SFB optimization applies to only FC layers, Parallax’s hybrid communication applies to the models that have both sparse and dense parameters.
9 Conclusion
We present Parallax, a tool that automatically parallelizes deep neural network training with data parallelism in distributed environments. We discussed the challenges that arise to perform distributed training of a singleGPU model with respect to correctness and performance. Parallax tackles the problems by automatically transforming a singleGPU graph to a graph for distributed execution and optimizing the execution of the transformed graph via hybrid communication, smart operation placement, local aggregation. The experiments show that Parallax can achieve correct and efficient dataparallel training of deep neural networks. We open sourced Parallax in the hope of facilitating users to take advantage of dataparallel training. Parallax is publicly available at https://github.com/snuspl/parallax.
References
 [1] Caffe2. https://caffe2.ai.
 [2] Horovod. https://github.com/uber/horovod.
 [3] NCCL. https://developer.nvidia.com/nccl.
 [4] wmt. http://www.statmt.org/wmt14.
 [5] XLA. https://www.tensorflow.org/performance/xla/.
 [6] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. TensorFlow: A system for largescale machine learning. In OSDI, 2016.
 [7] J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Desjardins, J. Turian, D. WardeFarley, and Y. Bengio. Theano: A cpu and gpu math compiler in python.
 [8] C. Chelba, T. Mikolov, M. Schuster, Q. Ge, T. Brants, P. Koehn, and T. Robinson. One billion word benchmark for measuring progress in statistical language modeling. Technical report, Google, 2013.
 [9] J. Chen, R. Monga, S. Bengio, and R. Józefowicz. Revisiting distributed synchronous SGD. CoRR, abs/1604.00981, 2016.
 [10] T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and Z. Zhang. MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274, 2015.
 [11] T. M. Chilimbi, Y. Suzue, J. Apacible, and K. Kalyanaraman. Project Adam: Building an efficient and scalable deep learning training system. In OSDI, 2014.
 [12] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159, 2011.
 [13] P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
 [14] W. Gropp, E. Lusk, and A. Skjellum. Using MPI: portable parallel programming with the messagepassing interface, volume 1. MIT press, 1999.
 [15] S. Gupta, W. Zhang, and F. Wang. Model accuracy and runtime tradeoff in distributed deep learning: A systematic study. In ICDM, pages 171–180. IEEE, 2016.
 [16] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
 [17] R. Jozefowicz, O. Vinyals, M. Schuster, N. Shazeer, and Y. Wu. Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410v2, 2016.
 [18] A. R. Mamidala, G. Kollias, C. Ward, and F. Artico. MXNETMPI: Embedding MPI parallelism in parameter server task model for scaling deep learning. arXiv preprint arXiv:1801.03855, 2018.
 [19] P. Micikevicius, S. Narang, J. Alben, G. Diamos, E. Elsen, D. Garcia, B. Ginsburg, M. Houston, O. Kuchaev, G. Venkatesh, et al. Mixed precision training. In ICLR, 2018.
 [20] G. Neubig, C. Dyer, Y. Goldberg, A. Matthews, W. Ammar, A. Anastasopoulos, M. Ballesteros, D. Chiang, D. Clothiaux, T. Cohn, K. Duh, M. Faruqui, C. Gan, D. Garrette, Y. Ji, L. Kong, A. Kuncoro, G. Kumar, C. Malaviya, P. Michel, Y. Oda, M. Richardson, N. Saphra, S. Swayamdipta, and P. Yin. Dynet: The dynamic neural network toolkit. arXiv preprint arXiv:1701.03980, 2017.
 [21] G. Neubig, Y. Goldberg, and C. Dyer. Onthefly operation batching in dynamic computation graphs. In Advances in Neural Information Processing Systems, pages 3974–3984, 2017.
 [22] R. Pascanu, T. Mikolov, and Y. Bengio. On the difficulty of training recurrent neural networks. CoRR, abs/1211.5063, 2012.
 [23] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. 2017.
 [24] P. Patarasuk and X. Yuan. Bandwidth optimal allreduce algorithms for clusters of workstations. Journal of Parallel and Distributed Computing, 69(2):117–124, 2009.
 [25] N. Qian. On the momentum term in gradient descent learning algorithms. Neural networks, 12(1):145–151, 1999.
 [26] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
 [27] K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. In ICLR, 2015.
 [28] A. Smola and S. Narayanamurthy. An architecture for parallel topic models. Proceedings of the VLDB Endowment, 3(12):703–710, 2010.
 [29] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. In CVPR, 2016.
 [30] S. Tokui, K. Oono, S. Hido, and J. Clayton. Chainer: a nextgeneration open source framework for deep learning. In Proceedings of workshop on machine learning systems (LearningSys) in the twentyninth annual conference on neural information processing systems (NIPS), volume 5, 2015.
 [31] A. van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals, A. Graves, et al. Conditional image generation with pixelcnn decoders. In Advances in Neural Information Processing Systems, pages 4790–4798, 2016.
 [32] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.
 [33] H. Zhang, Z. Zheng, S. Xu, W. Dai, Q. Ho, X. Liang, Z. Hu, J. Wei, P. Xie, and E. P. Xing. Poseidon: An efficient communication architecture for distributed deep learning on GPU clusters. In 2017 USENIX Annual Technical Conference (USENIX ATC 17), pages 181–193, Santa Clara, CA, 2017. USENIX Association.
 [34] W. Zhang, S. Gupta, X. Lian, and J. Liu. Stalenessaware asyncsgd for distributed deep learning. arXiv preprint arXiv:1511.05950, 2015.