HyPar-Flow: Exploiting MPI and Keras for Scalable Hybrid-Parallel DNN Training using TensorFlow
The enormous amount of data and computation required to train DNNs have led to the rise of various parallelization strategies. Broadly, there are two strategies: 1) Data-Parallelism – replicating the DNN on multiple processes and training on different training samples, and 2) Model-Parallelism – dividing elements of the DNN itself into partitions across different processes. While data-parallelism has been extensively studied and developed, model-parallelism has received less attention as it is non-trivial to split the model across processes. In this paper, we propose HyPar-Flow: a framework for scalable and user-transparent parallel training of very large DNNs (up to 5,000 layers). We exploit TensorFlow’s Eager Execution features and Keras APIs for model definition and distribution. HyPar-Flow exposes a simple API to offer data, model, and hybrid (model + data) parallel training for models defined using the Keras API. Under the hood, we introduce MPI communication primitives like send and recv on layer boundaries for data exchange between model-partitions and allreduce for gradient exchange across model-replicas. Our proposed designs in HyPar-Flow offer up to 3.1 speedup over sequential training for ResNet-110 and up to 1.6 speedup over Horovod-based data-parallel training for ResNet-1001; a model that has 1,001 layers and 30 million parameters. We provide an in-depth performance characterization of the HyPar-Flow framework on multiple HPC systems with diverse CPU architectures including Intel Xeon(s) and AMD EPYC. HyPar-Flow provides 110 speed up on 128 nodes of the Stampede2 cluster at TACC for hybrid-parallel training of ResNet-1001.
1. Introduction and Motivation
Recent advances in Machine/Deep Learning (ML/DL) techniques have triggered key success stories in many application domains like Computer Vision, Speech Comprehension and Recognition, and Natural Language Processing. Large-scale Deep Neural Networks (DNNs) are at the core of these state-of-the-art AI technologies, and have been the primary drivers of this success. In a very simplified manner, DNNs can be considered as complicated stacks of non-linear mathematical functions that map an input ‘x’ to an output ‘y’ such that where ‘f’ is the function (or rules) being learnt during the training phase and applied during the inference/prediction phase. However, the problem of training the DNN (learning the function ‘f’) for complicated DNN architectures and many training examples (data) is compute-intensive and can take weeks to months to achieve state-of-the-art prediction capabilities (accuracy). Designing deeper DNNs has emerged as a common strategy to achieve higher accuracy (He et al., 2015, 2016; ker, 2019).
These requirements have led researchers to resort to a simple but powerful approach called Data-Parallelism (cf. Section 2.2) to achieve shorter training times. This has resulted in various research studies (Awan et al., 2017; You et al., 2017; Goyal et al., 2017) and production-grade ML/DL software like TensorFlow (Abadi et al., 2016) and PyTorch (Paszke et al., 2017). Data-Parallel training replicates the DNN (model) on multiple processes (CPUs and/or GPUs) but uses different partitions of the training data (Awan et al., 2017; Sergeev and Del Balso, 2018; Goyal et al., 2017; Jia et al., 2018a). However, data-parallelism has three fundamental limitations: 1) Training the model to meet the accuracy of sequential training needs extensive hyperparameter (batch size, learning rate, etc.) search, which itself is a compute-intensive task, 2) Data-Parallel training has a synchronization (allreduce) overhead that increases linearly with respect to the number of processes (Huang et al., 2018; Harlap et al., 2018; Dryden et al., 2019), and 3) All DNN related data has to fit inside the device (CPU/GPU) memory. If the DNN cannot fit inside device’s memory, the DNN cannot be trained and is referred to as an Out-of-core model (Rhu et al., 2016; Awan et al., 2018; Markthub et al., 2018). Figure 1 highlights how memory consumption due to larger images and DNN depth limit the compute platforms that can be used for training; e.g. ResNet-1k (He et al., 2016) with the smallest possible batch-size of one (a single 224224 image) needs 16.8 GB memory, which cannot be trained on a 16 GB Pascal GPU. Similarly, ResNet-1k on image size 720720 needs 153 GB of memory and hence is not trainable on any other platform except the CPUs that have 192 GB memory (Skylake in Figure 1).
Unlike Data-Parallel and Out-of-core training, a different strategy called Model-Parallelism111Model-Parallelism and Layer-parallelism are equivalent terms when the smallest split of a model is a layer (Krizhevsky, 2014; Ben-Nun and Hoefler, 2018) is to split the DNN architecture itself into multiple partitions across different processes. However, little exists in the literature about model-parallelism for state-of-the-art DNNs like ResNet(s). Significant challenges exist in exploiting model-parallelism because the burden of partitioning the model is on the DNN designer, who is most likely a domain expert dealing with mathematical intuitions to design a model with better prediction capabilities for their use-case. Such an approach would also lead to very low productivity for the DNN designer since they may not be systems/HPC experts. Thus, there is a need for a user-transparent model-parallelism system that can automatically partition the model across multiple processes without any changes to the model definition as well as to the training process itself. Such a system will enable high-performance and high-productivity for DNN designers, which is currently not supported by existing frameworks.
The key challenge that we address in this paper is: How can we design a scalable and easy-to-use infrastructure for model, data, and hybrid-parallel training of DNNs that enables designers to 1) develop new type of DNNs without any restriction on a DNN’s memory consumption and 2) train existing models with better performance even for small batch sizes? Along with the aforementioned broad challenge, we tackle the following concrete challenges in this paper:
What are the characteristics and features of a DNN that make them amenable to either model, data, or hybrid-parallelism?
Can model-parallelism be made as simple to use as the current set of data-parallelism approaches like Horovod (Sergeev and Del Balso, 2018)?
How can we propose simple and unified APIs for parallel training that support multiple parallelization strategies (data, model, and hybrid)?
How can we effectively deal with communication across multiple processes that operate on different parts of a DNN in forward and backward passes?
How to design efficient communication schemes when model and data-parallelism is combined especially for complex models like ResNet(s) (He et al., 2015) with non-consecutive layer connections?
1.2. Proposed Approach
To address these challenges, we propose HyPar-Flow: a unified system to perform model, data, and hybrid-parallel training using a simple interface that does not require any model-definition changes and/or manual partitioning of the workload. HyPar-Flow’s easy-to-use hybrid-parallelism support is illustrated in Figure 2.
The user provides only four inputs to HyPar-Flow: 1) A model defined using the Keras API, 2) Number of model partitions, 3) Number of model replicas, and 4) Strategy (data, model, or hybrid). In the example illustrated in Figure 2, the user is providing a 5-layer ResNet-like model, three partitions, three replicas, and hybrid as the parallelization strategy. HyPar-Flow automatically generates a hybrid-parallel version of this model split across three partitions and three model replicas. The communication between model partitions is realized using send() and recv() whereas allreduce will be used to aggregate gradients across model replicas. This design of HyPar-Flow will enable the DNN architect to focus only on the science and design of a DNN without spending time on system related challenges like model partitioning, placement of partitions and replicas on cores and nodes. Design details of HyPar-Flow are further discussed in Section 5.
Broadly, our proposed solution is both model-size as well as model-type agnostic. We achieve this by exploiting a) Keras model definitions, b) TensorFlow Eager Execution (cf. Section 2.3), c) decomposition of a DNN for model, data, and hybrid parallelism , and d) a custom distributed-training loop. To the best of our knowledge, there are very few studies that focus on hybrid-parallel training of large DNNs; especially using TensorFlow and Keras in a user-transparent manner for HPC environments where the Message Passing Interface (MPI) (Message Passing Interface Forum, [n. d.]) is a dominant programming model. The key value propositions of this work are: 1) Our proposed HyPar-Flow framework enables design and training of infinitely large (cf. Section 8) models and 2) allows training of DNNs that deal with larger/real-world image sizes in addition to commonly used () images. We make the following key contributions in this paper:
Analyze various TensorFlow-specific APIs and execution models, and highlight why Keras Model definition APIs and custom training loops using the GradientTape feature is well suited for realizing user-transparent hybrid-parallelism.
Propose and design HyPar-Flow to enable parallel training of any Keras model (with consecutive as well as non-consecutive layer connections (Ben-Nun and Hoefler, 2018)) on multiple processes under any parallelization strategy, i.e. data, model, and hybrid.
Thoroughly stress test the HyPar-Flow framework by training and verifying the accuracy of the models trained using HyPar-Flow.
Demonstrate HyPar-Flow’s performance benefits for models like VGG-16, ResNet-110, and ResNet-1001: 1) Up to 3.1 speedup over sequential training for ResNet-110, 2) up to 1.6 speedup for ResNet-1001 over data-parallel training, and 3) 110 speedup over single-node for hybrid-parallel training of ResNet-1001 on 128 nodes.
Provide initial performance trends for next-generations models like ResNet-5000 (cf. Section 8), and beyond.
We provide the necessary background in this section including a discussion on DNN training, parallelization schemes for parallel training, and TensorFlow’s Eager Execution and Keras. Expert readers can skip this section and directly go to Section 3.
2.1. DNN Training
A DNN consists of different types of layers such as convolutions (conv), fully-connected or dense (FC), pooling, etc. DNNs are usually trained using a labeled dataset. A full pass over this dataset is called an epoch of training. Training itself is an iterative process and each iteration happens in two broad phases: 1) Forward pass over all the layers and 2) Back-propagation of loss (or error) in the reverse order. The end goal of DNN training is to obtain a model that has good prediction capabilities (accuracy is the generic term to refer to these). In order to reach the desired/target accuracy in the fastest possible time, the training process itself needs to be efficient. In this context, the total training time is a product of two metrics: 1) number of epochs required to reach the target accuracy and 2) the time required for one epoch of training. We note that the HPC community uses the terms weak scaling and strong scaling that can create confusion if used for parallel training. Unlike scientific simulations, data-parallel training is performed by increasing the effective batch size yet keeping it constant per process. This means that the work done per process before synchronization remains the same as more nodes are used. This can loosely be called weak scaling. However, as the target accuracy is tied to a fixed number of epochs for synchronous parallel training, which defines the total work, it can be considered strong scaling as the overall work is still divided across nodes and remains constant. Because of these two peculiarities, the terms weak and strong scaling cannot be directly applied to parallel training.
|Features and Supported Platforms|
|AlexNet (Krizhevsky et al., 2012; Krizhevsky, 2014)||✕||✔||CUDA||✕||✕||✕|
|MXNet-MP (mxn, 2019)||✕||Unknown||MPI||✔||✔||✕|
|LBANN (Dryden et al., 2019)||✔||✔||MPI/Aluminum||✕||✕||✕|
|Mesh TensorFlow (Shazeer et al., 2018)||✕||✔||MPI||✔||✕||✕|
|Gpipe (Huang et al., 2018)||✕||✕||gRPC/TF||✔||✕||Unknown|
|PipeDream (Harlap et al., 2018)||✕||✔||ZeroMQ||Unknown||✕||✕|
|FlexFlow (Jia et al., 2018b)||✔||✔||Legion/GASNet||✔||✕||✕|
2.2. Parallelization Schemes for DNN Training
Data-Parallel training runs the complete DNN model over multiple GPUs participating in the training. The training dataset is partitioned across multiple processes. Since the model replicas on each of the processes train on different partitions of data, the weights (also called parameters) need to be synchronized among replicas by averaging gradients across processes. This synchronization is performed using either a collective communication primitive like allreduce or by using parameter servers. The synchronization of weights is done at the end of every batch,and is referred to as synchronous parallel in this paper. Most state-of-the-art papers and studies have achieved better training accuracy as well as lesser overall training time via the synchronous parallel approach. This kind of communication introduces stalls as all the replicas have to wait for the synchronization step to complete before moving to the next iteration. Asynchronous parallel training, on the other hand, appears to proceed very fast because there is little to no synchronization, but does not converge (in terms of accuracy) as nicely as synchronous version and needs several more epochs. Thus, most researchers in the community have shifted their focus to the synchronous parallel approach only.
2.2.2. Model and Hybrid-Parallelism
Data-Parallelism works for models that can fit completely inside the memory of a single GPU. But as model sizes have grown, the model designers have pursued aggressive strategies to make them fit inside a GPU’s memory, which is a precious resource even on the latest Volta GPU (32 GB). This problem is less pronounced for CPU-based training as the amount of CPU memory is significantly higher (192 GB) on the latest generation CPUs. Nevertheless, model-parallelism alleviates this memory bound and designers can come up with new models without being restricted to a single CPU or GPU’s physical memory. The entire model is partitioned and each process is responsible only for part (e.g. a layer or some layers) of the DNN. Model-parallelism can be combined with data-parallelism as well, which we refer to as hybrid-parallelism in this paper.
2.3. TensorFlow Eager, GradientTape, and Keras
TensorFlow’s original Graph execution model has now been deprecated (tf-, 2019e, d) in favor of a concept called Eager Execution (Agrawal et al., 2019). Other frameworks like PyTorch (Paszke et al., 2017) and Chainer (cha, 2019) are also eager execution frameworks. Eager execution is a very recent change to TensorFlow and is a fundamental shift in the way programmers express TF programs. The main motivation is the ease of debugging the entire pipeline from model definition, to training, and finally to saving the trained model to a persistent storage system. We exploit capabilities offered by eager-execution for our advantage as we need to have fine-grained control over the gradient calculation and communication across model-partitions for our model-parallel design. The first benefit of eager execution we use is the ability to imperatively write the forward pass in DNN training and acquire gradients without running any sessions is crucial for debugging and control. The second important feature that we exploit is called tf.GradientTape, which provides the gradient of a computation with respect to its input variables – also called automatic differentiation (tf-, 2019a). By using tf.GradientTape and the tape.gradient() function, we calculate partial errors (cf. Section 6.2–Equation 6) that need to be sent to a remote model-partition to correctly implement back-propagation for our model-parallel design. The third requirement for user-transparent model-parallel software is to exploit a simple yet robust model definition API. To this end, Keras (ker, 2019) provides a very easy-to-use high-level API for model definition and model training. The Keras API has been implemented by the TensorFlow team and is now integrated with TensorFlow APIs under tf.keras.
3. The Design Space for Parallel Training Frameworks
Studies on data, model, and hybrid-parallelism and their associated features are summarized in Table 1. Ben-Nun and Hoefler provide a comprehensive survey of distributed DL in (Ben-Nun and Hoefler, 2018). Alex Krizhevsky introduced model-parallelism on GPUs in (Krizhevsky, 2014) using a single-tower design that used data-parallelism in convolutional layers but model-parallelism (MP) in fully-connected layers. Simulation-based results about various parallelization strategies are presented in (Gholami et al., 2018). The LBANN team presented model-parallel solutions including support for distributed linear algebra operations as well as spatial convolutions split across nodes in (Dryden et al., 2019). However, model-parallelism in LBANN is not yet publicly available so we cannot evaluate or comment on its performance. GPipe (Huang et al., 2018) enables the training of extremely large models like AmoebaNet (Real et al., 2018), which has 557 million parameters. GPipe is publicly available as part of the Lingvo (lin, 2019) framework but has no examples and/or documentation to train models like ResNet(s) with MP support. Thus, we cannot offer any performance comparisons for Gpipe. Unlike GPipe, we see benefits of model-parallelism over data-parallelism for VGG-19, ResNet-110, and ResNet-1001. FlexFlow (Jia et al., 2018b) is a recent system that searches parallelization strategies using simulation algorithms and proposes different dimensions of parallelism in DNNs. FlexFlow uses Legion (Bauer et al., 2012) for communication within the node and GASNet across nodes. Unfortunately, FlexFlow only works on GPUs and only provides ResNet-121 publicly so we could not compare its performance either. Mesh-TensorFlow (MTF) (tf-, 2019b; Shazeer et al., 2018) is a language for distributed DL with emphasis on tensors distributed across a processor mesh. MTF only works with the older TF APIs (sessions, graphs, etc.). Furthermore, the level at which MTF distributes work is much lower compared to HyPar-Flow, i.e., tensors vs. layers. Users of MTF need to re-write their entire model to be compatible with MTF APIs. Unlike MTF, HyPar-Flow works on the existing Keras models without any changes needed.
4. Challenges in Realizing Model and Hybrid-Parallelism
The biggest challenge to design a unified system like HyPar-Flow is the complexity of the overall DNN training process and how it is realized differently with software frameworks like TensorFlow, PyTorch, and several others. The fragmentation and quick evolution of APIs in such frameworks further exacerbates the research and development process. Specifically, TensorFlow is a prime example of rapid progress and innovation that has led to several outdated libraries and software that were built on top of its Graph-based design and Session-oriented execution model. To design a framework like HyPar-Flow, a thorough analysis of the trends in the ML community is needed. This will enable us to make appropriate design choices. Some specific open questions for HyPar-Flow and its likes are discussed in the following sections.
4.1. How to design a system which unifies sequential, model-parallel, data-parallel, and hybrid-parallel training?
The primary challenge is to investigate TensorFlow-specific APIs which can be used to realize a unified DNN training system. In this context, the design analysis of APIs and Execution Models like Eager Execution vs. Graph Execution are fundamental. Similarly, model definition APIs like TensorFlow Estimators and TensorFlow’s Keras implementation will also influence the design of systems like HyPar-Flow. Furthermore, we also need to investigate the performance trends and reproducibility of different training strategies in a fair and easy-to-use manner. The main requirement from an API’s perspective is to find out the right granularity offered by the API that allows us to split the model across different processes with little to no user involvement. Unlike other APIs in TensorFlow, the Keras API provides us exactly this capability via the tf.keras.Model objects as shown in Listing 1.
4.2. How to realize back-propagation algorithm across multiple model-partitions?
The complexity of model-parallelism lies in the backward propagation of loss and implementing the back-propagation algorithm, which is the crucial stage of DNN training. Data-Parallelism, on the other hand, is easy to implement as no modification is required for the back-propagation of loss (error) in the backward pass. We need to investigate methods and framework-specific functionalities that enable us to implement the back-propagation algorithm efficiently.
4.3. How to design communication for hybrid-parallel training?
With the advent of the ResNet (He et al., 2015) architecture, DNNs have evolved from a linear representation to a more complex graph with several types of skip connections (shortcuts) like identity connections, convolution connections, etc. For hybrid-parallelism to work, we need to realize communication between processes in a transparent fashion. In essence, we need to design a distributed back-propagation system, which embeds communication primitives like send, recv, and allreduce for exchanging partial error terms, gradient, and/or activations during the forward and backward passes. For skip connections, maintaining layer as well as model-partition dependencies is required to ensure deadlock-free communication across processes.
4.4. How to achieve performance for model-parallel training?
Even though model-parallelism looks very promising and intuitive, it is unclear if it can offer performance comparable to the data-parallel approach. To achieve performance, we need to investigate if widely-used and important HPC techniques such as the efficient placement of processes on CPU cores, pipelining via batch splitting, and overlap of computation and communication can be exploited for model-parallel training. Naive model-parallelism will certainly suffer from under-utilization of resources due to stalls caused by the sequential nature of DNN training. It is thus non-trivial to overcome these stalls and design an efficient system.
5. Overview of HyPar-Flow (HF)
To tackle the challenges discussed in Section 4 and realize HyPar-Flow efficiently, we analyzed various design choices, implemented major components of HyPar-Flow, and characterized the performance of several state-of-the-art as well as possible future models across different compute platforms and communication libraries.
Figure 3 depicts the role of HyPar-Flow in the execution stack. HyPar-Flow sits between the higher level ML frameworks like TensorFlow and communication runtimes like MPI that work directly on top of HPC hardware.
To train a model, the designer (user) needs to provide only four input variables: 1) DNN (model) definition in Keras format, 2) Number of partitions, 3) Number of Replicas, and 4) Strategy (model, data, or hybrid). Inside the HF class, we instantiate the Model Generator to create a hybrid-parallel version of the model. We then utilize the Trainer and Communication Engine to train the model across multiple processes in an efficient manner. For expert users, we also allow an additional input we call LPP. LPP stands for Layers Per Partition, and is a simple array of the form where is the number of total partitions (for MP) and is the number of layers for partition . This additional input is optional and is only for experts who already understand their system and model characteristics. It can also be a good knob for designers who want to experiment and benchmark their models as well as the HyPar-Flow system. The use of LPP is shown in Listing 2.
5.2. HyPar-Flow API for User-transparent Parallel Training
We propose and develop an easy-to-use API for HyPar-Flow that allows any Python programmer to import the library and use it for parallel training with no changes required to model definition as shown in Listing 2.
5.3. Realizing Hybrid-Parallelism
Model and data-parallelism can be combined in a myriad of ways to realize hybrid-parallel training. E.g. model-parallelism on a single node with multiple cores with data-parallelism across nodes. There are non-trivial and model-dependent trade-offs involved when designing hybrid schemes, which are beyond the scope of this paper. However, the key challenge that needs to be addressed is how to design communication for hybrid-parallel training? We need multiple MPI communicators to efficiently overlap computation and communication when Allreduce (for data-parallelism) is combined with Send/Recv (for model-parallelism).
Model-Parallelism and data-parallelism have different use cases. We have seen that model-parallelism is beneficial when we have a large model, or we want to keep a small batch size for training. On the other hand, Data-Parallelism gives a near-linear scale-up performance as we increase the number of nodes, but it also increases batch size. We also observe that on a single node model-parallelism gives better performance compared to data-parallelism by utilizing multiple model-partitions on a single node, but the number of model-partitions can not be larger than the number of layers in the model. Therefore we cannot increase the number of model-partitions beyond a certain point in model-parallelism. For example, we can not have more than 101 partitions for ResNet-101 model, but in practice, one layer per model-partition did not give the best performance. Therefore, we have included hybrid-parallelism in HyPar-Flow so that it can benefit from both model and data-parallelism. Performance on Hybrid-Parallelism depends on how well the combination of model-parallelism and data-parallelism performs and also how it is implemented under the hood.
In order to achieve linear speed-up with data-parallelism, we have to overlap computation and communication. The allreduce operation (gradient synchronization) is the only communication overhead in data-parallelism. We create one MPI communicator per model partition whereas the size of each communicator will be equal to the number of model-replicas. Therefore, in hybrid parallelism, we are using send and recv operations to communicate activations and gradients between partitions. Further, we are using allreduce operations among the same partitions of model replicas. For example, in hybrid-parallelism, if we are splitting the model across 48 model partitions, then we are using 48 allreduce operations (one for each model-partition) to get optimal performance. This design allows us to overlap the allreduce operation with the computation of other partitions on the same node. We are using Horovod’s tensor fusion (Sergeev and Del Balso, 2018) to fuse the tensors at one process and further optimize the performance of data-parallel training.
6. HyPar-Flow (HF): Design Details
HyPar-Flow has four main components: 1) Model Generator, 2) Load Balancer, 3) Trainer, and 4) Communication Engine (CE) as shown in Figure 4.
6.1. Model Generator
The model generator is responsible for creating an internal representation of a DNN (e.g. a Keras model) suitable for distributed training (Figure 2). In the standard single-process (sequential) case, all trainable variables (or weights) of a model exist in the address space of a single process so calling tape.gradients() to get gradients will suffice. However, this is not possible for model-parallel training as trainable variables (weights) are distributed among model-partitions. To deal with this, we propose grad layers (cf. Section 6.2).
We designed the model generator such that it guarantees to follow sequential semantics for the distributed model-parallel version it creates. This is achieved by keeping all hyper-parameters including batch size, learning rate, and training steps exactly the same as in sequential training. This is to make sure that there is no effect whatsoever on the accuracy of the training process. We note that this guarantee does not apply to data-parallel training as it averages the gradients across model-replicas so it is only semantically similar to serial training, in expectation (Maleki et al., 2018).
The internal representation of the model and dependency lists are generated and saved by the Model Generator. The Trainer and the Communication Engine (CE) will utilize this information to realize model-parallel training on multiple model-partitions. We also investigate the use of tape.gradients() to calculate partial errors that are needed to realize model-parallel back-propagation for TensorFlow.
Trainer contains implementations of the Forward and the Backward pass for various parallelization strategies. Keras model can be trained in two ways: 1) Single-process (sequential) training via model.fit() and model.train_on_batch(), and 2) Multi-process training via hf.fit() with model-parallel, data-parallel, and hybrid-parallel strategies. For data-parallel training, we simply use something called a DistributedGradientTape from the Horovod library to get the gradients and then call apply_gradients() on the tf.optimizer object. However, for model-parallel training, we need to design our own distributed back-propagation by using the generated model definition. We show a very simple DNN in Figure 5 to explain back-propagation and highlight what needs to be done for realizing parallel training using model-parallelism. In addition to Figure 5, we use Equations 1–6 to explain back-propagation in more detail. The symbols used are summarized in Table 2. There are three key data elements in DNN Training: 1) The input , 2) The predicted output , and 3) The actual output (or label) . The intermediate output from the hidden layer is denoted as . The difference between and is called error or loss labeled as (Eq. 1).
|Output of hidden layer|
|Forward pass output|
|Input to the model|
|Weight for hidden layer|
|Weight for output layer|
To realize distributed back-propagation, we need: 1) partial derivative (D1) of Loss with respect to the weight , and 2) partial derivative (D2) of Loss with respect to the weight . The challenge for multi-process case is that the term called “partial error” shown in Equations 5 and 6 can only be calculated on Partition-2 as only exists on Partition-2. To calculate D1, Partition-1 needs this “partial error” term in addition to the derivative of w.r.t to . This is what necessitates the grad layer that we design to act as pseudo-layers inserted before the actual layer on each model-partition. We note that TensorFlow’s GradientTape cannot be directly used for this case. Grad layers ensure that we can call tape.gradients() on this grad layer to calculate the partial errors during back-propagation.
Specifically, the grad layer is required for each recv operation so that partial error can be calculated for each preceding model-partition’s input. A call to tape.gradients() will return a list that contains gradients as well partial errors. The list is used to update the model by calling optimizer.apply_gradients(). Listing 3 shows the pseudo-code for HyPar-Flow’s implementation of this distributed model-parallel back-propagation.
We note that there is no need to modify back-propagation for data-parallel training as each model-replica is independently performing the Forward and Backward pass. The gradients are synchronized at the end using allreduce to update the model weights in a single step.
6.3. Communication Engine
The Communication Engine (CE) is a light-weight abstraction to deal with communication in HyPar-Flow. It provides four simple APIs: 1) send, 2) recv, 3) broadcast, and 4) allreduce. Using these primitives, data can be communicated among processes. send/receive operations are used for model-parallel training and broadcast/allreduce are used for data-parallel training in a unified and runtime-agnostic manner. In addition to providing the communication primitives, CE is also responsible to deal with deadlocks that may arise for models with non-consecutive models.
Figure 6 shows a model with skip connections that requires communication 1) between adjacent model-partitions for boundary layers and 2) non-adjacent model-partitions for the skip connections. To handle communication dependencies among layers for each model-partition, we create two lists: 1) Forward list and 2) Backward list. Each list is actually a list of lists to store dependencies between layers as shown in Figure 6. “F” corresponds to index of the layer to which current layer is sending its data and “B” corresponds to index of the layer from which the current layer is receiving data. An arbitrary sequence of sending and receiving messages may lead to a deadlock. For instance, if Partition-1 sends the partial predictions to Partition-3 when Partition-3 is waiting for predictions from Partition-2, a deadlock will occur as Partition-2 is itself blocked (waiting for) results from Partition-1. To deal with this, we sort the message sequence according to the ranks so that the partition sends the first message to the partition which has the next layer.
The communication engine also needs to use the Grad Layers, as explained earlier in Section 6.2. E.g. we need two grad layers to be inserted before Layer-4 on Partition-3 in Figure 6. This is to ensure that we can call tape.gradients() on this grad layer to calculate the partial errors during back-propagation.
7. Performance Characterization
We first provide the details of evaluation platforms and metrics used in characterizing the DNN training performance in our experiments. Next, the results are provided in the following order: 1) Single-node Multi-process training (Section 7.3), 2) Multi-node Multi-process training (Section 7.4), 3) Verification and validation of Training Accuracy (Section 7.5), and 4) Key Insights gained from the performance characterization (Section 7.6).
7.1. Evaluation Platforms
Our first evaluation platform is the Skylake partition of the Stampede2 (Stanzione et al., 2017) cluster situated at Texas Advanced Computing Center (TACC). The nodes are equipped with Intel Omni-Path interconnect. The default library on Stampede2, i.e. Intel MPI 2018 was used for MPI communication. All results are from this platform so we do not explicitly mention this in the figure captions.
The second platform we have used is a small 8-node cluster equipped with the latest dual-socket AMD EPYC 7551 32-core processors. These nodes are equipped with Mellanox InfiniBand EDR interconnect. The MVAPICH2 2.3.1 (MVAPICH2: MPI over InfiniBand, 10GigE/iWARP and RoCE, 2001) library was used on this cluster. Results provided for this platform are referred to as AMD-Platform in the figure captions.
The motivation to utilize AMD processors in addition to Intel processors is twofold: 1) It highlights the general applicability of the proposed HyPar-Flow designs and 2) It also alleviates the users from relying on Intel-specific libraries like Intel MKL-DNN that do not offer performance benefits on non-Intel platforms.
GPU-based model and hybrid parallelism is beyond the scope of this paper. We plan to investigate GPU-based hybrid-parallelism in future.
All experiments in the paper have been performed using TensorFlow v1.13.
7.2. Experimental Setup and Evaluation Metrics
Experiments are divided into two categories: 1) Performance Evaluation and 2) Correctness verification.
7.2.1. Performance Evaluation
Sequential: Default TensorFlow Eager training for the given model.
HF (MP): DNN training using HF Model-Parallelism interface.
HF (DP): DNN training using HF Data-Parallelism interface, which internally utilizes Horovod runtime.
Horovod (DP): DNN training using Horovod directly.
Images/second (img/sec) is the metric we are using for performance evaluation of different types of training experiments. Number of images processed by the DNN during training is affected by the depth (number of layers) of the model, batch size (bs), image size (WH), and number of processes. Higher img/sec indicates better performance. Some of the terms that can confuse the readers are clarified further:
Batch Size (BS): Number of images used for updating the parameters in each training step (per replica)
Effective Batch Size (EBS) = BS num_replicas
Image Size: Dimension of the image (WidthHeight).
Mapping Processes to Compute Elements: In this paper, we are using process to refer to a single process (MPI Process). The actual mapping of the process to the compute units (or cores) varies with how the MPI processes are used within and across nodes. E.g. if we run two MPI processes per node (2ppn), it means that each process has access to 24 cores on the 48-core Skylake.
7.2.2. Correctness Verification
The correctness experiments that perform full training of models so that accuracy numbers can be reported are provided in Section 7.5.
7.3. Single-Node Training
We train various models on a single Intel Xeon Skylake node, which has a total of 48 cores (96 with hyper-threading) in a dual-socket configuration.
The default version of TensorFlow relies on underlying math libraries like OpenBLAS and Intel MKL. On Intel systems, we tried the Intel-optimized version of TensorFlow, but it failed with different errors such as ”function not implemented” etc. For the AMD system, we profiled and observed that OpenBLAS available on the system is applied. Both of these platforms offered slow sequential training.
We present single-node results for VGG-16, ResNet-110-v1, and ResNet-1001-v2.
VGG-16 has 16 layers. We performed different splits but we observed the best performance when the model was split across 8 partitions for pure model-parallel training. As shown in Figure 7, we see that HF (MP) offers better performance for small batch sizes and HF/Horovod (DP) offers better performance for large batch sizes. HF (MP) offers better performance compared to sequential (1.65 better at BS 1024) as well as to data-parallel training (1.25 better at BS 64) for VGG-16 on the Intel Skylake system.
ResNet-110-v1 has 110 layers so we were able to exploit up to 48 model-partitions within the node as shown in Figure 8. We observed the following: 1) HF (MP) is up to 2.1 better than sequential at BS=1024, 2) HF (MP) is up to 1.6 better than Horovod (DP) and HF (DP) at BS=128, and 3) HF (MP) is 15% slower than HF (DP) at BS=1024. The results highlight that model-parallelism is better at smaller batch sizes and data-parallelism is better only when large batch-size is used.
Figure 9 shows that HF (MP) is able to offer up to 3.2 better performance than sequential training for ResNet-110-v1 on the AMD platform (dual-socket AMD EPYC 7551 processor with a total of 64 cores). The performance gains for HF (MP) over sequential training are due to efficient utilization of all the cores by HyPar-Flow’s design.
ResNet-1001-v2: To push the envelope of model depth and stress test the proposed HyPar-Flow system, we also perform experiments for ResNet-1001-v2, which has approximately 30 million parameters. Figure 10 shows the performance for ResNet-1001-v2. It is interesting to note that data-parallel training performs poorly for this model. This is because the number of parameters increases the synchronization overhead for HF (DP) and Horovod (DP) significantly. Hence, even for large batch sizes, the computation is not enough to amortize the communication overhead. Thus, HF (MP) offers much better performance compared to sequential (2.4 better at BS=256) as well as to data-parallel training (1.75 better at BS=128).
7.4. Multi-Node Performance
We perform multi-node experiments in two configurations: 1) Pure model-parallel configuration, and 2) Hybrid-parallel configuration. We present multi-node results for VGG-16 and ResNet-1001-v2.
VGG-16: Figure 11 shows the performance trends for VGG-16 training across two nodes. As mentioned earlier, we are only able to achieve good performance with model-parallelism for up to 8 model-partitions for the 16 layers of VGG-16. We also performed experiments for 16 model-partitions but observed performance degradation. This is expected because of the lesser computation per partition and greater communication overhead in this scenario.
Model-parallel ResNet-1k: We scale ResNet-1001-v2 on two nodes using 96 model-partitions in model-parallelism-only configuration. The result is presented in Figure 12. We observed that model-parallel HF (MP) training provides 1.6 speedup (at BS=256) over HF (DP) and Horovod (DP). On the other hand, a data-parallel-only configuration is not able to achieve good performance for ResNet-1001 due to significant communication (allreduce) overhead during gradient aggregation.
Hybrid-parallel ResNet-1k: First, we explore and discuss the importance of batch-size control in the context of hybrid-parallel training. From an accuracy (convergence) standpoint, the goal is to keep the batch-size small so that the network updates from more training data. However, a larger batch-size provides higher throughput (img/sec). HyPar-Flow enables batch-size control for pure data-parallel, pure model-parallel, and hybrid (data + model) parallel training. Hybrid batch-size control provides the user with the best possible management of the performance/accuracy trade-off during training. A demonstration of this control is presented in Figure 13, where we train ResNet-1001 on 128 nodes.
Figure 13 consists of three major dimensions: 1) Number of nodes on X-axis, 2) Performance (img/sec) on Y-axis, and 3) Batch Size using the diameter of the circles. The key takeaway is that hybrid-parallelism can maintain high-throughput while significantly reducing the largest batch-size. For instance, the large blue circle with diagonal lines shows results for 128 nodes using 128 model-replicas and 48 model-partitions leading to a batch-size of just 32,768 instead of 65,536 for the pure data-parallel case. The performance of pure data-parallelism even with 2 larger batch-size will still be lesser than the hybrid-parallel case, i.e., 793 img/sec (=6.2128 – considering ideal scaling on the DP number presented in Figure 10) vs. 940 img/sec (observed value– Figure 13). This is a significant benefit of hybrid-parallel training, which is impossible with pure model and/or data-parallelism.
7.5. Verifying the Correctness of Model-Parallel Training with HyPar-Flow
Because we proposed and designed HyPar-Flow as a new system built from scratch, it is important to provide confidence to the user that HyPar-Flow not only offers superior performance, but also correctly trains the DNN with its hybrid-parallel multi-process training. To this end, we present the correctness results based on two types of accuracy-related metrics: 1) Train accuracy and 2) Test accuracy.
Train Accuracy (train_acc): Percentage of correct predictions for the training data during the training process.
Test Accuracy (test_acc): Percentage of correct predictions for the testing data on the trained model.
VGG-16: Both metrics are covered for small scale training using VGG-16 on CIFAR-10 dataset. We trained for 10 epochs using 8 model-partitions on 2 nodes with a batch size of 128 and 16 pipeline stages as shown in Figure 14.
Next, we provide results for ResNet-110-v1 and ResNet-1001-v2. We used the batch size (BS) of 32 and a learning rate (LR) schedule available from (ker, 2019). We keep BS and LR schedule the same for sequential as well as for model-parallel training of ResNet-110-v1 and ResNet-1001-v2.
ResNet-110-v1: We train ResNet-110-v1 on CIFAR-10 for 150 epochs using multiple configurations as shown in Figure 15. The various configurations are:
1) SEQ (GT)– Sequential training using GradientTape (GT).
2) SEQ (MF)– Sequential training using model.fit (MF).
3) SEQ (MF-E)– Sequential training using model.fit (MF) and (E)ager Execution.
4) HF-MP (2)/(56)– Model-Parallel training using HyPar-Flow on 2 and 56 model-partitions, respectively.
ResNet-1001-v2 is a massive model and it takes a very long time to train. Thus, we used NVIDIA Pascal P100 GPUs to speed up the training process. For SEQ, we trained on a single GPU and for HF-MP (2), we trained using two model-partitions on two GPU nodes. The results are presented in Figure 16. The model was trained for 50 epochs using the CIFAR-10 dataset. The discussion about the performance of GPU-based training is beyond the scope of this paper. TensorFlow currently does not offer an API to get a low-level representation of a GPU tensor. This limits performance as each call to tensor.numpy() necessary for MPI-based communication returns a CPU-based buffer and incurs a device-to-host copy overhead for a GPU tensor (tf-, 2019c).
Discussion: Clearly, model-parallel training with HyPar-Flow is meeting the accuracy of the sequential model for 150 and 50 epochs of training for ResNet-110 and ResNet-1001, respectively. We note that training is a stochastic process and there are variations in earlier epochs whether we use the sequential version or the model-parallel version. However, the significance is of the end result, which in this case peaks at 92.5% for all the configurations presented. We ran multiple training jobs to ensure the trends presented are reproducible.
7.6. Key Insights
Models like ResNet-110 offer better performance for model-parallelism on smaller batch sizes (¡128).
Newer and very-deep models like ResNet-1001 benefit from model-parallelism for any batch size (Figure 10).
HyPar-Flow’s model-parallel training provides up to 3.2 speedup over sequential training (on AMD-platform) and 1.6 speedup over data-parallel (Horovod) training (on Intel-platform).
HyPar-Flow’s hybrid-parallel training offers the best performance for ResNet-1001, i.e., 110 speed up over single-node on 128 Intel Xeon (Skylake) nodes.
Next-generation and ultra-deep models like ResNet-5000 can only be designed if model/hybrid-parallelism is used because there is no constraint on the memory consumption (cf. Section 8).
8. Next-generation DNN Designs via HyPar-Flow’s Scalable Infrastructure
DNN depth is a hyperparameter, which has proven to be very good for increasing accuracy of DNNs (He et al., 2016). This relationship between the number of layers and accuracy is very clear (ker, 2019), at least for current datasets like CIFAR-10 and ImageNet. Adding more layers to the model increases the number of parameters as well as the computation and memory requirements. Depth of current generation models is limited by a single node’s memory. Thus, the goal of this study is to investigate and develop infinitely large models (Figure 1) that are much deeper compared to current-generation models. We note that we are providing this as a vision into future DNN models. Today, DNN designers attempt to develop a model accounting for the restriction of memory consumption. However, with HyPar-Flow, this restriction no longer exists, and designers can come up with models with as many layers as needed to achieve the desired accuracy. We have examined the memory requirements of a next-generation ResNet-5000 model with five thousand layers designed based on the ResNet-1000-v2 model. We define a model configuration as Trainable if it can fit in device memory at each training step. Table 3 provides trainability data for different configurations. For example, ResNet-5000 can be trained on one node using default TensorFlow (Sequential) with a batch size of 1 but cannot be trained with batch sizes of 2 and 4. To train ResNet-5000, we utilize model-parallel training via the HyPar-Flow system. The main objective is to showcase the ability of HyPar-Flow to train such massive models.
|Batch Size||Sequential||HF-MP (2)||HF-MP (4)|
Deep Learning workloads are going through a rapid change as newer models and larger, more diverse datasets are being developed. This has led to an explosion of software frameworks like TensorFlow and approaches like data and model-parallelism to deal with ever-increasing workloads. In this paper, we explored a new approach to train state-of-the-art DNNs and presented HyPar-Flow: a unified framework that enables user-transparent and parallel training of TensorFlow models using multiple parallelization strategies. HyPar-Flow does not enforce any specific paradigm. It allows the programmers to experiment with different parallelization strategies without requiring any changes to the model definition and without the need for any system-specific parallel training code. Instead, HyPar-Flow Trainer and Communication Engine take care of assigning the partitions to different processes and performing inter-partition and inter-replica communication efficiently. For ResNet-1001 training using HyPar-Flow, we were able to achieve up to 1.6 speedup over data-parallel training and up to 110 speedup over single-node training on 128 nodes. We also tested the ability of HyPar-Flow to train very large models like ResNet-5000, which consists of 5,000 layers. We believe that this study paves new ways to design next-generation DNNs and train them on large-scale HPC systems.
- tf- (2019a) 2019a. Automatic Differentiation in TensorFlow Eager. https://www.tensorflow.org/tutorials/eager/automatic_differentiation [Online; accessed November 14, 2019].
- cha (2019) 2019. Chainer. https://chainer.org/ [Online; accessed November 14, 2019].
- ker (2019) 2019. Keras (CIFAR-10 ResNet). https://keras.io/examples/cifar10_resnet/ [Online; accessed November 14, 2019].
- lin (2019) 2019. Lingvo: A TensorFlow Framework. https://medium.com/tensorflow/lingvo-a-tensorflow-framework-for-sequence-modeling-8b1d6ffba5bb [Online; accessed November 14, 2019].
- tf- (2019b) 2019b. Mesh TensorFlow: Model Parallelism Made Easier. https://github.com/tensorflow/mesh [Online; accessed November 14, 2019].
- tf- (2019c) 2019c. NumPy Compatibility. https://www.tensorflow.org/tutorials/eager/eager_basics#numpy_compatibility [Online; accessed November 14, 2019].
- tf- (2019d) 2019d. TensorFlow Roadmap. https://www.tensorflow.org/community/roadmap [Online; accessed November 14, 2019].
- mxn (2019) 2019. Training with Multiple GPUs Using Model Parallelism. https://mxnet.incubator.apache.org/versions/master/faq/model_parallel_lstm.html [Online; accessed November 14, 2019].
- tf- (2019e) 2019e. What’s coming in TensorFlow 2.0. https://medium.com/tensorflow/whats-coming-in-tensorflow-2-0-d3663832e9b8 [Online; accessed November 14, 2019].
- Abadi et al. (2016) Martın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. 2016. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems, 2015. Software available from tensorflow. org (2016).
- Agrawal et al. (2019) Akshay Agrawal, Akshay Naresh Modi, Alexandre Passos, Allen Lavoie, Ashish Agarwal, Asim Shankar, Igor Ganichev, Josh Levenberg, Mingsheng Hong, Rajat Monga, and Shanqing Cai. 2019. TensorFlow Eager: A Multi-Stage, Python-Embedded DSL for Machine Learning. CoRR abs/1903.01855 (2019). arXiv:1903.01855 http://arxiv.org/abs/1903.01855
- Awan et al. (2018) Ammar Ahmad Awan, Ching-Hsiang Chu, Hari Subramoni, Xiaoyi Lu, and Dhabaleswar K. Panda. 2018. OC-DNN: Exploiting Advanced Unified Memory Capabilities in CUDA 9 and Volta GPUs for Out-of-Core DNN Training. In 2018 IEEE 25th International Conference on High Performance Computing (HiPC). 143–152. https://doi.org/10.1109/HiPC.2018.00024
- Awan et al. (2017) Ammar Ahmad Awan, Khaled Hamidouche, Jahanzeb Maqbool Hashmi, and Dhabaleswar K. Panda. 2017. S-Caffe: Co-designing MPI Runtimes and Caffe for Scalable Deep Learning on Modern GPU Clusters. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP ’17). ACM, New York, NY, USA, 193–205. http://doi.acm.org/10.1145/3018743.3018769
- Bauer et al. (2012) Michael Bauer, Sean Treichler, Elliott Slaughter, and Alex Aiken. 2012. Legion: Expressing Locality and Independence with Logical Regions. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC ’12). IEEE Computer Society Press, Los Alamitos, CA, USA, Article 66, 11 pages. http://dl.acm.org/citation.cfm?id=2388996.2389086
- Ben-Nun and Hoefler (2018) Tal Ben-Nun and Torsten Hoefler. 2018. Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis. CoRR abs/1802.09941 (2018). arXiv:1802.09941 http://arxiv.org/abs/1802.09941
- Dryden et al. (2019) Nikoli Dryden, Naoya Maruyama, Tom Benson, Tim Moon, Marc Snir, and Brian Van Essen. 2019. Improving Strong-Scaling of CNN Training by Exploiting Finer-Grained Parallelism. CoRR abs/1903.06681 (2019). arXiv:1903.06681 http://arxiv.org/abs/1903.06681
- Gholami et al. (2018) Amir Gholami, Ariful Azad, Peter Jin, Kurt Keutzer, and Aydin Buluc. 2018. Integrated Model, Batch, and Domain Parallelism in Training Neural Networks. In Proceedings of the 30th on Symposium on Parallelism in Algorithms and Architectures (SPAA ’18). ACM, New York, NY, USA, 77–86. https://doi.org/10.1145/3210377.3210394
- Goyal et al. (2017) Priya Goyal, Piotr Dollár, Ross B. Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. 2017. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. CoRR abs/1706.02677 (2017). arXiv:1706.02677 http://arxiv.org/abs/1706.02677
- Harlap et al. (2018) Aaron Harlap, Deepak Narayanan, Amar Phanishayee, Vivek Seshadri, Nikhil Devanur, Greg Ganger, and Phil Gibbons. 2018. PipeDream: Fast and Efficient Pipeline Parallel DNN Training. CoRR abs/1806.03377 (2018). arXiv:1806.03377 http://arxiv.org/abs/1806.03377
- He et al. (2015) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep Residual Learning for Image Recognition. CoRR abs/1512.03385 (2015). arXiv:1512.03385 http://arxiv.org/abs/1512.03385
- He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Identity Mappings in Deep Residual Networks. CoRR absscaffe,/1603.05027 (2016). arXiv:1603.05027 http://arxiv.org/abs/1603.05027
- Huang et al. (2018) Yanping Huang, Yonglong Cheng, Dehao Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, and Zhifeng Chen. 2018. GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism. CoRR abs/1811.06965 (2018). arXiv:1811.06965 http://arxiv.org/abs/1811.06965
- Jia et al. (2018a) Xianyan Jia, Shutao Song, Wei He, Yangzihao Wang, Haidong Rong, Feihu Zhou, Liqiang Xie, Zhenyu Guo, Yuanzhou Yang, Liwei Yu, Tiegang Chen, Guangxiao Hu, Shaohuai Shi, and Xiaowen Chu. 2018a. Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes. CoRR abs/1807.11205 (2018). arXiv:1807.11205 http://arxiv.org/abs/1807.11205
- Jia et al. (2018b) Zhihao Jia, Matei Zaharia, and Alex Aiken. 2018b. Beyond Data and Model Parallelism for Deep Neural Networks. CoRR abs/1807.05358 (2018). arXiv:1807.05358 http://arxiv.org/abs/1807.05358
- Krizhevsky (2014) Alex Krizhevsky. 2014. One Weird Trick for Parallelizing Convolutional Neural Networks. CoRR abs/1404.5997 (2014). arXiv:1404.5997 http://arxiv.org/abs/1404.5997
- Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 1097–1105. http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf
- Maleki et al. (2018) Saeed Maleki, Madanlal Musuvathi, and Todd Mytkowicz. 2018. Semantics-Preserving Parallelization of Stochastic Gradient Descent. In 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 224–233. https://doi.org/10.1109/IPDPS.2018.00032
- Markthub et al. (2018) Pak Markthub, Mehmet E. Belviranli, Seyong Lee, Jeffrey S. Vetter, and Satoshi Matsuoka. 2018. DRAGON: Breaking GPU Memory Capacity Limits with Direct NVM Access. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC ’18). IEEE Press, Piscataway, NJ, USA, Article 32, 13 pages. http://dl.acm.org/citation.cfm?id=3291656.3291699
- Message Passing Interface Forum ([n. d.]) Message Passing Interface Forum. [n. d.]. http://www.mpi-forum.org/ Accessed: November 14, 2019.
- MVAPICH2: MPI over InfiniBand, 10GigE/iWARP and RoCE (2001) MVAPICH2: MPI over InfiniBand, 10GigE/iWARP and RoCE. 2001. https://mvapich.cse.ohio-state.edu/. [Online; accessed November 14, 2019].
- Paszke et al. (2017) Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic Differentiation in PyTorch. (2017).
- Real et al. (2018) Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V Le. 2018. Regularized Evolution for Image Classifier Architecture Search. CoRR abs/1802.01548 (2018). arXiv:1802.01548 http://arxiv.org/abs/1802.01548
- Rhu et al. (2016) Minsoo Rhu, Natalia Gimelshein, Jason Clemons, Arslan Zulfiqar, and Stephen W. Keckler. 2016. vDNN: Virtualized Deep Neural Networks for Scalable, Memory-efficient Neural Network Design. In The 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-49). IEEE Press, Piscataway, NJ, USA, Article 18, 13 pages. http://dl.acm.org/citation.cfm?id=3195638.3195660
- Sergeev and Del Balso (2018) Alexander Sergeev and Mike Del Balso. 2018. Horovod: Fast and Easy Distributed Deep Learning in TensorFlow. CoRR abs/1802.05799 (2018). arXiv:1802.05799 http://arxiv.org/abs/1802.05799
- Shazeer et al. (2018) Noam Shazeer, Youlong Cheng, Niki Parmar, Dustin Tran, Ashish Vaswani, Penporn Koanantakool, Peter Hawkins, HyoukJoong Lee, Mingsheng Hong, Cliff Young, Ryan Sepassi, and Blake Hechtman. 2018. Mesh-TensorFlow: Deep Learning for Supercomputers. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.). Curran Associates, Inc., 10414–10423. http://papers.nips.cc/paper/8242-mesh-tensorflow-deep-learning-for-supercomputers.pdf
- Stanzione et al. (2017) Dan Stanzione, Bill Barth, Niall Gaffney, Kelly Gaither, Chris Hempel, Tommy Minyard, S. Mehringer, Eric Wernert, H. Tufo, D. Panda, and P. Teller. 2017. Stampede 2: The Evolution of an XSEDE Supercomputer. In Proceedings of the Practice and Experience in Advanced Research Computing 2017 on Sustainability, Success and Impact (PEARC17). ACM, New York, NY, USA, Article 15, 8 pages. https://doi.org/10.1145/3093338.3093385
- You et al. (2017) Yang You, Igor Gitman, and Boris Ginsburg. 2017. Scaling SGD Batch Size to 32K for ImageNet Training. CoRR abs/1708.03888 (2017). arXiv:1708.03888 http://arxiv.org/abs/1708.03888