Training Large Neural Networks with Constant Memory using a New Execution Algorithm
Transformer-based NLP models such as BERT and GPT have been widely successful owing to their enormous capacity with an explosion in depth in state-of-the-art models trending to billions of parameters. Current execution methods demand brute-force resources such as HBM devices and high speed interconnectivity for data parallelism to ensure stable convergence. In this paper, we introduce a new relay-style execution technique called L2L (layer-to-layer) where at any given moment, the device memory is primarily populated only with the executing layer(s)’s footprint. The whole model resides in the DRAM memory attached to either a CPU or an FPGA as an entity we call “eager param-server” (EPS). Unlike a traditional param-server, EPS transmits the model piecemeal to the devices thereby allowing it to perform other tasks in the background such as reduction and distributed optimization. To overcome the bandwidth issues of shuttling parameters to and from EPS, the model is executed a layer at a time across many micro-batches instead of the conventional method of minibatches over whole model. In this paper, we explore a conservative version of L2L that is implemented on a modest Azure instance for BERT-Large running it with a batch size of on a single V100 GPU with less than memory. Our results show a more stable learning curve, faster convergence, better accuracy and reduction in memory compared to the state-of-the-art baseline that requires more than GB for even a small device batchsize of . Our method reproduces BERT results on any mid-level GPU that was hitherto not feasible. Moreover, L2L can scale to arbitrary depth without impacting memory. L2L also allows researchers to develop large models on more affordable devices. In addition, it enables heterogeneous optimization, parallel reduction, and dynamic approaches such as neural architecture search. This work has been performed on GPUs first but also targeted towards distributed training on high TFLOPS/Watt accelerators such as IPUs from Graphcore. The code will soon be available on github.
The transformer architecture spawned the “ResNet” moment in natural language processing (NLP), where residual blocks of arbitrary depth can be stacked to create state-of-the-art models such as BERT  and GPT-2 . Although these models reduce design complexity, they have significant overhead in memory requirements. BERT-large can barely train on a high-end GPU such as the V100 with with a batch-size of .
Training large NLP models like BERT with billions of parameters has only been successfully carried out on high-bandwidth memory devices such as GPUs and TPUs with high memory capacities. The memory size is influenced not only by the model parameters but also by a sufficiently large batch size required for convergence. The transformer-class of models such as BERT can be classified as having high weight/activation ratios: they have high number of parameters and yet relatively small output activations. For instance, BERT-large has encoder layers, parameters, but the layer output size is only per sample. This is the key observation to develop a more efficient execution method for large NLP models.
The current and prevailing techniques to overcome memory limitations include model parallelism that spreads the model across multiple devices as a worker group. Another technique for emulating a larger batch size is accomplished by splitting up the minibatch into smaller “microbatches” that fit in the device’s memory, accumulating a gradient tensor, and updating the weights after accumulation. The most popular batch size technique is synchronous SGD data parallelism at the top which is not only required for parallelizing a huge dataset across many workers or worker groups, but it also updates at larger batch sizes minimizing wall clock time. There is no solution, however, where a model can be run on a device with low memory where all techniques above result in out-of-memory even for a batch size of . There is also no known solution where the scalability of a model is not based on the size of the model but is based on the size of the layer.
In this paper, we propose a new relay-style execution algorithm called L2L (layer-to-layer) that runs models of high weight/activation ratio on a single device by keeping only the executing layer (and transit buffers) on the device. The whole model and the optimizer state are in the host which relays the next layer through the host-to-device interface after each layer-level iteration on the device. We also extend the proposal for distributed training by running the optimizer and reduction entirely in parallel on the host (which we call L2L-p). L2L allows a researcher to run a very large model independent of depth on a single worker (a device) or worker group (a model parallel group of devices) with a sufficiently large batch size for convergence.
With L2L, we show that we not only can run BERT-large with higher batch size, less memory and comparable performance than baseline, we demonstrate how L2L runs a gigantic layer BERT on a single GPU with only . Every other technique results in out of memory even with layers. Furthermore, the L2L-p version is estimated to extend L2L across multiple workers with near linear scalability.
2 Related Works
There are two model parallelism approaches that are highly cited to solve the memory limitation problem. The first is PipeDream , which partitions a model across multiple devices and pipelines the execution of forward passes interspersing them with backward passes to maximize hardware utilization. Pipedream updates on every minibatch and circumvents staleness by maintaining various versions of the model. A related model parallelism approach is GPipe  which also partitions the model across multiple devices. However, GPipe pipelines the execution of microbatches before applying a single synchronous gradient update for the entire minibatch. GPipe stacks the forward pass output activations and recomputes them during backward pass as it pops each microbatch off the stack. GPipe and PipeDream both have overheads related during the start of the pipeline, and both approaches require the number of devices to scale with the model depth and not just the layer size. They are not constant memory approaches. Also, neither approach has made specific extensions for distributed data parallelism training over model parallelism that can overcome their overheads.
A third method is OpenAI’s gradient checkpointing [5, 6]. The idea here is to tradeoff memory with more computation. A deep neural network can checkpoint a subset of nodes (where a node can be a layer or a sub-layer or a super-layer) in the computational graph so that it does not need to retain state of all the nodes. For a node’s backward pass, the required activations are recomputed from the nearest checkpoint. A constant memory implementation gradient checkpointing is feasible, but results in a computational complexity that scales by . This method suffers from huge recomputation costs for computationally intensive models such as BERT.
The recently published DeepSpeed and Zero partition a single copy of the model across many GPUs while running them in data parallelism layer-by-layer. DeepSpeed is an effective method for large models as they demonstrate a parameters model over GPUs. But DeepSpeed requires the model to fit across the combined memory of all the GPU devices.
In theory, L2L can run on top of any model parallelism (pipelined or just partitioned) or checkpointing, so it is complimentary. In particular, it can be combined with DeepSpeed and ZeroG as the same model memory partitioning can be applied in the eager param-server as each executing device only carries a much smaller part of the model.
In conventional methods, the whole model resides in the device and it is executed in series. Figure 0(a) illustrates such a method which in this case is a 4-worker data parallel setup. Within each worker, Figure 0(b) illustrates the execution of a minibatch of size through forward, backward, and ending in gradient reduce and weight update. The entire model along with its optimizer state, and all intermediate activations are resident on each worker. This is one of the major limitations of running models such as GPT-2 on single GPU devices.
Figure 2 illustrates the L2L strategy. In the basic form of L2L, only one copy of the model exists in the host which is a special form of param-server we call Eager Param-Server (EPS). Note that a traditional syncrhonous param-server hosts a coherent space where devices keep their parameters as a state dictionary from which they push all the gradients and update the models at every sync. The EPS - on the other hand – not only services the state space on every layer-level sync, but it also reduces and optimizes in parallel which means as soon as the layer-level gradients arrive and in parallel to execution. In the base L2L strategy, the reduction is eager but not the optimization while in the parallel version called L2L-p, the EPS reduces and optimizes in the background of training.
In L2L, the workers only carry the current executing layer and its activations. Figure 1(a) shows this setup. There are two execution paths for L2L: a serialized one shown in Figure 1(b) or the fully parallel L2L-p shown in Figure 1(c).
The main trick here is to run a long minibatch - if necessary dividing it into a number of microbatches u1, u2, u3 – on just one layer at a time so that the overall communication overhead of transmitting the layers over a slow host-to-worker interface is insignificant. Note that increasing the number of microbatches per minibatch is not necessary after the overhead is minimized.
However, even if the layer transmission overhead is minimized, there is a challenge for backward pass. All the forward pass activations are lost when a new layer is loaded. Since the model has relatively small output activations compared to model size, the best method is to recompute (or rematerialize) the activations during the backward pass. This requires the forward pass to stash away only the output activations of every microbatch for every layer in on-chip or off-chip device memory. (Recent advances in invertible networks such as the Reformer  alleviates the necessity of storing the output activations.)
Recompute is a loss in effective throughput, but unlike other techniques, L2L can compensate in two ways.
First – the worker now runs a layer with higher effective TFLOPs by using some of the on-chip memory savings to run faster underlying kernels that demand more memory.
Second - data parallelism overhead can be reduced to virtually zero. This can be accomplished in L2L-p (Figure 1(c)) as the host can reduce and update the model in parallel to execution except for the last two layers (the trailing update which cannot be hidden).
L2L-p is fully parallel and is projected to scale almost linearly on BERT-Large with virtually zero overhead on thousands of devices, provided batch sizes can be as large as . Most of the overhead is simply hidden behind the minibatch size. The only exposed overhead is on the last two layers of reduction and update. This is negligible as the neural network gets deeper for any given model size.
L2L-p does not necessarily require all the bandwidth of the high-speed links (NVLinks) for reduction. It uses a new form of reduce – a “parallel reduce” – where the reduction is wholly in parallel in the EPS for all layers except the last layer which can be through the NVLinks. However, the NV-links will be used for quicker loading of the next layer to offset the slow PCI-e bandwidth across a number of devices. For example, if there are four devices, the EPS will feed each device one-fourth of the weights over PCIe. Then the devices gather the weights over the high-speed NVLinks at full throttle.
In this paper, we run a basic version of L2L on an Azure instance with a single V100 (with a PCIe link and of HBM memory) where the EPS is not running any optimized libraries. We choose Pytorch as the framework due to ease of development as the L2L method keeps its own data structures and requires a new forward and backward pass. We also discuss the improvements for the basic L2L, and the potential of L2L-p over a cluster of GPUs or future ASIC accelerators such as the Graphcore IPUs.
To give a better picture of the proposed L2L and L2L-p, we compare these algorithms with the conventional baseline and baseline with accumulated gradients. Algorithms 1 and 2 show the execution order in the baseline and baseline with accumulated gradients. Algorithms 3 and 4 show the execution of the L2L and L2L-p approaches, respectively. Note that the main trick here is that L2L inverts the minibatch loop and layer loop. That is the key principle for depth-independent memory sizing.
3.1 Memory and Computation Costs of L2L and L2L-p
The memory cost of baseline at the beginning of backward pass will be:
Where is the number of layers, is the layer size (assuming uniform layer size), is the minibatch size, is the intermediate activation size per sample, and is the output activation size per sample.
The reason for the memory cost to be four times is that in addition to the model parameters () and gradients, ADAM dual momentum optimizer requires two additional copies of gradients.
For the basic version of L2L, the memory cost at the beginning of backward pass is:
Comparing the two equations when the model has a high ratio, the cost of L2L is relatively fractional due to the gains on the first and second terms. The second term is only because of recomputation trade-off (time estimates are covered in next section). The third term - output activations - in basic L2L are a function of the depth but small in comparison. As an example, in BERT-large, is and is layers (excluding embedding and classification). So, baseline is over L2L considering only first and third terms and even assuming that the second term is the same (i.e baseline also recomputes to save memory).
For L2L-p, there is an additional buffer for weights and gradients for transmission.
L2L-p makes it possible to have truly constant memory regardless of depth is to transfer the stash () to the CPU during execution. At which point, L2L-p cost of memory turns into:
Computational Time Estimates
For a device with limitless memory to run baseline, the approximate computation time can be estimated as:
Where is the forward compute time per microbatch, is the backward compute time per microbatch, is the optimization time in the device, and is the number of microbatches per minibatch.
For instance, for normal baseline, but for baseline with accumulated gradients.
The computational cost of L2L has two components: (1) the overhead of transmission and (2) the recomputation time.
where is the host-to-device bandwidth (raw is ) and is the optimization time that is now in the CPU. So, indicates the time for loading a layer. A layer is loaded twice (once during forward and once during backward). is slower than as optimization runs in CPU.
The difference is . This difference becomes significantly smaller when u is large since the backward compute time () is greater than forward compute time () or . In L2L-p, the transfer and optimization times can be hidden by overlapping with execution.
Let us compare the three equations for BERT-Large on a V100 GPU with effective TFLOPs. Let us assume is and is i.e. a microbatch size of . Forward requires GFLOP per layer per sample, Backward requires GFLOP per layer per sample, and Optimization requires GFLOP. We also assume the EPS (eager param-server) performs computations at GFLOPs on the CPU. Then we have:
The majority of the difference in L2L-p is the seconds spent in recomputing the minibatch size of . This overhead can be offset in distributed training as the reduction and optimization are in parallel.
In actual results, the gap between baseline and L2L is much narrower than . This is because L2L and baseline have a different microbatch size ( is smaller for L2L) as L2L can trade off some of the memory savings for running a microbatch large enough to maximize the effective TFLOPs of the device. So, the L2L forward and backward compute times ( and ) are faster than corresponding baseline and . In fact, as the batchsize increases, L2L begins to outperform due to less frequent updates.
4 Experimental Results
In this section, we present the experimental results for the L2L approach compared with the baseline.
4.1 Experimental Data and Setups
We have used the GLUE dataset  in our experiments which includes 8 sequence classification tasks. Our experiments are performed on an Azure NC6-v3 single Nvidia V100 instance with memory and the HuggingFace library  as a baseline for development and experiments. The pretrained model provided by BERT  is used as initial weights for fine-tuning the sequence classification task in both baseline and L2L methods. Table 1 shows the BERT configuration for both baseline and L2L.
|BERT Configuration for baseline and L2L|
|Max sequence Length||512|
4.2 Performance Evaluation of the Single Micro Batch L2L
As described in the previous sections, L2L algorithm enables training BERT using large batch sizes on memory constrainted devices. Using the HuggingFace library as a baseline, the maximum device batch size that can be used for fine-tuning the BERT with sequence length of 512 is . L2L algorithm allows training the same model with batch sizes of up to with less memory required compared with the baseline. Table 2 shows the memory required for fine-tuning BERT using L2L and baseline.
|Method||Device Batch Size||#Layer||Memory(GB)|
Table 2 shows that our proposed L2L algorithm can fit BERT with layers using fractionally more memory than the layer BERT-large, whereas the baseline cannot fit more than layers.
Furthermore, L2L converges better than baseline with a less noisy and more stable learning curve from the larger batch size of , allowing us to reproduce the original BERT results on TPU which was not easily possible even on high-end GPUs.
Figure 3 shows the F1 score comparison of L2L and baseline for the MRPC task each trained for epochs.
|Method||Batch Size||Accuracy (%)|
|Baseline with AG||32||92.07||93.57||59.94||89.8||89.89||71.48|
We have also compared L2L’s convergence with the baseline with gradient accumulation. Figure 4 shows the F1 score of both methods when baseline’s device batch size is and gradient accumulation step is set to .
Results show that L2L can converge to a better accuracy after 3 epochs compared to baseline. However, baseline with batch size of runs slightly faster than L2L in this experiment. This is caused by running the optimizer on CPU sequentially which can be improved by introducing the multi-process L2L.
We have performed more experiments on other GLUE datasets by training each of them for epochs with learning rates ranging between to . The following table shows the best accuracy achieved by L2L and baseline on the dev dataset after epochs.
Results show that our method converges to a comparable or better accuracy than baseline on these tasks in epochs which allows scaling the BERT family of models to be trained on low memory devices in a reasonable amount of time.
4.3 Performance Comparison of Scaled L2L and Baseline using Gradient Accumulation
In another experiment, we used gradient accumulation to optimize the baseline and L2L with larger batch sizes. We have compared the computation time required per epoch to train the MRPC task using the maximum device batch sizes possible for baseline and L2L. Figure 5 shows computation time of both algorithms for different batch sizes.
Results show that as the batch size increases, L2L’s computation time outperforms the baseline. The reason for this is that since the optimizer in L2L is in CPU and as the number of optimization steps decrease, L2L can reach better performance. Performance of the L2L is expected to further improve as we are going to run the optimizer layer to layer in parallel with backward pass which can eliminate the optimization time in L2L.
This will increase the efficiency of using better optimizers that show significant improvements by training over large batch sizes such as LAMB optimizer .
4.4 Performance Profiling of Sequential L2L
The ultimate goal of L2L is to enable distributed training of Transformer-based models in constant memory, so that they can be scaled to arbitrary depths. Sufficiently large batch sizes are required for fast and stable convergence, which we achieve by rethinking the model execution paradigm
In this paper, we introduced the sequential version of L2L which uses multiple micro-batches to train BERT one layer at a time. As explained earlier, the main reason to do this is to hide the transfer time between GPU and CPU to improve the computation time of the algorithm.
In this experiment, we first measured the memory required for training BERT with different batch sizes which is presented in Table4.
|Batch Size||uBatch Size||Memory(MB)|
In another experiment, we have analyzed the effect of using multiple micro batches in sequential L2L by profiling the memory required for training with the batch size of . Table 5 shows the memory analysis of this approach on the MRPC task.
|Batch Size||uBatch Size||Memory(MB)|
As the above results show, most of the memory in L2L is used to stash the activations on GPU. Using the newer models that does not require stashing such as invertible transformers and reformers, we expect to see more memory savings by discarding the activations.
We have also analyzed the computation time required for each step of L2L including forward, backward, optimizer and transfer time. Figure 6 shows the Pie Chart of the computation time for L2L with batch size of .
As results show, of the time in L2L is used for optimization (gradient clipping and update) in CPU and about for transfers. This can also be reduced as the CPU optimizer is currently not using performant libraries such as Intel MKL and the transfers are not yet using pinned pages. These improvements will be available in the multi-process L2L-p version which will also enable data parallel training where the optimization overhead is mostly hidden.
Training BERT and generally Transformer-based models require a huge amount of resources and device memories that are only possible with high-end GPUs and TPUs. Until now, Google’s BERT results were not easily reproducible on any single GPU within a reasonable amount of time. Moreover, with the advent of new high TFLOPs-per-Watt chips, it is imperative to find a method to run on memory constrainted devices. This was the main motivation for us to present an algorithm called L2L that introduces a new execution paradigm by elastically using the CPU memory for storing the model and the optimizer. The device in L2L stores only the executing layer of the model while a process in the CPU called eager param-server(EPS) prepares and transmits the next layer. The EPS running on the host also takes over the reduction and optimization tasks (using PyTorch multi-processing) with the potential to reduce overhead in large scale distributed training to virtually zero in a parallel version of L2L called L2L-p. An unanticipated benefit was that L2L outperforms baseline as the batch size increases due to two factors: (a) more effective TFLOPs with relaxed memory constraints, (b) infrequent updates where L2L gains more as CPU optimizer is slower.
We demonstrate a basic L2L method by running BERT-Large on a single GPU with less memory, a batchsize of , and faster time to convergence than baseline which can only do a batchsize of . We also demonstrate that L2L never runs out of memory even when the BERT model grows to 96 layers while all other approaches go out-of-memory. We hope this new execution paradigm will also influence the hardware industry that is currently investing in single-tier devices with brute-force Hight Bandwidth Memory technologies and high speed links to also consider a two-tier approach to training where the top tier is responsible for the model and optimization (EPS) while the device tier is responsible for executing the layer.
In conclusion, the constant-memory nature of this approach allows to scale to arbitrary depth in the number of layers. We enable developers to run very large models on more affordable hardware. Lastly, each layer can be structurally agnostic to another, encouraging dynamic modeling approaches such as neural architecture search (NAS). The L2L version of the BERT-large model and the EPS will soon be available in open source.
This paper and the research behind it would not have been possible without the exceptional support of our manager and colleagues. We would especially like to thank Tiyasa Mitra, Mohit Mittal , Layali Rashid, Marc Tremblay and Rajiv Kapoor for their great advice and support during the development and publishing this paper.
- J. Devlin, M. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, 2019. Association for Computational Linguistics.
- A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. Language models are unsupervised multitask learners. 2019.
- A. Harlap, D. Narayanan, A. Phanishayee, V. Seshadri, N. Devanur, G. Ganger, and P. Gibbons. Pipedream: Fast and efficient pipeline parallel dnn training. arXiv preprint arXiv:1806.03377, 2018.
- Y. Huang, Y. Cheng, A. Bapna, O. Firat, M. X. Chen, D. Chen, H. Lee, J. Ngiam, Q. V. Le, Y. Wu, and Z. Chen. Gpipe: Efficient training of giant neural networks using pipeline parallelism. arXiv preprint arXiv:1811.06965, 2018.
- T. Chen, B. Xu, C. Zhang, and C. Guestrin. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174, 2016.
- Y. Bulatov. Fitting larger networks into memory. Technical report, OpenAI, 2018.
- S. Rajbhandari, J. Rasley, O. Ruwase, and Y. He. Zero: Memory optimization towards training a trillion parameter models. arXiv preprint arXiv:1910.02054, 2019.
- N. Kitaev, L. Kaiser, and A. Levskaya. Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451, 2020.
- Y. You, J. Li, J. Hseu, X. Song, J. Demmel, and C. Hsieh. Large batch optimization for deep learning: Training bert in 76 minutes. arXiv preprint arXiv:1904.00962, 2019.
- A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman1. Glue: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the International Conference on Representation Learning, 2019.
- T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, and J. Brew. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771, abs/1910.03771, 2019.