TFLMS: Large Model Support in TensorFlow by Graph Rewriting
Abstract.
While accelerators such as GPUs have limited memory, deep neural networks are becoming larger and will not fit with the memory limitation of accelerators for training. We propose an approach to tackle this problem by rewriting the computational graph of a neural network, in which swapout and swapin operations are inserted to temporarily store intermediate results on CPU memory. In particular, we first revise the concept of a computational graph by defining a concrete semantics for variables in a graph. We then formally show how to derive swapout and swapin operations from an existing graph and present rules to optimize the graph. To realize our approach, we developed a module in TensorFlow, named TFLMS. TFLMS is published as a pull request in the TensorFlow repository for contributing to the TensorFlow community. With TFLMS, we were able to train ResNet50 and 3DUnet with and larger batch size, respectively. In particular, we were able to train 3DUNet using images of size of for image segmentation, which, without TFLMS, had been done only by dividing the images to smaller images, which affects the accuracy.
1. Introduction
Deep neural networks together with deep learning are effective for solving complex signalprocessing problems such as those in computer vision, speech recognition, and natural language processing. However, training a neural network is timeconsuming, often taking days to weeks. The training is mainly based on matrix multiplications; therefore, it is often accelerated using accelerators such as GPUs. In , GPUs were used for training a neural network for the first time. It was a deep convolutional neural network of layers, called AlexNet (Krizhevsky et al., 2012), achieving outstanding image classification results in the ILSVRC2012 competition ^{1}^{1}1http://www.imagenet.org/challenges/LSVRC/2012/ with a top5 test error rate of . Since then, GPUs have been popular for deep learning.
After the success of AlexNet in the ILSVRC2012 competition, deep learning has evolved quickly for a broader spectrum of applications. Neural networks are deeper (including more layers) and larger, e.g., ResNet1001 consists of layers and is much deeper than AlexNet (He et al., 2016). Thus, neural networks are sometimes too large to be fit with the memory limitation of GPUs for training.
From the hardware viewpoint, GPUs should be designed to have a larger physical memory, but increasing physical memory is expensive. From the software viewpoint, there are three main approaches to solving this problem. The first one is reducing memory consumption by reusing memory regions (Shirahata et al., 2016) for different computations, compressing a neural network (Choi et al., 2018) or using low precision (Faraone et al., 2017), the second is recomputing some of the computations from checkpoints (Chen et al., 2016), the third is using an external memory such as CPU memory for temporarily storing intermediate results during training (Rhu et al., 2016; Meng et al., 2017).
We pursued the third approach of using an external memory because it often helps with training a larger model compared to the other approaches and it can be generally applied to any neural networks. Different from the previous studies involving swapping data from GPU memory to an external memory, and vice versa, in an adhoc manner, we propose an approach based on formal rules for graph rewriting, which is provable. Our contributions in this paper are as follows:

We revised the concept of a computational graph of a neural network. Our definition of a computational graph is inspired by that in TensorFlow (Abadi et al., 2017), a popular framework for deep learning. Different from a computational graph in TensorFlow, variables in our computational graph are firstclass citizens and consistent with the concept of operations in a computational graph.

We formally derived swapout and swapin operations from an existing graph, those used to exchange intermediate results between GPUs and CPUs. Derivation is based on some rules in program transformations with correctness guarantee, which helps us understand the nature of swapping operations.

We presented two strategies for finding control operations that are used to control when data are swapped in from an external memory to GPU memory, which helps improve performance.

To realize our approach, we developed a module in TensorFlow, called TFLMS. TFLMS is published as a pull request in the TensorFlow repository for contributing to the TensorFlow community. With TFLMS, we were able to train ResNet50 (He et al., 2015) and 3DUnet (Çiçek et al., 2016) with a and larger batch size, respectively. In particular, we were able to train 3DUNet using images of size of for image segmentation, which, without TFLMS, had been done only by dividing the images to smaller images.
The rest of the paper is organized as follows. In Section 2, we discuss related work. In Section 3, we discuss our proposed approach involving revising the concept of a computational graph and presenting the semantics of the graph. In Section 4, we discuss the rules to derive swapout and swapin operations and optimizations. In Section 5, we discuss our TFLMS module that implements our approach in TensorFlow. In Section 6, we present the experimental results. Section 7 summarizes the key points and discusses future work.
2. Related work
The most intuitive method for training large models is using Unified Memory (Sakharnykh, 2017), a single memory address space accessible from both CPUs and GPUs. Enabling Unified Memory is simple, but its performance is very poor compared to custom methods that manually offload and prefetch data. Shirahata et al. (Shirahata et al., 2016) proposed a reduction approach of reusing, during the backward phase, the memory regions allocated for the forward phase. Rhu et al. (Rhu et al., 2016) proposed a different approach of managing runtime memory by virtualizing the memory usage of neural networks against both GPU and CPU memory. During training, only the current layer is active and consumes GPU memory while the other layers’ data are swapped out to the CPU memory. This approach performed better than using Unified Memory. Meng et al. (Meng et al., 2017) took the same approach as (Rhu et al., 2016) for TensorFlow by swapping tensors from GPU memory to CPU memory and vice versa. However, the authors did not discuss how to derive swapout and swapin operations (Meng et al., 2017). Besides, we could not find their TensorFlow source code. We borrowed Meng et al.’s idea but formally defined transformation rules for graph rewriting so that the correctness of the transformed computational graph is provable. Apart from using CPU memory as a temporary memory for computation, Chen et al. (Chen et al., 2016) proposed an approach of gradientcheckpointing, in which checkpointing vertices in a computational graph are automatically defined using graph partition. Parts of the graph in between checkpointing vertices are recomputed during the backward phase. The forward phase is generally computed twice. Wang et al. (Wang et al., 2018) combined both swapping and recomputation in a single framework.
3. Computational graphs
Notation  Definition  Meaning  

Vertices  Normal operation , taking list of inputs and returning output .  
Parameterized operation or variable .  
Edges  reads as input. is output of . This edge represents .  
produces output that is used to update variable . and must have the same type.  
cannot be executed unless finished. 
A computational graph is a core concept in TensorFlow. Neural networks defined by users are represented by a computational graph of operations. TensorFlow then executes optimizations over the graph before invoking operations in the graph. In this section, we revise the concept of a computational graph in TensorFlow (Abadi et al., 2017) to make its semantics more consistent.
3.1. Definition
Definition 1 ().
(Computational graph) Let be a vertex and edgelabeled directed graph, where is the set of vertices in , is the set of edges in G, is a function mapping each vertex to a tuple of an operation and a Boolean value indicating whether the operation is parameterized or not, and is a function mapping each edge to a tuple of a value of data type and an action in where .
Computational graphs are a way to express mathematical expressions in which each vertex is an operation with inputs of incoming edges and outputs of outgoing edges. In deep learning, computational graphs are used to express computations in neural networks that consist of operations whose input and output are often multidimensional arrays. Multidimensional arrays are often called tensors. Tensors that are used to store the internal states of a neural network, e.g., learning weights and bias in hidden layers in a neural network, are updated regularly. Hence, we classify operations into normal operations and parameterized operations where parameterized operations have internal states that can be updated. A variable is a special parameterized operation that is to update its internal variable using the identity operation^{2}^{2}2Identity function accepts a value and returns the same value. A constant is a special case of a variable where its value is set once and is never updated. Each edge has a value indicating an action related to the tensor on the edge. There are three actions: “read”, “update”, and “control”. Considering an edge from an operation to an operation , actions “read” and “update” mean reads and updates the tensor, respectively; and action “control” means triggers the execution of , and is called a control dependency operation.
Figure 1 shows a computational graph for an expression . In this example, there are three variables , and . An outgoing edge emanating from a variable means reading the variable value, and an incoming edge to a variable means updating a tensor to the variable (denoted with a dotted arrow).
3.2. Notations and semantics
Table 1 lists the notations to represent different vertices and edges in a graph. Function composition is denoted as “”, and, from its definition, we have . Function “” is to take the th element in a tuple, e.g., returns .
An operation in a computational graph is generally triggered to execute when all of its incoming edges have data. The operation generates data on its outgoing edges then other operations are repeatedly triggered in the same manner. This procedure ends when all of the reachable operations are executed and all of the reachable edges are filled with data. In other words, each of the reachable operations, except variables, is executed once.
However, there is no way to trigger the execution of a graph. At the beginning of computation, there is no way to set a value for an edge. Furthermore, computational graphs are acyclic graphs, and there are some operations with no incoming edges. These operations cannot be triggered. This problem is resolved using variables.
Variables in a computational graph are used to store learnable parameters, input and output data, and are used to trigger computation of the graph. Variables are special and make a computational graph for deep learning different from a general dependency graph. Because a variable has an internal state, defining its semantics is nontrivial in the context of the graph. At the beginning, variables are initialized with values input by users or random values generated by a distribution. During training, they are updated by a learning optimizer. This leads to a variable being visited more than once, and may introduce cycles if its semantics is ambiguous. The remainder of this section introduces a clear semantics for variables.
To describe the semantics of a computational graph containing variables, we first define a topological ordering over a computational graph.
Definition 2 ().
(Topological ordering) Given a computational graph , let be the number of vertices in the graph, topological ordering is a mapping of vertices to an integer, , satisfying

, and

.
In general, a topological ordering represents the order of execution of operations in a graph. Given two operations and , if , is executed before . If , and are executed in parallel. In this paper, variables always have order of , which means variables will be executed first, and incoming edges (“update” edges) to them do not change their order. Later executions of a variable depend on its incoming operations, and are independent of the variable’s order. These executions alone do not trigger the variable’s outgoing operations.
Let us consider the example graph in Figure 1(a). The graph has the following execution ordering: “”. First, variable is initialized by users then it triggers operation . Then, is executed and triggers operation . Finally, is updated with the output of , and the computation finishes. Operation depends on only, and itself can not trigger again.
The example graph in Figure 1(b) may have two possible execution orderings: “”, or “”. Operation is triggered based on the availability of tensors and . It is easy to see that must be executed after and after . However, is executed multiple times. It is important to know which output of is used as input to .
To avoid ambiguity, we present the following convention regarding variables:

An operation is always using the latest value of a variable.

Variables always have the highest priority of execution among operations consuming the same tensor.
This convention helps us ensure that is executed after updating with the output from .
The execution order of an operation not only depends on data availability on incoming edges but also control dependency edges. “Control” edges do not have data. In other words, they are not inputs for the operation. “Control” edges are used to control the execution order of an operation. Adding a “control” edge into a graph will alter the topological ordering of the graph. If is a “control” edge, must be executed after , and . By this definition, there is no control edge to a variable.
The example graph in Figure 1(c) has a new operation, , that consumes the output of , executes computation and, updates variable . Without the control edge from to , after is executed, and can be executed in parallel because they do not depend on each other. Because they both access variable , i.e, reads and writes to , a control edge is necessary to ensure that they access in order. The “control” edge from to states that will be executed after finishing and updating .
3.3. Training using backpropagation
Training a neural network involves minimizing an objective function measuring the distance between a ground truth value and predicted value. The objective function is a composition of multiple functions with learnable parameters, and the gradient descent algorithm is often used to minimize the function. Optimization is an iterative procedure updating learnable parameters so that the objective function is minimized, in which each training iteration consists of three phases: forward phase to compute the objective function, backward phase to compute gradients of the objective function with respect to learnable parameters, and update phase to update learnable parameters using the gradients. Backward phase is done via backpropagation for efficiency, starting from the objective function and propagating back gradients through the functions. At the beginning of an iteration, tensors are cleaned up except variables for learnable parameters. Variables for input tensors are fed with new data and trigger the iteration. Because a training dataset is often very big, each iteration takes only a subset (batch) of examples extracted from the training dataset as its input tensor. The number of examples in a batch (or batch size) will affect the size of the input tensor and also other tensors in the computational graph. In general, increasing batch size will make a model larger.
Figure 3 shows how learnable parameters (represented by variables) are updated during training. In the forward phase, variable is an input to function , outputs from are used in the later function, finally a loss value is produced by objective function . In the backward phase, we compute gradients of with respect to learnable parameters. Function computes the gradient of with respect to , which requires ’s output as one of its inputs. Finally, is updated by a function during the update phase.
3.4. Device placement
In TensorFlow, each operation in the computational graph is placed on a device such as a GPU, CPU, FPGA. Communication between two devices automatically occurs if an operation on one device consumes a tensor produced by another operation on the other device. In fact, TensorFlow adds a pair of two operations, “send” and “receive”, to the graph for exchanging a tensor. In this paper, we do not show these communication operations when drawing graphs.
3.5. Garbage collection
If a tensor is no longer used in TensorFlow, it is released by TensorFlow garbage collection. Every tensor is assigned a reference count, which is the number of operations. Each time a tensor is consumed by an operation, its reference count is decreased by one. If the reference count reaches zero, the tensor is available to be released. In other words, the lifetime of a tensor is from the operation generating it to the last operation consuming it. Let be a tensor produced by an operation , and be operations consuming . The life time of is computed as .
4. Graph rewriting
A computational graph or a neural network model is said being large to be trained with the memory limitation of GPUs if there are many tensors that are kept in the GPU memory at a time so that they consume more memory than the GPU memory. Hence, an outofmemory error often happens when training such a large graph. This is essentially because there are many tensors with a long lifetime in a computational graph. In this section, we will show how to rewrite a large graph so that training them is possible with a limited GPU memory. In general, our idea is temporally sending “long lifetime” tensors in a GPU to a CPU and sending them back to the GPU when necessary.
4.1. Swapping out tensors to CPU memory
To put a tensor residing in GPU memory on CPU memory, we derive operations to automatically send the tensor to the CPU and send it back to the GPU. Let us consider an edge where and , are executed using a GPU. Computation for this edge is
(1) 
where the superscript stands for GPU. This computation can be rewritten into:
(2) 
where the superscript stands for CPU, and is an identity function that is .
Since is executed using a CPU, the output tensor of will be swapped out to the CPU memory for immediately after finishes, and GPU memory is released. The output tensor of will be swapped in to the GPU for when is triggered. We call function in Equation 2 a swapout operation.
Using Equation 2, we are able rewrite a graph so that GPU memory consumption is reduced. However, not all edges are needed to rewrite. For edges where , is executed immediately after . Hence, there is no need to swap the tensor on such edges. We can define a threshold and graph rewriting for an edge is triggered if .
4.2. Optimization
Equation 2 is not optimized due to two reasons: it is too late to swap the output tensor of in, and must wait for the tensor sent from CPU memory to GPU memory; and the tensor may be swapped out and swapped in multiple times since there may be multiple operations apart from reading it. In this section we present three rules to optimize Equation 2. Figure 4 shows computational graphs obtained by each of optimization rules.
4.2.1. Introduce swapin operations
To swap a tensor in early, we need an additional operation. An Identity function can be rewritten as the composition of a function and its inverse function, that is,
(3) 
Equation 2 becomes:
(4) 
Since also has the inverse function, i.e, , we choose for (if one would like to reduce the memory consumption on the CPU, a pair of encoding and decoding functions can be used for instead of ),
(5) 
In Equation 5, will be used to swap a tensor in to a device, and we call function a swapin operation. It is worth noting that we must manually trigger in a good order; otherwise, is executed immediately after . To do this, a control edge from an operation to must be added. We present two strategies for choosing a control operation in Section 4.3.
4.2.2. Fuse swapout operations
A tensor produced by an operation is often used by multiple operations, and it is redundant if the tensor is swapped out to CPU memory multiple times. Hence, it is recommended to always fuse swapout operations of the same tensor into a single swapout operation.
4.2.3. Fuse swapin operations
Consider a situation that multiple swapin operations swap a tensor multiple times for multiple consuming operations. If the tensor is large and the consuming operations are close to each other, then swapping the tensor multiple times would introduce more overhead. In this case, it is better to fuse the swapin operations into one swapin operation. The tensor is swapped in only once and resides in GPU memory to be reused by the consuming operations. For example, in the rightmost graph in Figure 4, if and are close and is large, then we fuse and into a singe swapin operation. To determine how close two operations are, we may define a threshold for the distance between them.
4.3. Strategies to add control edges
Control edges to swapin operations are added to a computational graph to control when swapin operations are trigger. They are important to reduce the overhead of communication of swapping tensors in. Consider Equation 5, a control operation for the swapin operation must be chosen from a set of operations, , where to guarantee the correctness of the computational graph. Let be the distance between and . If is too small, a tensor is swapped in too late, and has to wait for the tensor. If is too large, a tensor is swapped in too early, and the tensor is kept in the device for a long time before being actually used by .
An ideal solution for choosing a control operation is having a cost model for computational graphs and using the model to prioritize operations. However, in TensorFlow, the shape of the input and output tensors of an operation is generally unknown at the beginning unless data are fed into the graph then trigger the operation. This means that, at the time a graph is rewritten, there is no information about the actual size of tensors, and it fails to compute operation cost statically.
In a context of statically modifying a computational graph, we introduce two parameters: lowerbound and upperbound to handle choosing control operations. Let us assume that an edge is rewritten using a swapout operation and swapin operation :
(6) 
We present two strategies to find a control operation for .
4.3.1. Directorder strategy
The directorder strategy involves directly using the topological ordering to obtain a set of candidates for control operation, starting from the target operation and going back to . Lowerbound and upperbound are relative to .
Algorithm 1 shows the algorithm of this strategy. Candidates are operations whose distance to is in the range of to (Line ) and there exists a path from them to (Line ). The algorithm stops once it has found one operation satisfying the above conditions (Lines –).
4.3.2. Chainrule strategy
The chainrule strategy involves starting from the source operation and going down along the forward phase to find corresponding backward operations as candidates for control operations. Breadthfirst search is used to traverse operations in the forward phase in which lowerbound and upperbound are used to limit the search space of forward operations. In other words, lowerbound and upperbound are relative to the source operation .
Algorithm 2 shows the algorithm of this strategy. For breadthfirst search, we maintain two open sets and , and one closed set . The contains current forward operations, and contains forward operations for the next level (including all outgoing operations of operation in ). The contains visited operations. Starting from , once the algorithm is in the range of to (Line ), it obtains outgoing backward operations of a current operation (Line ), then checks the validity of these backward operations (Lines –). If there is one valid operation, it is a candidate and the algorithm returns it. Otherwise, the algorithm goes to the next level (Lines –).
5. TFLMS module in TensorFlow
Parameter  Meaning  Default value 
graph  The graph we will modify for LMS. This should be the graph of userdefined neural network.  required 
optimizer_scopes  A set of scopes for the optimizers/solvers.  required 
starting_scope  Tensors that are reachable from the operations in this scope will be swapped for LMS. Set this to the scope of the first layer if we would like to modify the whole graph.  None 
starting_op_names  Tensors that are reachable from the operations with these names will be swapped for LMS.  None 
excl_scopes  A set of scopes. Output tensors of operations in the scopes will not be swapped out to CPU memory.  empty 
incl_scopes  A set of scopes. Output tensors of operations in the scopes will be swapped out to CPU memory.  empty 
excl_types  A set of types. Output tensors of operations with these types will not be swapped out to CPU memory.  empty 
incl_types  A set of types. Output tensors of operations with these types will be swapped out to CPU memory.  empty 
n_tensors  The number of tensors for LMS, counting from the starting_scope.  1 (all tensors) 
lb  Lowerbound value for LMS.  1 
ub  Upperbound value for LMS.  10000 
ctrld_strategy  Two strategies to find control dependency operations for swapin operations: chain_rule and direct_order.  chain_rule 
fuse_swapins  Fuse ”close” swapin operations into one operation.  False 
swap_branches  If True, LMS will swap tensors in branches in the forward phase.  False 
branch_threshold  A threshold for swapping branches in the forward phase.  0 
We developed a TensorFlow module, named TFLMS, based on our proposed approach. The module allows users to quickly turn their large model into one that can be trained with limited GPU memory. In TensorFlow, users first define a neural network model. TensorFlow then automatically generates a computational graph from the model. Finally, users define a TensorFlow session to execute operations in the computational graph. Once a session is invoked, users cannot modify the computational graph. Hence, we implement TFLMS to statically modify the graph before a session starts.
Figure 5 shows how TFLMS is positioned in TensorFlow. TFLMS takes a computational graph and automatically modifies it using the transformation rules presented in Section 4. TFLMS uses APIs in the module “graph editor”^{3}^{3}3Graph editor: https://www.tensorflow.org/api_guides/python/contrib.graph_editor in TensorFlow to modify the graph. The modified graph is then executed by a TensorFlow session as normal. TFLMS’s source code is publicly available as a pull request in the TensorFlow repository^{4}^{4}4https://github.com/tensorflow/tensorflow/pull/19845.
Listing 1 shows a brief example of using TFLMS in TensorFlow. While defining a neural network, users must define a scope for their optimizer (Line ). Users then define a LMS instance for that scope and run the instance to modify the computational graph of the neural network (Lines –). After that, users create a TensorFlow session and train the network as usual.
5.1. Implementation
The important part of TFLMS is building a topological ordering. Given a graph, TFLMS uses the python package “toposort”^{5}^{5}5https://pypi.org/project/toposort/ to build a topological order. The topological ordering, , is to decide which tensors are swapped out and when they are swapped in as shown Section 4. To rewrite edges, TFLMS traverses through the graph using the breadthfirst search algorithm, starting from input variables. We do not rewrite incoming and outgoing edges of variables. In other words, learnable parameters are kept in GPU memory. Apart from an input of a computational graph, TFLMS allows users to pass other parameters to flexibly control how the graph is modified. Table 2 lists the parameters in TFLMS.
By default, TFLMS always rewrites edges between a forward operation and a backward operation. To determine operations in the backward phase, users should pass the scope^{6}^{6}6In TensorFlow, scope defines a name for a set of operations, similar to a folder in a file system. of solvers or optimizers that are used to train the model (via TFLMS parameter optimizer_scopes). Note that, it is possible to automatically rewrite the whole graph without optimizer_scopes. Using optimizer_scopes reduces unnecessary operations that are not helpful for large model support, e.g. operations in the update phase. If a model has many branches in the forward phase, users may want to use parameters swap_branches and branch_threshold to enable rewriting edges satisfying . branch_threshold is the threshold defined in Section 4.1. Swapping tensors in the forward phase may affect the performance of inferencing of a neural network because it introduces overhead of swapping the tensors out and in. However, if the neural network is still large for inferencing, swapping those tensors is necessary. Without enabling swap_branches, our modification does not cause any affect on the performance of inferencing because added swapout and swapin operations between the forward and backward phases are not executed during the inferencing. Inclusion or exclusion of an operation can be done via the operation’s type or scope. Users can define a starting point for the breadthfirst search by using the scope or name of operations via parameters starting_scope and starting_op_names. By default, TFLMS rewrites all reachable edges. However, users can define the number of tensors that are swapped via parameter n_tensor. Parameters lb and ub are lowerbound and upperbound, respectively, as defined in Section 4.3. A strategy for choosing control operations is set by parameter ctrld_strategy. Parameter fuse_swapins is to enable the optimization of fusing swapin operations.
5.2. Performance tuning
To get the maximum performance when using TFLMS, we need to find the combination of tuning parameters that provides the fastest training time with the model. The goal of the performance tuning is to swap out enough tensors to allow our training to run without outofmemory errors, while not swapping too many such that the extra swapping communication overhead degrades performance.
The two tuning parameters we should focus on are n_tensors and lb. Since n_tensors controls the number of tensors that will be swapped, the higher this is set, the lower the peak GPU memory usage will be. The lb controls how soon the tensor is swapped back in before use. A low value lb can make the training on the GPU pause and wait while the swapin finishes. This will degrade performance. A higher value lb allows the tensor swapin to finish before it is needed and allows training to run without pause. The downside to swapping in too early is that more tensors will be in the GPU memory at any point in time, resulting in higher peak GPU memory usage.
Tuning thus becomes finding the correct balance between n_tensors and lb that provides the best performance for a given model. To start the performance tuning it is suggested that n_tensors be set to 1, which will swap all reachable tensors, e.g., tensors. The lb should be set to the default of 1, which is the latest possible swapin. It is useful to run with and then adjust it downward. If the model has branches similar to the 3UNet model, it is likely useful to set swap_branches to True and tune the branch threshold.
6. Experiments
6.1. Experimental environment
Experiments were run on an IBM POWER8 NUMAbased machine (IBM, 2016) using one GPU. The machine has two 4GHz 10core POWER8 processors, eight simultaneous multithreads (SMTs) per core and 256 MB RAM per processor. There are four NVIDIA Tesla P100 GPUs (each with 16 GB memory). NVLinks are used for connections among GPUs and CPUs: one 80 GB/s duplex link between GPUs 0 and 1, one 80 GB/s duplex link between GPUs 2 and 3, two 80 GB/s duplex links from CPU 0 to GPUs 0 and 1, and two 80 GB/s duplex links from CPU 1 to GPUs 2 and 3. On the machine, we installed TensorFlow 1.8, CUDA Toolkit v9.0 and cuDNN 7.0.5.
We evaluated TFLMS using two popular neural networks: ResNet50 for image recognition and 3DUNet for image segmentation. To make a model larger, we increase the batch size of each iteration. By default, we always fuse swapout operations.
6.2. Maximum batch size
Model  Image  Without TFLMS  With TFLMS  Ratio 

ResNet50  
3DUnet  
3DUnet  OOM 
Table 3 shows the maximum batch size we are able to train using TFLMS. We let TFLMS swap all reachable tensors to reduce GPU memory consumption as much as possible. In total, TFLMS swapped all of tensors for ResNet50, all of tensors for 3DUNet with images and tensors for 3DUNet with images^{7}^{7}73DUnet architecture is changed according as image size.. With TFLMS we were able to train ResNet50 and 3DUNet with and times larger batch size, respectively. For 3DUnet, we were able to train the whole images of without resizing or splitting the images, which was impossible without TFLMS.
6.3. Training performance
Figure 6 shows the effectiveness of parameters n_tensors and lb on training performance of ResNet50. We measured the number of images per second (images/sec) for each batch size. Without TFLMS, the maximum batch size we were able to train is . Performance for a smaller batch size was poor because GPU usage was small. With TFLMS, when we first swapped out all reachable tensors, i.e. tensors, and set lb to for swapping in a tensor as late as possible, the maximum batch size we were able to train is , times larger than the one without TFLMS. However, performance was not good. We then tried to increase lb from to to swap in tensor earlier so that there were more overlap between computation and communication. It is clear that the higher lb, the better training performance, but the maximum batch size was decreased because there were more tensors residing in GPU memory at a time. Similarly, we decreased the number of tensors being swapped out, from (all) to or . We also obtained better performance. n_tensors was more effective than lb on training performance, and lb was less effective than n_tensors on the maximum batch size. Hence, there should be a tradeoff between n_tensors and lb.
Figure 8 shows the effectiveness of fusing swapin operations. In both cases, we swapped out tensors in total, but the numbers of swapping operations added to the graph with fuse_swapins enabled and disabled are and , respectively. Fusing swapin operations lead to better performance but smaller maximum batch size. This is because some tensors were kept in GPU memory for reusing as we mentioned in Section 4.2.3.
Figure 7 shows a comparison between two strategies “chain_rule” and “direct_order” for finding control dependency operations. Though the strategy “direct_order” is simple than “chain_rule”, it sometimes had poorer performance for training ResNet50. In particular, “direct_order” was much slow with batch sizes , and .
Figure 9 shows results for 3DUnet. The maximum batch size we were able to train with TFLMS is twice as large as that without TFLMS. The effectiveness of Parameters n_tensors and lb for 3DUnet is similar to that for ResNet50. In particular, when we decreased n_tensors from (all tensors) to , we clearly saw better performance, but the maximum batch size was decreased from to . We measured the effectiveness of swapping branches. We enabled swapping branches with threshold , the number of added operations was increased from to and the number of swapped tensors stayed the same. By swapping branches, we were able to train 3DUnet with the maximum batch size of instead of . We also tried to train 3DUnet with large images, i.e. images of size of . While without TFLMS we got outofmemory errors, with TFLMS, we were able to train 3DUnet at images/sec (Batch size=, n_tensors= (all), lb=, swap_branches=True, branch_threshold = ).
7. Conclusion
We have proposed a formal approach to deriving swapout and swapin operations for enabling large model support. We formally revised the concept of computational graph and borrowed the theory of program transformations to derive new operations as well as optimize the graph. Furthermore, We have proposed two strategies to statically find control dependency operations for triggering swapin operations. The experimental results showed that our approach helped train very large models, i.e. and times larger for ResNet50 and 3DUnet, respectively. Though our definition of computational graph is inspired by TensorFlow, it is still general enough to be applied to other computational graph based frameworks. In the future, we plan to incorporate the recomputation technique by introducing new transformation rules. Investigating a good heuristics to finding control dependency operations is an open problem.
Acknowledgements.
Authors would like to thank Samuel D. Matzek from IBM Systems PowerAI team for helping refactor our source code for the pull request. The authors would also like to thank Geert Janssen and Minsik Cho from IBM Research for their fruitful discussion on our approach for large model support.References
 (1)
 Abadi et al. (2017) Martín Abadi, Michael Isard, and Derek G. Murray. 2017. A Computational Model for TensorFlow: An Introduction. In Proceedings of the 1st ACM SIGPLAN International Workshop on Machine Learning and Programming Languages (MAPL 2017). ACM, New York, NY, USA, 1–7.
 Chen et al. (2016) T. Chen, B. Xu, C. Zhang, and C. Guestrin. 2016. Training Deep Nets with Sublinear Memory Cost. ArXiv eprints (April 2016). arXiv:1604.06174
 Choi et al. (2018) Yoojin Choi, Mostafa ElKhamy, and Jungwon Lee. 2018. Universal Deep Neural Network Compression. CoRR abs/1802.02271 (2018). http://arxiv.org/abs/1802.02271
 Çiçek et al. (2016) Özgün Çiçek, Ahmed Abdulkadir, Soeren S. Lienkamp, Thomas Brox, and Olaf Ronneberger. 2016. 3D UNet: Learning Dense Volumetric Segmentation from Sparse Annotation. CoRR abs/1606.06650 (2016). http://arxiv.org/abs/1606.06650
 Faraone et al. (2017) Julian Faraone, Nicholas J. Fraser, Giulio Gamberdella, Michaela Blott, and Philip Heng Wai Leong. 2017. Compressing Low Precision Deep Neural Networks Using SparsityInduced Regularization in Ternary Networks. CoRR abs/1709.06262 (2017). http://arxiv.org/abs/1709.06262
 He et al. (2015) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep Residual Learning for Image Recognition. CoRR abs/1512.03385 (2015). http://arxiv.org/abs/1512.03385
 He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Identity Mappings in Deep Residual Networks. Springer International Publishing, 630–645.
 IBM (2016) IBM. 2016. IBM Power System S822LC for High Performance Computing. http://www03.ibm.com/systems/power/hardware/s822lchpc/.
 Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In International Conference on Neural Information Processing Systems. 1097–1105.
 Meng et al. (2017) Chen Meng, Minmin Sun, Jun Yang, Minghui Qiu, and Yang Gu. 2017. Training deeper models by GPU memory optimization on TensorFlow. In Proc. of ML Systems Workshop in NIPS.
 Rhu et al. (2016) M. Rhu, N. Gimelshein, J. Clemons, A. Zulfiqar, and S. W. Keckler. 2016. vDNN: Virtualized Deep Neural Networks for Scalable, MemoryEfficient Neural Network Design. ArXiv eprints (Feb. 2016). arXiv:1602.08124
 Sakharnykh (2017) Nikolay Sakharnykh. 2017. Unified memory on Pascal and Volta. (2017). http://ondemand.gputechconf.com/gtc/2017/presentation/s7285nikolaysakharnykhunifiedmemoryonpascalandvolta.pdf GTC.
 Shirahata et al. (2016) K. Shirahata, Y. Tomita, and A. Ike. 2016. Memory reduction method for deep neural network training. In 2016 IEEE 26th International Workshop on Machine Learning for Signal Processing (MLSP). 1–6.
 Wang et al. (2018) Linnan Wang, Jinmian Ye, Yiyang Zhao, Wei Wu, Ang Li, Shuaiwen Leon Song, Zenglin Xu, and Tim Kraska. 2018. Superneurons: Dynamic GPU Memory Management for Training Deep Neural Networks. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP ’18). ACM, New York, NY, USA, 41–53.