TFLMS: Large Model Support in TensorFlow by Graph Rewriting

TFLMS: Large Model Support in TensorFlow by Graph Rewriting

Tung D. Le IBM Research - Tokyo19-21, Nihonbashi Hakozaki-cho, Chuo-kuTokyoJapan103-8510 tung@jp.ibm.com Haruki Imai IBM Research - Tokyo19-21, Nihonbashi Hakozaki-cho, Chuo-kuTokyoJapan103-8510 imaihal@jp.ibm.com Yasushi Negishi IBM Research - Tokyo19-21, Nihonbashi Hakozaki-cho, Chuo-kuTokyoJapan103-8510 negishi@jp.ibm.com  and  Kiyokuni Kawachiya IBM Research - Tokyo19-21, Nihonbashi Hakozaki-cho, Chuo-kuTokyoJapan103-8510 kawatiya@jp.ibm.com
Abstract.

While accelerators such as GPUs have limited memory, deep neural networks are becoming larger and will not fit with the memory limitation of accelerators for training. We propose an approach to tackle this problem by rewriting the computational graph of a neural network, in which swap-out and swap-in operations are inserted to temporarily store intermediate results on CPU memory. In particular, we first revise the concept of a computational graph by defining a concrete semantics for variables in a graph. We then formally show how to derive swap-out and swap-in operations from an existing graph and present rules to optimize the graph. To realize our approach, we developed a module in TensorFlow, named TFLMS. TFLMS is published as a pull request in the TensorFlow repository for contributing to the TensorFlow community. With TFLMS, we were able to train ResNet-50 and 3DUnet with and larger batch size, respectively. In particular, we were able to train 3DUNet using images of size of for image segmentation, which, without TFLMS, had been done only by dividing the images to smaller images, which affects the accuracy.

copyright: none

1. Introduction

Deep neural networks together with deep learning are effective for solving complex signal-processing problems such as those in computer vision, speech recognition, and natural language processing. However, training a neural network is time-consuming, often taking days to weeks. The training is mainly based on matrix multiplications; therefore, it is often accelerated using accelerators such as GPUs. In , GPUs were used for training a neural network for the first time. It was a deep convolutional neural network of layers, called AlexNet (Krizhevsky et al., 2012), achieving outstanding image classification results in the ILSVRC-2012 competition 111http://www.image-net.org/challenges/LSVRC/2012/ with a top-5 test error rate of . Since then, GPUs have been popular for deep learning.

After the success of AlexNet in the ILSVRC-2012 competition, deep learning has evolved quickly for a broader spectrum of applications. Neural networks are deeper (including more layers) and larger, e.g., ResNet-1001 consists of layers and is much deeper than AlexNet (He et al., 2016). Thus, neural networks are sometimes too large to be fit with the memory limitation of GPUs for training.

From the hardware viewpoint, GPUs should be designed to have a larger physical memory, but increasing physical memory is expensive. From the software viewpoint, there are three main approaches to solving this problem. The first one is reducing memory consumption by reusing memory regions (Shirahata et al., 2016) for different computations, compressing a neural network (Choi et al., 2018) or using low precision (Faraone et al., 2017), the second is re-computing some of the computations from checkpoints (Chen et al., 2016), the third is using an external memory such as CPU memory for temporarily storing intermediate results during training (Rhu et al., 2016; Meng et al., 2017).

We pursued the third approach of using an external memory because it often helps with training a larger model compared to the other approaches and it can be generally applied to any neural networks. Different from the previous studies involving swapping data from GPU memory to an external memory, and vice versa, in an ad-hoc manner, we propose an approach based on formal rules for graph rewriting, which is provable. Our contributions in this paper are as follows:

  • We revised the concept of a computational graph of a neural network. Our definition of a computational graph is inspired by that in TensorFlow (Abadi et al., 2017), a popular framework for deep learning. Different from a computational graph in TensorFlow, variables in our computational graph are first-class citizens and consistent with the concept of operations in a computational graph.

  • We formally derived swap-out and swap-in operations from an existing graph, those used to exchange intermediate results between GPUs and CPUs. Derivation is based on some rules in program transformations with correctness guarantee, which helps us understand the nature of swapping operations.

  • We presented two strategies for finding control operations that are used to control when data are swapped in from an external memory to GPU memory, which helps improve performance.

  • To realize our approach, we developed a module in TensorFlow, called TFLMS. TFLMS is published as a pull request in the TensorFlow repository for contributing to the TensorFlow community. With TFLMS, we were able to train ResNet-50 (He et al., 2015) and 3DUnet (Çiçek et al., 2016) with a and larger batch size, respectively. In particular, we were able to train 3DUNet using images of size of for image segmentation, which, without TFLMS, had been done only by dividing the images to smaller images.

The rest of the paper is organized as follows. In Section 2, we discuss related work. In Section 3, we discuss our proposed approach involving revising the concept of a computational graph and presenting the semantics of the graph. In Section 4, we discuss the rules to derive swap-out and swap-in operations and optimizations. In Section 5, we discuss our TFLMS module that implements our approach in TensorFlow. In Section 6, we present the experimental results. Section 7 summarizes the key points and discusses future work.

2. Related work

The most intuitive method for training large models is using Unified Memory (Sakharnykh, 2017), a single memory address space accessible from both CPUs and GPUs. Enabling Unified Memory is simple, but its performance is very poor compared to custom methods that manually offload and prefetch data. Shirahata et al. (Shirahata et al., 2016) proposed a reduction approach of reusing, during the backward phase, the memory regions allocated for the forward phase. Rhu et al. (Rhu et al., 2016) proposed a different approach of managing runtime memory by virtualizing the memory usage of neural networks against both GPU and CPU memory. During training, only the current layer is active and consumes GPU memory while the other layers’ data are swapped out to the CPU memory. This approach performed better than using Unified Memory. Meng et al. (Meng et al., 2017) took the same approach as (Rhu et al., 2016) for TensorFlow by swapping tensors from GPU memory to CPU memory and vice versa. However, the authors did not discuss how to derive swap-out and swap-in operations (Meng et al., 2017). Besides, we could not find their TensorFlow source code. We borrowed Meng et al.’s idea but formally defined transformation rules for graph rewriting so that the correctness of the transformed computational graph is provable. Apart from using CPU memory as a temporary memory for computation, Chen et al. (Chen et al., 2016) proposed an approach of gradient-checkpointing, in which checkpointing vertices in a computational graph are automatically defined using graph partition. Parts of the graph in between checkpointing vertices are re-computed during the backward phase. The forward phase is generally computed twice. Wang et al. (Wang et al., 2018) combined both swapping and recomputation in a single framework.

3. Computational graphs

Figure 1. Computational graph for . Vertices are operations and edges are tensors. Continuous arrows represent ”read” edges and dotted arrows represent ”update” edges. Double circles are parameterized operations including variables and constants.
Notation Definition Meaning
Vertices

Normal operation , taking list of inputs and returning output .

Parameterized operation or variable .
Edges

reads as input. is output of . This edge represents .

produces output that is used to update variable . and must have the same type.

cannot be executed unless finished.
Table 1. Notations

2

0

1

(a) Variable is updated with output of .

2

0

1

3

(b) Variable is immediately updated with output of , then reads .

2

0

1

4

3

(c) and access at the same time. Control edge is necessary to force to be executed after .
Figure 2. Examples about graphs regarding variables. Integers above or below vertex are order of that vertex in topological ordering.

A computational graph is a core concept in TensorFlow. Neural networks defined by users are represented by a computational graph of operations. TensorFlow then executes optimizations over the graph before invoking operations in the graph. In this section, we revise the concept of a computational graph in TensorFlow (Abadi et al., 2017) to make its semantics more consistent.

3.1. Definition

Definition 1 ().

(Computational graph) Let be a vertex and edge-labeled directed graph, where is the set of vertices in , is the set of edges in G, is a function mapping each vertex to a tuple of an operation and a Boolean value indicating whether the operation is parameterized or not, and is a function mapping each edge to a tuple of a value of data type and an action in where .

Computational graphs are a way to express mathematical expressions in which each vertex is an operation with inputs of incoming edges and outputs of outgoing edges. In deep learning, computational graphs are used to express computations in neural networks that consist of operations whose input and output are often multi-dimensional arrays. Multi-dimensional arrays are often called tensors. Tensors that are used to store the internal states of a neural network, e.g., learning weights and bias in hidden layers in a neural network, are updated regularly. Hence, we classify operations into normal operations and parameterized operations where parameterized operations have internal states that can be updated. A variable is a special parameterized operation that is to update its internal variable using the identity operation222Identity function accepts a value and returns the same value. A constant is a special case of a variable where its value is set once and is never updated. Each edge has a value indicating an action related to the tensor on the edge. There are three actions: “read”, “update”, and “control”. Considering an edge from an operation to an operation , actions “read” and “update” mean reads and updates the tensor, respectively; and action “control” means triggers the execution of , and is called a control dependency operation.

Figure 1 shows a computational graph for an expression . In this example, there are three variables , and . An outgoing edge emanating from a variable means reading the variable value, and an incoming edge to a variable means updating a tensor to the variable (denoted with a dotted arrow).

3.2. Notations and semantics

Table 1 lists the notations to represent different vertices and edges in a graph. Function composition is denoted as “”, and, from its definition, we have . Function “” is to take the -th element in a tuple, e.g., returns .

An operation in a computational graph is generally triggered to execute when all of its incoming edges have data. The operation generates data on its outgoing edges then other operations are repeatedly triggered in the same manner. This procedure ends when all of the reachable operations are executed and all of the reachable edges are filled with data. In other words, each of the reachable operations, except variables, is executed once.

However, there is no way to trigger the execution of a graph. At the beginning of computation, there is no way to set a value for an edge. Furthermore, computational graphs are acyclic graphs, and there are some operations with no incoming edges. These operations cannot be triggered. This problem is resolved using variables.

Variables in a computational graph are used to store learnable parameters, input and output data, and are used to trigger computation of the graph. Variables are special and make a computational graph for deep learning different from a general dependency graph. Because a variable has an internal state, defining its semantics is non-trivial in the context of the graph. At the beginning, variables are initialized with values input by users or random values generated by a distribution. During training, they are updated by a learning optimizer. This leads to a variable being visited more than once, and may introduce cycles if its semantics is ambiguous. The remainder of this section introduces a clear semantics for variables.

To describe the semantics of a computational graph containing variables, we first define a topological ordering over a computational graph.

Definition 2 ().

(Topological ordering) Given a computational graph , let be the number of vertices in the graph, topological ordering is a mapping of vertices to an integer, , satisfying

  • , and

  • .

In general, a topological ordering represents the order of execution of operations in a graph. Given two operations and , if , is executed before . If , and are executed in parallel. In this paper, variables always have order of , which means variables will be executed first, and incoming edges (“update” edges) to them do not change their order. Later executions of a variable depend on its incoming operations, and are independent of the variable’s order. These executions alone do not trigger the variable’s outgoing operations.

Let us consider the example graph in Figure 1(a). The graph has the following execution ordering: “”. First, variable is initialized by users then it triggers operation . Then, is executed and triggers operation . Finally, is updated with the output of , and the computation finishes. Operation depends on only, and itself can not trigger again.

The example graph in Figure 1(b) may have two possible execution orderings: “”, or “”. Operation is triggered based on the availability of tensors and . It is easy to see that must be executed after and after . However, is executed multiple times. It is important to know which output of is used as input to .

To avoid ambiguity, we present the following convention regarding variables:

  • An operation is always using the latest value of a variable.

  • Variables always have the highest priority of execution among operations consuming the same tensor.

This convention helps us ensure that is executed after updating with the output from .

The execution order of an operation not only depends on data availability on incoming edges but also control dependency edges. “Control” edges do not have data. In other words, they are not inputs for the operation. “Control” edges are used to control the execution order of an operation. Adding a “control” edge into a graph will alter the topological ordering of the graph. If is a “control” edge, must be executed after , and . By this definition, there is no control edge to a variable.

The example graph in Figure 1(c) has a new operation, , that consumes the output of , executes computation and, updates variable . Without the control edge from to , after is executed, and can be executed in parallel because they do not depend on each other. Because they both access variable , i.e, reads and writes to , a control edge is necessary to ensure that they access in order. The “control” edge from to states that will be executed after finishing and updating .

3.3. Training using back-propagation

Training a neural network involves minimizing an objective function measuring the distance between a ground truth value and predicted value. The objective function is a composition of multiple functions with learnable parameters, and the gradient descent algorithm is often used to minimize the function. Optimization is an iterative procedure updating learnable parameters so that the objective function is minimized, in which each training iteration consists of three phases: forward phase to compute the objective function, backward phase to compute gradients of the objective function with respect to learnable parameters, and update phase to update learnable parameters using the gradients. Backward phase is done via back-propagation for efficiency, starting from the objective function and propagating back gradients through the functions. At the beginning of an iteration, tensors are cleaned up except variables for learnable parameters. Variables for input tensors are fed with new data and trigger the iteration. Because a training dataset is often very big, each iteration takes only a subset (batch) of examples extracted from the training dataset as its input tensor. The number of examples in a batch (or batch size) will affect the size of the input tensor and also other tensors in the computational graph. In general, increasing batch size will make a model larger.

Figure 3 shows how learnable parameters (represented by variables) are updated during training. In the forward phase, variable is an input to function , outputs from are used in the later function, finally a loss value is produced by objective function . In the backward phase, we compute gradients of with respect to learnable parameters. Function computes the gradient of with respect to , which requires ’s output as one of its inputs. Finally, is updated by a function during the update phase.

Forward

Backward

Update

Figure 3. How variable is used and updated in training.

3.4. Device placement

In TensorFlow, each operation in the computational graph is placed on a device such as a GPU, CPU, FPGA. Communication between two devices automatically occurs if an operation on one device consumes a tensor produced by another operation on the other device. In fact, TensorFlow adds a pair of two operations, “send” and “receive”, to the graph for exchanging a tensor. In this paper, we do not show these communication operations when drawing graphs.

3.5. Garbage collection

If a tensor is no longer used in TensorFlow, it is released by TensorFlow garbage collection. Every tensor is assigned a reference count, which is the number of operations. Each time a tensor is consumed by an operation, its reference count is decreased by one. If the reference count reaches zero, the tensor is available to be released. In other words, the lifetime of a tensor is from the operation generating it to the last operation consuming it. Let be a tensor produced by an operation , and be operations consuming . The life time of is computed as .

4. Graph rewriting

(Original graph)

10

25

18

11

GPU

(Swap out tensors to CPU memory)

10

11

11

25

11

18

GPU

CPU

GPU

(Introduce swap-in operations)

10

11

21

25

16

11

18

11

20

15

GPU

CPU

GPU

(Fuse swap-out operations)

10

11

21

25

16

18

11

20

15

GPU

CPU

GPU

Figure 4. Example of graph rewriting for supporting large models. Thick edges in left subgraph are rewritten to produce right subgraph. Integers above or below vertex are order of that vertex in topological ordering. In this example, threshold () to trigger graph rewriting is , so edges from to and are rewritten. and are control dependency operations that trigger executions of swap-in operations and , respectively.

A computational graph or a neural network model is said being large to be trained with the memory limitation of GPUs if there are many tensors that are kept in the GPU memory at a time so that they consume more memory than the GPU memory. Hence, an out-of-memory error often happens when training such a large graph. This is essentially because there are many tensors with a long lifetime in a computational graph. In this section, we will show how to rewrite a large graph so that training them is possible with a limited GPU memory. In general, our idea is temporally sending “long lifetime” tensors in a GPU to a CPU and sending them back to the GPU when necessary.

4.1. Swapping out tensors to CPU memory

To put a tensor residing in GPU memory on CPU memory, we derive operations to automatically send the tensor to the CPU and send it back to the GPU. Let us consider an edge where and , are executed using a GPU. Computation for this edge is

(1)

where the superscript stands for GPU. This computation can be rewritten into:

(2)

where the superscript stands for CPU, and is an identity function that is .

Since is executed using a CPU, the output tensor of will be swapped out to the CPU memory for immediately after finishes, and GPU memory is released. The output tensor of will be swapped in to the GPU for when is triggered. We call function in Equation 2 a swap-out operation.

Using Equation 2, we are able rewrite a graph so that GPU memory consumption is reduced. However, not all edges are needed to rewrite. For edges where , is executed immediately after . Hence, there is no need to swap the tensor on such edges. We can define a threshold and graph rewriting for an edge is triggered if .

4.2. Optimization

Equation 2 is not optimized due to two reasons: it is too late to swap the output tensor of in, and must wait for the tensor sent from CPU memory to GPU memory; and the tensor may be swapped out and swapped in multiple times since there may be multiple operations apart from reading it. In this section we present three rules to optimize Equation 2. Figure 4 shows computational graphs obtained by each of optimization rules.

4.2.1. Introduce swap-in operations

To swap a tensor in early, we need an additional operation. An Identity function can be rewritten as the composition of a function and its inverse function, that is,

(3)

Equation 2 becomes:

(4)

Since also has the inverse function, i.e, , we choose for (if one would like to reduce the memory consumption on the CPU, a pair of encoding and decoding functions can be used for instead of ),

(5)

In Equation 5, will be used to swap a tensor in to a device, and we call function a swap-in operation. It is worth noting that we must manually trigger in a good order; otherwise, is executed immediately after . To do this, a control edge from an operation to must be added. We present two strategies for choosing a control operation in Section 4.3.

4.2.2. Fuse swap-out operations

A tensor produced by an operation is often used by multiple operations, and it is redundant if the tensor is swapped out to CPU memory multiple times. Hence, it is recommended to always fuse swap-out operations of the same tensor into a single swap-out operation.

4.2.3. Fuse swap-in operations

Consider a situation that multiple swap-in operations swap a tensor multiple times for multiple consuming operations. If the tensor is large and the consuming operations are close to each other, then swapping the tensor multiple times would introduce more overhead. In this case, it is better to fuse the swap-in operations into one swap-in operation. The tensor is swapped in only once and resides in GPU memory to be reused by the consuming operations. For example, in the right-most graph in Figure 4, if and are close and is large, then we fuse and into a singe swap-in operation. To determine how close two operations are, we may define a threshold for the distance between them.

4.3. Strategies to add control edges

Control edges to swap-in operations are added to a computational graph to control when swap-in operations are trigger. They are important to reduce the overhead of communication of swapping tensors in. Consider Equation 5, a control operation for the swap-in operation must be chosen from a set of operations, , where to guarantee the correctness of the computational graph. Let be the distance between and . If is too small, a tensor is swapped in too late, and has to wait for the tensor. If is too large, a tensor is swapped in too early, and the tensor is kept in the device for a long time before being actually used by .

An ideal solution for choosing a control operation is having a cost model for computational graphs and using the model to prioritize operations. However, in TensorFlow, the shape of the input and output tensors of an operation is generally unknown at the beginning unless data are fed into the graph then trigger the operation. This means that, at the time a graph is rewritten, there is no information about the actual size of tensors, and it fails to compute operation cost statically.

In a context of statically modifying a computational graph, we introduce two parameters: lower-bound and upper-bound to handle choosing control operations. Let us assume that an edge is rewritten using a swap-out operation and swap-in operation :

(6)

We present two strategies to find a control operation for .

4.3.1. Direct-order strategy

The direct-order strategy involves directly using the topological ordering to obtain a set of candidates for control operation, starting from the target operation and going back to . Lower-bound and upper-bound are relative to .

Algorithm 1 shows the algorithm of this strategy. Candidates are operations whose distance to is in the range of to (Line ) and there exists a path from them to (Line ). The algorithm stops once it has found one operation satisfying the above conditions (Lines ).

1:source operation , target operation , lower-bound , upper-bound
2:an operation
3: Lowest order
4:for  to  do
5:      operations before
6:     if  then Out of range
7:         return null
8:     end if
9:      All operations of order
10:      is reachable from
11:     if  is not empty then
12:          Randomly get one item in
13:         return
14:     end if
15:end for
Algorithm 1 Direct-order strategy

4.3.2. Chain-rule strategy

The chain-rule strategy involves starting from the source operation and going down along the forward phase to find corresponding backward operations as candidates for control operations. Breadth-first search is used to traverse operations in the forward phase in which lower-bound and upper-bound are used to limit the search space of forward operations. In other words, lower-bound and upper-bound are relative to the source operation .

Algorithm 2 shows the algorithm of this strategy. For breadth-first search, we maintain two open sets and , and one closed set . The contains current forward operations, and contains forward operations for the next level (including all outgoing operations of operation in ). The contains visited operations. Starting from , once the algorithm is in the range of to (Line ), it obtains outgoing backward operations of a current operation (Line ), then checks the validity of these backward operations (Lines ). If there is one valid operation, it is a candidate and the algorithm returns it. Otherwise, the algorithm goes to the next level (Lines ).

1:source operation , target operation , lower-bound , upper-bound
2:an operation
3:; ;
4:while  is not empty do
5:     if  or  then
6:         return null
7:     end if
8:      GET() Get one item in
9:      Out() Outgoing operations of
10:     if  then Inside the range
11:         
12:         
13:          is reachable from
14:         if  is not empty then
15:               GET(B) Randomly get one item in
16:              return
17:         end if
18:     end if
19:     
20:     for  in  do
21:         if  then is visited
22:              continue
23:         end if
24:         if  then
25:              
26:         end if
27:     end for
28:      mark as visited
29:     if  is empty then go down one level
30:         ;
31:         ;
32:     end if
33:end while
Algorithm 2 Chain-rule strategy

5. TFLMS module in TensorFlow

User-definedmodelTFLMS(graph rewriting)TensorFlow’ssession
Figure 5. TFLMS module in TensorFlow.
Parameter Meaning Default value
graph The graph we will modify for LMS. This should be the graph of user-defined neural network. required
optimizer_scopes A set of scopes for the optimizers/solvers. required
starting_scope Tensors that are reachable from the operations in this scope will be swapped for LMS. Set this to the scope of the first layer if we would like to modify the whole graph. None
starting_op_names Tensors that are reachable from the operations with these names will be swapped for LMS. None
excl_scopes A set of scopes. Output tensors of operations in the scopes will not be swapped out to CPU memory. empty
incl_scopes A set of scopes. Output tensors of operations in the scopes will be swapped out to CPU memory. empty
excl_types A set of types. Output tensors of operations with these types will not be swapped out to CPU memory. empty
incl_types A set of types. Output tensors of operations with these types will be swapped out to CPU memory. empty
n_tensors The number of tensors for LMS, counting from the starting_scope. -1 (all tensors)
lb Lower-bound value for LMS. 1
ub Upper-bound value for LMS. 10000
ctrld_strategy Two strategies to find control dependency operations for swap-in operations: chain_rule and direct_order. chain_rule
fuse_swapins Fuse ”close” swap-in operations into one operation. False
swap_branches If True, LMS will swap tensors in branches in the forward phase. False
branch_threshold A threshold for swapping branches in the forward phase. 0
Table 2. Parameters in TFLMS

We developed a TensorFlow module, named TFLMS, based on our proposed approach. The module allows users to quickly turn their large model into one that can be trained with limited GPU memory. In TensorFlow, users first define a neural network model. TensorFlow then automatically generates a computational graph from the model. Finally, users define a TensorFlow session to execute operations in the computational graph. Once a session is invoked, users cannot modify the computational graph. Hence, we implement TFLMS to statically modify the graph before a session starts.

Figure 5 shows how TFLMS is positioned in TensorFlow. TFLMS takes a computational graph and automatically modifies it using the transformation rules presented in Section 4. TFLMS uses APIs in the module “graph editor”333Graph editor: https://www.tensorflow.org/api_guides/python/contrib.graph_editor in TensorFlow to modify the graph. The modified graph is then executed by a TensorFlow session as normal. TFLMS’s source code is publicly available as a pull request in the TensorFlow repository444https://github.com/tensorflow/tensorflow/pull/19845.

1# define a scope for the optimizer/solver
2with tf.name_scope(’adam_optimizer’):
3   opt = tf.train.AdamOptimizer(1e-4)
4   train_step = opt.minimize(cross_entropy)
5
6# define a LMS instance and run it
7from tensorflow.contrib.lms import LMS
8lms_obj = LMS({’adam_optimizer’})
9lms_obj.run(graph=tf.get_default_graph())
10
11with tf.Session() as sess:
12    sess.run(tf.global_variables_initializer())
13        batch = mnist.train.next_batch(50)
14        train_step.run(feed_dict={x: batch[0],
15                                  y_: batch[1]})
Listing 1: Sample Python code to use TFLMS in TensorFlow.

Listing 1 shows a brief example of using TFLMS in TensorFlow. While defining a neural network, users must define a scope for their optimizer (Line ). Users then define a LMS instance for that scope and run the instance to modify the computational graph of the neural network (Lines ). After that, users create a TensorFlow session and train the network as usual.

5.1. Implementation

The important part of TFLMS is building a topological ordering. Given a graph, TFLMS uses the python package “toposort”555https://pypi.org/project/toposort/ to build a topological order. The topological ordering, , is to decide which tensors are swapped out and when they are swapped in as shown Section 4. To rewrite edges, TFLMS traverses through the graph using the breadth-first search algorithm, starting from input variables. We do not rewrite incoming and outgoing edges of variables. In other words, learnable parameters are kept in GPU memory. Apart from an input of a computational graph, TFLMS allows users to pass other parameters to flexibly control how the graph is modified. Table 2 lists the parameters in TFLMS.

By default, TFLMS always rewrites edges between a forward operation and a backward operation. To determine operations in the backward phase, users should pass the scope666In TensorFlow, scope defines a name for a set of operations, similar to a folder in a file system. of solvers or optimizers that are used to train the model (via TFLMS parameter optimizer_scopes). Note that, it is possible to automatically rewrite the whole graph without optimizer_scopes. Using optimizer_scopes reduces unnecessary operations that are not helpful for large model support, e.g. operations in the update phase. If a model has many branches in the forward phase, users may want to use parameters swap_branches and branch_threshold to enable rewriting edges satisfying . branch_threshold is the threshold defined in Section 4.1. Swapping tensors in the forward phase may affect the performance of inferencing of a neural network because it introduces overhead of swapping the tensors out and in. However, if the neural network is still large for inferencing, swapping those tensors is necessary. Without enabling swap_branches, our modification does not cause any affect on the performance of inferencing because added swap-out and swap-in operations between the forward and backward phases are not executed during the inferencing. Inclusion or exclusion of an operation can be done via the operation’s type or scope. Users can define a starting point for the breadth-first search by using the scope or name of operations via parameters starting_scope and starting_op_names. By default, TFLMS rewrites all reachable edges. However, users can define the number of tensors that are swapped via parameter n_tensor. Parameters lb and ub are lower-bound and upper-bound, respectively, as defined in Section 4.3. A strategy for choosing control operations is set by parameter ctrld_strategy. Parameter fuse_swapins is to enable the optimization of fusing swap-in operations.

5.2. Performance tuning

To get the maximum performance when using TFLMS, we need to find the combination of tuning parameters that provides the fastest training time with the model. The goal of the performance tuning is to swap out enough tensors to allow our training to run without out-of-memory errors, while not swapping too many such that the extra swapping communication overhead degrades performance.

The two tuning parameters we should focus on are n_tensors and lb. Since n_tensors controls the number of tensors that will be swapped, the higher this is set, the lower the peak GPU memory usage will be. The lb controls how soon the tensor is swapped back in before use. A low value lb can make the training on the GPU pause and wait while the swap-in finishes. This will degrade performance. A higher value lb allows the tensor swap-in to finish before it is needed and allows training to run without pause. The downside to swapping in too early is that more tensors will be in the GPU memory at any point in time, resulting in higher peak GPU memory usage.

Tuning thus becomes finding the correct balance between n_tensors and lb that provides the best performance for a given model. To start the performance tuning it is suggested that n_tensors be set to -1, which will swap all reachable tensors, e.g., tensors. The lb should be set to the default of 1, which is the latest possible swap-in. It is useful to run with and then adjust it downward. If the model has branches similar to the 3UNet model, it is likely useful to set swap_branches to True and tune the branch threshold.

6. Experiments

6.1. Experimental environment

Experiments were run on an IBM POWER8 NUMA-based machine (IBM, 2016) using one GPU. The machine has two 4GHz 10-core POWER8 processors, eight simultaneous multi-threads (SMTs) per core and 256 MB RAM per processor. There are four NVIDIA Tesla P100 GPUs (each with 16 GB memory). NVLinks are used for connections among GPUs and CPUs: one 80 GB/s duplex link between GPUs 0 and 1, one 80 GB/s duplex link between GPUs 2 and 3, two 80 GB/s duplex links from CPU 0 to GPUs 0 and 1, and two 80 GB/s duplex links from CPU 1 to GPUs 2 and 3. On the machine, we installed TensorFlow 1.8, CUDA Toolkit v9.0 and cuDNN 7.0.5.

We evaluated TFLMS using two popular neural networks: ResNet-50 for image recognition and 3DUNet for image segmentation. To make a model larger, we increase the batch size of each iteration. By default, we always fuse swap-out operations.

6.2. Maximum batch size

Model Image Without TFLMS With TFLMS Ratio
ResNet-50
3DUnet
3DUnet OOM
Table 3. Maximum batch size when swapping all reachable tensors. OOM stands for out-of-memory.

Table 3 shows the maximum batch size we are able to train using TFLMS. We let TFLMS swap all reachable tensors to reduce GPU memory consumption as much as possible. In total, TFLMS swapped all of tensors for ResNet-50, all of tensors for 3DUNet with images and tensors for 3DUNet with images7773DUnet architecture is changed according as image size.. With TFLMS we were able to train ResNet-50 and 3DUNet with and times larger batch size, respectively. For 3DUnet, we were able to train the whole images of without resizing or splitting the images, which was impossible without TFLMS.

6.3. Training performance

Figure 6. Effectiveness of n_tensors and lb on training performance of ResNet-50. TFLMS(x, y) means running TFLMS with n_tensors=x and lb=y. “all” means swapping all tensors, in this case n_tensors=317.
Figure 7. Effectiveness of ctrld_strategy on training performance of ResNet-50. TFLMS(x, y, z) means running TFLMS with n_tensors=x, lb=y, ctrld_strategy=z. “all” means swapping all tensors, in this case n_tensors=317.
Figure 8. Effectiveness of fuse_swapins on training performance of ResNet-50. TFLMS(x, y, z) means running TFLMS with n_tensors=x, lb=y, fused_swapins=z. “all” means swapping all tensors, in this case n_tensors=317.
Figure 9. Effectiveness of n_tensors, lb, swap_branches on training performance of 3DUnet. TFLMS(w, x, y, z) means running TFLMS with n_tensors=w, lb=x, swap_branches=y, branch_threshold=z. “all” means swapping all tensors, in this case n_tensors=. Input images are of size of .

Figure 6 shows the effectiveness of parameters n_tensors and lb on training performance of ResNet-50. We measured the number of images per second (images/sec) for each batch size. Without TFLMS, the maximum batch size we were able to train is . Performance for a smaller batch size was poor because GPU usage was small. With TFLMS, when we first swapped out all reachable tensors, i.e. tensors, and set lb to for swapping in a tensor as late as possible, the maximum batch size we were able to train is , times larger than the one without TFLMS. However, performance was not good. We then tried to increase lb from to to swap in tensor earlier so that there were more overlap between computation and communication. It is clear that the higher lb, the better training performance, but the maximum batch size was decreased because there were more tensors residing in GPU memory at a time. Similarly, we decreased the number of tensors being swapped out, from (all) to or . We also obtained better performance. n_tensors was more effective than lb on training performance, and lb was less effective than n_tensors on the maximum batch size. Hence, there should be a tradeoff between n_tensors and lb.

Figure 8 shows the effectiveness of fusing swap-in operations. In both cases, we swapped out tensors in total, but the numbers of swapping operations added to the graph with fuse_swapins enabled and disabled are and , respectively. Fusing swap-in operations lead to better performance but smaller maximum batch size. This is because some tensors were kept in GPU memory for re-using as we mentioned in Section 4.2.3.

Figure 7 shows a comparison between two strategies “chain_rule” and “direct_order” for finding control dependency operations. Though the strategy “direct_order” is simple than “chain_rule”, it sometimes had poorer performance for training ResNet-50. In particular, “direct_order” was much slow with batch sizes , and .

Figure 9 shows results for 3DUnet. The maximum batch size we were able to train with TFLMS is twice as large as that without TFLMS. The effectiveness of Parameters n_tensors and lb for 3DUnet is similar to that for ResNet-50. In particular, when we decreased n_tensors from (all tensors) to , we clearly saw better performance, but the maximum batch size was decreased from to . We measured the effectiveness of swapping branches. We enabled swapping branches with threshold , the number of added operations was increased from to and the number of swapped tensors stayed the same. By swapping branches, we were able to train 3DUnet with the maximum batch size of instead of . We also tried to train 3DUnet with large images, i.e. images of size of . While without TFLMS we got out-of-memory errors, with TFLMS, we were able to train 3DUnet at images/sec (Batch size=, n_tensors= (all), lb=, swap_branches=True, branch_threshold = ).

7. Conclusion

We have proposed a formal approach to deriving swap-out and swap-in operations for enabling large model support. We formally revised the concept of computational graph and borrowed the theory of program transformations to derive new operations as well as optimize the graph. Furthermore, We have proposed two strategies to statically find control dependency operations for triggering swap-in operations. The experimental results showed that our approach helped train very large models, i.e. and times larger for ResNet-50 and 3DUnet, respectively. Though our definition of computational graph is inspired by TensorFlow, it is still general enough to be applied to other computational graph based frameworks. In the future, we plan to incorporate the re-computation technique by introducing new transformation rules. Investigating a good heuristics to finding control dependency operations is an open problem.

Acknowledgements.
Authors would like to thank Samuel D. Matzek from IBM Systems PowerAI team for helping re-factor our source code for the pull request. The authors would also like to thank Geert Janssen and Minsik Cho from IBM Research for their fruitful discussion on our approach for large model support.

References

  • (1)
  • Abadi et al. (2017) Martín Abadi, Michael Isard, and Derek G. Murray. 2017. A Computational Model for TensorFlow: An Introduction. In Proceedings of the 1st ACM SIGPLAN International Workshop on Machine Learning and Programming Languages (MAPL 2017). ACM, New York, NY, USA, 1–7.
  • Chen et al. (2016) T. Chen, B. Xu, C. Zhang, and C. Guestrin. 2016. Training Deep Nets with Sublinear Memory Cost. ArXiv e-prints (April 2016). arXiv:1604.06174
  • Choi et al. (2018) Yoojin Choi, Mostafa El-Khamy, and Jungwon Lee. 2018. Universal Deep Neural Network Compression. CoRR abs/1802.02271 (2018). http://arxiv.org/abs/1802.02271
  • Çiçek et al. (2016) Özgün Çiçek, Ahmed Abdulkadir, Soeren S. Lienkamp, Thomas Brox, and Olaf Ronneberger. 2016. 3D U-Net: Learning Dense Volumetric Segmentation from Sparse Annotation. CoRR abs/1606.06650 (2016). http://arxiv.org/abs/1606.06650
  • Faraone et al. (2017) Julian Faraone, Nicholas J. Fraser, Giulio Gamberdella, Michaela Blott, and Philip Heng Wai Leong. 2017. Compressing Low Precision Deep Neural Networks Using Sparsity-Induced Regularization in Ternary Networks. CoRR abs/1709.06262 (2017). http://arxiv.org/abs/1709.06262
  • He et al. (2015) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep Residual Learning for Image Recognition. CoRR abs/1512.03385 (2015). http://arxiv.org/abs/1512.03385
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Identity Mappings in Deep Residual Networks. Springer International Publishing, 630–645.
  • IBM (2016) IBM. 2016. IBM Power System S822LC for High Performance Computing. http://www-03.ibm.com/systems/power/hardware/s822lc-hpc/.
  • Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In International Conference on Neural Information Processing Systems. 1097–1105.
  • Meng et al. (2017) Chen Meng, Minmin Sun, Jun Yang, Minghui Qiu, and Yang Gu. 2017. Training deeper models by GPU memory optimization on TensorFlow. In Proc. of ML Systems Workshop in NIPS.
  • Rhu et al. (2016) M. Rhu, N. Gimelshein, J. Clemons, A. Zulfiqar, and S. W. Keckler. 2016. vDNN: Virtualized Deep Neural Networks for Scalable, Memory-Efficient Neural Network Design. ArXiv e-prints (Feb. 2016). arXiv:1602.08124
  • Sakharnykh (2017) Nikolay Sakharnykh. 2017. Unified memory on Pascal and Volta. (2017). http://on-demand.gputechconf.com/gtc/2017/presentation/s7285-nikolay-sakharnykh-unified-memory-on-pascal-and-volta.pdf GTC.
  • Shirahata et al. (2016) K. Shirahata, Y. Tomita, and A. Ike. 2016. Memory reduction method for deep neural network training. In 2016 IEEE 26th International Workshop on Machine Learning for Signal Processing (MLSP). 1–6.
  • Wang et al. (2018) Linnan Wang, Jinmian Ye, Yiyang Zhao, Wei Wu, Ang Li, Shuaiwen Leon Song, Zenglin Xu, and Tim Kraska. 2018. Superneurons: Dynamic GPU Memory Management for Training Deep Neural Networks. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP ’18). ACM, New York, NY, USA, 41–53.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
212484
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description