Performance Analysis of Distributed Training of Flood-Filling Networks

Scaling Distributed Training of Flood-Filling Networks on HPC Infrastructure for Brain Mapping

Wushi Dong 0000-0001-6614-1680The University of Chicago Murat Keceli Argonne National Laboratory Rafael Vescovi Argonne National Laboratory Hanyu Li The University of Chicago Corey Adams Argonne National Laboratory Elise Jennings Argonne National Laboratory Samuel Flender Argonne National Laboratory Thomas Uram Argonne National Laboratory Venkatram Vishwanath Argonne National Laboratory Nicola Ferrier Argonne National Laboratory Narayanan Kasthuri The University of ChicagoArgonne National Laboratory  and  Peter Littlewood The University of ChicagoArgonne National Laboratory

Mapping all the neurons in the brain requires automatic reconstruction of entire cells from volume electron microscopy data. The flood-filling networks (FFN) architecture can achieve leading performance. However, the training of the network is computationally very expensive. In order to reduce the training time, we implemented synchronous and data-parallel distributed training using the Horovod framework on top of the published FFN code. We demonstrated the scaling of FFN training up to 1024 Intel Knights Landing (KNL) nodes at Argonne Leadership Computing Facility. We investigated the training accuracy with different optimizers, learning rates, and optional warm-up periods. We discovered that square root scaling for learning rate works best beyond 16 nodes, which is contrary to the case of smaller number of nodes, where linear learning rate scaling with warm-up performs the best. Our distributed training reaches 95% accuracy in approximately 4.5 hours on 1024 KNL nodes using Adam optimizer.

Deep Learning, Distributed Training, HPC, Synchronous SGD, Learning Rate Scaling, Connectomics
journalyear: 2019copyright: acmlicensedconference: SC ’19; November 17–22, 2019; Denver, COprice: doi: isbn:

1. Introduction

Understanding how brains function is one of the great intellectual challenges of the 21st Century. Full descriptions of neural connections and cellular compositions will reveal fundamental principles of organization that cannot be inferred in any other way. Understanding brain mapping will lead to understanding of: (1) the mechanics of neural computation (reconstructions of the wiring diagrams for populations of characterized cells will make it possible to discover how directional networks of connections produced signals observed by recordings such as fMRI), (2) Adaption and learning (using brain mappings from multiple specimens with and without a skill may allow us to detect cell types and structures that are rewired to create specific capacities) (3) variation in computation across different brains and species.

Comparative approaches across species and phylogeny require imaging technologies that are capable of multi-scale brain mapping at the nanometer scale required for tracing neuronal connections, fast enough to image many samples from many species, and amenable to algorithmic tracing of brain structures over the resulting large datasets (e.g. petabyte, 1 cm of Electron Microscopy (EM) requires 1,000,000,000 GB). To accomplish this, advances in brain imaging computational methods are required to achieve scalability and resolution needed for these studies. Several facilities have ability to collect the petabytes of image data required to map small volumes of brains (mm) using EM. To extend to cm volumes, several two dimensional EM images must be stitched together to form a slice (plane) and these slices must be aligned to form a 3D volume. Then, segmentation of this volume extracts the structures (neurons, blood vessels, etc). Computational methods for extracting structure (segmentation) lag behind data collection abilities even for the mm volumes and computational analyses on the large directed graphs produced by the mapping must be developed (Bullmore and Bassett, 2011). This new kind of very large data volume of brain data requires new types of computational approaches and large-scale infrastructures.

Automatic segmentation of brain images (e.g. algorithmic identification of anatomical structures), over large brain volumes remains a critical but rate-limiting step for producing large and reliable brain maps (Lichtman et al., 2014). For segmenting neurons and their connections in EM datasets, there are many existing algorithms. Recent successes use deep learning approaches (Ciresan et al., 2012; Huang and Jain, 2013; Maitin-Shepard et al., 2016; Kasthuri et al., 2015; Nunez-Iglesias et al., 2013; Jain et al., 2011; Lee et al., 2017; Januszewski et al., 2016; Bui et al., 2017; Chen et al., 2017; Arganda-Carreras et al., 2015), where examples of correct labeling from human (Lee et al., 2017) are used to train a computational neural network. Conventional machine learning segmentation algorithm is usually performed in two stages. First, a convolutional neural network use the intensities of the voxels at and near an image location to infer the likelihood of its being a boundary. Then a separate algorithm cluster all non-boundary voxels into distinct segments based on the boundary map. Examples of such algorithms include watershed, connected components or graph cut. Recently proposed novel flood-filling network (FFN) demonstrated an order of magnitude better performance than previous methods on EM data (CREMI Challenge, (Januszewski et al., 2018) FFN is an iterative voxel-classification process. It merges the two separate steps in previous machine learning methods by adding to the boundary classifier a second input channel for predicted object map(POM). This results in a recurrent model that can remember voxels in its field of view (FOV) already classified with high certainty in previous iterations. Such a one-step approach can automatically incorporate implicit shape priors into the primary voxel classification process and lead to better performance. However, the iterative nature of FFN has also led to substantially increased computational costs. Training large networks can take days, months, or years, depending on the methods, network design, the size of the network and data, and how much parameter tuning is required. Hyper-parameter optimization greatly increases the required computational resources. This makes it necessary to use distributed training that can efficiently take advantage of large numbers of computing units.

In this work, we demonstrate how we scaled the distributed training of FFN up to 1024 KNL nodes on the Theta supercomputer at Argonne National Laboratory. We used data-parallel training with synchronous stochastic gradient descent (SGD) as implemented in the Horovod framework (Sergeev and Balso, 2018). Using the optimal learning rate scaling based on our empirical study, we reached a training accuracy of around 95% in approximately 4.5 hours, which is about half of previously reported best result using an asynchronous approach.

2. Overview

2.1. FFN Algorithm

The FFN comprised a stack of 3D convolutional layers with a total of 472,353 trainable weights, Figure 1 shows the network architecture. The input module contained a ReLU nonlinearity sandwiched between two 3D convolutions. This is followed by eight residual modules, which together performed a ReLU nonlinearity, 3D convolution, ReLU nonlinearity, and 3D convolution. The last layer combines input from all feature maps and performs a voxel-wise convolution. The input and output of the network have equal spatial sizes. The input was formed by a two-channel image, with channel 1 containing imaging data and channel 2 the current state of the POM in logits. The output of the network was the updated POM. The ground-truth mask are binarized within every subvolume by setting voxels belonging to the same object as the central voxel of the subvolume to 0.95, and the rest of the voxels to 0.05. These soft labels provided the initial object mask probability map. The coordinates of the positions of the subvolumes are chosen away from potential membranes as identified using an edge detector. FFN was implemented in TensorFlow and trained with voxelwise cross-entropy loss. For more details, please refer to (Januszewski et al., 2018).

Figure 1. Architecture of flood-filling-network.(Januszewski et al., 2016)

2.2. Distributed Training

2.2.1. Data parallelism VS Model parallelism

There are mainly two different strategies for parallelizing deep learning algorithms, namely model parallelism and data parallelism. Model parallelism means splitting the model across multiple machines if it is too big to fit into a single machine. For example, a single layer can be fit into the memory of a single machine. Forward and backward propagation involves communication between different machines in a serial fashion. We resort to model parallelism only if the model size exceed the capacity of single machine. On the other hand, data parallelism means data is distributed across multiple machines. In our work, we choose data parallelism, because not only can this approach suits when data becomes too large to be stored on a single machine, it can also help achieve faster training.

2.2.2. Synchronous VS Asynchronous SGD

Data-parallelism has two paradigms for combining gradients, i.e. asynchronous and synchronous stochastic gradient descent. In asynchronous SGD, the parameters of the model are distributed on multiple servers called parameter servers. There are also multiple computing units called workers processing data in parallel and communicating with the parameter servers. During training, each worker fetches from parameter servers the most up-to-date parameters of the model. It then computes gradients of the loss for these parameters based on its local data. Finally, the workers send back these gradients to the parameter servers in order for them to update the model accordingly. Traditional TensorFlow framework uses the parameter server model with asynchronous SGD for distributed training.(Dean et al., 2012) However, this method can cause problems in the case of large-scale training. For example, when there is a large number of workers, model updates can hardly keep pace with the computation of stochastic gradient. The resultant gradients are called stale gradients since they are obtained with outdated parameters. At larger scales, more workers can add to the number of updates between corresponding read and update operations, making the problem of stale gradients even worse. (Chen et al., 2016; You et al., 2018) As suggested by (Goyal et al., 2017), data-parallelism Synchronous SGD works better for large-scale distributed training. The idea of Synchronous SGD is more straightforward. All the workers average their gradients after each training step and then update their weights using the same gradient. This ensures that each update uses the computed stochastic gradients from the latest batch of data, with the effective batch size equal to the sum of all the mini-batch sizes of the workers. Based on the above reasons, we choose synchronous SGD for implementing our distributed training of FFN.

2.2.3. Optimizers

Vanilla SGD works by first computing the gradient of the loss for each mini-batch with respect to the model parameter. Then it updates the model parameters in the direction of the negative gradient by a step whose width is characterized by the learning rate. There are several variants of the gradient descent algorithm. They all try to make use of the potentially valuable information contained in the gradients from previous time steps, by adding different features, including momentum, adaptive learning rates, and conjugate gradients etc. Adam (Kingma and Ba, 2014) is shown to outperform other second-order optimization algorithms. Therefore, we choose Adam optimizer for our training.

3. Innovations, Contributions and Related Work

3.1. Synchronous training using ring-allreduce

We implemented data-parallelism synchronous SGD and integrated it into the distributed training of FFN using the Horovod framework (Sergeev and Balso, 2018). Horovod uses a ring-allreduce algorithm and Message Passing Interface (MPI) for averaging gradients and communicating those gradients to all nodes without the need for a parameter server. (Patarasuk and Yuan, 2009) This algorithm can minimize idle time given large enough buffer. In our experiments, we observed nearly ideal scaling performance with 1024 KNL nodes on Theta using Aries interconnect with Dragonfly configuration. We are also able to scale our training to more number of nodes than using parameter servers and still have a good training efficiency, as shown in Section 4.5.2.

3.2. Learning rate scaling for large batch size

A major challenge for deep neural networks is to optimize the hyperparameters as the network training is scaled out. Large batch size is critical for training deep neural networks at large scale because it can significantly reduce training time via large data throughput, enabled by large numbers of computing units. However, one has to keep the learning rate parameter in accordance with increased batch size. This proves to be very tricky in practice, and could often compromise model accuracy as was shown by (Krizhevsky, 2014; Li et al., 2014; Keskar et al., 2016). Krizhevsky ((2014)) reports that what worked the best in experiments is a linear scaling policy, i.e. multiplying the base learning rate by increased factor of batch size. But the author also claims that theory suggests the usage of a square root scaling policy, i.e. multiplying the base learning rate by the square root of increased factor of batch size, without further explanations nor comparison between the two scaling policies.(Krizhevsky, 2014) Goyal et al. show that a linear learning rate scaling with warm-up scheme can lead to no loss of accuracy when training with large batch sizes.Goyal et al. The theoretical explanation (Smith et al., 2017) is that linear scaling of learning rate can keep optimal level of gradient noise, while the warm-up scheme can help prevent divergence at the beginning of training. However, they also report that accuracy degrades rapidly beyond a certain batch size. Also there are many subtle pitfalls associated with applying this policy, making it difficult to use in practice. Hoffer et al. recommends the square root scaling policy, and provide both theoretical and experimental support. (Hoffer et al., 2017) They demonstrate that square root scaling can keep the covariance matrix of the parameters update step in the same range with any batch size. They also found that square root scaling works better on the CIFAR10 dataset than linear scaling. You et al. further confirm that linear scaling does not perform well on the ImageNet dataset and suggest instead to use their Layer-wise Adaptive Rate Scaling. (You et al., 2017) Most works cited above focus on image recognition tasks e.g. ImageNet, so it is not clear whether their conclusions can be applied to other fields, including our 3D volume segmentation. Also, more experiments are needed for studying the effects of different optimizers. We can see that large-batch training is still an open research question.

For our distributed training, we first tune the learning rate for a single node. Then, we tried to apply linear learning rate scaling with optional warm-up scheme, but find that the scaling stopped after a batch size of 16. We further tried different combinations of learning rate and number of nodes beyond 16. We find that the learning rate gradually shifts to a square root scaling. We then confirmed this trend by tuning learning rate for larger number of nodes. With our discovered optimal learning rate for 1024 nodes, we achieved a training accuracy of 95% in approximately 4.5 hours, which significantly cuts the time cost compared to an asynchronous approach. This result could be further improved by employing a more latest version of TensorFlow as shown by our recent tests. Also, we did not use learning rate scheduling in our study because we choose to focus on learning rate scaling. The usage of learning rate scheduling techniques e.g. cyclic training could further reduce training time. The ability to perform these trainings in the hour range rather than days is a key enabler for further explorations of the hyperparameters and model architecture space.

3.3. Parallel data input pipeline

For data input, we use data sharding as implemented in TensorFlow to distribute training data across all workers. We split data equally among workers. Each worker computes gradients on its own shard of the data. The gradients are combined to update the model parameters by using synchronous SGD. This method would become more favorable in the future when we work with large datasets that cannot fit into the memory of one computing unit.

4. Experiments and results

4.1. Dataset and ground truth

In our experiments, we use the FIB-25 fly brain dataset acquired by electron microscopy (EM) approaches as reported in (Takemura et al., 2015). We use two volumes of this data, one () for training and one () for inference. Each has a isotropic physical voxel size of 10 nm. We use this dataset as an example to show our training pipeline, which can be conveniently extended to train on much larger datasets. Figure 2 shows the respective pair of raw data and label data.

Figure 2. FIB-25 dataset. Left: Raw EM data; Right: Human-annotated ground truth labels.

4.2. Testing environment

We use datascience modules tensorflow-1.12 and horovod-0.15.2 as available on Theta supercomputer at Argonne Leadership Computing Facility. These modules are based on Intel Python 3.5 version 2017.0.035. TensorFlow is compiled from source with bazel-0.16.0 build system using gcc 7.3.0 and linked with MKLDNN optimized for the KNL architecture. The following flags were used to optimize TensorFlow for the KNL architecture: –copt=-mavx –copt=-mavx2 –copt=-mfma –copt=-mavx512f –copt=-mavx512pf –copt=-mavx512cd –copt=-mavx512er –copt=’-mtune=knl’ –copt=”-DEIGEN_USE_VML”. We are currently investigating the performance of FFN with a more recent version of Tensorflow 1.13.1.

4.3. Evaluation metrics

We track the following metrics during our training: accuracy, precision, recall, and f1 score. Their calculations are based on true positive (TP), false positive (FP), true negative (TN), and false negative (FN).


F1 score is generally considered the most efficient metric to use since it has contributions from both precision and recall.

4.4. Single-node results

Single node performance of Tensorflow on KNL architecture has been analyzed by Intel and Google engineers and a set of recommendations are provided on a web page. We followed their recommendations and investigated the influence of different settings on the single node performance. We found that the most important parameter that improves the throughput performance is the batch size. While larger batch sizes enhances the throughput significantly, we found that the training accuracy does not necessarily improve in our single-node experiments. Considering the lack of generalization ability(Keskar et al., 2016) and the limited data size with distributed training with sharding, we choose the smallest possible batch size of 1 for the distributed training calculations. Other important settings for a single-node calculation are related to thread performance. The most important one is the number of OpenMP threads. We found that 1 core or 2 cores per thread gives similar performance, while 4 cores per thread deteriorates the performance significantly. We also use –cc depth to let Cray ALPS to control thread affinity. Theta nodes have KNL 7230 SKU CPU with 64 cores, and we use 64 threads for single-node calculations. For the inter-op and intra-op parameters of Tensorflow, we found that 2 and 64 gives us the best results. KMP_BLOCKTIME is another important environment variable and it sets the time in miliseconds for how long a thread should wait after completion of a parallel region. We found that setting this variable to 0 gives the best performance, while ‘infinite‘ gives the worst results by more than a factor of 10 for batch size 1. For larger batch sizes, we found that the influence of this variable decreases while 0 still gives the best results.

4.5. Multi-node Results

4.5.1. Scaling experiments

We perform strong scaling experiments for up to 1024 nodes on Theta. The results are shown in Figure 3. The decrease in efficiency can be attributed to two factors. First, large number of nodes will naturally increase the upper limit of the time it takes for all the workers to finish one step. Although this inefficiency is unavoidable in synchronous SGD training, one way to mitigate this problem could be the usage of backup workers as suggested by Chen et al..(Chen et al., 2016) Another reason is that more nodes bring an increased amount of data exchange and add to the time of network communication. The Aries interconnect used by Theta can largely reduce this cost. As a result, we find that the training performance achieves a parallel efficiency of about 71% on 1024 nodes, yielding a sustained throughput of about 523 FOVs/Sec.

Figure 3. Strong scaling results in terms of FOVs/sec on Theta. The dashed line represents ideal scaling and markers show the performance of running our distributed training code.

4.5.2. Distributed training efficiency

In order to determine the optimal learning rate for our training, we compare the accuracy for each and every combination of number of nodes and learning rate. We plotted the normalized accuracies as scatters in Figure 4 and compare them with both linear and square root scaling policies. We first used a smoothing factor of 0.9 as implemented in TensorBoard to remove the large noise in the training curve. Then We measure the smoothed value of accuracy at 10K step for all combinations of number of nodes and learning rate. In order to compare the optimal learning rate among different number of nodes, we normalized the accuracy by dividing it by the maximum accuracy reached using the same number of nodes. For the following study, we choose to focus on the accuracy metric but the conclusion should apply to all metrics discussed in our paper since they are highly correlated. We find that the optimal learning rates follow a linear scaling policy for smaller number of nodes and gradually shifts to a square root scaling policy when we further increase the number of nodes and equivalently the effective batch size. This divergence from linear scaling is also consistent with the observation of degraded accuracy when using a linear policy beyond a certain batch size. (Goyal et al., 2017)

Figure 4. Training accuracy for different combinations of learning rate and number of nodes. The blue line shows the linear scaling and red for square root scaling. Accuracy is normalized within the same number of node.

To examine the efficiency of our distributed training, we compare the training efficiency using optimal learning rate for each number of nodes. We calculate the training efficiency as the relative time needed for each run to reach a certain accuracy. We compute the training efficiency as the wall time used to reach certain f1 score. We choose this f1 score to be 0.15 when equal or less than 32 nodes, and 0.7 when equal or larger than 32. We set the training efficiency of a single node as 1. And we use the result of 32 nodes for comparison of numbers of nodes based on different f1 scores reached.

Figure 5. Strong scaling results in terms of training efficiency. Blue line shows ideal scaling and red shows our results.

4.5.3. Training results

We show the training curves of all previously mentioned metrics for 1024 nodes in Figure 6. We averaged all curves over 3 repetitive runs to filter out step-to-step fluctuations and show the range of one standard deviation using shaded regions. We can see that accuracy reaches a value of around 95% in approximately 4.5 hours. This is only half the time cost of previously reported best result of 12 hours. Based on our recent tests, this number could be further reduced by a factor of 2 using more up-to-date versions of TensorFlow.

Figure 6. Training curves for 1024 KNL nodes: (a) Accuracy, (b) f1, (c) precision, (d) recall. All curves are averaged over 3 runs. The shaded regions show the range of 1 standard deviation.

4.6. Visualization of segmentations

We further show the 3D volumetric visualization of several reconstructed neurons in Figure 7. The results were visualized using the Neuroglancer tool, WebGL-based viewer for volumetric data developed by Google.

Figure 7. Volumetric visualization of the inference result.

5. Implications

To date, neuroscience has been limited by the volume of brain data available and thus the number of neurons mapped. Data acquisition technologies can achieve rates suitable for large scale studies. For example, Zeiss Inc. manufactures a 91-beam scanning electron microscope that can image brain sections and could sustain data rates allowing routine collection of nanometer scale data over large volumes of brains (and entire smaller brains). The FFN network presented here will be part of a scalable computational pipeline that will analyze the data and produce large scale brain maps. Whole-brain mapping efforts across and within species will enable complete brain studies at much larger-scales, which in turn will opens door for more sophisticated quantitative biological characterizations towards understanding of how the brain changes during development, aging, and disease. Our work extends open-source software tools including TensorFlow and Horovod. The workflow and proposed innovations should also be applicable to generic deep learning problems at scale.

6. Conclusions

We have implemented a data-parallel synchronous SGD approach for the distributed training of FFN motivated by the important problem of full mapping of neural connections in brains. We discovered that the learning rate scaling shifts from linear policy to square root policy for our problem when we increases the effective batch size by employing more nodes. With our discovered optimal learning rates, we scaled the FFN training up to 1024 Intel Knights Landing (KNL) nodes. The training reached a accuracy of about 95% in approximately 4.5 hours, which is half the time cost of previously reported best results. Our work is an important step towards a complete computational pipeline to produce large-scale brain maps.

This work was partially supported by funding through the Office of Science, U.S. Dept. of Energy, under Contract DE-AC02-06CH11357. The submitted manuscript has been created by UChicago Argonne, LLC, Operator of Argonne National Laboratory (Argonne). The U.S. Government retains for itself, and others acting on its behalf, a paid-up nonexclusive, irrevocable worldwide license in said article to reproduce, prepare derivative works, distribute copies to the public, and perform publicly and display publicly, by or on behalf of the Government. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan.


Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description