User-transparent Distributed TensorFlow
Deep Learning (DL) algorithms have become the de facto choice for data analysis. Several DL implementations – primarily limited to a single compute node – such as Caffe, TensorFlow, Theano and Torch have become readily available. Distributed DL implementations capable of execution on large scale systems are becoming important to address the computational needs of large data produced by scientific simulations and experiments. Yet, the adoption of distributed DL implementations faces significant impediments: 1) most implementations require DL analysts to modify their code significantly – which is a show-stopper, 2) several distributed DL implementations are geared towards cloud computing systems – which is inadequate for execution on massively parallel systems such as supercomputers.
This work addresses each of these problems. We provide a distributed memory DL implementation by incorporating required changes in the TensorFlow runtime itself. This dramatically reduces the entry barrier for using a distributed TensorFlow implementation. We use Message Passing Interface (MPI) – which provides performance portability, especially since MPI specific changes are abstracted from users. Lastly – and arguably most importantly – we make our implementation available for broader use, under the umbrella of Machine Learning Toolkit for Extreme Scale (MaTEx) at http://hpc.pnl.gov/matex. We refer to our implementation as MaTEx-TensorFlow.
Machine Learning and Data Mining (MLDM) algorithms are becoming quintessential in analyzing large volume of data produced by simulations, experiments and mobile devices [1, 2]. MLDM algorithms are generally divided into supervised (the input data set is labeled with the ground truth) and unsupervised (learning from unlabeled data) algorithms. Base supervised/unsupervised algorithms may be combined together using ensemble methods. Several software packages that support supervised, unsupervised and ensemble algorithms have become available including Weka , Scikit , libsvm , and Matlab .
Deep Learning (DL) algorithms are a class of MLDM algorithms that emulate the computational structure of a mammalian brain by using several layers of neurons interconnected with synapses and learn the weights for the synapses using gradient descent method. DL algorithms can be divided into several classes: Multi-Layer Perceptrons (MLP - typically used on tabular data sets), Convolutional Neural Networks (CNNs - typically used on images and other spatially related data) and Recurrent Neural Networks (RNNs - typically used on sequential and time-series data). Many researchers have applied DL algorithms to solve problems in their domains, often reporting better results than the state of the art published models. These domains include high energy physics , computational biology  and cyber security [9, 10, 11, 12]. Naturally, open source toolkits such as Theano [13, 14], Torch  and Caffe  which use cuDNN  have become widely available.
In November 2015, Google released TensorFlow, an open source toolkit for developing MLDM algorithms primarily suited to implementing DL algorithms. It uses a dataflow model by specifying operations on tensors (multi-dimensional arrays). TensorFlow supports automatic differentiation, which simplifies the design and implementation of gradient descent methods for novel structures. This allows TensorFlow to readily support MLPs, CNNs and RNNs on multi-core/many-core systems (GPUs) and supports the use of algorithmic improvements, such as AdaGrad , Adam  and Momentum  gradient descent and neuron dropout for regularization .
Distributed TensorFlow (starting with version 0.8.0) has become available for execution on multiple nodes. These compute nodes may possibly be connected with multiple GPUs on each node. Google’s distributed TensorFlow is based on Google’s RPC (gRPC), which is primarily geared towards cloud computing systems interconnected using Ethernet. This is inadequate for supercomputers, which typically use interconnects such as InfiniBand, Intel Omni-path and Cray interconnects for leveraging high bandwidth and Remote Direct Memory Access (RDMA) features. A few efforts such as gRPC using Message Passing Interface (MPI) [23, 24] have attempted to address this limitation. Besides limited applicability to HPC interconnects, gRPC is primarily geared towards parameter sever based DL implementations, which diverges from the convergence properties of sequential batch/stochastic gradient descent (SGD). Recently, Baidu announced the availability of MPI with TensorFlow by introducing a novel All-to-all reduction technique and user-operations which may be added to existing TensorFlow scripts. While optimized for performance, Baidu’s contributions require several changes related to MPI in existing TensorFlow scripts.
At the same time, the majority of DL analysts tend to write a sequential TensorFlow program. This leads to our problem statement: Can we design a TensorFlow runtime capable of execution on multiple nodes without requiring any TensorFlow specific changes to existing scripts?
Specifically, we make the following contributions in this paper:
We consider several design choices for implementing distributed TensorFlow such as defining new user-operations, and methods to synchronize replicas (since we focus on data parallelism)
We evaluate our implementation on two platforms: 1) Intel multi-core system connected with InfiniBand, and 2) NVIDIA multi-GPU system connected with InfiniBand
We observe that MaTEx-TensorFlow scales well on multiple compute nodes using ImageNet LSVRC12 datasets and AlexNet, GoogLeNet, InceptionV3 and ResNet-50 neural network topologies. Our primary contribution is the ability to leverage the multi-node CPU systems, and multi-node GPU implementations, without modifying any source code specific to TensorFlow. We recommend using our data readers, which provide a simple interface for reading data available in multiple formats.
The rest of the paper is organized as follows: In section II, we present the background of our work. In section III, we present a solution space for designing MaTEx-TensorFlow. We present an in-depth performance evaluation in section IV, followed by related work in section V and conclusions in section VII.
Google released TensorFlow in November 2015 as a platform for building and developing DL implementations. TensorFlow is capable of utilizing multiple threads, such that multi-core systems can be utilized effectively. It also provides implementations to leverage GPUs (using NVIDIA CUDA based DNN (cuDNN)), such that one (or more) GPUs on a single node may be utilized effectively.
Ii-A1 TensorFlow Graph
The fundamental model of computation within TensorFlow is a computational graph. A graph contains vertices, representing operations, and edges, representing tensors (arbitrary dimensional arrays). Each operation can take multiple inputs and generate multiple outputs, with tensors created and passed from one operation to another. Edges also act as control flow objects in the computational graph, which ensures dependencies, that naturally arise in DL implementations.
There are several special types of tensors in TensorFlow. An important tensor is a variable. Variables are persistent tensors that can be accessed outside the computational graph. In DL implementations, the weights and biases of a model are stored as variables and updated by operations, when a computational graph is executed. Another type of a tensor is placeholder. Placeholders are input points into a computational graph. Outside of placeholders, the computational graph is self-contained.
In TensorFlow, a session controls the graph. It stores the values of variables and is used to run the computations described by the graph. After the creation of a session, an initializer must be run to give values to the variables to be used within the session. Subsequent computations, such as the computation of gradients, must be managed through the session to ensure that the correct values of variables are used. The session makes use of a scheduler, which maintains a record of which operations have been completed and enqueues those whose dependencies are all satisfied to be executed.
Ii-A4 Device Scheduling
In addition to its use by the session to keep track of which operations are ready to execute, the TensorFlow scheduler also handles device scheduling when multiple devices are available. Before executing a graph as desired by the user, the schedule runs a simulation of the graph to determine execution time and the order of the operations. It then uses this information to create the dependency lists that the session requires and to assign each operation to a device. These assignments first depend on whether there is an implementation of the operation for a given device – for instance, sometimes GPU implementations may be unavailable – and then upon expected execution speed taking into account inter-device communication times for the relevant tensors.
Ii-B Message Passing Interface
Message Passing Interface (MPI) [23, 24] provides a rich set of abstractions for inter-process communication. It supports pair-wise communication (such as using send, receive) and group communication (such as using reduction, barrier). MPI has become the de facto communication interface for legacy scientific applications. The primary reason for MPI’s success is its wide availability. MPI is available on large scale supercomputers, cloud computing systems and it can also be used for inter-process communication on a single compute node – if other shared memory programming models are not available. Unlike other runtimes such as Spark and gRPC, MPI is able to take advantage of high performance interconnects such as InfiniBand, Intel Omni-Path and Cray interconnects interconnects effectively. Due to the performance reasons, we considered MPI to be the primary communication interface instead of other communication subsystems.
In MaTEx-TensorFlow, we have used several MPI routines for our large scale implementation. We have used All-to-all reduction (an MPI primitive which allows operations such as sum on user’s data, and disseminates the final result among all the processes in a group) for averaging gradients and point-to-point operations for data distribution. We also observed that MPI has been criticized for its lack of support for fault tolerance. However, with recent advancements – such as User-level Fault Mitigation (ULFM) – and open source implementations, it is possible to design fault tolerant DL algorithms using MPI, without losing performance and ”continued execution” in the presence of hardware faults. We expect that with ULFM (or its variants) becoming available with mainstream implementations, MPI would find its wide acceptance in the DL community.
Iii MaTEx-TensorFlow Design Space
In this section, we present a detailed description of MaTEx-TensorFlow design space.
Iii-a Data Parallelism/Model Parallelism
An important design consideration is the type of parallelism to be used for MaTEx-TensorFlow. In model parallelism, the layers in a DNN are split across multiple devices (such as GPUs and/or multiple compute nodes). The model parallelism is potentially effective in scale-out, since the scheduling on multiple devices enables the use of small batch sizes.
However, DNNs increasingly contain deeper convolutional layers, where the size of the activations is much larger than the overall model. Under model parallelism, these activations would need to be communicated across devices – which is prohibitive. Hence, it is worthwhile to consider data parallelism, where the model is replicated and the data is split across multiple compute devices. Similar observations have been pointed out by Krizhevsky et al. . Hence, we use data parallelism for implementing MaTEx-TensorFlow.
Iii-B Programming Models
We considered several programming models/interfaces for implementing MaTEx-TensorFlow. Specifically, we considered Spark, Hadoop, gRPC and MPI. MapReduce frameworks such as Spark  and Hadoop  abstract the details of parallelism effectively. However, they are not suitable for large scale systems which are typically connected using high performance interconnects.
Another possibility is to use Google’s Remote Procedure Call (gRPC). The initial implementation uses sockets interface, which is not suitable for HPC interconnects. Recent implementations of gRPC using Remote Direct Memory Access (RDMA) alleviate this limitation. However, the primary gRPC primitives do not include all-to-all reduction based collective operations – which is problematic for scaling out SGD. gRPC is specifically targeted for parameter-server (PS) based implementation of SGD. However, PS based implementations suffer from slow convergence and communication bottlenecks.
An alternative choice is to use Message Passing Interface (MPI). It provides a rich set of communication primitives including point-to-point, collective and other operations. MPI is also widely available on large scale systems including supercomputers, and cloud computing systems. For these reasons, we use MPI as the communication interface for implementing MaTEx-TensorFlow. MPI has frequently been criticized due to lack of fault tolerance. While MaTEx-TensorFlow is not fault tolerant, we plan to handle fault tolerance for MPI using ULFM – which allows the MPI application to continue executing in the presence of faults. By using data parallelism the critical data structures are automatically replicated for fault tolerance. This approach would allow MPI to address the limitations of Spark while maintaining many of its advantages. However, fault tolerant TensorFlow is beyond the scope of this paper.
Iii-C Existing Approaches for Distributed Memory
Up to now, we have identified using MPI for implementing distributed memory DL and data parallelism for scaling out the algorithms. It is equally important to consider the level of abstraction which should be provided to the user. There are several design choices
Iii-C1 MPI-enabled TensorFlow Scripts
One possibility is to use MPI within TensorFlow scripts – visible to the end-user. This approach requires no changes to the TensorFlow runtime, which makes it an attractive choice. In the previous version of MaTEx-TensorFlow, this approach was used . The upside of this approach is that a user who does not want to write TensorFlow code may use these scripts to build DNNs. However, in many cases, users tend to write their customized TensorFlow scripts. Hence, they would be required to add MPI specific changes in their code – which is problematic for these users.
Iii-C2 Class Packages
Another possibility is to create a module of helper functions and classes. These functions and classes may then be used by TensorFlow users. Recently, Baidu  has proposed work on this model. Baidu’s extensions are integrated into TensorFlow. However, the user must still make Baidu-specific changes to their TensorFlow scripts to make use of these extensions for distributed memory execution.
Iii-D Proposed Approach for Distributed Memory
We have observed that – due to pre-existing, complex scripts – the distributed memory implementations are inadequate for most DL analysts. Hence, it is important to consider implementations which would provide distributed memory DL while abstracting the changes from the users completely. That is the focus of MaTEx-TensorFlow. In this section, we provide implementation details along these lines.
For achieving this objective, we leverage TensorFlow operators. These operators can be user-defined and inserted in the computational graph. As shown in Figure 3, MaTEx-TensorFlow provides two new TensorFlow operators: a Global Broadcast for TensorFlow model variables and an MPI_Allreduce operator for the model results (gradients) for the training phase. Both operators enhance the TensorFlow framework to provide support for synchronous, data parallel models on a distributed memory system.
Iii-D1 Broadcast Operator
MaTEx-TensorFlow ensures that each model replica is exactly the same at the start of the training phase. To ensure this, we use a broadcast operator in which the default MPI process (also referred as rank zero in MPI terminology broadcasts the model at the start of the training phase. A TensorFlow variable has two components: 1) a tensor with actual value, and 2) an associated computational graph operation. For the broadcast operator, TensorFlow creates an unordered list of initializer graphs for each variable. Since TensorFlow scheduler is unordered in scheduling variables, we add explicit data dependencies to ensure that the buffers for broadcast are matched correctly.
Iii-D2 MPI_Allreduce Operator
MaTEx-TensorFlow provides equivalence to the default SGD algorithm. Since it uses data parallelism, the replicas need to be synchronized after each batch. We use an MPI_Allreduce operator for achieving this objective. Since the gradients (model updates) are returned as data tokens to the framework, the MPI_Allreduce operator has a simpler structure. The current version of MaTEx-TensorFlow provides layer-wise all-to-all reduction. This sets up an ordered list of reduction operators and then sequentially synchronizes each layer across ranks, ensuring that the buffers are correctly ordered.
The use of MPI_Allreduce function provides a communication complexity , where is the number of nodes. As the work to compute the gradients is divided evenly among nodes when using strong scaling, this will provide approximately work, where is the amount of computation necessary to compute the gradients for each batch on a single compute node.
Iii-D3 User-operations versus TensorFlow Runtime
We choose to modify the TensorFlow backend directly. Though this has an increased engineering requirement, it allows for delivering a seamless user experience. Very few changes are required for the user’s scripts in this schema, making this method the simplest for the end-user, with the only substantial changes being the use of parallel data readers rather than sequential ones.
Iii-E Synchronous versus Asynchronous Implementation
To enable efficient implementation of the backend modifications, we place certain constraints on how data is distributed across the system. The most significant constraints are that data parallelism is the only mode that will be used and that synchronous algorithms are the main vehicles of computation.
The choice of data over model parallelism is due to the trend towards more expensive computation and fewer parameters for state-of-the-art neural networks. Model parallelism distributes different pieces of the model across different nodes, and for a DL algorithm transmits the activations, which are large for convolutions and small for fully connected networks. Data parallelism, however, duplicates the model across nodes and divides up the processing of the dataset between them. For convolutions, this is far more efficient . Moreover, as we are requiring that our algorithms be synchronous, the advantages of model parallelism decrease further.
We implement synchronous models rather than asynchronous models to maintain numerical equivalence with the sequential algorithm (c.f. Figure 8). Synchronous models maintain this equivalence, but at the cost of potentially having some devices idle at times. Asynchronous models prioritize full utilization of all devices at all times over equivalence to the sequential algorithm. A way in which asynchronous algorithms are used is under the parameter server paradigm, where a single node is responsible for maintaining the model and the remaining nodes are workers. Each worker independently computes updates which are applied by the model as they are received. This paradigm might leads to stale updates, and in many cases requires a “warm start,” that is, for the model to be trained synchronously for a time before switching to a parameter server. At large scale, the server/worker model can create a communication bottleneck as well where the server(s) are overwhelmed with worker requests.
Iii-F I/O Considerations and Data Readers
Besides supporting user-transparent distributed memory execution, MaTEx provides interfaces for reading and automatically distributing datasets across multiple compute nodes. Currently, MaTEx supports parallel NetCDF format, CSV, MNIST and CIFAR dataset formats.
Iii-G Putting It All Together
In this section, we present the integration of the proposed MaTEx-TensorFlow design. Specifically, we have extended TensorFlow 1.0.0 for this purpose. The changes regarding the runtime are completely abstracted from the user. As shown in Figure 4, the difference between the serial TensorFlow script and multi-node script are only related to data readers. These readers are considered optional as well. The only requirement is to provide input numpy arrays.
Iv Experimental Evaluation
In this section, we present a performance evaluation of MaTEx-TensorFlow. We compare the performance with serial TensorFlow. Table I provides a description of the architectures used for evaluation. Table II provides a description of the datasets and neural networks used for performance evaluation.
|K40||Haswell (20)||K40||IB||OpenMPI 1.8.3||4||7.5||8||160|
|SP||Ivybridge (20)||N/A||IB||OpenMPI 1.8.4||N/A||N/A||20||400|
In Figures 5 and 6 we evaluate both the computation and communication costs of other neural networks relative to AlexNet – the oldest of these four models. These charts provide a graphical characterization of the scaling potential for each network. As the number of compute nodes increases, the communication cost increases logarithmically, but the aggregate compute cost is constant (under strong scaling). This indicates that the models with a higher ratio, as shown in Figure 7 will scale better. Based on these figures, we see that the most difficult model to scale is AlexNet and the one with the best scaling properties is GoogLeNet. This is empirically confirmed when examining their performance with strong scaling experiments in section IV-B.
Iv-B Performance Comparisons
In this section, we present a performance evaluation of MaTEx-TensorFlow using several neural network models. We use SB and K40 architectures (please refer to table I). Specifically, we present the speedup relative to 1 compute node/device (in the case of GPUs). We use strong scaling with a batch size of 256 for AlexNet and GoogleNet, 128 for InceptionV3 and 64 for ResNet50. Figure 11 shows the relative speedup comparisons for CPU (SB architecture) and GPU (K40 architecture), respectively. We observe that AlexNet scales the worst of all achieving less than 2x speedup on 4 GPUs and 11x speedup on 16 CPU nodes. The ratio of computation to communication dictates how well a network scales, with computationally more expensive networks with fewer parameters, such as InceptionV3 and ResNet50 scaling better than AlexNet with GoogLeNet scaling the best on 4 K40 GPUs with a speedup of ). On CPUs, the tested (excluding AlexNet) models scale well up to 16 CPU nodes, where GoogLeNet, InceptionV3 and ResNet50 respectively speedup by a factor of 14.7x, 14.5x and 15.3x, respectively.
We also note that with the addition of new user-operations, as described in Section III introduces non-trivial overhead. We observe that the overhead is 12%. We intend to further reduce the overhead with upcoming releases of MaTEx-TensorFlow.
Figure 8 compares the loss curves of MaTEx-TensorFlow and sequential TensorFlow using AlexNet. The objective is to empirically prove the equivalence of MaTEx-TensorFlow in terms of loss in comparison to the sequential implementation. We train AlexNet with a version of the quick solver described in . As observed from the figure, the losses are identical – which validates our hypothesis.
V Related Work
Several researchers have conducted in-depth exploration of DL algorithms, including a few focusing on multi-core/many-core systems. Some of these researchers further considered execution on large scale systems. The most widely used DL implementations include Caffe , Warp-CTC , Theano [14, 13], Torch , Microsoft CNTK , Chainer  and Google TensorFlow , all of which implement GPU support using NVIDIA CUDA Deep Neural Network (cuDNN) library.
For large scale execution of machine learning models in general, several programming models have been proposed. MapReduce  provides large scale parallel execution using the Map and Reduce tasks. Although MapReduce as a model is generic, its implementations, such as Hadoop, have been widely critiques for performance reasons. Spark, a recently proposed programming model, supports in-memory iterative training of algorithms. Distbelief  is an approach proposed by Dean et al., using a parameter server for model updates at a central server, which despite scaling well due to asynchronicity, has poor converge properties and the server model becomes a bottleneck .
Message Passing Interface (MPI) [23, 24] has become the most common method of building large scale DL algorithms. It provides abstractions for both pair-wise and group communication and is capable of using high speed interconnects natively, making it particularly suitable to supercomputing environments. Among the toolkits that use MPI are Microsoft CNTK, the Machine Learning Toolkit for Extreme Scale (MaTEx) version of Caffe [37, 45, 46, 47, 48, 49, 50], and the multi-node version of Chainer.
TensorFlow itself provides abstractions for building DL algorithms, including computational graph structures and automatic differentiation. Furthermore, it provides methods for the user to define a parameter server style parallel training regimen, using Google’s Remote Procedure Call, which is restricted to using sockets interface and static assignment of work to threads. To do so, the user must define a cluster, containing a server and workers, divide communication tasks among them, specify that each device receives a copy of the model, enforce synchronization, and wrap important operators so that the parallel training can use them. Similarly, a recent release by Baidu , which uses MPI to train a model in parallel, requires that the user get MPI related variables from the environment, wrap the same important operators as TensorFlow requires (along with several additional ones). Earlier work  included MPI outside of the TensorFlow runtime, explicitly inserting the MPI commands into the user script.
This research and development is supported by a grant from Advanced Scientific Computing Research (ASCR) on ”Convergence of Machine Learning and Deep Learning for HPC Modeling and Simulation”, Analysis in Motion (AIM) Laboratory Directed Research and Development (LDRD) and US Government.
Deep Learning (DL) algorithms have become a popular choice for data analysis. Several DL implementations – primarily limited to a single compute node – such as Caffe, TensorFlow, Theano and Torch have become readily available. Distributed DL implementations capable of execution on large scale systems are becoming important to address the computational needs of large data produced by scientific simulations and experiments. Yet, the adoption of distributed DL faces significant impediments: 1) Most implementations require DL analysts to modify their code significantly – which is a show-stopper, 2) Several distributed DL implementations are geared towards cloud computing systems – which is inadequate for execution on massively parallel systems such as supercomputers.
This work addresses each of these problems. We provide a distributed memory DL implementation by incorporating required changes in the TensorFlow runtime itself. This dramatically reduces the entry barrier for using distributed TensorFlow implementation. We use Message Passing Interface (MPI) – which provides performance portability, especially since MPI specific changes are abstracted from users. Lastly – and arguably most importantly – we make our implementation available for broader use, under the umbrella of Machine Learning Toolkit for Extreme Scale (MaTEx) at http://hpc.pnl.gov/matex.
-  Report from the DOE ASCR 2011 Workshop on Exascale Data Management, Analysis, and Visualization, “Scientific Discovery at the Exascale,” 2011.
-  DOE ASCAC Subcommittee, “Synergistic Challenges in Data-Intensive Science and Exascale Computing,” 2013.
-  M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten, “The weka data mining software: An update,” SIGKDD Explor. Newsl., vol. 11, no. 1, pp. 10–18, Nov. 2009.
-  F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikit-learn: Machine learning in python,” J. Mach. Learn. Res., vol. 12, Nov. 2011.
-  C.-C. Chang and C.-J. Lin, “LIBSVM: A library for support vector machines,” ACM Transactions on Intelligent Systems and Technology, vol. 2, pp. 27:1–27:27, 2011, software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
-  MATLAB, version 7.10.0 (R2010a). Natick, Massachusetts: The MathWorks Inc., 2010.
-  P. Baldi, P. Sadowski, and D. Whiteson, “Searching for Exotic Particles in High-Energy Physics with Deep Learning,” Nature Commun., vol. 5, p. 4308, 2014.
-  A. Ben-Hur, C. S. Ong, S. Sonnenburg, B. Schölkopf, and G. Rätsch, “Support vector machines and kernels for computational biology,” PLoS Comput Biol, vol. 4, no. 10, p. e1000173, 2008.
-  A. L. Tarca, V. J. Carey, X.-w. Chen, R. Romero, and S. Drăghici, “Machine learning and its applications to biology,” PLoS Comput Biol, vol. 3, no. 6, p. e116, 06 2007.
-  A. Vossen, “Support vector machines in high-energy physics,” 2008.
-  P. Balaprakash, Y. Alexeev, S. A. Mickelson, S. Leyffer, R. L. Jacob, and A. P. Craig, “Machine learning based load-balancing for the cesm climate modeling package,” 2013.
-  Y. Liu, E. Racah, J. Correa, A. Khosrowshahi, D. Lavers, K. Kunkel, M. Wehner, W. Collins et al., “Application of deep convolutional neural networks for detecting extreme weather in climate datasets,” arXiv preprint arXiv:1605.01156, 2016.
-  J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Desjardins, J. Turian, D. Warde-Farley, and Y. Bengio, “Theano: a CPU and GPU math expression compiler,” in Proceedings of the Python for Scientific Computing Conference (SciPy), Jun. 2010, oral Presentation.
-  F. Bastien, P. Lamblin, R. Pascanu, J. Bergstra, I. J. Goodfellow, A. Bergeron, N. Bouchard, and Y. Bengio, “Theano: new features and speed improvements,” Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop, 2012.
-  R. Collobert, S. Bengio, and J. Marithoz, “Torch: A modular machine learning software library,” 2002.
-  Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” arXiv preprint arXiv:1408.5093, 2014.
-  S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catanzaro, and E. Shelhamer, “cudnn: Efficient primitives for deep learning,” arXiv preprint arXiv:1410.0759, 2014.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in Neural Information Processing Systems 25, F. Pereira, C. Burges, L. Bottou, and K. Weinberger, Eds. Curran Associates, Inc., 2012, pp. 1097–1105. [Online]. Available: http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf
-  J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods for online learning and stochastic optimization,” J. Mach. Learn. Res., vol. 12, pp. 2121–2159, Jul. 2011. [Online]. Available: http://dl.acm.org/citation.cfm?id=1953048.2021068
-  D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
-  I. Sutskever, J. Martens, G. E. Dahl, and G. E. Hinton, “On the importance of initialization and momentum in deep learning.” ICML (3), vol. 28, pp. 1139–1147, 2013.
-  G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov, “Improving neural networks by preventing co-adaptation of feature detectors,” arXiv preprint arXiv:1207.0580, 2012.
-  W. Gropp, E. Lusk, N. Doss, and A. Skjellum, “A High-Performance, Portable Implementation of the MPI Message Passing Interface Standard,” vol. 22, no. 6, 1996, pp. 789–828.
-  A. Geist, W. Gropp, S. Huss-Lederman, A. Lumsdaine, E. L. Lusk, W. Saphir, T. Skjellum, and M. Snir, “MPI-2: Extending the message-passing interface,” in Euro-Par, Vol. I, 1996, pp. 128–135. [Online]. Available: citeseer.ist.psu.edu/geist96mpi.html
-  A. Vishnu, C. Siegel, and J. Daily, “Distributed tensorflow with mpi,” 2016.
-  Machine Learning Toolkit for Extreme Scale, “MaTEx,” http://hpc.pnl.gov/matex.
-  M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, “TensorFlow: Large-scale machine learning on heterogeneous systems,” 2015, software available from tensorflow.org. [Online]. Available: http://tensorflow.org/
-  A. Krizhevsky, “One weird trick for parallelizing convolutional neural networks,” CoRR, vol. abs/1404.5997, 2014. [Online]. Available: http://arxiv.org/abs/1404.5997
-  M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, “Spark: Cluster computing with working sets,” in Proceedings of the 2Nd USENIX Conference on Hot Topics in Cloud Computing, ser. HotCloud’10. Berkeley, CA, USA: USENIX Association, 2010, pp. 10–10. [Online]. Available: http://dl.acm.org/citation.cfm?id=1863103.1863113
-  T. White, Hadoop: The Definitive Guide, 1st ed. O’Reilly Media, Inc., 2009.
-  B. Research, “Tensorflow (0.12.1) with mpi,” 2017. [Online]. Available: https://github.com/baidu-research/tensorflow-allreduce
-  A. Krizhevsky, “One weird trick for parallelizing convolutional neural networks,” arXiv preprint arXiv:1404.5997, 2014.
-  O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,” International Journal of Computer Vision (IJCV), vol. 115, no. 3, pp. 211–252, 2015.
-  C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in CVPR 2015, 2015. [Online]. Available: http://arxiv.org/abs/1409.4842
-  C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2818–2826.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
-  C. Siegel, J. Daily, and A. Vishnu, “Adaptive neuron apoptosis for accelerating deep learning on large scale systems,” arXiv preprint arXiv:1610.00790, 2016.
-  Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” arXiv preprint arXiv:1408.5093, 2014.
-  B. Research, “warp-ctc,” 2016. [Online]. Available: https://github.com/baidu-research/warp-ctc
-  A. Agarwal, E. Akchurin, C. Basoglu, G. Chen, S. Cyphers, J. Droppo, A. Eversole, B. Guenter, M. Hillebrand, R. Hoens, X. Huang, Z. Huang, V. Ivanov, A. Kamenev, P. Kranen, O. Kuchaiev, W. Manousek, A. May, B. Mitra, O. Nano, G. Navarro, A. Orlov, M. Padmilac, H. Parthasarathi, B. Peng, A. Reznichenko, F. Seide, M. L. Seltzer, M. Slaney, A. Stolcke, Y. Wang, H. Wang, K. Yao, D. Yu, Y. Zhang, and G. Zweig, “An introduction to computational networks and the computational network toolkit,” Tech. Rep. MSR-TR-2014-112, August 2014. [Online]. Available: http://research.microsoft.com/apps/pubs/default.aspx?id=226641
-  S. Tokui, K. Oono, S. Hido, and J. Clayton, “Chainer: a next-generation open source framework for deep learning,” in Proceedings of Workshop on Machine Learning Systems (LearningSys) in The Twenty-ninth Annual Conference on Neural Information Processing Systems (NIPS), 2015. [Online]. Available: http://learningsys.org/papers/LearningSys_2015_paper_33.pdf
-  J. Dean and S. Ghemawat, “Mapreduce: simplified data processing on large clusters,” Communications of the ACM, vol. 51, no. 1, pp. 107–113, 2008.
-  J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang, Q. V. Le, and A. Y. Ng, “Large scale distributed deep networks,” in Advances in Neural Information Processing Systems 25, P. Bartlett, F. Pereira, C. Burges, L. Bottou, and K. Weinberger, Eds., 2012, pp. 1232–1240. [Online]. Available: http://books.nips.cc/papers/files/nips25/NIPS2012_0598.pdf
-  J. Chen, R. Monga, S. Bengio, and R. Józefowicz, “Revisiting distributed synchronous SGD,” CoRR, vol. abs/1604.00981, 2016. [Online]. Available: http://arxiv.org/abs/1604.00981
-  A. Vishnu and K. Agarwal, “Large scale frequent pattern mining using MPI one-sided model,” in 2015 IEEE International Conference on Cluster Computing, CLUSTER 2015, Chicago, IL, USA, September 8-11, 2015, 2015, pp. 138–147. [Online]. Available: http://dx.doi.org/10.1109/CLUSTER.2015.30
-  A. Vishnu, J. Narasimhan, L. Holder, D. J. Kerbyson, and A. Hoisie, “Fast and accurate support vector machines on large scale systems,” in 2015 IEEE International Conference on Cluster Computing, CLUSTER 2015, Chicago, IL, USA, September 8-11, 2015, 2015, pp. 110–119. [Online]. Available: http://dx.doi.org/10.1109/CLUSTER.2015.26
-  A. Vishnu, H. van Dam, N. R. Tallent, D. J. Kerbyson, and A. Hoisie, “Fault modeling of extreme scale applications using machine learning,” in 2016 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2016, Chicago, IL, USA, May 23-27, 2016, 2016, pp. 222–231. [Online]. Available: http://dx.doi.org/10.1109/IPDPS.2016.111
-  S. Shohdy, A. Vishnu, and G. Agrawal, “Fault tolerant support vector machines,” in 45th International Conference on Parallel Processing, ICPP 2016, Philadelphia, PA, USA, August 16-19, 2016, 2016, pp. 598–607. [Online]. Available: http://dx.doi.org/10.1109/ICPP.2016.75
-  ——, “Fault tolerant frequent pattern mining,” CoRR, vol. abs/1610.05116, 2016. [Online]. Available: http://arxiv.org/abs/1610.05116
-  S. Zheng, A. Vishnu, and C. H. Q. Ding, “Accelerating deep learning with shrinkage and recall,” in 22nd IEEE International Conference on Parallel and Distributed Systems, ICPADS 2016, Wuhan, China, December 13-16, 2016, 2016, pp. 963–970. [Online]. Available: http://dx.doi.org/10.1109/ICPADS.2016.0129