Distributed Learning of Deep Neural Networks
using Independent Subnet Training
Abstract
Stochastic gradient descent (SGD) is the method of choice for distributed machine learning, by virtue of its light complexity per iteration on compute nodes, leading to almost linear speedups in theory. Nevertheless, such speedups are rarely observed in practice, due to high communication overheads during synchronization steps.
We alleviate this problem by introducing independent subnet training: a simple, jointly modelparallel and dataparallel, approach to distributed training for fully connected, feedforward neural networks. During subnet training, neurons are stochastically partitioned without replacement, and each partition is sent only to a single worker. This reduces the overall synchronization overhead, as each worker only receives the weights associated with the subnetwork it has been assigned to. Subnet training also reduces synchronization frequency: since workers train disjoint portions of the network, the training can proceed for long periods of time before synchronization, similar to local SGD approaches. We empirically evaluate our approach on realworld speech recognition and product recommendation applications, where we observe that subnet training results into accelerated training times, as compared to state of the art distributed models, and often results into boosting the testing accuracy, as it implicitly combines dropout and batch normalization regularizations during training.
1 Introduction
Deep neural networks (DNNs) have led to recent success of machine learning in reallife applications Krizhevsky et al. (2012); Simonyan and Zisserman (2014); Girshick (2015); Long et al. (2015); Goodfellow et al. (2016). Although, despite the progress in modern hardware Jouppi et al. (2017), training DNNs can take an impractically long time in a single machine, and accelerating DNN training over a compute cluster is not easy. Indeed, it has become a fundamental challenge in modern computing systems Ratner et al. (2019).
Mitigating hardware inefficiency. Much research focuses on parallelized/distributed DNN training Dean et al. (2012); Chilimbi et al. (2014); Li et al. (2014); Hadjis et al. (2016). There, methods to accelerate training may roughly be categorized as modelparallel and dataparallel. In the former Hadjis et al. (2016); Dean et al. (2012), different compute nodes are responsible for different parts of a unique neural network; in the latter Zhang et al. (1990); Farber and Asanovic (1997); Raina et al. (2009), each compute node updates a complete copy of the neural network’s parameters on different portions of the data. In both cases, the obvious way to speed up learning is to use more compute hardware for performing the backpropagation/gradient step; the neural network is split across more CPUs/GPUs in the modelparallel setting, or less gradient descent operations are required per compute node in the dataparallel setting.
Due to its easeofimplementation, distributed, dataparallel training is most commonly used, and it is the method best supported by common deep learning softwares such as TensorFlow Abadi et al. (2016) and PyTorch Paszke et al. (2017). It is generally understood that, for the best dataparallel training accuracy on models where the gradients are not sparse, such in a DNN, local results need to be synchronized globally once per iteration among compute nodes, in order to complete a step of backpropagation. ^{1}^{1}1Though asychnronous training is possible for models that tend to have sparse updates; for a discussion on asynchrony, see Section 6 However, there are limitations preventing this approach from easily scaling out. In particular, the size of the minibatches is a crucial hyperparameter for neural network’s convergence. Adding hardware, while keeping the size of the batch constant, means that each node performs backpropagation on its own local portion of the batch faster, but it leaves the synchronization step no faster. In fact, it can even make it slower, as the number of workers increases. If synchronization time dominates, this means that adding more machines in this case can actually make training slower, in terms of wallclock time.
In contrast, adding more machines, while making the batch size larger so that synchronization costs do not begin to dominate, is often cumbersome to properly tune, as very large batch sizes often do not speed convergence, compared to smaller batch sizes, and they can hurt generalizability Goyal et al. (2017); Yadan et al. (2013); You et al. (2017); Smith et al. (2017); Codreanu et al. (2017); You et al. (2019b, a). Avoiding this requires an excessive hyperparameter tuning of step and batch sizes, often constituting tuning as the most timeconsuming part for the practitioners. There is also an ongoing debate about the efficiency of such methods in practice Ma et al. (2019); Golmant et al. (2018).
Mitigating statistical inefficiency. Overfitting is an important problem that is often observed when training DNNs; it is particularly problematic when using large batch sizes to maintain hardware efficiency during training. Techniques such as dropout Srivastava et al. (2014), dropconnect Wan et al. (2013) and batch normalization Ioffe and Szegedy (2015) provide regularization during training to prevent overfitting; see Labach et al. (2019) for an overview of dropout methods. In dropoutbased regularization, the key idea is to randomly drop units in the neural network during training, which seems to prevent them from coadapting. This leads to models where neurons tend to behave independently, so they are more robust to discrepancies between training and testing data. Batch normalization Ioffe and Szegedy (2015), on the other hand, appears to achieve regularization by limiting the way the distribution of activations between consecutive layers change (known as covariate shift).
Our proposal: Independent subnet training. Interestingly, batch normalization and dropout methods, when combined together, can easily lead to worse generalization performance compared to their individual applications Li et al. (2018). The central idea in this paper, called independent subnet training, seems to contradict this finding. It calls for an extreme combination of dropout to facilitate combined model and dataparallel distributed training. The subnet training algorithm decomposes the neural network into a set of independent subnets for the same task, by applying dropout without replacement to decompose the overall network. Each of those networks is trained at a different compute site using batch normalization for one or more local SGD steps, before a synchronization step. Because of the extreme dropout—each site in an machine cluster gets only a small, fraction of the neurons at each layer in the network—batch normalization is required to handle the shift in distribution observed between training and testing, when all of the subnets are recomposed.
There are several advantages of this method with respect to hardware efficiency. Because the subnets share no parameters, synchronization requires no aggregation, in contrast to the dataparallel model—it is just an exchange of parameters. Moreover, each of the subnets is a fullyoperational classifier by itself, and can be updated locally for a very large number of iterations before synchronizing. This radically reduces communication cost. Communication costs are also reduced because in an machine cluster, each machine gets at most a fraction of the weights at each layer in the network—contrast this to data parallel training, when each machine must receive all of the weights.
Independent subnets have advantages over classical, modelparallel methods as well. During local updates, no synchronization pipelines between subnetworks are required, in contrast to the modelparallel setting. This reduces communication costs. Moreover, independent subnets have many of the advantages of modelparallel methods. For example, each machine gets just a small fraction of the overall model, which reduces the local memory requirement. This can be a significant advantage when training very large models using GPUs, which tend to have limited memory. Another advantage is simplicity—independent subnets achieve modelparallelism “for free”, without complicated distributed communication schemes for communicating partial results.
Key attributes of independent subnet training. Key attributes of our proposed algorithm can be summarized as follows:

[leftmargin=0.5cm]

Partitioning the original network into subnets implicitly prevents overfitting, due to its relation to dropout: our experimental findings support this statement, with increased generalization accuracy in our experiments, compared to simple dataparallel model.

By sampling without replacement, the model can be split into nonoverlapping networks, reducing the interdependence between compute nodes, so that aggressive local updates can be applied, in contrast to modelparallel approaches.

When sharing the model among the cluster, only the partitioned subnet needs to be transmitted to each node instead of the whole model.

Since each worker trains a “thin” model locally, it can perform more local iterations compared to classical, dataparallel training, and it only communicates this thin model back into the network.

Then, during the model update at the parameter server, instead of averaging gradients/models, we just update the model parameters by copying the data sent from the corresponding partition, which requires no arithmetical operations.
We evaluate our method on reallife applications on speech command recognition and Amazon product recommendation on an Amazon Web Service (AWS) cluster. We also analyze the convergence behavior in order to validate the effectiveness and efficiency of our approach.
2 Preliminaries
DNN training. DNNs are instances of supervised learning: e.g., for data classification, we are interested in optimizing a loss function over a set of labeled data examples, such that, after training and given an unseen sample, the classifier will map it approximately to its corresponding label. The loss encodes the neural network architecture, with parameters . Formally, given a data source probability distribution , every (labeled) data sample drawn from can be represented as , where represents the data sample, and is its corresponding label. Then, deep learning aims in finding that minimizes the empirical loss:
(1) 
where represents the continuous hypothesis space of values of parameters . Henceforth, we denote datasets as , and will assume it includes both samples and their corresponding labels .
The minimization (1) can be achieved by using different approaches Wright and Nocedal (1999); Zeiler (2012); Kingma and Ba (2014); Duchi et al. (2011); Ruder (2016), but almost all neural network training is accomplished via some variation on SGD: we compute (stochastic) gradient directions that, on expectation, point towards paths that decrease the loss, and then set Here, represents the learning rate, and represents a single or a minibatch of examples, randomly selected from .
Why classical distributed approaches can be ineffective? It is generally understood that computing over the whole data set is wasteful Defazio and Bottou (2018). Instead, in each iteration, mini batch SGD computes for a small subsample of . In a centralized system, we often use no more than a few hundred data items in , and few would ever advocate using more than a few thousand Goyal et al. (2017); Yadan et al. (2013); Smith et al. (2017). Past that, the system wastes cycles lowering an almost nonexistent error on the estimate for .
For parallel/distributed computation, this is problematic for two reasons: first, it makes it difficult to speed up the computation by adding more computational units. Since the batch size is quite small, splitting the task to more than a few compute notes is no beneficial, which motivates different training approaches for neural networks Berahas et al. (2017); Bottou et al. (2018); Kylasa et al. (2018); Xu et al. (2017); Berahas et al. (2019); Martens and Grosse (2015): computing the gradient update on a few single fast GPU may take a few milliseconds at most.
Second, gathering the updates in a distributed setting introduces a nonnegligible time overhead in large clusters, which is often the main bottleneck towards efficient largescale computing; see Section 4 for alternative solutions to this issue. This imbalance between communication and computation capacity can lead to significant increases in training time when a larger, more expensive cluster is used. As an example, consider performing the forward pass through two hidden layers in a neural network, expressed as in . Here, is an activation function such as ReLU, and and are disjoint parts of the vectorized full model . The data parallel version of this computation partitions the data matrix into and runs at each compute site.
However, even this simplified data parallel learning example requires a large amount of communication to broadcast the model to each machine. In particular, imagine that the input data have features, our two hidden layers have neurons, so that and are and matrices, respectively. By running this computation using singleprecision arithmetic on 8 machines, over a batch of 256 data points, this results into broadcasting two matrices of size 1.63GB and 0.07GB, respectively, for 13.6GB of total communication to process each batch. Increasing the cluster size to 64 machines with a 1024 batch size, this results into 108GB of total communication. Using superfast GPUs, the time required for this transfer can dominate the overall computation time.
3 Training via Independent Subnetworks
We now describe the idea of independent subnet training. It can be seen as a generalization of, and, more importantly a distributedcomputing alternative to, dropout: we retain the generalization accuracy by training different subnetworks, just as in dropout. However, we also advocate an extreme version of dropping neurons, where neurons are partitioned over sites, so that with machines, each machine has a small, fraction of the neurons from each layer. Since the neuron sets are mutually exclusive, they can be trained independently, with significantly less synchronization.
A prototype of our algorithm is described in Algorithm 1. On the left, we describe the “master” routine, which aggregates results from each compute node; on the right we describe the “worker” routine, that runs local SGD on each subnet. Just like classical SGD, subnet training repeatedly updates the parameters in a set of training steps. A training step consists of three substeps:
(1) Partitioning the model. A masterroutine training step begins by first randomly partitioning the neurons in each of the hidden layers into partitions; see Algorithm 1 (Left panel). We denote the resulting subnetworks at the th iteration as , for . To visually represent this process, see Figure 1: on the left we show the original network, that is decomposed into three independent subnets, containing subset of edges, denoted as . More specifically, these partitions induce a set of submatrices of each of the network’s weight matrices. In sequence, these submatrices are broadcast around the compute cluster, to specific workers.
In math, let refer to the th partition of neurons in layer . If the weight matrix connects layers and in the network, then define the submatrix as: Intuitively, is all of the weights that connect the neurons in partition with the neurons in partition . Since the neurons in the input layer and the neurons in the output layer are not partitioned (each site will use all of the input channels and produce all of the outputs), we have: Once the neurons have been partitioned and the various submatrices determined, then for each value in , the weight submatrix is sent to site . Each site has now effectively been assigned a subnetwork of the original neural network.
Observe that not all parameters are updated per broadcast cycle: some parameters may have been sent to no sites, as a parameter connection between two neurons in adjacent layers is only sent to some site, if those two neurons happen to be assigned to the same site.
(2) Subnet local SGD. Once each site has received a subnet, one or more back propagation steps are performed. Assuming each site maintains a subset of the data , the local gradient steps on the th node are performed using samples of the local subset of . See Algorithm 1 (Right panel).
(3) Synchronization. Finally, each site sends all of its submatrices back to the (possibly distributed) parameter server (Step 7 in Algorithm 1 (Left panel)). Note that unlike classical, modelparallel training, no arithmetic is required during synchronization, we need only copy the submatrices back to the parameter server, in contrast to classical local SGD procedures Mcdonald et al. (2009); Zinkevich et al. (2010); Zhang and Ré (2014); Zhang et al. (2016). Once synchronization completes, the next training step begins. Once training ends, the resulting neural network can be deployed without modification.
Some care is required during the synchronization phase to make this work well in practice. Because of the extreme dropout during training—each machine receives just a tiny, fraction of the weights at each layer—we observe that it is necessary to use a technique such as batch normalization Ioffe and Szegedy (2015) to ensure that the distribution to each neuron is the same during training and during deployment. The intuition is as follows: since the set of neurons from the previous layer providing input to each neuron is much smaller during training than during deployment, the input to the neuron is of lower magnitude during training, as compared to when all of the neurons are used. One can counter this by scaling up the input to each neuron during training (or scaling down the input during deployment); this is what is done when dropout training is used.
Unfortunately, naively scaling results in problems during subnet training. During subnet training, only a very small number of neurons are used by each site, so that each neuron sees a much higher variance during training. In other words, the neuron’s distribution differs considerably between training and deployment. With a nonlinear activation function, this is a problem, as it fundamentally changes the activation pattern of the neuron, leading to worse generalization performance. Batch normalization addresses this by forcing the neuron normalized to mean zero and variance one, so that learned parameter can be consistently used during the test phase. To the best of our knowledge, this work is the first to combine dropout and batch normalizations in favor of better generalization accuracy, in contrast to common wisdom Li et al. (2018).
Finally, not only is subnet training related to dropout techniques, but the idea is also related to ensemble methods Dietterich (2000). Subnet training treats a single neural network as a set of smaller networks, training them independently for a period, before they are synchronized and repartitioned. During deployment, all subnets are combined and used as a single network, to make decisions.
4 Related work
Here, we highlight the most related works to our idea. These include information quantization during training, sparsification, largebatch training and local SGD approaches.
Data parallelism often suffers from the high bandwidth costs to communicate gradient updates between workers. In order to address such issue, quantized SGD Alistarh et al. (2017); Courbariaux et al. (2015); Seide et al. (2014) and sparsified SGD Aji and Heafield (2017) are proposed. Quantized SGD adapts some lossy compression heuristics to quantize the gradients to decrease the network traffic, in some extreme cases, only three numerical levels Wen et al. (2017), or more finegrained multilevel quantizations Dettmers (2015); Gupta et al. (2015); Hubara et al. (2017); Alistarh et al. (2017). On the other hand, Sparsified SGD reduces the exchange overhead by only transmitting the gradients with maximal magnitude (eg. top 1%). Although such methods are relevant to our approach, there is a fundamental difference: those techniques postprocess the fullmodel gradients from the forward and backward propagation of the raw model, which is computational demanding; while our approach only compute the gradients of a thin partitioned model, so that computation cost is also reduced.
Recently, there is a push for efficiency in largescale scenarios, that has led to changes in the sofar traditional DNN training. E.g., there has been a series of papers on the general topic of using parallelism to “Solve the learning problem in minutes”, for everdecreasing values of Goyal et al. (2017); Yadan et al. (2013); You et al. (2017); Smith et al. (2017); Codreanu et al. (2017); You et al. (2019b, a). However, increasing the number of compute nodes does not necessarily alleviate any communication bottlenecks—synchronization rounds are less frequent, but a massive amount of information still needs to be transferred over the network. Further, it is generally known—but still debated Dinh et al. (2017)—that large batch training converges to “sharp minima”, hurting generalization Keskar et al. (2016); Yao et al. (2018); Defazio and Bottou (2018). Finally, it is more common than not that achieving such results seems to require teams of PhDs, utilizing specialpurpose hardware: there is no easy “plugnplay” approach that generalizes well without extensive experimental trialanderror. At the present time, such speedups are not achieved by small teams using offtheshelf frameworks, in conjunction with cloud services.
One way to solve some of these issues is through distributed local SGD Mcdonald et al. (2009); Zinkevich et al. (2010); Zhang and Ré (2014); Zhang et al. (2016): the idea is to update the parameters, through averaging, only after several local steps are performed per compute node. In contrast to classical minibatch SGD that has higher statistical efficiency, local SGD “spends” more time locally at each worker, performing several iterations locally, that lead to different local models: these models are then transferred over the network, and averaged locally to result into the next distributed iteration’s model, and thus higher hardware efficiency Zhang et al. (2016). There are extreme settings where the averaging part is performed only once, at the very end of local SGD training Zinkevich et al. (2010). This strategy reduces the communication cost, as less number of synchronization steps are performed, as in largebatch training, decreasing the overall training time without accuracy loss. However, per synchronization round, each local compute node shares a dense model over the network, which results into communication bottleneck. Recent approaches Lin et al. (2018) propose less frequent synchronization towards the end of the training, but they cannot avoid it at the beginning. More importantly, as the experiments section in Lin et al. (2018) reveals, local updates of the whole model leads to interdependence among updates, which limits the local updates up to 16 iterations, while subnet training enables much more aggressive local update period, since each partitioned model shares no overlapping parameters.
Finally, a way to work around the high synchronization costs in distributed SGD is by the idea of gradient staleness in computations Recht et al. (2011); Dean et al. (2012); Paine et al. (2013); Zhang et al. (2013). Asynchrony has been used in distributedmemory systems, such as the DistBelief system by Google Dean et al. (2012) (also known as Downpour SGD) and the Project Adam by Microsoft Kingma and Ba (2014). While such systems, asymptotically, show nice convergence rate guarantees—, where is the number of compute nodes, and is the number of iterations—there seems to be growing agreement that unconstrained asynchrony does not work well in practice for complex neural network models, and it seems to be losing favor in practice. A recent paper Chen et al. (2016) strongly argued that asynchronous updates can be harmful beyond a few dozens of compute nodes. Our approach naturally creates nonoverlapping subnets that eliminate overwrites as the parameter server; we leave this direction for future work.
5 Experiments
We empirically demonstrate subnet training’s effectiveness and efficiency on two learning tasks.
Google Speech Commands Warden (2018): We learn a two and threelayer network over this data set, which includes the waveform of spoken words, and is designed to help train and evaluate keyword spotting systems. We preprocess the data to represent each waveform as a 4096dimensional feature vector according to melscaled spectrogram Stevens et al. (1937). 35 labeled keywords are included for the classification task. We train two neural network structures: a two layer fully connected network with two weight matrices of size and ; a three layer fully connected network with three weight matrices of size , and .
Amazon670k Bhatia et al. : We learn a larger model over a dataset that includes product recommendation with 670,000 labels. Each sample represents a product with labels of other products users might be interested in purchasing. This is one of the standard large scale benchmarks for extreme classification research. We train a twolayer, fullyconnected neural network, with the hidden layer of size 512, where the two weight matrices sized at and . The KLdivergence between the network’s output and the corresponding target is applied as the loss function for SGD.
Implementation. We implement distributed subnet training in PyTorch. In addition to subnet training, we use the classical data parallel training, implemented in PyTorch. We train the Google speech command recognization networks on three AWS clusters, with 2, 4, and 8 machine instances (m5.2xlarge CPU machines). We train the Amazon670k extreme classification network on an AWS GPU clusters, with 8 GPU machines (p3.2xlarge instances). We use a batch size of 128 waveforms/data points, and we process 75 batches before synchronization for subnet training with the Google data set, and 40 batches before synchronization with the Amazon670k data set. A separate mode is used as a parameter server (we use a m5.2xlarge CPU machine).
We use the standard SGD method, with step size tuned over an exponentiallyspaced set: for the Google data set, and for the Amazon data set; we selected for the former and for the latter. We considered only fixed step size selections, with no schedule decay. The initialization of the neural networks were completed by the default initialization in PyTorch; also, we consider here single run experiments.
Data Parallel  SubNet  

Accuracy  2 Node  4 Node  8 Node  2 Node  4 Node  8 Node 
65%  264  161  142  39  18  6 
67%  352  225  213  51  21  8 
69%  440  291  309  64  32  12 
71%  705  421  428  76  40  17 
73%  970  615  572  113  65  28 
75%  1499  713  834  158  102  48 
77%  2208  1297  1550  259  168  98 
Data Parallel  SubNet  

Accuracy  2 Node  4 Node  8 Node  2 Node  4 Node  8 Node 
65%  359  201  289  46  22  15 
67%  540  269  385  69  27  21 
69%  722  336  530  91  43  30 
71%  905  404  772  113  64  45 
73%  1088  472  1545  137  86  64 
75%  1270  540  1932  182  129  104 
77%  1450  947  2222  251  214  185 
Results. Results for the Google Speech classifiers are given in Tables 1 and 2. For each of the clusters tested, and for both subnet training and data parallel training, we show the time required to reach different accuracies. A similar set of results is shown for the 8node GPU cluster used to train the Amazon670k classifier in Table 3. We also summarize the best accuracy for two/threelayer Google speech classifier, and best P@1 for Amazon670k classifier, associated with the training time for the largest cluster scale in Table 4
Discussion. We observe that the data parallel training is mostly ineffective at realizing any sort of speedup from distributed training. Adding more machines can actually increase the training time. This is due to the fact that the interconnect between machines, provided by AWS, is slow enough that communication times dominate. The interconnect operates at 20Gbps, meaning that, e.g., the time required to transmit a 4096 4096 matrix between two machines (Google data set) takes more than of a second, and transmitting a 512 670,091 matrix (Amazon data set) takes more than a second. Adding more machines to decrease the wallclock time of backpropagation does not decrease the overall time, because it tends to increase the communication time. This is especially the case when a single parameter server is used, which is one of the most used configurations for distributed training. Further, we conjecture that, adding additional parameter servers may alleviate this problem, but it likely would not have made the data parallel training anywhere as fast as the distributed subnet training, which is more than 20 times faster than data parallel training for the larger clusters. Subnet training was also handicapped by the single parameter server, as the AWS interconnect was slow enough that communication to the various workers was still a bottleneck, even for subnet training.
Given how slow the AWS interconnect is, there were two main reasons that the subnet training scaled well, compared to the data parallel training. One is the reduced communication to each worker node during synchronization. The second is the reduced frequency of communication. Since we run 70 independent batches on each worker in the case of Google (40 in the case of Amazon), there are far fewer communication steps. While we likely do lose some computational efficiency –the lack of communication may require more computing time to reach a given accuracy– it is more than compensated, in terms of increased communication efficiency.
Would a faster interconnect between machines have changed the results? This is a case for further investigation, due to PyTorch limitations for GPUtoGPU communication. However, even with a faster InfiniBand connection (typically 3  5 faster than the AWS interconnect), the difference between the training times is significant, that constitutes unlikely to outperform subnet training. Further, most deep learning “consumers” would be limited to using AWS interconnect, as the most common platform for training of neural networks.
Accuracy  26%  28%  30%  32%  34%  36%  38% 

Data Parallel  23540  26254  28953  31159  33895  39312  41572 
SubNet  3288  3688  4252  5019  5791  6926  8872 
Data Parallel  SubNet  
Accuracy/Precision  Time  Accuracy/Precision  Time  
2layer speech classifier  77.15%  2386  78.26%  192 
3layer speech classifier  77.68%  3512  79.51%  557 
Amazon670k classifier  39.05%  49720  40.11%  12328 
6 Conclusion
In this work, we propose independent subnet training for distributed optimization of fully connected neural networks. By stochastically partitioning the model into nonoverlapping subnetworks, subnet training reduces the communication overhead for model synchronization, and the computation overhead of forwardbackward propagation for a thinner model on each worker. Inherited from the regularization effect of dropout, the same neural network architecture generalizes better, when optimized with subnets, comparing to the classic data parallel approach.
There are interesting open questions remaining: This paper focuses on synchronizing updates on a homogenous environment; deploying our proposal on heterogeneous distributed environments, with asynchrony and adaptive local tuning, is an interesting extension. Further, the current prototype is applied only to fully connected neural networks with SGD; extending the approach to other architectures, such as CNNs, and other training algorithms such as Adam, introduces interesting open questions: e.g., how to split momentum and secondorder information via subnet training? Lastly, we plan to combine techniques on postprocessing the gradients, such as sparsification and quantization.
References
 [1] (2016) Tensorflow: a system for largescale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pp. 265–283. Cited by: §1.
 [2] (2017) Sparse communication for distributed gradient descent. arXiv preprint arXiv:1704.05021. Cited by: §4.
 [3] (2017) QSGD: communicationefficient SGD via gradient quantization and encoding. In Advances in Neural Information Processing Systems, pp. 1709–1720. Cited by: §4.
 [4] (2017) An investigation of Newtonsketch and subsampled Newton methods. arXiv preprint arXiv:1705.06211. Cited by: §2.
 [5] (2019) QuasiNewton methods for deep learning: forget the past, just sample. arXiv preprint arXiv:1901.09997. Cited by: §2.
 [6] The extreme classification repository: multilabel datasets and code. Note: http://manikvarma.org/downloads/XC/XMLRepository.html Cited by: §5.
 [7] (2018) Optimization methods for largescale machine learning. Siam Review 60 (2), pp. 223–311. Cited by: §2.
 [8] (2016) Revisiting distributed synchronous SGD. arXiv preprint arXiv:1604.00981. Cited by: §4.
 [9] (2014) Project Adam: building an efficient and scalable deep learning training system.. In OSDI, Vol. 14, pp. 571–582. Cited by: §1.
 [10] (2017) Scale out for large minibatch SGD: Residual network training on ImageNet1K with improved accuracy and reduced time to train. arXiv preprint arXiv:1711.04291. Cited by: §1, §4.
 [11] (2015) BinaryConnect: training deep neural networks with binary weights during propagations. In Advances in neural information processing systems, pp. 3123–3131. Cited by: §4.
 [12] (2012) Large scale distributed deep networks. In Advances in neural information processing systems, pp. 1223–1231. Cited by: §1, §4.
 [13] (2018) On the ineffectiveness of variance reduced optimization for deep learning. arXiv preprint arXiv:1812.04529. Cited by: §2, §4.
 [14] (2015) 8bit approximations for parallelism in deep learning. arXiv preprint arXiv:1511.04561. Cited by: §4.
 [15] (2000) Ensemble methods in machine learning. In International workshop on multiple classifier systems, pp. 1–15. Cited by: §3.
 [16] (2017) Sharp minima can generalize for deep nets. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 1019–1028. Cited by: §4.
 [17] (2011) Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research 12 (Jul), pp. 2121–2159. Cited by: §2.
 [18] (1997) Parallel neural network training on multispert. In Algorithms and Architectures for Parallel Processing, 1997. ICAPP 97., 1997 3rd International Conference on, pp. 659–666. Cited by: §1.
 [19] (2015) Fast RCNN. arXiv preprint arXiv:1504.08083. Cited by: §1.
 [20] (2018) On the computational inefficiency of large batch sizes for stochastic gradient descent. arXiv preprint arXiv:1811.12941. Cited by: §1.
 [21] (2016) Deep learning. Vol. 1, MIT press Cambridge. Cited by: §1.
 [22] (2017) Accurate, large minibatch SGD: training ImageNet in 1 hour. arXiv preprint arXiv:1706.02677. Cited by: §1, §2, §4.
 [23] (2015) Deep learning with limited numerical precision. In International Conference on Machine Learning, pp. 1737–1746. Cited by: §4.
 [24] (2016) Omnivore: an optimizer for multidevice deep learning on CPUs and GPUs. arXiv preprint arXiv:1606.04487. Cited by: §1.
 [25] (2017) Quantized neural networks: training neural networks with low precision weights and activations.. Journal of Machine Learning Research 18 (187), pp. 1–30. Cited by: §4.
 [26] (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, pp. 448–456. Cited by: §1, §3.
 [27] (2017) Indatacenter performance analysis of a tensor processing unit. In 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), pp. 1–12. Cited by: §1.
 [28] (2016) On largebatch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836. Cited by: §4.
 [29] (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §2, §4.
 [30] (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §1.
 [31] (2018) GPU accelerated subsampled Newton’s method. arXiv preprint arXiv:1802.09113. Cited by: §2.
 [32] (2019) Survey of dropout methods for deep neural networks. arXiv preprint arXiv:1904.13310. Cited by: §1.
 [33] (2014) Scaling distributed machine learning with the parameter server.. In OSDI, Vol. 14, pp. 583–598. Cited by: §1.
 [34] (2018) Understanding the disharmony between dropout and batch normalization by variance shift. arXiv preprint arXiv:1801.05134. Cited by: §1, §3.
 [35] (2018) Don’t use large minibatches, use local SGD. arXiv preprint arXiv:1808.07217. Cited by: §4.
 [36] (2015) Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431–3440. Cited by: §1.
 [37] (2019) Inefficiency of KFAC for large batch size training. arXiv preprint arXiv:1903.06237. Cited by: §1.
 [38] (2015) Optimizing neural networks with Kroneckerfactored approximate curvature. In International conference on machine learning, pp. 2408–2417. Cited by: §2.
 [39] (2009) Efficient largescale distributed training of conditional maximum entropy models. In Advances in Neural Information Processing Systems, pp. 1231–1239. Cited by: §3, §4.
 [40] (2013) GPU asynchronous stochastic gradient descent to speed up neural network training. arXiv preprint arXiv:1312.6186. Cited by: §4.
 [41] (2017) Automatic differentiation in pytorch. Cited by: §1.
 [42] (2009) Largescale deep unsupervised learning using graphics processors. In Proceedings of the 26th annual international conference on machine learning, pp. 873–880. Cited by: §1.
 [43] (2019) SysML: the new frontier of machine learning systems. arXiv preprint arXiv:1904.03257. Cited by: §1.
 [44] (2011) Hogwild: a lockfree approach to parallelizing stochastic gradient descent. In Advances in neural information processing systems, pp. 693–701. Cited by: §4.
 [45] (2016) An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747. Cited by: §2.
 [46] (2014) 1bit stochastic gradient descent and its application to dataparallel distributed training of speech DNNs. In Fifteenth Annual Conference of the International Speech Communication Association, Cited by: §4.
 [47] (2014) Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §1.
 [48] (2017) Don’t decay the learning rate, increase the batch size. arXiv preprint arXiv:1711.00489. Cited by: §1, §2, §4.
 [49] (2014) Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15 (1), pp. 1929–1958. Cited by: §1.
 [50] (1937) A scale for the measurement of the psychological magnitude pitch. The Journal of the Acoustical Society of America 8 (3), pp. 185–190. Cited by: §5.
 [51] (2013) Regularization of neural networks using Dropconnect. In International Conference on Machine Learning, pp. 1058–1066. Cited by: §1.
 [52] (2018) Speech commands: a dataset for limitedvocabulary speech recognition. arXiv preprint arXiv:1804.03209. Cited by: §5.
 [53] (2017) Terngrad: ternary gradients to reduce communication in distributed deep learning. In Advances in neural information processing systems, pp. 1509–1519. Cited by: §4.
 [54] (1999) Numerical optimization. Springer Science 35 (6768), pp. 7. Cited by: §2.
 [55] (2017) Newtontype methods for nonconvex optimization under inexact hessian information. arXiv preprint arXiv:1708.07164. Cited by: §2.
 [56] (2013) MultiGPU training of convnets. arXiv preprint arXiv:1312.5853. Cited by: §1, §2, §4.
 [57] (2018) Hessianbased analysis of large batch training and robustness to adversaries. In Advances in Neural Information Processing Systems, pp. 4949–4959. Cited by: §4.
 [58] (2017) Scaling SGD batch size to 32K for ImageNet training. arXiv preprint arXiv:1708.03888. Cited by: §1, §4.
 [59] (2019) Largebatch training for LSTM and beyond. arXiv preprint arXiv:1901.08256. Cited by: §1, §4.
 [60] (2019) Reducing BERT pretraining time from 3 days to 76 minutes. arXiv preprint arXiv:1904.00962. Cited by: §1, §4.
 [61] (2012) ADADELTA: an adaptive learning rate method. arXiv preprint arXiv:1212.5701. Cited by: §2.
 [62] (2014) Dimmwitted: a study of mainmemory statistical analytics. Proceedings of the VLDB Endowment 7 (12), pp. 1283–1294. Cited by: §3, §4.
 [63] (2016) Parallel SGD: when does averaging help?. arXiv preprint arXiv:1606.07365. Cited by: §3, §4.
 [64] (2013) Asynchronous stochastic gradient descent for DNN training. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pp. 6660–6663. Cited by: §4.
 [65] (1990) An efficient implementation of the backpropagation algorithm on the connection machine CM2. In Advances in neural information processing systems, pp. 801–809. Cited by: §1.
 [66] (2010) Parallelized stochastic gradient descent. In Advances in neural information processing systems, pp. 2595–2603. Cited by: §3, §4.