# Distributed Learning of Neural Networks using Independent Subnet Training

## Abstract

We alleviate costly communication/computation overhead in classical distributed learning by introducing independent subnet training: a novel, simple, jointly model-parallel and data-parallel approach to distributed neural network training.Our main idea is that, per iteration, the model’s neurons can be randomly divided into smaller surrogate models without replacement—dubbed as subnetworks or subnets—and each subwork is sent for training only to a single worker. This way, our algorithm broadcasts the whole model parameters only once into the distributed network per synchronization cycle. This not only reduces the overall communication overhead, but also the computation workload: each worker only receives the weights associated with the subwork it has been assigned to. Further, subwork generation and training reduces synchronization frequency: since workers train disjoint portions of the network as if they were independent models, training continues for longer periods of time before synchronization, similar to local SGD approaches. We test our approach on speech recognition and product recommendation applications, as well as image classification tasks. Subnet training: i) leads to accelerated training, as compared to state of the art distributed models, and ii) often results into boosting the testing accuracy, as it implicitly leverages dropout regularization during training.

## 1 Introduction

Accelerating neural network (NN) distributed training over a compute cluster has become a fundamental challenge in modern computing systems Ratner et al. (2019); Dean et al. (2012); Chilimbi et al. (2014); Li et al. (2014); Hadjis et al. (2016). Distributed training algorithms may be roughly categorized into model parallel and data parallel. In the former Hadjis et al. (2016); Dean et al. (2012), different compute nodes are responsible for different parts of a NN. In the latter Zhang et al. (1990); Farber and Asanovic (1997); Raina et al. (2009), each compute node updates a complete copy of the NN’s parameters on different portions of the data. In both cases, the obvious way to speed up learning is to add more nodes. With more hardware, the model is split across more CPUs/GPUs in the model parallel setting, or gradients are computed using fewer data objects per compute node in the data parallel setting.

Due to its ease-of-implementation, data parallel training is most commonly used, and it is best supported by common deep learning software, such as TensorFlow Abadi et al. (2016) and PyTorch Paszke et al. (2017).
However, there are limitations preventing data parallelism from easily scaling out.
Adding nodes means that each node can perform forward and backward propagation more quickly on its own local data, but it leaves the synchronization step no faster.
In fact, if synchronization time dominates, adding more machines could actually make training even slower as the number of bytes transferred to broadcast an updated model grows linearly with cluster size. This is particularly problematic in public clouds such as Amazon EC2^{1}

Independent subnet training. The central idea in this paper, called independent subnet training (or IST), facilitates combined model- and data-parallel distributed training. IST utilizes ideas from dropout Srivastava et al. (2014) and approximate matrix multiplication Drineas et al. (2006). IST decomposes the layers of a NN into a set of independent subnets for the same task, by partitioning the neurons across different sites. Each of those networks is trained for one or more local stochastic gradient descent (SGD) iterations, before a synchronization step.

Since subnets share no parameters in the distributed setting, synchronization requires no aggregation on these parameters, in contrast to the data-parallel model—it is just an exchange of parameters. Moreover, because subnets are sampled without replacement, the interdependence among them is minimized, which allows local SGD updates for a very large number of iterations before synchronizing. This reduces communication frequency. Communication costs per synchronization step are also reduced because in an -machine cluster, each machine gets between and of the weights—contrast this to data parallel training, when each machine must receive all of the weights.

IST has advantages over model-parallel approaches. Since each of the subnets is a fully-operational model by itself during local updates, no synchronization between subnetworks is required, in contrast to the model-parallel setting. On the other hand, IST inherits the advantages of model-parallel methods. Since each machine gets just a small fraction of the overall model, the local memory footprint is reduced. This can be an advantage when training large models using GPUs, which tend to have limited memory.

Experimental findings. We evaluate our method on several applications, including speech recognition, image classification, and a large-scale Amazon product recommendation task. We find that IST leads up to a speedup in time-to-convergence, compared to a state-of-the-art data-parallel implementation using bandwidth-optimal ring all-reduce (Xu, 2018), as well as the “vanilla” local SGD method Lin et al. (2018). Finally, because it allows for efficient model-parallel training, we show that IST is able to solve an “extreme” Amazon product recommendation task with better generalization than state-of-the-art embedding based approaches.

## 2 Preliminaries

NN training. We are interested in optimizing a loss function over a set of labeled examples; the loss encodes the NN architecture, with parameters . Given a data probability distribution , samples from are denoted as , where represents examples, and its corresponding label. Then, deep learning aims in finding that minimizes the empirical loss:

(1) |

where is the continuous hypothesis space of values of .

The minimization can be achieved by using different approaches Wright and Nocedal (1999); Zeiler (2012); Kingma and Ba (2014); Duchi et al. (2011); Ruder (2016), but almost all NN training is accomplished via some variation on SGD: we compute (stochastic) gradient directions that, on expectation, decrease the loss, and then set Here, is the learning rate, and represents a single or a mini-batch of examples, randomly selected from .

Why classical distributed approaches can be ineffective? Computing over the whole is wasteful Defazio and Bottou (2018). Instead, mini batch SGD computes for a small subsample of . In a centralized system, we often use no more than a few hundred data items in , and few would advocate using more than a few tens of thousands of Goyal et al. (2017); Yadan et al. (2013); Smith et al. (2017).

For distributed computation, this is problematic for two reasons: first, it makes it difficult to speed up the computation by adding more computing hardware. Since the batch size is small, splitting the task to more than a few compute notes is no beneficial, which motivates different training approaches for NNs Berahas et al. (2017); Bottou et al. (2018); Kylasa et al. (2018); Xu et al. (2017); Berahas et al. (2019); Martens and Grosse (2015).

Second, gathering the updates in a distributed setting introduces a non-negligible time overhead in large clusters, which is often the main bottleneck towards efficient large-scale computing; see Section 7 for alternative solutions. This imbalance between communication and computation capacity may lead to significant increases in training time when a larger cluster is used.

## 3 Training via Independent Subnetworks

### 3.1 IST: Overview

Assume sites in a distributed system. For simplicity, we assume all layers of the NN utilize the same activation function . Let denote the vector of activations at layer . denotes the set of activations at the final or “top” layer of the network, and denotes the feature vector that is input into the network. Assume that the number of neurons at layer is .

IST is a randomized, distributed training regime that utilizes a set of membership indicators:

Here, ranges over the sites in the distributed system, and ranges over the neurons in layer . Each , randomly selected, where the marginal probability . Further, for each layer and activation , we constrain to be and the covariance of and must be zero, so that .

Then, we define the recurrence at the heart of IST:

(2) |

Here , is the weight matrix connecting layer in the network with layer and denotes the Hadamard product of the two vectors.

This recurrence is useful for two key reasons. First, it is easy to argue that if is an unbiased estimator for , then

is an unbiased estimator for . To show this, we note that the th entry in the vector is computed as , and hence its expectation is:

which is precisely the th entry in .

This unbiasedness suggests that this recurrence can be computed in place of the standard recurrence implemented by a NN, . A feature vector can be pushed through the resulting “approximate” NN, and the final vector can be used as an approximation for .

### 3.2 Distributing Independent Subnets

The second reason the recurrence is useful is that it is much easier to distribute the computation of —and its backpropagation—than it is to distribute the computation of . When randomly generating the membership indicators, we require that be . Two important aspects of the computation of follow directly from this requirement. First, in the summation of Equation 2, only one “site” can contribute to the th entry in the vector ; this is due to the Hadamard product with , which implies that all other sites’ contributions will be zeroed out. Second, only the entries in that were themselves associated with the same site value for can contribute to the th entry, again due to the Hadamard product with .

This implies that we can co-locate at site the computation of all entries in where is , and all entries in where is , and then no cross-site communication is required to compute the activations in layer from the activations in layer . Further, since only the entries in the weight matrix for which are used at site —and on expectation, only of the weights in will be used—this implies that during an iteration of distributed backpropagation, each site need only receive (and communicate gradients for) a fraction of the weights in each weight matrix.

The distributed implementation of the recurrence across three sites for a NN with three hidden layers is depicted in Figure 1. The neurons in each layer are partitioned randomly across the sites, except for the input layer, which is fully utilized at all sites:

and the output layer, which is computed using all of the activations at the layer :

## 4 Distributed IST In-Depth

### 4.1 Distributed Training Algorithm

All of this suggests an algorithm for distributed learning, given as Algorithm 1 and Algorithm 2. Algorithm 1 repeatedly samples a set of membership indicators, and then partitions the model weights across the set of compute nodes, as dictated by the indicators. Since the weights are fully partitioned, the independent subsets at each node can be trained separately on local data for a number of iterations (Algorithm 2), before the indicators are re-sampled, and the weights are re-shuffled across the nodes.

Periodic resampling of the indicators (followed by reshuffling) is necessary due to the possible accumulation of random effects. While the recurrence of Equation (2) dictates for an unbiased estimate for the input to a neuron, after backpropogation, the expected input to a neuron will change. Since each subset is being trained using samples from the same data distribution, this shift may be inconsistent across sites. Resampling guards against this.

### 4.2 Why Is This Fast?

Answer: Due to the subsampling forced by the membership indicators. This reduces both network traffic and compute workload. In addition, IST allows for periods of local updates with no communication, again reducing network traffic.

Local SGD factor. For a feed-forward NN, at each round of “classical” data parallel training, the entire set of parameters must be broadcast to each site. Measuring the inflow to each site, the total network traffic per gradient step is (in floating point numbers transferred):

In contrast, during IST, each site receives the current parameters only one time every gradient steps.

Weight matrices subsampling. Subsampling reduces this cost even further; the matrices attached to the input and output layers are partitioned across nodes (not broadcast), and only a fraction of the weights in each of the other matrices are sent to any node. The total network traffic per gradient step is:

Less computations per site. Computational resource utilization is reduced similarly. Considering the FLOPs required by matrix multiplications during forward and backward steps, during “classical” data parallel training, the number of FLOPS required per gradient step is:

In contrast, the number of FLOPS per IST gradient step is:

Overall. In Figure 2 we plot the average cost of each gradient step as a function of the number of machines in a cluster, assuming a feed forward NN with three hidden layers of 4,000 neurons, an input feature vector of 1,000 features, a batch size of 512 data objects, and 200 output labels, assuming , the number of subnet local SGD steps, is 10. There is a radical decrease in both network traffic and FLOPS using IST. In particular, using IST both of these quantities decrease with the addition of more machines in the cluster.

Note that this plot does not tell the whole story, as IST may have lower (or higher) statistical efficiency. The fact that IST partitions the network and runs local updates may decrease efficiency, whereas the fact that each “batch” processed during IST actually consists of independent samples of size (compared to a single global sample in classical data parallel training) may tends to increase efficiency. This will be examined experimentally.

### 4.3 IST for Non-Fully Connected Architectures

As described, IST currently applies to fully-connected layers.
Extending the method to other common neural constructs, such as convolutional layers, is beyond the scope of this work.
However, the idea as described here can still be applied to the fully-connected layer(s) that make are part of nearly every modern architecture.^{2}

### 4.4 Distributed Parameter Server

To support the IST algorithm, a carefully designed distributed system is required. Algorithm 1 implies that there is a coordinator, but in practice there can be no actual coordinator—a coordinator will inevitably become a bottleneck during learning. In our implementation of IST, we shard each weight matrix across all worker nodes. To run each invocation of subnet local SGD, each worker obtains a portion of each weight matrix from each of the other workers, runs subnet local SGD, and then returns the updated portions to their owners.

This requires an algorithm for distributed generation of membership indicators. Imagine a site is assigned a set of neurons at layer and at layer . Site will need all weights connecting any pairs of neurons in and . Site and site may both have relevant weights, but for to send those weights to , both will need to agree on and , ideally without incurring the cost of communicating indicators (which may be as high as sending the weights).

We use the simple idea of using a common pseudo-random number generator for all sites. A seed is broadcast, and that seed is used to produce identical pseudo-random sequences (and hence identical assignments) at all sites. Then, when site sends weights to site , the latter need not specify which weights to send, nor receive any meta-data.

## 5 Correcting Distributional Shift

There is, however, a significant problem with the above formulation. Specifically, when justifying the use of the recurrence of Equation (2), we argued that since is an unbiased estimator for , it holds that is a reasonable estimator for . In doing so, we are guilty of applying a form of the classical statistical fallacy that for random variable , if , then .

This fallacy is dangerous when the function is non-linear, which is the case with the standard activation functions used in modern NNs. Because the membership indicators force subsampling the inputs to each neuron (and a scale factor of is then applied to the resulting quantity to unbias it), we end up increasing the standard deviation of the input to each neuron by a factor of during training, compared to the standard deviation that will be observed when applying the network to perform actual predictions without the use of membership indicators. This increased variance means that we are more likely to observe extreme inputs to each neuron during training than during actual deployment. The network learns to expect such extreme values and avoid saturation during deployment, and adapts accordingly. However, the learned network fails when it is deployed.

To force the training and deployment distributions to match, we could apply an analytic approach. But instead, we simply remove the correction and during training, for a given neuron, we compute the mean and standard deviation of the inputs to the neuron, and use a modified activation function . Before deployment of the full network, we can compute and for each neuron over a small subset of the training data using the full network, and use those values during deployment.

Note that this is equivalent to batch normalization Ioffe and Szegedy (2015)—we can learn a scale and shift as well, if desired—though our motivation for its use is somewhat different. Classically, the motivation for using batch normalization has been to keep the input in the non-saturated range of the activation function during training. This tends to speed convergence and increase generalization capabilities. In contrast, IST will simply not work without some sort of normalization, due to the distributional shift that will be encountered when deploying the whole network.

## 6 Empirical Evaluation

Learning tasks and environment. (1) Google Speech Commands Warden (2018): We learn a 2-layer network of neurons and a 3-layer network of neurons to recognize 35 labeled keywords from audio waveforms (compared to the 12 keywords in prior work Warden (2018)). We represent each waveform as a -dimensional feature vector Stevens et al. (1937). (2) VGG11 on CIFAR100 Simonyan and Zisserman (2014): We train the VGG11 model (with batch normalization) over the CIFAR100 image classification dataset (see Section 4.3 for a discussion of IST and non-fully connected architectures). (3) Amazon-670k Bhatia et al. (): We train a 2-layer, fully-connected neural network, which accepts a -dimensional input feature, and generates a prediction over output labels.

On all three data sets, we use a fixed batch size of objects per machine. Google speech uses a fixed learning rate of . The VGG11 on CIFAR100 is trained with an initial learning rate and decayed once by the factor of after epochs.

We train the Google speech networks on three AWS CPU clusters, with 2, 4, and 8 CPU instances (m5.2xlarge). We train the VGG11 on CIFAR 100 and Amazon-670k extreme classification network on three AWS GPU clusters, with 2, 4, and 8 GPU machines (p3.2xlarge). Our choice of AWS was deliberate, as it is a very common learning platform, and illustrates the challenge faced by many consumers of machine learning: distributed learning without a super-fast interconnect.

Distributed learning frameworks. We implement IST in PyTorch. We compare IST to the PyTorch implementation of data parallel learning. We also adapt the PyTorch data parallel learning implementation to realize local SGD Lin et al. (2018) where learning occurs locally for a number of iterations before synchronizing.

For the CPU experiments, we use PyTorch’s gloo backend. For the GPU experiments, data parallel learning and local SGD use Pytorch’s nccl backend, which leverages the most advanced Nvidia collective communication library. nccl is the set of high-performance multi-GPU and multi-node collective communication primitives optimized for NVIDIA GPUs. nccl implements ring-based all-reduce Xu (2018), which is used in well-known distributed learning systems such as Horovod Sergeev and Del Balso (2018).

Unfortunately, IST cannot use the nccl backend because the latter does not support the scatter operator required to implement IST. This is likely because the deep learning community has focused on data parallel learning. Our work constitutes also a suggestion to the systems + ML community to look into variants of the standard data and model parallel paradigms, in order to achieve best performance.

As a result, IST must use the gloo backend (meant for CPU-based learning). This is a serious handicap for IST, though we emphasize that it is not the result of any intrinsic flaw of the method, it is merely a lack of support for required operations in the high-performance GPU library. To give the reader an idea of the magnitude of this handicap, data parallel CIFAR100 VGG11 learning realizes a 3.1 speedup when switching from the from gloo backend to nccl backend.

### 6.1 Experimental results

Scalability. We first investigate the relative scaling of IST compared to the alternatives, with an increasing number of EC2 workers. For various configurations, we time how long each of the distributed learning frameworks take to complete one training epoch. Figure 3 summarizes our findings on the scaling comparison of data parallel, local SGD and IST with various local update iterations. The speedup is calculated by comparing with the training time for one epoch to 1-worker SGD. It is clear that, across various hyper-parameter choices, IST provides significant speedups compared to a 1-worker SGD strategy; speedups that range from to .

Convergence speed. While IST can process data quickly, there are questions regarding its statistical efficiency vis-a-vis the other methods, and how this affects convergence. Figure 4 plots the hold-out test accuracy for selected benchmarks as a function of time. Table 1 shows the training time required for the various methods to reach specified levels of hold-out test accuracy. It is clear from the results that IST gets to the same targeted accuracy much faster than the compared methods, and IST achieves even better final performance in most cases. The latter is justfied by the fact that IST implicitly constitutes a distributed version of dropout, regularizing towards a more generalizable solution. See also below.

2-Layer Google Speech | |||||||||
---|---|---|---|---|---|---|---|---|---|

Data Parallel | Local SGD | SubNet | |||||||

Accuracy | 2 Node | 4 Node | 8 Node | 2 Node | 4 Node | 8 Node | 2 Node | 4 Node | 8 Node |

0.63 | 118 | 269 | 450 | 68 | 130 | 235 | 35 | 28 | 24 |

0.69 | 294 | 584 | 923 | 171 | 278 | 441 | 87 | 55 | 59 |

0.75 | 759 | 1708 | 2417 | 444 | 742 | 1110 | 231 | 167 | 192 |

3-Layer Google Speech | |||||||||

Data Parallel | Local SGD | SubNet | |||||||

Accuracy | 2 Node | 4 Node | 8 Node | 2 Node | 4 Node | 8 Node | 2 Node | 4 Node | 8 Node |

0.63 | 376 | 1228 | 1922 | 182 | 586 | 1115 | 76 | 141 | 300 |

0.69 | 1503 | 2939 | 4823 | 740 | 1270 | 2589 | 298 | 283 | 541 |

0.75 | 4534 | 9340 | 14886 | 2032 | 4107 | 6539 | 812 | 664 | 1161 |

CIFAR100 | |||||||||

Data Parallel | Local SGD | SubNet | |||||||

Accuracy | 2 Node | 4 Node | 8 Node | 2 Node | 4 Node | 8 Node | 2 Node | 4 Node | 8 Node |

0.36 | 108 | 275 | 730 | 23 | 67 | 133 | 17 | 39 | 212 |

0.42 | 325 | 736 | 1357 | 58 | 121 | 236 | 43 | 54 | 314 |

0.48 | 542 | 1472 | 3342 | 104 | 215 | 473 | 68 | 85 | 466 |

Embedding Dim. | prec @1 | prec @3 | prec @5 |
---|---|---|---|

512 (Data parallel) | 0.3861 | 0.3454 | 0.3164 |

512 (IST) | 0.3962 | 0.3604 | 0.3313 |

1024 (IST) | 0.4089 | 0.3685 | 0.3392 |

Data Parallel | Local SGD | SubNet | |
---|---|---|---|

Speech 2 layer | 0.7938 | 0.7998 | 0.8153 |

Speech 3 layer | 0.7950 | 0.7992 | 0.8327 |

CIFAR100 vgg | 0.5787 | 0.5878 | 0.6228 |

Trained model accuracy. Because IST is inherently a model-parallel traning method, it has certain advantages, including the ability to scale to large models. This can have certain advantages. Using IST, we train a model with a -neuron embedding using an 8-instance IST GPU cluster, and a -neuron embedding using a 4-instance IST GPU cluster and a 4-instance data parallel GPU cluster, and evaluate the hold-out test performance. The precision @1, @3, and @5 are reported in Table 2. In Table 3 we give the final accuracy of each method, trained on a 2-node cluster.

### 6.2 Discussion

There are significant advantages to IST in terms of being able to process data quickly. Figure 3 shows that IST is able to process far more data in a short amout of time than the other distributed training frameworks. Interestingly, we find that the IST speedups in CPU clusters are more significant than that in GPU clusters. There are two reasons for this. First, for GPU clusters, IST suffers from its use of Pytorch’s gloo backend, compared to the all-reduce operator provided by nccl. It also appears that since the GPU provides a very high level of computation, there is less benefit to be realized from the reduction in FLOPS per gradient step using IST.

Figure 4 and Table 1 generally show that IST is much faster for achieving high levels of accuracy on a hold-out test set. For example, IST exhibits a speedup compared to local SGD, and speedup compared to classical data parallel for the 2-layer Google speech model to reach . IST exhibits speedup compared to local SGD, and a speedup comparing to data parallel for the 3-layer model to reach the accuracy of . In every case, some variant of IST was the fastest to reach each particular level of hold-out accuracy. We note that this was observed even though IST was handicapped by its use of gloo for its GPU implementation.

Another key advantage of IST is illustrated by Table 2; because it is a model-parallel framework and distributes—it does not broadcast—the model to multiple machines, IST is able to scale to virtually unlimited model sizes. Hence it can compute a -dimensional embedding (and realize the associated, additional accuracy), whereas the other frameworks are unable to do this.

Perhaps the most interesting result of all is the fact that most of the frameworks actually do worse—in terms of time-to-high-accuracy—with additional machines. This illustrates a significant problem with distributed learning. Unless a super-fast interconnect is used (and such interconnects are not available from typical cloud providers), it can actually be detrimental to add additional machines, as the added cost of transferring data can actually result in slower running times. We see this clearly in Table 1, where the state-of-the-art PyTorch data parallel implementation (and the local SGD variant) does significantly worse with multiple machines. In fact, IST is the only of the three frameworks to show the ability to utilize additional machines without actually becoming slower to reach high accuracy. That said, even IST struggled to scale beyond two machines in the case of CIFAR-100 (handicapped by the fact that current realization of IST does not decompose the convolutional layers into subnets). Still, IST showed the best potential to scale.

Finally, various compression techniques could be used to increase the effective bandwidth of the interconnect (including gradient sparsification Aji and Heafield (2017), quantization Alistarh et al. (2017), sketching Ivkin et al. (2019), and low-rank compression Vogels et al. (2019)). However, these methods could be used along with any framework. While compression may allow effective scaling to larger compute clusters than observed here, it would not affect the relative efficacy of IST.

## 7 Related work

Data parallelism often suffers from the high bandwidth costs to communicate gradient updates between workers. Quantized SGD Alistarh et al. (2017); Courbariaux et al. (2015); Seide et al. (2014); Dettmers (2015); Gupta et al. (2015); Hubara et al. (2017); Wen et al. (2017) and sparsified SGD Aji and Heafield (2017) both address this. Quantized SGD uses lossy compression to quantize the gradients. Sparsified SGD reduces the exchange overhead by transmitting the gradients with maximal magnitude. Such methods are orthogonal to IST, and could be used in combination with it.

Recently, there has been a series of papers on using parallelism to “Solve the learning problem in minutes”, for ever-decreasing values of Goyal et al. (2017); Yadan et al. (2013); You et al. (2017); Smith et al. (2017); Codreanu et al. (2017); You et al. (2019b, a). Often these methods employ large batches. It is generally accepted—though still debated Dinh et al. (2017)—that large batch training converges to “sharp minima”, hurting generalization Keskar et al. (2016); Yao et al. (2018); Defazio and Bottou (2018). Further, achieving such results seems to require teams of PhDs utilizing special-purpose hardware: there is no approach that generalizes well without extensive trial-and-error.

Distributed local SGD Mcdonald et al. (2009); Zinkevich et al. (2010); Zhang and Ré (2014); Zhang et al. (2016) updates the parameters, through averaging, only after several local steps are performed per compute node. This reduces synchronization and thus allows for higher hardware efficiency Zhang et al. (2016). IST uses a similar approach but makes the local SGD and each synchronization round less expensive. Recent approaches Lin et al. (2018) propose less frequent synchronization towards the end of the training, but they cannot avoid it at the beginning.

Finally, asynchrony avoids SGD synchronization cost Recht et al. (2011); Dean et al. (2012); Paine et al. (2013); Zhang et al. (2013). It has been used in distributed-memory systems, such as DistBelief Dean et al. (2012) and the Project Adam Kingma and Ba (2014). While such systems, asymptotically, show nice convergence rate guarantees, there seems to be growing agreement that unconstrained asynchrony does not always work well Chen et al. (2016), and it seems to be losing favor in practice.

## 8 Conclusion

In this work, we propose independent subnet training for distributed optimization of fully connected neural networks. By stochastically partitioning the model into non-overlapping subnetworks, IST reduces the communication overhead for model synchronization, and the computation workload of forward-backward propagation for a thinner model on each worker. Inherited from the regularization effect of dropout, the same neural network architecture generalizes better, when optimized with IST, comparing to the classic data parallel approach.

### Footnotes

- 89% of cloud-based deep learning projects are executed on EC2, according to Amazon’s marketing materials.
- During training, the rest of the network is broadcast to every site, whereas the fully connected layers are decomposed into subnets.

### References

- Tensorflow: a system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pp. 265–283. Cited by: §1.
- Sparse communication for distributed gradient descent. arXiv preprint arXiv:1704.05021. Cited by: §6.2, §7.
- QSGD: communication-efficient SGD via gradient quantization and encoding. In Advances in Neural Information Processing Systems, pp. 1709–1720. Cited by: §6.2, §7.
- An investigation of Newton-sketch and subsampled Newton methods. arXiv preprint arXiv:1705.06211. Cited by: §2.
- Quasi-Newton methods for deep learning: forget the past, just sample. arXiv preprint arXiv:1901.09997. Cited by: §2.
- The extreme classification repository: multi-label datasets and code. Note: \urlhttp://manikvarma.org/downloads/XC/XMLRepository.html Cited by: §6.
- Optimization methods for large-scale machine learning. Siam Review 60 (2), pp. 223–311. Cited by: §2.
- Revisiting distributed synchronous SGD. arXiv preprint arXiv:1604.00981. Cited by: §7.
- Project Adam: building an efficient and scalable deep learning training system.. In OSDI, Vol. 14, pp. 571–582. Cited by: §1.
- Scale out for large minibatch SGD: Residual network training on ImageNet-1K with improved accuracy and reduced time to train. arXiv preprint arXiv:1711.04291. Cited by: §1, §7.
- BinaryConnect: training deep neural networks with binary weights during propagations. In Advances in neural information processing systems, pp. 3123–3131. Cited by: §7.
- Large scale distributed deep networks. In Advances in neural information processing systems, pp. 1223–1231. Cited by: §1, §7.
- On the ineffectiveness of variance reduced optimization for deep learning. arXiv preprint arXiv:1812.04529. Cited by: §2, §7.
- 8-bit approximations for parallelism in deep learning. arXiv preprint arXiv:1511.04561. Cited by: §7.
- Sharp minima can generalize for deep nets. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1019–1028. Cited by: §7.
- Fast monte carlo algorithms for matrices i: approximating matrix multiplication. SIAM Journal on Computing 36 (1), pp. 132–157. Cited by: §1.
- Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research 12 (Jul), pp. 2121–2159. Cited by: §2.
- Parallel neural network training on multi-spert. In Algorithms and Architectures for Parallel Processing, 1997. ICAPP 97., 1997 3rd International Conference on, pp. 659–666. Cited by: §1.
- On the computational inefficiency of large batch sizes for stochastic gradient descent. arXiv preprint arXiv:1811.12941. Cited by: §1.
- Accurate, large minibatch SGD: training ImageNet in 1 hour. arXiv preprint arXiv:1706.02677. Cited by: §1, §2, §7.
- Deep learning with limited numerical precision. In International Conference on Machine Learning, pp. 1737–1746. Cited by: §7.
- Omnivore: an optimizer for multi-device deep learning on CPUs and GPUs. arXiv preprint arXiv:1606.04487. Cited by: §1.
- Quantized neural networks: training neural networks with low precision weights and activations.. Journal of Machine Learning Research 18 (187), pp. 1–30. Cited by: §7.
- Batch normalization: accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, pp. 448–456. Cited by: §5.
- Communication-efficient distributed sgd with sketching. In Advances in Neural Information Processing Systems, pp. 13144–13154. Cited by: §6.2.
- On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836. Cited by: §7.
- Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §2, §7.
- Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §4.3.
- GPU accelerated sub-sampled Newton’s method. arXiv preprint arXiv:1802.09113. Cited by: §2.
- Scaling distributed machine learning with the parameter server.. In OSDI, Vol. 14, pp. 583–598. Cited by: §1.
- Don’t use large mini-batches, use local SGD. arXiv preprint arXiv:1808.07217. Cited by: §1, §6, §7.
- Inefficiency of K-FAC for large batch size training. arXiv preprint arXiv:1903.06237. Cited by: §1.
- Optimizing neural networks with Kronecker-factored approximate curvature. In International conference on machine learning, pp. 2408–2417. Cited by: §2.
- Efficient large-scale distributed training of conditional maximum entropy models. In Advances in Neural Information Processing Systems, pp. 1231–1239. Cited by: §7.
- GPU asynchronous stochastic gradient descent to speed up neural network training. arXiv preprint arXiv:1312.6186. Cited by: §7.
- Automatic differentiation in pytorch. Cited by: §1.
- Large-scale deep unsupervised learning using graphics processors. In Proceedings of the 26th annual international conference on machine learning, pp. 873–880. Cited by: §1.
- SysML: the new frontier of machine learning systems. arXiv preprint arXiv:1904.03257. Cited by: §1.
- Hogwild: a lock-free approach to parallelizing stochastic gradient descent. In Advances in neural information processing systems, pp. 693–701. Cited by: §7.
- An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747. Cited by: §2.
- 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs. In Fifteenth Annual Conference of the International Speech Communication Association, Cited by: §7.
- Horovod: fast and easy distributed deep learning in tensorflow. arXiv preprint arXiv:1802.05799. Cited by: §6.
- Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §6.
- Don’t decay the learning rate, increase the batch size. arXiv preprint arXiv:1711.00489. Cited by: §1, §2, §7.
- Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15 (1), pp. 1929–1958. Cited by: §1.
- A scale for the measurement of the psychological magnitude pitch. The Journal of the Acoustical Society of America 8 (3), pp. 185–190. Cited by: §6.
- PowerSGD: practical low-rank gradient compression for distributed optimization. In Advances in Neural Information Processing Systems, pp. 14236–14245. Cited by: §6.2.
- Speech commands: a dataset for limited-vocabulary speech recognition. arXiv preprint arXiv:1804.03209. Cited by: §6.
- Terngrad: ternary gradients to reduce communication in distributed deep learning. In Advances in neural information processing systems, pp. 1509–1519. Cited by: §7.
- Numerical optimization. Springer Science 35 (67-68), pp. 7. Cited by: §2.
- NCCL based multi-gpu training. Note: \urlhttp://on-demand.gputechconf.com/gtc-cn/2018/pdf/CH8209.pdfAccessed: 2020-02-06 Cited by: §1, §6.
- Newton-type methods for non-convex optimization under inexact hessian information. arXiv preprint arXiv:1708.07164. Cited by: §2.
- Multi-GPU training of convnets. arXiv preprint arXiv:1312.5853. Cited by: §1, §2, §7.
- Hessian-based analysis of large batch training and robustness to adversaries. In Advances in Neural Information Processing Systems, pp. 4949–4959. Cited by: §7.
- Scaling SGD batch size to 32K for ImageNet training. arXiv preprint arXiv:1708.03888. Cited by: §1, §7.
- Large-batch training for LSTM and beyond. arXiv preprint arXiv:1901.08256. Cited by: §1, §7.
- Reducing BERT pre-training time from 3 days to 76 minutes. arXiv preprint arXiv:1904.00962. Cited by: §1, §7.
- ADADELTA: an adaptive learning rate method. arXiv preprint arXiv:1212.5701. Cited by: §2.
- Dimmwitted: a study of main-memory statistical analytics. Proceedings of the VLDB Endowment 7 (12), pp. 1283–1294. Cited by: §7.
- Parallel SGD: when does averaging help?. arXiv preprint arXiv:1606.07365. Cited by: §7.
- Asynchronous stochastic gradient descent for DNN training. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pp. 6660–6663. Cited by: §7.
- An efficient implementation of the back-propagation algorithm on the connection machine CM-2. In Advances in neural information processing systems, pp. 801–809. Cited by: §1.
- Parallelized stochastic gradient descent. In Advances in neural information processing systems, pp. 2595–2603. Cited by: §7.