Distributed Learning of Deep Neural Networks using Independent Subnet Training

Distributed Learning of Deep Neural Networks
using Independent Subnet Training

Binhang Yuan
Rice University
Houston, TX 77005
by8@rice.edu
&Anastasios Kyrillidis
Rice University
Houston, TX 77005
anastasios@rice.edu
&Christopher M. Jermaine
Rice University
Houston, TX 77005
cmj4@rice.edu
Abstract

Stochastic gradient descent (SGD) is the method of choice for distributed machine learning, by virtue of its light complexity per iteration on compute nodes, leading to almost linear speedups in theory. Nevertheless, such speedups are rarely observed in practice, due to high communication overheads during synchronization steps.

We alleviate this problem by introducing independent subnet training: a simple, jointly model-parallel and data-parallel, approach to distributed training for fully connected, feed-forward neural networks. During subnet training, neurons are stochastically partitioned without replacement, and each partition is sent only to a single worker. This reduces the overall synchronization overhead, as each worker only receives the weights associated with the subnetwork it has been assigned to. Subnet training also reduces synchronization frequency: since workers train disjoint portions of the network, the training can proceed for long periods of time before synchronization, similar to local SGD approaches. We empirically evaluate our approach on real-world speech recognition and product recommendation applications, where we observe that subnet training results into accelerated training times, as compared to state of the art distributed models, and often results into boosting the testing accuracy, as it implicitly combines dropout and batch normalization regularizations during training.

1 Introduction

Deep neural networks (DNNs) have led to recent success of machine learning in real-life applications Krizhevsky et al. (2012); Simonyan and Zisserman (2014); Girshick (2015); Long et al. (2015); Goodfellow et al. (2016). Although, despite the progress in modern hardware Jouppi et al. (2017), training DNNs can take an impractically long time in a single machine, and accelerating DNN training over a compute cluster is not easy. Indeed, it has become a fundamental challenge in modern computing systems Ratner et al. (2019).

Mitigating hardware inefficiency. Much research focuses on parallelized/distributed DNN training Dean et al. (2012); Chilimbi et al. (2014); Li et al. (2014); Hadjis et al. (2016). There, methods to accelerate training may roughly be categorized as model-parallel and data-parallel. In the former Hadjis et al. (2016); Dean et al. (2012), different compute nodes are responsible for different parts of a unique neural network; in the latter Zhang et al. (1990); Farber and Asanovic (1997); Raina et al. (2009), each compute node updates a complete copy of the neural network’s parameters on different portions of the data. In both cases, the obvious way to speed up learning is to use more compute hardware for performing the backpropagation/gradient step; the neural network is split across more CPUs/GPUs in the model-parallel setting, or less gradient descent operations are required per compute node in the data-parallel setting.

Due to its ease-of-implementation, distributed, data-parallel training is most commonly used, and it is the method best supported by common deep learning softwares such as TensorFlow Abadi et al. (2016) and PyTorch Paszke et al. (2017). It is generally understood that, for the best data-parallel training accuracy on models where the gradients are not sparse, such in a DNN, local results need to be synchronized globally once per iteration among compute nodes, in order to complete a step of backpropagation. 111Though asychnronous training is possible for models that tend to have sparse updates; for a discussion on asynchrony, see Section 6 However, there are limitations preventing this approach from easily scaling out. In particular, the size of the mini-batches is a crucial hyperparameter for neural network’s convergence. Adding hardware, while keeping the size of the batch constant, means that each node performs backpropagation on its own local portion of the batch faster, but it leaves the synchronization step no faster. In fact, it can even make it slower, as the number of workers increases. If synchronization time dominates, this means that adding more machines in this case can actually make training slower, in terms of wall-clock time.

In contrast, adding more machines, while making the batch size larger so that synchronization costs do not begin to dominate, is often cumbersome to properly tune, as very large batch sizes often do not speed convergence, compared to smaller batch sizes, and they can hurt generalizability Goyal et al. (2017); Yadan et al. (2013); You et al. (2017); Smith et al. (2017); Codreanu et al. (2017); You et al. (2019b, a). Avoiding this requires an excessive hyper-parameter tuning of step and batch sizes, often constituting tuning as the most time-consuming part for the practitioners. There is also an ongoing debate about the efficiency of such methods in practice Ma et al. (2019); Golmant et al. (2018).

Mitigating statistical inefficiency. Overfitting is an important problem that is often observed when training DNNs; it is particularly problematic when using large batch sizes to maintain hardware efficiency during training. Techniques such as dropout Srivastava et al. (2014), dropconnect Wan et al. (2013) and batch normalization Ioffe and Szegedy (2015) provide regularization during training to prevent overfitting; see Labach et al. (2019) for an overview of dropout methods. In dropout-based regularization, the key idea is to randomly drop units in the neural network during training, which seems to prevent them from co-adapting. This leads to models where neurons tend to behave independently, so they are more robust to discrepancies between training and testing data. Batch normalization Ioffe and Szegedy (2015), on the other hand, appears to achieve regularization by limiting the way the distribution of activations between consecutive layers change (known as covariate shift).

Our proposal: Independent subnet training. Interestingly, batch normalization and dropout methods, when combined together, can easily lead to worse generalization performance compared to their individual applications Li et al. (2018). The central idea in this paper, called independent subnet training, seems to contradict this finding. It calls for an extreme combination of dropout to facilitate combined model- and data-parallel distributed training. The subnet training algorithm decomposes the neural network into a set of independent subnets for the same task, by applying dropout without replacement to decompose the overall network. Each of those networks is trained at a different compute site using batch normalization for one or more local SGD steps, before a synchronization step. Because of the extreme dropout—each site in an -machine cluster gets only a small, fraction of the neurons at each layer in the network—batch normalization is required to handle the shift in distribution observed between training and testing, when all of the subnets are re-composed.

There are several advantages of this method with respect to hardware efficiency. Because the subnets share no parameters, synchronization requires no aggregation, in contrast to the data-parallel model—it is just an exchange of parameters. Moreover, each of the subnets is a fully-operational classifier by itself, and can be updated locally for a very large number of iterations before synchronizing. This radically reduces communication cost. Communication costs are also reduced because in an -machine cluster, each machine gets at most a fraction of the weights at each layer in the network—contrast this to data parallel training, when each machine must receive all of the weights.

Independent subnets have advantages over classical, model-parallel methods as well. During local updates, no synchronization pipelines between subnetworks are required, in contrast to the model-parallel setting. This reduces communication costs. Moreover, independent subnets have many of the advantages of model-parallel methods. For example, each machine gets just a small fraction of the overall model, which reduces the local memory requirement. This can be a significant advantage when training very large models using GPUs, which tend to have limited memory. Another advantage is simplicity—independent subnets achieve model-parallelism “for free”, without complicated distributed communication schemes for communicating partial results.

Key attributes of independent subnet training. Key attributes of our proposed algorithm can be summarized as follows:

  • [leftmargin=0.5cm]

  • Partitioning the original network into subnets implicitly prevents overfitting, due to its relation to dropout: our experimental findings support this statement, with increased generalization accuracy in our experiments, compared to simple data-parallel model.

  • By sampling without replacement, the model can be split into non-overlapping networks, reducing the interdependence between compute nodes, so that aggressive local updates can be applied, in contrast to model-parallel approaches.

  • When sharing the model among the cluster, only the partitioned subnet needs to be transmitted to each node instead of the whole model.

  • Since each worker trains a “thin” model locally, it can perform more local iterations compared to classical, data-parallel training, and it only communicates this thin model back into the network.

  • Then, during the model update at the parameter server, instead of averaging gradients/models, we just update the model parameters by copying the data sent from the corresponding partition, which requires no arithmetical operations.

We evaluate our method on real-life applications on speech command recognition and Amazon product recommendation on an Amazon Web Service (AWS) cluster. We also analyze the convergence behavior in order to validate the effectiveness and efficiency of our approach.

2 Preliminaries

DNN training. DNNs are instances of supervised learning: e.g., for data classification, we are interested in optimizing a loss function over a set of labeled data examples, such that, after training and given an unseen sample, the classifier will map it approximately to its corresponding label. The loss encodes the neural network architecture, with parameters . Formally, given a data source probability distribution , every (labeled) data sample drawn from can be represented as , where represents the data sample, and is its corresponding label. Then, deep learning aims in finding that minimizes the empirical loss:

(1)

where represents the continuous hypothesis space of values of parameters . Henceforth, we denote datasets as , and will assume it includes both samples and their corresponding labels .

The minimization (1) can be achieved by using different approaches Wright and Nocedal (1999); Zeiler (2012); Kingma and Ba (2014); Duchi et al. (2011); Ruder (2016), but almost all neural network training is accomplished via some variation on SGD: we compute (stochastic) gradient directions that, on expectation, point towards paths that decrease the loss, and then set Here, represents the learning rate, and represents a single or a mini-batch of examples, randomly selected from .

Why classical distributed approaches can be ineffective? It is generally understood that computing over the whole data set is wasteful Defazio and Bottou (2018). Instead, in each iteration, mini batch SGD computes for a small subsample of . In a centralized system, we often use no more than a few hundred data items in , and few would ever advocate using more than a few thousand Goyal et al. (2017); Yadan et al. (2013); Smith et al. (2017). Past that, the system wastes cycles lowering an almost non-existent error on the estimate for .

For parallel/distributed computation, this is problematic for two reasons: first, it makes it difficult to speed up the computation by adding more computational units. Since the batch size is quite small, splitting the task to more than a few compute notes is no beneficial, which motivates different training approaches for neural networks Berahas et al. (2017); Bottou et al. (2018); Kylasa et al. (2018); Xu et al. (2017); Berahas et al. (2019); Martens and Grosse (2015): computing the gradient update on a few single fast GPU may take a few milliseconds at most.

Second, gathering the updates in a distributed setting introduces a non-negligible time overhead in large clusters, which is often the main bottleneck towards efficient large-scale computing; see Section 4 for alternative solutions to this issue. This imbalance between communication and computation capacity can lead to significant increases in training time when a larger, more expensive cluster is used. As an example, consider performing the forward pass through two hidden layers in a neural network, expressed as in . Here, is an activation function such as ReLU, and and are disjoint parts of the vectorized full model . The data parallel version of this computation partitions the data matrix into and runs at each compute site.

However, even this simplified data parallel learning example requires a large amount of communication to broadcast the model to each machine. In particular, imagine that the input data have features, our two hidden layers have neurons, so that and are and matrices, respectively. By running this computation using single-precision arithmetic on 8 machines, over a batch of 256 data points, this results into broadcasting two matrices of size 1.63GB and 0.07GB, respectively, for 13.6GB of total communication to process each batch. Increasing the cluster size to 64 machines with a 1024 batch size, this results into 108GB of total communication. Using super-fast GPUs, the time required for this transfer can dominate the overall computation time.

Figure 1: Decomposing a neural network with three hidden layers into three subnets.

3 Training via Independent Subnetworks

We now describe the idea of independent subnet training. It can be seen as a generalization of, and, more importantly a distributed-computing alternative to, dropout: we retain the generalization accuracy by training different sub-networks, just as in dropout. However, we also advocate an extreme version of dropping neurons, where neurons are partitioned over sites, so that with machines, each machine has a small, fraction of the neurons from each layer. Since the neuron sets are mutually exclusive, they can be trained independently, with significantly less synchronization.

Subnet training(, , , )   0:  Initialization , # of iters. , # of machines , step size .   1:  Initialize full network with . 2:  for  do 3:     Divide current model into disjoint subnets. 4:     Set the -th model partition as for -th subnet. 5:     Broadcast to compute nodes. 6:     Run Subnet local SGD(, , , ). 7:     Aggregate model into . 8:  end for  until stopping criterion is met Subnet local SGD(, , , )   0:  -th subnet model , loss , # of local iters. , , mini-batch .   1:  Initialize -th subnet with . 2:  Draw samples from local data. 3:  for  do 4:     . 5:  end for  
Algorithm 1 Independent subnet training.

A prototype of our algorithm is described in Algorithm 1. On the left, we describe the “master” routine, which aggregates results from each compute node; on the right we describe the “worker” routine, that runs local SGD on each subnet. Just like classical SGD, subnet training repeatedly updates the parameters in a set of training steps. A training step consists of three sub-steps:

(1) Partitioning the model. A master-routine training step begins by first randomly partitioning the neurons in each of the hidden layers into partitions; see Algorithm 1 (Left panel). We denote the resulting subnetworks at the -th iteration as , for . To visually represent this process, see Figure 1: on the left we show the original network, that is decomposed into three independent subnets, containing subset of edges, denoted as . More specifically, these partitions induce a set of sub-matrices of each of the network’s weight matrices. In sequence, these sub-matrices are broadcast around the compute cluster, to specific workers.

In math, let refer to the -th partition of neurons in layer . If the weight matrix connects layers and in the network, then define the sub-matrix as: Intuitively, is all of the weights that connect the neurons in partition with the neurons in partition . Since the neurons in the input layer and the neurons in the output layer are not partitioned (each site will use all of the input channels and produce all of the outputs), we have: Once the neurons have been partitioned and the various sub-matrices determined, then for each value in , the weight sub-matrix is sent to site . Each site has now effectively been assigned a subnetwork of the original neural network.

Observe that not all parameters are updated per broadcast cycle: some parameters may have been sent to no sites, as a parameter connection between two neurons in adjacent layers is only sent to some site, if those two neurons happen to be assigned to the same site.

(2) Subnet local SGD. Once each site has received a subnet, one or more back propagation steps are performed. Assuming each site maintains a subset of the data , the local gradient steps on the -th node are performed using samples of the local subset of . See Algorithm 1 (Right panel).

(3) Synchronization. Finally, each site sends all of its sub-matrices back to the (possibly distributed) parameter server (Step 7 in Algorithm 1 (Left panel)). Note that unlike classical, model-parallel training, no arithmetic is required during synchronization, we need only copy the sub-matrices back to the parameter server, in contrast to classical local SGD procedures Mcdonald et al. (2009); Zinkevich et al. (2010); Zhang and Ré (2014); Zhang et al. (2016). Once synchronization completes, the next training step begins. Once training ends, the resulting neural network can be deployed without modification.

Some care is required during the synchronization phase to make this work well in practice. Because of the extreme dropout during training—each machine receives just a tiny, fraction of the weights at each layer—we observe that it is necessary to use a technique such as batch normalization Ioffe and Szegedy (2015) to ensure that the distribution to each neuron is the same during training and during deployment. The intuition is as follows: since the set of neurons from the previous layer providing input to each neuron is much smaller during training than during deployment, the input to the neuron is of lower magnitude during training, as compared to when all of the neurons are used. One can counter this by scaling up the input to each neuron during training (or scaling down the input during deployment); this is what is done when dropout training is used.

Unfortunately, naively scaling results in problems during subnet training. During subnet training, only a very small number of neurons are used by each site, so that each neuron sees a much higher variance during training. In other words, the neuron’s distribution differs considerably between training and deployment. With a non-linear activation function, this is a problem, as it fundamentally changes the activation pattern of the neuron, leading to worse generalization performance. Batch normalization addresses this by forcing the neuron normalized to mean zero and variance one, so that learned parameter can be consistently used during the test phase. To the best of our knowledge, this work is the first to combine dropout and batch normalizations in favor of better generalization accuracy, in contrast to common wisdom Li et al. (2018).

Finally, not only is subnet training related to dropout techniques, but the idea is also related to ensemble methods Dietterich (2000). Subnet training treats a single neural network as a set of smaller networks, training them independently for a period, before they are synchronized and repartitioned. During deployment, all subnets are combined and used as a single network, to make decisions.

4 Related work

Here, we highlight the most related works to our idea. These include information quantization during training, sparsification, large-batch training and local SGD approaches.

Data parallelism often suffers from the high bandwidth costs to communicate gradient updates between workers. In order to address such issue, quantized SGD Alistarh et al. (2017); Courbariaux et al. (2015); Seide et al. (2014) and sparsified SGD Aji and Heafield (2017) are proposed. Quantized SGD adapts some lossy compression heuristics to quantize the gradients to decrease the network traffic, in some extreme cases, only three numerical levels Wen et al. (2017), or more fine-grained multi-level quantizations Dettmers (2015); Gupta et al. (2015); Hubara et al. (2017); Alistarh et al. (2017). On the other hand, Sparsified SGD reduces the exchange overhead by only transmitting the gradients with maximal magnitude (eg. top 1%). Although such methods are relevant to our approach, there is a fundamental difference: those techniques post-process the full-model gradients from the forward and backward propagation of the raw model, which is computational demanding; while our approach only compute the gradients of a thin partitioned model, so that computation cost is also reduced.

Recently, there is a push for efficiency in large-scale scenarios, that has led to changes in the so-far traditional DNN training. E.g., there has been a series of papers on the general topic of using parallelism to “Solve the learning problem in minutes”, for ever-decreasing values of Goyal et al. (2017); Yadan et al. (2013); You et al. (2017); Smith et al. (2017); Codreanu et al. (2017); You et al. (2019b, a). However, increasing the number of compute nodes does not necessarily alleviate any communication bottlenecks—synchronization rounds are less frequent, but a massive amount of information still needs to be transferred over the network. Further, it is generally known—but still debated Dinh et al. (2017)—that large batch training converges to “sharp minima”, hurting generalization Keskar et al. (2016); Yao et al. (2018); Defazio and Bottou (2018). Finally, it is more common than not that achieving such results seems to require teams of PhDs, utilizing special-purpose hardware: there is no easy “plug-n-play” approach that generalizes well without extensive experimental trial-and-error. At the present time, such speedups are not achieved by small teams using off-the-shelf frameworks, in conjunction with cloud services.

One way to solve some of these issues is through distributed local SGD Mcdonald et al. (2009); Zinkevich et al. (2010); Zhang and Ré (2014); Zhang et al. (2016): the idea is to update the parameters, through averaging, only after several local steps are performed per compute node. In contrast to classical mini-batch SGD that has higher statistical efficiency, local SGD “spends” more time locally at each worker, performing several iterations locally, that lead to different local models: these models are then transferred over the network, and averaged locally to result into the next distributed iteration’s model, and thus higher hardware efficiency Zhang et al. (2016). There are extreme settings where the averaging part is performed only once, at the very end of local SGD training Zinkevich et al. (2010). This strategy reduces the communication cost, as less number of synchronization steps are performed, as in large-batch training, decreasing the overall training time without accuracy loss. However, per synchronization round, each local compute node shares a dense model over the network, which results into communication bottleneck. Recent approaches Lin et al. (2018) propose less frequent synchronization towards the end of the training, but they cannot avoid it at the beginning. More importantly, as the experiments section in Lin et al. (2018) reveals, local updates of the whole model leads to interdependence among updates, which limits the local updates up to 16 iterations, while subnet training enables much more aggressive local update period, since each partitioned model shares no overlapping parameters.

Finally, a way to work around the high synchronization costs in distributed SGD is by the idea of gradient staleness in computations Recht et al. (2011); Dean et al. (2012); Paine et al. (2013); Zhang et al. (2013). Asynchrony has been used in distributed-memory systems, such as the DistBelief system by Google Dean et al. (2012) (also known as Downpour SGD) and the Project Adam by Microsoft Kingma and Ba (2014). While such systems, asymptotically, show nice convergence rate guarantees—, where is the number of compute nodes, and is the number of iterations—there seems to be growing agreement that unconstrained asynchrony does not work well in practice for complex neural network models, and it seems to be losing favor in practice. A recent paper Chen et al. (2016) strongly argued that asynchronous updates can be harmful beyond a few dozens of compute nodes. Our approach naturally creates non-overlapping subnets that eliminate overwrites as the parameter server; we leave this direction for future work.

5 Experiments

We empirically demonstrate subnet training’s effectiveness and efficiency on two learning tasks.

Google Speech Commands Warden (2018): We learn a two- and three-layer network over this data set, which includes the waveform of spoken words, and is designed to help train and evaluate keyword spotting systems. We pre-process the data to represent each waveform as a 4096-dimensional feature vector according to mel-scaled spectrogram Stevens et al. (1937). 35 labeled keywords are included for the classification task. We train two neural network structures: a two layer fully connected network with two weight matrices of size and ; a three layer fully connected network with three weight matrices of size , and .

Amazon-670k Bhatia et al. : We learn a larger model over a dataset that includes product recommendation with 670,000 labels. Each sample represents a product with labels of other products users might be interested in purchasing. This is one of the standard large scale benchmarks for extreme classification research. We train a two-layer, fully-connected neural network, with the hidden layer of size 512, where the two weight matrices sized at and . The KL-divergence between the network’s output and the corresponding target is applied as the loss function for SGD.

Implementation. We implement distributed subnet training in PyTorch. In addition to subnet training, we use the classical data parallel training, implemented in PyTorch. We train the Google speech command recognization networks on three AWS clusters, with 2, 4, and 8 machine instances (m5.2xlarge CPU machines). We train the Amazon-670k extreme classification network on an AWS GPU clusters, with 8 GPU machines (p3.2xlarge instances). We use a batch size of 128 waveforms/data points, and we process 75 batches before synchronization for subnet training with the Google data set, and 40 batches before synchronization with the Amazon-670k data set. A separate mode is used as a parameter server (we use a m5.2xlarge CPU machine).

We use the standard SGD method, with step size tuned over an exponentially-spaced set: for the Google data set, and for the Amazon data set; we selected for the former and for the latter. We considered only fixed step size selections, with no schedule decay. The initialization of the neural networks were completed by the default initialization in PyTorch; also, we consider here single run experiments.

Data Parallel SubNet
Accuracy 2 Node 4 Node 8 Node 2 Node 4 Node 8 Node
65% 264 161 142 39 18 6
67% 352 225 213 51 21 8
69% 440 291 309 64 32 12
71% 705 421 428 76 40 17
73% 970 615 572 113 65 28
75% 1499 713 834 158 102 48
77% 2208 1297 1550 259 168 98
Table 1: The time (in seconds) for the two-layer Google Speech classifier to reach various accuracies.
Data Parallel SubNet
Accuracy 2 Node 4 Node 8 Node 2 Node 4 Node 8 Node
65% 359 201 289 46 22 15
67% 540 269 385 69 27 21
69% 722 336 530 91 43 30
71% 905 404 772 113 64 45
73% 1088 472 1545 137 86 64
75% 1270 540 1932 182 129 104
77% 1450 947 2222 251 214 185
Table 2: The time (in seconds) for the three-layer Google Speech classifier to reach various accuracies.

Results. Results for the Google Speech classifiers are given in Tables 1 and 2. For each of the clusters tested, and for both subnet training and data parallel training, we show the time required to reach different accuracies. A similar set of results is shown for the 8-node GPU cluster used to train the Amazon-670k classifier in Table 3. We also summarize the best accuracy for two/three-layer Google speech classifier, and best P@1 for Amazon-670k classifier, associated with the training time for the largest cluster scale in Table 4

Discussion. We observe that the data parallel training is mostly ineffective at realizing any sort of speedup from distributed training. Adding more machines can actually increase the training time. This is due to the fact that the interconnect between machines, provided by AWS, is slow enough that communication times dominate. The interconnect operates at 20Gbps, meaning that, e.g., the time required to transmit a 4096 4096 matrix between two machines (Google data set) takes more than of a second, and transmitting a 512 670,091 matrix (Amazon data set) takes more than a second. Adding more machines to decrease the wall-clock time of backpropagation does not decrease the overall time, because it tends to increase the communication time. This is especially the case when a single parameter server is used, which is one of the most used configurations for distributed training. Further, we conjecture that, adding additional parameter servers may alleviate this problem, but it likely would not have made the data parallel training anywhere as fast as the distributed subnet training, which is more than 20 times faster than data parallel training for the larger clusters. Subnet training was also handicapped by the single parameter server, as the AWS interconnect was slow enough that communication to the various workers was still a bottleneck, even for subnet training.

Given how slow the AWS interconnect is, there were two main reasons that the subnet training scaled well, compared to the data parallel training. One is the reduced communication to each worker node during synchronization. The second is the reduced frequency of communication. Since we run 70 independent batches on each worker in the case of Google (40 in the case of Amazon), there are far fewer communication steps. While we likely do lose some computational efficiency –the lack of communication may require more computing time to reach a given accuracy– it is more than compensated, in terms of increased communication efficiency.

Would a faster interconnect between machines have changed the results? This is a case for further investigation, due to PyTorch limitations for GPU-to-GPU communication. However, even with a faster InfiniBand connection (typically 3 - 5 faster than the AWS interconnect), the difference between the training times is significant, that constitutes unlikely to outperform subnet training. Further, most deep learning “consumers” would be limited to using AWS interconnect, as the most common platform for training of neural networks.

Accuracy 26% 28% 30% 32% 34% 36% 38%
Data Parallel 23540 26254 28953 31159 33895 39312 41572
SubNet 3288 3688 4252 5019 5791 6926 8872
Table 3: The time (in seconds) for the Amazon-670k classifier to reach various precision@1.
Data Parallel SubNet
Accuracy/Precision Time Accuracy/Precision Time
2-layer speech classifier 77.15% 2386 78.26% 192
3-layer speech classifier 77.68% 3512 79.51% 557
Amazon-670k classifier 39.05% 49720 40.11% 12328
Table 4: The final accuracy for two/three-layer Google speech classifiers and final precision@1 for Amazon-670k classifier associated with the training time (in seconds) of 8 node clusters.

6 Conclusion

In this work, we propose independent subnet training for distributed optimization of fully connected neural networks. By stochastically partitioning the model into non-overlapping subnetworks, subnet training reduces the communication overhead for model synchronization, and the computation overhead of forward-backward propagation for a thinner model on each worker. Inherited from the regularization effect of dropout, the same neural network architecture generalizes better, when optimized with subnets, comparing to the classic data parallel approach.

There are interesting open questions remaining: This paper focuses on synchronizing updates on a homogenous environment; deploying our proposal on heterogeneous distributed environments, with asynchrony and adaptive local tuning, is an interesting extension. Further, the current prototype is applied only to fully connected neural networks with SGD; extending the approach to other architectures, such as CNNs, and other training algorithms such as Adam, introduces interesting open questions: e.g., how to split momentum and second-order information via subnet training? Lastly, we plan to combine techniques on post-processing the gradients, such as sparsification and quantization.

References

  • [1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. (2016) Tensorflow: a system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pp. 265–283. Cited by: §1.
  • [2] A. F. Aji and K. Heafield (2017) Sparse communication for distributed gradient descent. arXiv preprint arXiv:1704.05021. Cited by: §4.
  • [3] D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic (2017) QSGD: communication-efficient SGD via gradient quantization and encoding. In Advances in Neural Information Processing Systems, pp. 1709–1720. Cited by: §4.
  • [4] A. S. Berahas, R. Bollapragada, and J. Nocedal (2017) An investigation of Newton-sketch and subsampled Newton methods. arXiv preprint arXiv:1705.06211. Cited by: §2.
  • [5] A. S. Berahas, M. Jahani, and M. Takáč (2019) Quasi-Newton methods for deep learning: forget the past, just sample. arXiv preprint arXiv:1901.09997. Cited by: §2.
  • [6] K. Bhatia, K. Dahiya, H. Jain, Y. Prabhu, and M. Varma The extreme classification repository: multi-label datasets and code. Note: http://manikvarma.org/downloads/XC/XMLRepository.html Cited by: §5.
  • [7] L. Bottou, F. E. Curtis, and J. Nocedal (2018) Optimization methods for large-scale machine learning. Siam Review 60 (2), pp. 223–311. Cited by: §2.
  • [8] J. Chen, X. Pan, R. Monga, S. Bengio, and R. Jozefowicz (2016) Revisiting distributed synchronous SGD. arXiv preprint arXiv:1604.00981. Cited by: §4.
  • [9] T. M. Chilimbi, Y. Suzue, J. Apacible, and K. Kalyanaraman (2014) Project Adam: building an efficient and scalable deep learning training system.. In OSDI, Vol. 14, pp. 571–582. Cited by: §1.
  • [10] V. Codreanu, D. Podareanu, and V. Saletore (2017) Scale out for large minibatch SGD: Residual network training on ImageNet-1K with improved accuracy and reduced time to train. arXiv preprint arXiv:1711.04291. Cited by: §1, §4.
  • [11] M. Courbariaux, Y. Bengio, and J. David (2015) BinaryConnect: training deep neural networks with binary weights during propagations. In Advances in neural information processing systems, pp. 3123–3131. Cited by: §4.
  • [12] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, A. Senior, P. Tucker, K. Yang, Q. V. Le, et al. (2012) Large scale distributed deep networks. In Advances in neural information processing systems, pp. 1223–1231. Cited by: §1, §4.
  • [13] A. Defazio and L. Bottou (2018) On the ineffectiveness of variance reduced optimization for deep learning. arXiv preprint arXiv:1812.04529. Cited by: §2, §4.
  • [14] T. Dettmers (2015) 8-bit approximations for parallelism in deep learning. arXiv preprint arXiv:1511.04561. Cited by: §4.
  • [15] T. G. Dietterich (2000) Ensemble methods in machine learning. In International workshop on multiple classifier systems, pp. 1–15. Cited by: §3.
  • [16] L. Dinh, R. Pascanu, S. Bengio, and Y. Bengio (2017) Sharp minima can generalize for deep nets. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1019–1028. Cited by: §4.
  • [17] J. Duchi, E. Hazan, and Y. Singer (2011) Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research 12 (Jul), pp. 2121–2159. Cited by: §2.
  • [18] P. Farber and K. Asanovic (1997) Parallel neural network training on multi-spert. In Algorithms and Architectures for Parallel Processing, 1997. ICAPP 97., 1997 3rd International Conference on, pp. 659–666. Cited by: §1.
  • [19] R. Girshick (2015) Fast R-CNN. arXiv preprint arXiv:1504.08083. Cited by: §1.
  • [20] N. Golmant, N. Vemuri, Z. Yao, V. Feinberg, A. Gholami, K. Rothauge, M. W. Mahoney, and J. Gonzalez (2018) On the computational inefficiency of large batch sizes for stochastic gradient descent. arXiv preprint arXiv:1811.12941. Cited by: §1.
  • [21] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio (2016) Deep learning. Vol. 1, MIT press Cambridge. Cited by: §1.
  • [22] P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He (2017) Accurate, large minibatch SGD: training ImageNet in 1 hour. arXiv preprint arXiv:1706.02677. Cited by: §1, §2, §4.
  • [23] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan (2015) Deep learning with limited numerical precision. In International Conference on Machine Learning, pp. 1737–1746. Cited by: §4.
  • [24] S. Hadjis, C. Zhang, I. Mitliagkas, D. Iter, and C. Ré (2016) Omnivore: an optimizer for multi-device deep learning on CPUs and GPUs. arXiv preprint arXiv:1606.04487. Cited by: §1.
  • [25] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio (2017) Quantized neural networks: training neural networks with low precision weights and activations.. Journal of Machine Learning Research 18 (187), pp. 1–30. Cited by: §4.
  • [26] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, pp. 448–456. Cited by: §1, §3.
  • [27] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers, et al. (2017) In-datacenter performance analysis of a tensor processing unit. In 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), pp. 1–12. Cited by: §1.
  • [28] N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang (2016) On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836. Cited by: §4.
  • [29] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §2, §4.
  • [30] A. Krizhevsky, I. Sutskever, and G. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §1.
  • [31] S. B. Kylasa, F. Roosta-Khorasani, M. W. Mahoney, and A. Grama (2018) GPU accelerated sub-sampled Newton’s method. arXiv preprint arXiv:1802.09113. Cited by: §2.
  • [32] A. Labach, H. Salehinejad, and S. Valaee (2019) Survey of dropout methods for deep neural networks. arXiv preprint arXiv:1904.13310. Cited by: §1.
  • [33] M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, and B. Su (2014) Scaling distributed machine learning with the parameter server.. In OSDI, Vol. 14, pp. 583–598. Cited by: §1.
  • [34] X. Li, S. Chen, X. Hu, and J. Yang (2018) Understanding the disharmony between dropout and batch normalization by variance shift. arXiv preprint arXiv:1801.05134. Cited by: §1, §3.
  • [35] T. Lin, S. U. Stich, and M. Jaggi (2018) Don’t use large mini-batches, use local SGD. arXiv preprint arXiv:1808.07217. Cited by: §4.
  • [36] J. Long, E. Shelhamer, and T. Darrell (2015) Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431–3440. Cited by: §1.
  • [37] L. Ma, G. Montague, J. Ye, Z. Yao, A. Gholami, K. Keutzer, and M. W. Mahoney (2019) Inefficiency of K-FAC for large batch size training. arXiv preprint arXiv:1903.06237. Cited by: §1.
  • [38] J. Martens and R. Grosse (2015) Optimizing neural networks with Kronecker-factored approximate curvature. In International conference on machine learning, pp. 2408–2417. Cited by: §2.
  • [39] R. Mcdonald, M. Mohri, N. Silberman, D. Walker, and G. S. Mann (2009) Efficient large-scale distributed training of conditional maximum entropy models. In Advances in Neural Information Processing Systems, pp. 1231–1239. Cited by: §3, §4.
  • [40] T. Paine, H. Jin, J. Yang, Z. Lin, and T. Huang (2013) GPU asynchronous stochastic gradient descent to speed up neural network training. arXiv preprint arXiv:1312.6186. Cited by: §4.
  • [41] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in pytorch. Cited by: §1.
  • [42] R. Raina, A. Madhavan, and A. Y. Ng (2009) Large-scale deep unsupervised learning using graphics processors. In Proceedings of the 26th annual international conference on machine learning, pp. 873–880. Cited by: §1.
  • [43] A. Ratner, D. Alistarh, G. Alonso, P. Bailis, S. Bird, N. Carlini, B. Catanzaro, E. Chung, B. Dally, J. Dean, et al. (2019) SysML: the new frontier of machine learning systems. arXiv preprint arXiv:1904.03257. Cited by: §1.
  • [44] B. Recht, C. Re, S. Wright, and F. Niu (2011) Hogwild: a lock-free approach to parallelizing stochastic gradient descent. In Advances in neural information processing systems, pp. 693–701. Cited by: §4.
  • [45] S. Ruder (2016) An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747. Cited by: §2.
  • [46] F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu (2014) 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs. In Fifteenth Annual Conference of the International Speech Communication Association, Cited by: §4.
  • [47] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §1.
  • [48] S. L. Smith, P. Kindermans, C. Ying, and Q. V. Le (2017) Don’t decay the learning rate, increase the batch size. arXiv preprint arXiv:1711.00489. Cited by: §1, §2, §4.
  • [49] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15 (1), pp. 1929–1958. Cited by: §1.
  • [50] S. S. Stevens, J. Volkmann, and E. B. Newman (1937) A scale for the measurement of the psychological magnitude pitch. The Journal of the Acoustical Society of America 8 (3), pp. 185–190. Cited by: §5.
  • [51] L. Wan, M. Zeiler, S. Zhang, Y. Le Cun, and R. Fergus (2013) Regularization of neural networks using Dropconnect. In International Conference on Machine Learning, pp. 1058–1066. Cited by: §1.
  • [52] P. Warden (2018) Speech commands: a dataset for limited-vocabulary speech recognition. arXiv preprint arXiv:1804.03209. Cited by: §5.
  • [53] W. Wen, C. Xu, F. Yan, C. Wu, Y. Wang, Y. Chen, and H. Li (2017) Terngrad: ternary gradients to reduce communication in distributed deep learning. In Advances in neural information processing systems, pp. 1509–1519. Cited by: §4.
  • [54] S. Wright and J. Nocedal (1999) Numerical optimization. Springer Science 35 (67-68), pp. 7. Cited by: §2.
  • [55] P. Xu, F. Roosta-Khorasani, and M. W. Mahoney (2017) Newton-type methods for non-convex optimization under inexact hessian information. arXiv preprint arXiv:1708.07164. Cited by: §2.
  • [56] O. Yadan, K. Adams, Y. Taigman, and M. Ranzato (2013) Multi-GPU training of convnets. arXiv preprint arXiv:1312.5853. Cited by: §1, §2, §4.
  • [57] Z. Yao, A. Gholami, Q. Lei, K. Keutzer, and M. W. Mahoney (2018) Hessian-based analysis of large batch training and robustness to adversaries. In Advances in Neural Information Processing Systems, pp. 4949–4959. Cited by: §4.
  • [58] Y. You, I. Gitman, and B. Ginsburg (2017) Scaling SGD batch size to 32K for ImageNet training. arXiv preprint arXiv:1708.03888. Cited by: §1, §4.
  • [59] Y. You, J. Hseu, C. Ying, J. Demmel, K. Keutzer, and C. Hsieh (2019) Large-batch training for LSTM and beyond. arXiv preprint arXiv:1901.08256. Cited by: §1, §4.
  • [60] Y. You, J. Li, J. Hseu, X. Song, J. Demmel, and C. Hsieh (2019) Reducing BERT pre-training time from 3 days to 76 minutes. arXiv preprint arXiv:1904.00962. Cited by: §1, §4.
  • [61] M. D. Zeiler (2012) ADADELTA: an adaptive learning rate method. arXiv preprint arXiv:1212.5701. Cited by: §2.
  • [62] C. Zhang and C. Ré (2014) Dimmwitted: a study of main-memory statistical analytics. Proceedings of the VLDB Endowment 7 (12), pp. 1283–1294. Cited by: §3, §4.
  • [63] J. Zhang, C. De Sa, I. Mitliagkas, and C. Ré (2016) Parallel SGD: when does averaging help?. arXiv preprint arXiv:1606.07365. Cited by: §3, §4.
  • [64] S. Zhang, C. Zhang, Z. You, R. Zheng, and B. Xu (2013) Asynchronous stochastic gradient descent for DNN training. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pp. 6660–6663. Cited by: §4.
  • [65] X. Zhang, M. Mckenna, J. P. Mesirov, and D. L. Waltz (1990) An efficient implementation of the back-propagation algorithm on the connection machine CM-2. In Advances in neural information processing systems, pp. 801–809. Cited by: §1.
  • [66] M. Zinkevich, M. Weimer, L. Li, and A. J. Smola (2010) Parallelized stochastic gradient descent. In Advances in neural information processing systems, pp. 2595–2603. Cited by: §3, §4.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
393322
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description