Sparse Binary Compression: Towards Distributed Deep Learning with minimal Communication
Abstract
Currently, progressively larger deep neural networks are trained on ever growing data corpora. As this trend is only going to increase in the future, distributed training schemes are becoming increasingly relevant. A major issue in distributed training is the limited communication bandwidth between contributing nodes or prohibitive communication cost in general. These challenges become even more pressing, as the number of computation nodes increases. To counteract this development we propose sparse binary compression (SBC), a compression framework that allows for a drastic reduction of communication cost for distributed training. SBC combines existing techniques of communication delay and gradient sparsification with a novel binarization method and optimal weight update encoding to push compression gains to new limits. By doing so, our method also allows us to smoothly tradeoff gradient sparsity and temporal sparsity to adapt to the requirements of the learning task. Our experiments show, that SBC can reduce the upstream communication on a variety of convolutional and recurrent neural network architectures by more than four orders of magnitude without significantly harming the convergence speed in terms of forwardbackward passes. For instance, we can train ResNet50 on ImageNet in the same number of iterations to the baseline accuracy, using less bits or train it to a lower accuracy using less bits. In the latter case, the total upstream communication required is cut from 125 terabytes to 3.35 gigabytes for every participating client.
I Introduction
Distributed Stochastic Gradient Descent (DSGD) is a training setting, in which a number of clients jointly trains a deep learning model using stochastic gradient descent [dean2012large][recht2011hogwild][moritz2015sparknet]. Every client holds an individual subset of the training data, used to improve the current master model. The improvement is obtained by investing computational resources to perform iterations of stochastic gradient descent (SGD). This local training produces a weight update in every participating client, which in regular or irregular intervals (communication rounds) is exchanged to produce a new master model. This exchange of weightupdates can be performed indirectly via a centralized server or directly in an allreduce operation. In both cases, all clients share the same master model after every communication round (Fig. 1). In vanilla DSGD the clients have to communicate a full gradient update during every iteration. Every such update is of the same size as the full model, which can be in the range of gigabytes for modern architectures with millions of parameters [he2016deep][huang2017densely]. Over the course of multiple hundred thousands of training iterations on big datasets the total communication for every client can easily grow to more than a petabyte. Consequently, if communication bandwidth is limited, or communication is costly, distributed deep learning can become unproductive or even unfeasible.
DSGD is a very popular training setting with many applications. On one end of the spectrum, DSGD can be used to greatly reduce the training time of largescale deep learning models by introducing devicelevel data parallelism [chilimbi2014project][zinkevich2010parallelized][xing2015petuum][li2014communication], making use of the fact that the computation of a minibatch gradient is perfectly parallelizable. In this setting, the clients are usually embodied by hardwired highperformance computation units (i.e. GPUs in a cluster) and every client performs one iteration of SGD per communication round. Since communication is highfrequent in this setting, bandwidth can be a significant bottleneck. On the other end of the spectrum DSGD can also be used to enable privacypreserving deep learning [shokri2015privacy][mcmahan2016communication]. Since the clients only ever share weightupdates, DSGD makes it possible to train a model from the combined data of all clients without any individual client having to reveal their local training data to a centralized server. In this setting the clients typically are embedded or mobile devices with low network bandwidth, intermittent network connections, and an expensive mobile data plan.
In both scenarios, the communication cost between the individual training nodes is a limiting factor for the performance of the whole learning system. As a result of this, substantial research has gone into the effort of reducing the amount of communication necessary between the clients via lossy compression schemes [li2014communication][wen2017terngrad][alistarh2017qsgd][seide20141][bernstein2018signsgd][strom2015scalable][aji2017sparse][lin2017deep][mcmahan2016communication][konevcny2016federated]. For the synchronous distributed training scheme described above, the total amount of bits communicated by every client during training is given by
(1) 
where is the total number of training iterations (forwardbackward passes) every client performs, is the communication frequency, is the sparsity of the weightupdate, , are the average number of bits required to communicate the position and the value of the nonzero elements respectively and is the number of receiving nodes (if is dense, the positions of all weights are predetermined and no position bits are required).
Existing compression schemes only focus on reducing one or two of the multiplicative components that contribute to . Using the systematic of equation 1, we can group these prior approaches into three different groups:
Sparsification methods restrict weightupdates to modifying only a small subset of the parameters, thus reducing . Strom [strom2015scalable] presents an approach in which only gradients with a magnitude greater than a certain predefined threshold are sent to the server. All other gradients are aggregated into a residual. This method achieves compression rates of up to 3 orders of magnitude on an acoustic modeling task. In practice however, it is hard to choose appropriate values for the threshold, as it may vary a lot for different architectures and even different layers. Instead of using a fixed threshold to decide what gradient entries to send, Aji et al [aji2017sparse] use a fixed sparsity rate. They only communicate the fraction entries of the gradient with the biggest magnitude, while also collecting all other gradients in a residual. At a sparsity rate of their method slightly degrades the convergence speed and final accuracy of the trained model. Lin et al. [lin2017deep] present modifications to the work of Aji et al. which close this performance gap. These modifications include using a curriculum to slowly increase the amount of sparsity in the first couple communication rounds and applying momentum factor masking to overcome the problem of gradient staleness. They report extensive results for many modern convolutional and recurrent neural network architectures on big datasets. Using a naive encoding of the sparse weightupdates, they achieve compression rates ranging from 270 to 600 on different architectures, without any slowdown in convergence speed or degradation of final accuracy.
Communication delay methods try to reduce the communication frequency . McMahan et al. [mcmahan2016communication] propose Federated Averaging to reduce the communication bitwidth. In Federated Averaging, instead of communicating after every iteration, every client performs multiple iterations of SGD to compute a weightupdate. The authors observe that this delay of communication does not significantly harm the convergence speed in terms of local iterations and report a reduction in the number of necessary communication rounds by a factor of 10  100 on different convolutional and recurrent neural network architectures. In a followup work Konevcny et al. [konevcny2016federated] combine this communication delay with random sparsification and probabilistic quantization. They restrict the clients to learn random sparse weightupdates or force random sparsity on them afterwards ("structured" vs "sketched" updates) and combine this sparsification with probabilistic quantization. While their method also combines communication delay with (random) sparsification and quantization, and achieves good compression gains for one particular CNN and LSTM model, it also causes a major drop in convergence speed and final accuracy.
Dense quantization methods try to reduce the amount of value bits . Wen et al. propose TernGrad [wen2017terngrad], a method to stochastically quantize gradients to ternary values. This achieves a moderate compression rate of 16, while accuracy drops noticeably on big modern architectures. The authors prove the convergence of their method under the assumption of bounded gradients. Alistarh et al. [alistarh2017qsgd], explore the tradeoff between model accuracy and gradient precision. They prove information theoretic bounds on the compression rate achievable by dense quantization and propose QSGD, a family of compression schemes with convergence guarantees. Other authors experiment with 1bit quantization schemes: Seide et al. [seide20141] show empirically that it is possible to quantize the weightupdates to 1 bit without harming convergence speed, if the quantization errors are accumulated. Bernstein et al. [bernstein2018signsgd] propose signSGD, a distributed training scheme in which every client quantizes the gradients to binary signs and the server aggregates the gradients by means of a majority vote. In general of course, dense quantization can only achieve a maximum compression rate of 32.
Ii Sparse Binary Compression
We propose Sparse Binary Compression (cf. Figure 2), to drastically reduce the number of communicated bits in distributed training. SBC makes use of multiple techniques simultaneously^{1}^{1}1To clarify, we have put our contributions in emphasis. to reduce all multiplicative components of equation (1).
In the following will refer to the entirety of neural network parameters, while will refer to one specific tensor of weights. Arithmetic operations on are to be understood componentwise.
Communication Delay, Fig. 2 (a): We use communication delay, proposed by [mcmahan2016communication], to introduce temporal sparsity into DSGD. Instead of communicating gradients after every local iteration, we allow the clients to compute more informative updates by performing multiple iterations of SGD. These generalized weightupdates are given by
where refers to the set of weights obtained by performing iterations of stochastic gradient descent on , while sampling minibatches from the ith client’s training data . Empirical analysis by [mcmahan2016communication] suggests that communication can be delayed drastically, with only marginal degradation of accuracy. For we obtain regular DSGD.
Sparse Binarization, Fig. 2 (b), (c): Following the works of [lin2017deep][strom2015scalable][shokri2015privacy][aji2017sparse] we use the magnitude of an individual weight within a weightupdate as a heuristic for it’s importance. First, we set all but the fraction biggest and fraction smallest weightupdates to zero. Next, we compute the mean of all remaining positive and all remaining negative weightupdates independently. If the positive mean is bigger than the absolute negative mean , we set all negative values to zero and all positive values to the positive mean and vice versa. The method is illustrated in figure 2 and formalized in algorithm 2. Finding the fraction smallest and biggest values in a vector requires operations, where refers to the number of elements in . As suggested in [lin2017deep], we can reduce the computational cost of this sorting operation, by randomly subsampling from . However this comes at the cost of introducing (unbiased) noise in the amount of sparsity. Luckily, in our approach communication rounds (and thus compressions) are relatively infrequent, which helps to marginalize the overhead of the sparsification. Quantizing the nonzero elements of the sparsified weightupdate to the mean reduces the required value bits from 32 to 0. This translates to a reduction in communication cost by a factor of around . We can get away with averaging out the nonzero weightupdates because they are relatively homogeneous in value and because we accumulate our compression errors as described in the next paragraph.
Although other methods, like TernGrad [wen2017terngrad] also combine sparsification and quantization of the weightupdates, none of these methods work with sparsity rates as high as ours (see Table I).
Residual Accumulation, Fig. 2 (d): It is well established (see [lin2017deep][strom2015scalable][aji2017sparse][seide20141]) that the convergence in sparsified DSGD can be greatly accelerated by accumulating the error that arises from only sending sparse approximations of the weightupdates. After every communication round, the residual is updated via
(2) 
Error accumulation has the great benefit that no gradient information is lost (it may only become outdated or "stale"). In the context of pure sparsification residual accumulation can be interpreted to be equivalent to increasing the batch size for individual parameters [lin2017deep]. Moreover, we can show:
Theorem II.1.
Let be (flattened) weightupdates, computed by one client in the first communication rounds. Let be the actual weightupdates, transferred in the previous rounds (restricted to some subspace ) and be the content of the residual at time as in (2). Then the orthogonal projection
(3) 
uniquely minimizes the accumulated error
(4) 
in . (Proof in Supplement.)
That means that the residual accumulation keeps the compressed optimization path as close as possible to optimization path taken with noncompressed weightupdates.
Optimal Position Encoding, Fig. 2 (e): To communicate a set of sparse binary tensors produced by SGC, we only need to transfer the positions of the nonzero elements in the flattened tensors, along with one mean value ( or ) per tensor. Instead of communicating the absolute nonzero positions it is favorable to only communicate the distances between all nonzero elements. Under the simplifying assumption that the sparsity pattern is random for every weightupdate, it is easy to show that these distances are geometrically distributed with success probability equal to the sparsity rate. Therefore, as previously done by [strom2015scalable], we can optimally encode the distances using the Golomb code [golomb1966run]. Golomb encoding reduces the average number of position bits to
(5) 
with and being the golden ratio. For a sparsity rate of i.e. , we get , which translates to compression, compared to a naive distance encoding with 16 fixed bits. While the overhead for encoding and decoding makes it unproductive to use Golomb encoding in the situation of [strom2015scalable], this overhead becomes negligible in our situation due to the infrequency of weightupdate exchange resulting from communication delay. The encoding scheme is given in algorithm 3, while the decoding scheme can be found in the supplement.
Momentum Correction, Warmup Training and Momentum Masking: Lin et al. [lin2017deep] introduce multiple minor modifications to the vanilla Gradient Dropping method, to improve the convergence speed. We adopt momentum masking, while momentum correction is implicit to our approach. For more details on this we refer to the supplement.
Our proposed method is described in Algorithms 1, 2 and 3. Algorithm 1 describes how compression and residual accumulation can be introduced into DSGD. Algorithm 2 describes our compression method. Algorithm 3 describes the Golomb encoding. Table I compares theoretical asymptotic compression rates of different popular compression methods.

Baseline 





Temporal Sparsity  100%  100%  100%  0.1%  10%  0.1%  10%  
Gradient Sparsity  100%  100%  0.1%  100%  0.1%  10%  
Value Bits  1  8  
Position Bits    
Compression Rate   

 
Iii Temporal vs Gradient Sparsity
Communication constraints can vary heavily between learning tasks and may not even be consistent throughout one distributed training session. Take, for example, a set of mobile devices, jointly training a model with privacypreserving DSGD [mcmahan2016communication][shokri2015privacy]. Part of the day, the devices might be connected to wifi, enabling them to frequently exchange weightupdates (still with as small of a bit size as possible), while at other times an expensive or limited mobile plan might force the devices to delay their communication. Practical methods should be able to adapt to these fluctuations in communication constraints. We take an holistic view towards communication efficient distributed deep learning by observing that communication delay and weightupdate compression can be viewed as two types of sparsity, that both affect the total number of parameters, updated throughout training, in a multiplicative way (cf. fig. 3). While compression techniques sparsify individual gradients, communication delay sparsifies the gradient information in time.
Our method is unique in the sense that it allows us to smoothly trade of these two types of sparsity against one another. Figure 3 shows validation errors for ResNet32 trained on CIFAR (model specification in section IVA) for 60000 iterations at different levels of temporal and gradient sparsity. Along the offdiagonals of the matrix, the total sparsity, defined as the product of temporal and gradient sparsity, remains constant. We observe multiple things: 1.) The validation error remains more or less constant along the offdiagonals of the matrix. 2.) Federated Averaging (purple) and Gradient Dropping/ DGC (yellow) are just lines in the twodimensional space of possible compression methods. 3.) There exists a roughly triangular area of approximately constant error, optimal compression methods lie along the hypotenuse of this triangle. We find this behavior consistently across different model architectures, more examples can be found in the supplement. These results indicate, that there exists a fixed communication budged in DSGD, necessary to achieve a certain accuracy. Figure 4 shows validation errors for the same ResNet32 model trained on CIFAR at different levels of total sparsity and different numbers of training iterations. We observe two distinct phases during training: In the beginning (iterations 0  30000), when training is performed using a high learningrate, sparsified methods consistently achieve the lowest error and temporally sparsified DSGD tends to outperform purely gradient sparsified DSGD at all sparsity levels. After the learning rate is decreased by a factor of 10 in iteration 30000, this behavior is reversed and gradient sparsification methods start to perform better that temporally sparsified methods. These results highlight, that an optimal compression strategy, needs to be able to adapt temporal and gradient sparsity, not only based on the learning task, but also on the current learning stage. Such an adaptive sparsity can be integrated seamlessly into our SBC framework.
Iv Experiments
Iva Networks and Datasets
We evaluate our method on commonly used convolutional and recurrent neural networks with millions of parameters, which we train on wellstudied data sets that contain up to multiple millions of samples. Throughout all of our experiments, we fix the number of clients to 4 and split the training data among the clients in a balanced way (the number of training samples and their distribution is homogenous among the clients).
Image Classification: We run experiments for LeNet5Caffe^{2}^{2}2A modified version of LeNet5 from [lecun1998gradient] (see supplement). on MNIST [lecun1998mnist], ResNet32 [he2016deep] on CIFAR10 [krizhevsky2014cifar] and ResNet50 on ILSVRC2012 (ImageNet) [deng2009imagenet]. We split the training data randomly into 4 shards of equal size, and assign one shard to every one of the 4 clients. The MNIST model is trained using the Adam optimizer [kingma2014adam], while the other models are trained using momentum SGD. Learning rate, weight intitiallization and data augmentation are as in the respective papers.
Language Modeling: We experiment with multilayer sequencetosequence LSTM models as described in [zaremba2014recurrent] on the Penn Treebank corpus (PTB) [marcus1993building] and Shakespeare dataset for nextword and nextcharacter prediction. The PTB dataset consists of a sequence 923000 training, and 82000 validation words. We restrict the vocabulary to the most common 10000 words, plus an additional token for all words that are less frequent and train a twolayer LSTM model with 650 hidden units ("WordLSTM"). The Shakespeare dataset consists of the complete works of William Shakespeare [shakespeare2014complete] concatenated to a sequence of 5072443 training and 105675 test characters. The number of different characters in the dataset is 98. We train the twolayer "CharLSTM" with 200 hidden units. For both datasets, we split the sequences of training symbols each into four subsequences of equal length and assign every client one of these subsequences.
While the models we use in our experiments do not fully achieve stateoftheart results on the respective tasks and datasets, they are still sufficient for the purpose of evaluating our compression method and demonstrate, that our method works well with common regularization techniques such as batch normalization [ioffe2015batch] and dropout [srivastava2014dropout]. A complete description of models and hyperparameters can be found in the supplement.
IvB Results
We experiment with three configurations of our method: SBC (1) uses no communication delay and a gradient sparsity of 0.1%, SBC (2) uses 10 iterations of communication delay and 1% gradient sparsity and SBC (3) uses 100 iterations of communication delay and 1% gradient sparsity. Our decision for these points on the 2D grid of possible configurations is somewhat arbitrary. The experiments with SBC (1) serve the purpose of enabling us to directly compare our 0valuebit quantization to the 32valuebit Gradient Dropping (with momentum correction and momentum factor masking [lin2017deep]).
Method  Baseline 


SBC (1)  SBC (2)  SBC (3)  

Accuracy  0.9946  0.994  0.994  0.994  0.994  0.991  
Compression  

Accuracy  0.926  0.927  0.919  0.923  0.919  0.922  
Compression  

Accuracy  0.737  0.739  0.724  0.735  0.737  0.728  
Compression  

Perplexity  89.16  89.39  88.59  89.32  88.47  89.31  
Compression  

Perplexity  3.635  3.639  3.904  3.671  3.782  4.072  
Compression 
Table II lists compression rates and final validation accuracies achieved by different compression methods, when applied to the training of neural networks on 5 different datasets. The number of iterations (forwardbackwardpasses) is held constant for all methods. On all benchmarks, our methods perform comparable to the baseline, while communicating significantly less bits.
Figure 5 shows convergence speed in terms of iterations (left) and communicated bits (right) respectively for ResNet50 trained on ImageNet. The convergence speed is only marginally affected, by the different compression methods. In the first 30 epochs SBC (3) even achieves the highest accuracy, using about less bits than the baseline. In total, SBC (3) reduces the upstream communication on this benchmark from 125 terabytes to 3.35 gigabytes for every participating client. After the learning rate is lowered in epochs 30 and 60 progress slows down for SBC (3) relative to the methods which do not use communication delay. In direct comparison SBC (1) performs very similar to Gradient Dropping, while using about less bits (that is less bits than the baseline).
Figure 6 shows convergence speed in terms of iterations (left) and communicated bits (right) respectively for WordLSTM trained on PTB. While Federated Averaging and SBC (3) initially slow down convergence in terms of iterations, all models converge to approximately the same perplexity after around 60 epochs. Throughout all experiments, SBC (2) performs very similar to SBC (1) in terms of convergence speed and final accuracy, while maintaining a compressionedge of about  .
V Conclusion
The gradient information for training deep neural networks with SGD is highly redundant (see e.g. [lin2017deep]). We exploit this fact to the extreme by combining 3 powerful compression strategies and are able to achieve compression gains of up to four orders of magnitude with only a slight decrease in accuracy. We show through experiments that communication delay and gradient sparsity can be viewed as two independent types of sparsity, that have similar effects on the convergence speed, when introduced into distributed SGD. We would like to highlight, that in no case we did modify the hyperparameters of the respective baseline models to accommodate our method. This demonstrates that our method is easily applicable. Note however that an extensive hyperparameter search could further improve the results. Furthermore, our findings in section III indicate that even higher compression rates are possible if we adapt both the kind of sparsity and the sparsity rate to the current training phase. It remains an interesting direction of further research to identify heuristics and theoretical insights that can exploit these fluctuations in the training statistics to guide sparsity towards optimality.
Vi Acknowledgements
This work was supported by the Fraunhofer Society through the MPIFhG collaboration project "Theory & Practice for Reduced Learning Machines". This work was also supported by the German Ministry for Education and Research as Berlin Big Data Center under Grant 01IS14013A.
References
Vii Supplement
Viia Momentum Correction, Warmup Training and Momentum Masking:
Lin et al. [17] introduce multiple minor modifications to the vanilla Gradient Dropping method. With these modifications they achieve up to around 1% higher accuracy compared to Gradient Dropping on a variety of benchmarks. Those modifications include:
Momentum correction: Instead of adding the raw gradient to the residuum, the momentumcorrected gradient is added. This is used implicitly in our approach, as our weight updates are already momentumcorrected.
Warmup Training: The sparsity rate is increased exponentially from 25% to 0.1% in the first epochs. We find that warmup training can indeed speedup convergence in the beginning of training, but ultimately has no effect on the final accuracy of the model. We therefore omit warm up training in our experiments, as it adds an additional hyperparameter to the method, without any real benefit.
Momentum Masking: To avoid stale momentum from carrying the optimization into a wrong direction after a weight update is performed, Lin et al. suggest to set the momentum to zero for updated weights. We adopt momentum correction in our method.
ViiB Golomb Position Decoding
Algorithm 4 describes the decoding of a binary sequence produced by Golomb Position Encoding (see main paper). Since the shapes of all weighttensors are known to both the server and all clients, we can omit the shape information in both encoding and decoding.
ViiC Model Specification
Below, we describe the neural network models used in our experiments. Table III list the training hyperparameters that were used.
Iterations  Optimizer  Batchsize  LR  LR Decay  
LeNet5Caffe  2000  Adam [4]  128  0.001    
ResNet32  60000  Momentum @0.9  128  0.01  0.1 @ 30000, 50000  
ResNet50  700000  Momentum @0.9  32  0.1  0.1 @ 300000, 600000  
WordLSTM  60000  Gradient Descent  5  1.0  0.8 @  
CharLSTM  16000  Gradient Descent  5  1.0  0.8 @

LeNet5Caffe:
The model specification can be downloaded from the Caffe MNIST tutorial page: https://github.com/BVLC/caffe/blob/master/examples/mnist/lenet_train_test.prototxt. (Features convolutional layers, fully connected layers, pooling.)
ResNet32, ResNet50: We use the implementation from the official Tensorflow repository: https://github.com/tensorflow/models/tree/master/research/resnet. (Features skipconnections, batchnormalization.)
WordLSTM: We use the implementation from the official Tensorflow repository (configuration "medium"): https://github.com/tensorflow/models/tree/master/tutorials/rnn/ptb. (Features trainable wordembeddings, multilayer LSTMcells, dropout.)
CharLSTM: We adapt the implementation of WordLSTM to use a smaller vocabulary of 98 symbols by decreasing the size of the embedding. (Features trainable wordembeddings, multilayer LSTMcells, dropout.)
ViiD Proof of Theorem 2.1.
Proof.
It holds that
(6) 
Since is a metric subspace, the projection
(7) 
uniquely solves the minimization problem in . ∎
ViiE Additional Results
Figure 7 shows convergence speed in terms of iterations (left) and communicated bits (right) respectively for ResNet32 trained on CIFAR. Sparse Binary Compression can train the model to approximately baseline accuracy, using significantly less bits. SBC (3) trains the model to almost baseline accuracy (0.4% degradation), while using less bits (cf. table 2).
Figure 8 shows convergence speed in terms of iterations (left) and communicated bits (right) respectively for CharLSTM trained on Shakespeare. Sparse Binary Compression can train the model to approximately baseline accuracy, while using significantly less bits. SBC (1) even achieves the highest accuracy using less bits than the baseline (cf. table 2). SBC (3) however, shows nonmonotonic convergence behavior on this benchmark. It might be that SBC (3) falls below the total communication budget necessary for this learning task (cf. section 3).
Figure 9 shows validation error for WordLSTM trained on PTB at different levels of gradient sparsity and temporal sparsity. The total sparsity, defined as the product of temporal and gradient sparsity remains constant along the diagonals of the matrix. We observe that different forms of sparsity perform best during different stages of training. Phrased differently, this means that there is not one optimal sparsity setup, but rather sparsity needs to be adapted to the current training phase to achieve optimal compression.