signSGD with Majority Vote is Communication Efficient And Byzantine Fault Tolerant
Training neural networks on large datasets can be accelerated by distributing the workload over a network of machines. As datasets grow ever larger, networks of hundreds or thousands of machines become economically viable. The time cost of communicating gradients limits the effectiveness of using such large machine counts, as may the increased chance of network faults. We explore a particularly simple algorithm for robust, communication-efficient learning—signSGD. Workers transmit only the sign of their gradient vector to a server, and the overall update is decided by a majority vote. This algorithm uses less communication per iteration than full-precision, distributed SGD. Under natural conditions verified by experiment, we prove that signSGD converges in the large and mini-batch settings, establishing convergence for a parameter regime of Adam as a byproduct. We model adversaries as those workers who may compute a stochastic gradient estimate and manipulate it, but may not coordinate with other adversaries. Aggregating sign gradients by majority vote means that no individual worker has too much power. We prove that unlike SGD, majority vote is robust when up to 50% of workers behave adversarially. On the practical side, we built our distributed training system in Pytorch. Benchmarking against the state of the art collective communications library (NCCL), our framework—with the parameter server housed entirely on one machine—led to a 25% reduction in time for training resnet50 on Imagenet when using 15 AWS p3.2xlarge machines.
|Jeremy Bernstein††thanks: JB was primary contributor for theory. JZ was primary contributor for large-scale experiments., Jiawei Zhao, Kamyar Azizzadenesheli, Anima Anandkumar|
|Caltech, Nanjing University of Aeronautics and Astronautics, UC Irvine|
The most powerful supercomputer in the world is currently a cluster of over 27,000 GPUs at Oak Ridge National Labs (ornl). Distributed algorithms designed for such large-scale systems typically involve both computation and communication: worker nodes compute intermediate results locally, before sharing them with their peers. When devising new machine learning algorithms for distribution over networks of thousands of workers, we posit the following desiderata:
fast algorithmic convergence;
good generalisation performance;
robustness to network faults.
When seeking an algorithm that satisfies all four desiderata D1–4, inevitably some tradeoff must be made. Stochastic gradient descent (SGD) naturally satisfies D1–2, and this has buoyed recent advances in deep learning. Yet when it comes to large neural network models with hundreds of millions of parameters, distributed SGD can suffer large communication overheads. To make matters worse, any faulty SGD worker can corrupt the entire model at any time by sending an infinite gradient, meaning that SGD without modification is not robust.
A simple algorithm with aspirations towards all desiderata D1–4 is as follows: workers send the sign of their gradient up to the parameter server, which aggregates the signs and sends back only the majority decision. We refer to this algorithm as signSGD with majority vote. All communication to and from the parameter server is compressed to one bit, so the algorithm certainly gives us D3. What’s more, in deep learning folklore sign based methods are known to perform well, indeed inspiring the popular RMSprop and Adam optimisers (balles2018dissecting), giving hope for D1. As far as robustness goes, aggregating gradients by a majority vote denies any individual worker too much power, suggesting it may be a natural way to achieve D4.
In this work, we make the above aspirations rigorous. Whilst D3 is immmediate, we provide the first convergence guarantees for signSGD in the mini-batch setting, providing theoretical grounds for D1. We show how theoretically the behaviour of signSGD changes as gradients move from high to low signal-to-noise ratio. We also extend the theory of majority vote to show that it achieves Byzantine fault tolerance assuming that adversaries cannot cooperate. A distributed algorithm is Byzantine fault tolerant (krum) if its convergence is robust when up to 50% of workers behave adversarially. This is a relatively strong property that often entails desirable weaker properties, such as robustness to a corrupted worker sending random bits, or an outdated worker sending stale gradients. This means that Byzantine fault tolerance is not just a property of security, but also confers robustness to a wide variety of plausible network faults, giving us D4. Assuming non-cooperative adversaries is an interesting failure model, though not the most general one.
Next, we embark on a large-scale empirical validation of our theory. We implement majority vote in the Pytorch deep learning framework, using CUDA kernels to bit pack sign tensors down to one bit. Our results provide experimental evidence for D1–D4. Comparing our framework to NCCL (the state of the art communications library), we were able to speed up Imagenet training by 25% when distributing over 7 to 15 AWS p3.2xlarge machines, albeit at a slight loss in generalisation.
Finally, in an interesting twist, the theoretical tools we develop may be brought to bear on a seemingly unrelated problem in the machine learning literature. adam-non-converge proved that the extremely popular Adam optimiser in general does not converge in the mini-batch setting. This result belies the success of the algorithm in a wide variety of practical applications. signSGD is equivalent to a special case of Adam, and we establish the convergence rate of mini-batch signSGD for a large class of practically realistic objectives. Therefore, we expect that these tools should carry over to help understand the success modes of Adam. Our insight is that gradient noise distributions in practical problems are often unimodal and symmetric because of the Central Limit Theorem, yet adam-non-converge’s construction relies on bimodal noise distributions.
2 Related Work
For decades, neural network researchers have adapted biologically inspired algorithms for efficient hardware implementation. Hopfield2554, for example, considered taking the sign of the synaptic weights of his memory network for readier adaptation into integrated circuits. This past decade, neural network research has focused on training feedforward networks by gradient descent (mafia). It is natural to ask what practical efficiency may accompany simply taking the sign of the backpropagated gradient. In this section, we explore related work pertaining to this question.
Deep learning: whilst stochastic gradient descent (SGD) is the workhorse of machine learning (robbins1951), algorithms like RMSprop (tieleman_rmsprop_2012) and Adam (kingma_adam:_2015) are also extremely popular neural net optimisers. These algorithms have their roots in the Rprop optimiser (riedmiller_direct_1993), which is a sign-based method similar to signSGD except for a component-wise adaptive learning rate.
Non-convex optimisation: parallel to (and oftentimes in isolation from) advances in deep learning practice, a sophisticated optimisation literature has developed. nesterov_cubic_2006 proposed cubic regularisation as an algorithm that can escape saddle points and provide guaranteed convergence to local minima of non-convex functions. This has been followed up by more recent works such as Natasha (allen-zhu_natasha_2017) that use other theoretical tricks to escape saddle points. It is still unclear how relevant these works are to deep learning, since it is not clear to what extent saddle points are an obstacle in practical problems. We avoid this issue altogether and satisfy ourselves with establishing convergence to critical points.
Gradient compression: prior work on gradient compression generally falls into two camps. In the first camp, algorithms like QSGD (QSGD), TernGrad (wen2017terngrad) and Atomo (atomo) use stochastic quantisation schemes to ensure that the compressed stochastic gradient remains an unbiased approximation to the true gradient. These works are therefore able to bootstrap existing SGD convergence theory. In the second camp, more heuristic algorithms like 1BitSGD (seide_1-bit_2014) and deep gradient compression (dgc) pay less attention to theoretical guarantees and focus more on practical performance. These algorithms track quantisation errors and feed them back into subsequent updates. The commonality between the two camps is an effort to, one way or another, correct for bias in the compression.
signSGD with majority vote takes a different approach to these two existing camps. In directly employing the sign of the stochastic gradient, the algorithm unabashedly uses a biased approximation of the stochastic gradient. ssd and bernstein provide theoretical and empirical evidence that signed gradient schemes can converge well in spite of their biased nature. Their theory only applies in the large batch setting, meaning the theoretical results are less relevant to deep learning practice. Still bernstein showed promising experimental results in the small batch setting. An appealing feature of majority vote is that it naturally leads to compression in both directions of communication between workers and parameter server. As far as we are aware, all existing gradient compression schemes lose compression before scattering results back to workers.
Byzantine fault tolerant optimisation: the problem of modifying SGD to make it Byzantine fault tolerant has recently attracted interest in the literature. For example, krum proposed Krum, which operates by detecting and excluding outliers in the gradient aggregation. BSGD propose ByzantineSGD which instead focuses on detecting and eliminating adversaries. Clearly both these strategies incur overheads, and eliminating adversaries precludes the possibility that they might reform. Majority vote is a simple algorithm which avoids these problems.
We aim to develop an optimisation theory that is relevant for real problems in deep learning. For this reason, we are careful about the assumptions we make. For example, we do not assume convexity because neural network loss functions are typically not convex. Though we allow our objective function to be non-convex, we insist on a lower bound to enable meaningful convergence results.
Assumption 1 (Lower bound).
For all and some constant , we have objective value .
Our next two assumptions of Lipschitz smoothness and bounded variance are standard in the stochastic optimisation literature (allen-zhu_natasha_2017). That said, we give them in a component-wise form. This allows our convergence results to encode information not just about the total noise level and overall smoothness, but also about how these quantities are distributed across dimension.
Assumption 2 (Smooth).
Let denote the gradient of the objective evaluated at point . Then we require that for some non-negative constant
Assumption 3 (Variance bound).
Upon receiving query , the stochastic gradient oracle gives us an independent, unbiased estimate that has coordinate bounded variance:
for a vector of non-negative constants .
Our final assumption is non-standard. We assume that the gradient noise is unimodal and symmetric. Clearly, Gaussian noise is a special case. Note that even for a moderate mini-batch size, we expect the central limit theorem to kick in rendering typical gradient noise distributions close to Gaussian. See Figure 2 for noise distributions measured whilst training resnet18 on Cifar-10.
Assumption 4 (Unimodal, symmetric gradient noise).
At any given point , each component of the stochastic gradient vector has a unimodal distribution that is also symmetric about the mean.
Showing how to work with this assumption is a key theoretical contribution of this work. Combining Assumption 4 with an old tail bound of gauss yields Lemma LABEL:lem:symm, which will be crucial for guaranteeing mini-batch convergence of signSGD. As will be explained in Section LABEL:sec:challegnge, this result also constitutes a convergence proof for a parameter regime of Adam. This suggests that Assumption 4 may more generally be a theoretical fix for adam-non-converge’s non-convergence proof of mini-batch Adam, a fix which does not involve modifying the Adam algorithm itself.
3.2 Mini-Batch Convergence of signSGD
With our assumptions in place, we move on to presenting our theoretical results, which are all proved in Appendix LABEL:app:proofs. Our first result establishes the mini-batch convergence behaviour of signSGD. We will first state the result and make some remarks. We provide intuition for the proof in Section LABEL:sec:challegnge.