Zeno++: Robust Fully Asynchronous SGD
Abstract
We propose Zeno++, a new robust asynchronous Stochastic Gradient Descent (SGD) procedure which tolerates Byzantine failures of the workers. In contrast to previous work, Zeno++ removes some unrealistic restrictions on workerserver communications, allowing for fully asynchronous updates from anonymous workers, arbitrarily stale worker updates, and the possibility of an unbounded number of Byzantine workers. The key idea is to estimate the descent of the loss value after the candidate gradient is applied, where large descent values indicate that the update results in optimization progress. We prove the convergence of Zeno++ for nonconvex problems under Byzantine failures. Experimental results show that Zeno++ outperforms existing approaches.
1 Introduction
Synchronous training and asynchronous training are the two most common paradigms of distributed machine learning. On the one hand, synchronous training requires, periodically, the global updates at the server to be blocked until all the workers respond. In contrast, for asynchronous training, the server updates the global model immediately after a worker responds. Theoretical and experimental analysis (Dutta et al., 2018) suggests that synchronous training is more stable with less noise, but can also be slowed down by the global barrier across all the workers. Asynchronous training is generally faster, but needs to address instability and noisiness due to staleness. In this paper, we focus on asynchronous training.
We study the security of distributed asynchronous Stochastic Gradient Descent (SGD) in a centralized workerserver architecture, also known as the Parameter Server (or PS) architecture. In the PS architecture, there are server nodes and worker nodes. When combined with an asynchronous SGD approach, each worker pulls the global model from the servers, estimates the gradients using the local portion of the training data, then sends the gradient estimates to the servers. The servers update the model as soon as a new gradient is received from any worker.
The security of machine learning has gained increasing attention in recent years. In particular, tolerance to Byzantine failures (Blanchard et al., 2017; Chen et al., 2017; Yin et al., 2018; Feng et al., 2014; Su and Vaidya, 2016a, b; Xie et al., 2018b; Alistarh et al., 2018; Cao and Lai, 2018) has become an important topic in the distributed machine learning literature. Byzantine failures are wellstudied for the distributed systems (Lamport et al., 1982a). However, in distributed machine learning, Byzantine failures have unique properties. In brief, the goal of Byzantine workers is to prevent convergence of the model training. By construction, Byzantine failures (Lamport et al., 1982b) assume the worst case, i.e., the Byzantine workers can behave arbitrarily. Such failures may be caused by a variety of reasons including but not limited to: hardware/software bugs, vulnerable communication channels, poisoned datasets, or malicious attackers. To make things worse, groups of Byzantine workers can collude, potentially resulting in more harmful attacks. It is also clear that as the worst case, Byzantine failures generalize benign failures such as hardware or software errors.
Unlike previous work (Damaskinos et al., 2018), we tackle Byzantine tolerance in a more general scenario. The Byzantine tolerance of asynchronous SGD is challenging in this case because of:

Asynchrony. The lack of synchrony incurs additional noise for the stochastic gradients. Such noise makes it more difficult to distinguish the Byzantine gradients from the benign ones, especially as Byzantine behavior may exacerbate staleness.

Unpredictable successive updates. The lack of synchronous scheduling makes it possible for the server to receive updates from Byzantine workers successively. Thus, even in the standard scenario, where less than half of the workers are Byzantine, the server can be suffocated by successive Byzantine gradients.

Unbounded number of Byzantine workers. For the Byzantine tolerance in fully asynchronous training, the assumption of a bounded number of Byzantine workers is meaningless. In Byzantinetolerant synchronous training (Blanchard et al., 2017; Chen et al., 2017; Yin et al., 2018; Xie et al., 2018b, a), the servers can compare the candidate gradients with each other, and utilize the majority assumption to filter out the harmful gradients, or use robust aggregation to bound the error. However, such strategies are infeasible in asynchronous training, since there is nothing to compare to or aggregate. Aggregating the successive gradients is also meaningless since the successive gradients could all be pushed by the same Byzantine worker. Furthermore, although most of the previous work (Blanchard et al., 2017; Chen et al., 2017; Yin et al., 2018; Feng et al., 2014; Su and Vaidya, 2016a, b; Alistarh et al., 2018; Cao and Lai, 2018) assumes a majority of honest workers, this requirement is not guaranteed to be satisfied in practice.
The key idea of our approach is to estimate the descent of the loss value after the candidate gradient is applied to the model parameters, based on the Byzantinetolerant synchronous SGD algorithm, Zeno (Xie et al., 2018b). Intuitively, if the loss value decreases, the candidate gradient is likely to result in optimization progress. For computational efficiency, we also propose a lazy update.
To the best of our knowledge, this paper is the first to theoretically and empirically study Byzantinetolerant fully asynchronous SGD with anonymous workers, and potentially an unbounded number of Byzantine workers. In summary, our contributions are:

We propose Zeno++, a new approach for Byzantinetolerant fully asynchronous SGD with anonymous workers.

We show that Zeno++ tolerates Byzantine workers without any limit on either the staleness or the number of Byzantine workers.

We prove the convergence of Zeno++ for nonconvex problems.

Experimental results validate that 1) existing algorithms may fail in practical scenarios, and 2) Zeno++ gracefully handles such cases.
2 Related work
Most of the existing Byzantinetolerant SGD algorithms focus on synchronous training. Chen et al. (2017); Su and Vaidya (2016a, b); Yin et al. (2018); Xie et al. (2018a) use robust statistics (Huber, 2011) including the geometric median, coordinatewise median, and trimmed mean as Byzantinetolerant aggregation rules. Blanchard et al. (2017); Mhamdi et al. (2018) propose Krum and its variants, which select the candidates with minimal local sum of Euclidean distances. Alistarh et al. (2018) utilize historical information to identify harmful gradients. Chen et al. (2018) use coding theory and majority voting to recover correct gradients. Most of these synchronous algorithms assume that most of the workers are nonByzantine. However, in practice, there are no guarantees that the number of Byzantine workers can be controlled. Xie et al. (2018b); Cao and Lai (2018) propose synchronous SGD algorithms for an unbounded number of Byzantine workers.
Recent years have witnessed an increasing number of largescale machine learning algorithms, including asynchronous SGD (Zinkevich et al., 2009; Lian et al., 2018; Zheng et al., 2017; Zhou et al., 2018). Damaskinos et al. (2018) proposed Kardam, which to our knowledge is the only prior work to address Byzantinetolerant asynchronous training. Kardam utilizes the Lipschitzness of the gradients to filter out outliers. However, Kardam assumes a threat model much weaker than ours. The major differences in the threat model are listed as follows:

Verification of worker identity. Unlike Kardam, we do not require verifying the identities of the workers when the server receives gradients. Kardam uses the socalled empirical Lipschitz coefficient, to test the benignity of the gradient sent by a specific worker. Such a mechanism keeps the record of the empirical Lipschitz coefficient of each worker. Thus, whenever a gradient is received, the Kardam server must be able to identify the identity/index of the worker. However, since Byzantine workers can behave arbitrarily, they can fake their identities/indices when sending gradients to the servers. Thus, Kardam assumes a threat model much weaker than the traditional Byzantine failure/threat model. Note that for synchronous training, the server can partially counter the index spoofing attack by simply filtering out all the gradients with duplicated indices. However, such an approach is infeasible for asynchronous training.

Bounded staleness of workers/limit of successive gradients. Unlike Kardam, we do not require bounded staleness of the workers. Kardam requires that the number of gradients successively received from a single worker is bounded above. To be more specific, on the server, any sequence of successively received gradients of length must contain at least gradients from honest workers. However, in realworld asynchronous training, such an assumption is very difficult to satisfy.

A majority of honest workers. Unlike Kardam, we do not require a majority of honest workers. Kardam requires that the number of Byzantine workers is less than onethird of the total number of workers – much stronger restriction than the standard setting that allows for the number of Byzantine workers to be up to 50% of the total number of workers. Zeno++ further extends this guarantee to allow for not only 50%, but also a majority of Byzantine workers.
3 Model
We consider the following optimization problem: , where , for , is sampled from the local data on the th device.
We solve this problem in a distributed manner with workers. Each worker trains the model on local data. In each iteration, the th worker will sample independent data points from the dataset , and compute the gradient of the local empirical loss , where is the th sampled data on the th worker. When there are no Byzantine failures, the servers update the model whenever a new gradient is received:
When there are Byzantine failures, can be replaced by arbitrary value (Damaskinos et al., 2018). Formally, we define the threat model as follows.
Definition 1.
(Threat Model). When the server receives a gradient estimator , it is either correct or Byzantine. If sent by a Byzantine worker, is assigned arbitrary value. If sent by an honest worker, the correct gradient is . Thus, we have
We assume that out of workers are Byzantine, where . Furthermore, the indices of Byzantine workers can change across different iterations.
Notation  Description 

, ,  Number of workers, set of integers , number of Byzantine workers 
,  is the training dataset on the th worker, is the validation dataset on Zeno++ server 
,  Minibatch size of workers, minibatch size of Zeno++ server 
, ,  Number of global iterations, index of global iteration, learning rate 
, ,  Hyperparameter of Zeno++, is the maximum delay of , also called “server delay” 
Maximum delay of workers, also called “worker delay”, different from the “server delay”  
All the norms in this paper are norms 
4 Methodology
In this section, we introduce Zeno++, a Byzantinetolerant asynchronous SGD algorithm based on innerproduct validation. Zeno++ is a computationally efficient version of its prototype: Zeno+.
4.1 Zeno+
Like Zeno (Xie et al., 2018b), we compute a score for each candidate gradient estimator by using the stochastic zeroorder oracle. However, in contrast to the existing synchronous SGD with majoritybased aggregation methods, we need a hard threshold to decide whether a gradient is accepted, as sorting is not meaningful in asynchronous settings. This descent score is described next.
Definition 2.
(Stochastic Descent Score (Xie et al., 2018b)) Denote , where ’s are i.i.d. samples drawn from , where , and is the batch size of . For any gradient estimator (correct or Byzantine) , model parameter , learning rate , and a constant weight , we define its stochastic descent score as follows:
Remark 1.
Note that we assume that the dataset for computing is different from the training dataset, e.g., can be a separated validation dataset. In other words, .
The score defined in Definition 2 is composed of two parts: the estimated descent of the loss function, and the magnitude of the update. The score increases when the estimated descent of the loss function, , gets larger. We penalize the score by , so that the change of the model parameter will not be too large. A large descent suggests faster convergence. Observe that even when a gradient is Byzantine, a small magnitude indicates that it will be less harmful to the model.
Using the stochastic descent score, we can set a hard threshold parameterized by to filter out candidate gradients with relatively small scores. The detailed algorithm is outlined in Algorithm 1.
4.2 Zeno++
Calculating the stochastic descent score for every candidate gradient can be computationally expensive. To reduce the computation overhead, we approximate it by its firstorder Taylor’s expansion.
Definition 3.
(Approximated Stochastic Descent Score) Denote , where ’s are i.i.d. samples drawn from , where , and is the batch size of . For any gradient estimator (correct or Byzantine) , model parameter , learning rate , and a constant weight , we approximate its stochastic descent score as follows:
In brief, Zeno++ is a computationally efficient version of Zeno+ which uses this approximated stochastic descent score, combined with lazy updates. The detailed algorithm is shown in Algorithm 2. Compared to Zeno, we highlight several new techniques in Zeno++ (Algorithm 2), specially designed for asynchronous training: 1) rescaling the candidate gradient (Line 6); 2) firstorder Taylor’s expansion (Line 7); 3) hard threshold instead of comparison with the others (Line 7); 4) lazy update for reducing the computation overhead (Line 9).
Before moving forward, we wish to highlight several practical remarks for Zeno++:

Preparing the validation dataset for Zeno++: The dataset used for calculating (the validation gradient of Zeno++) can be collected in many ways. It can be a separate validation dataset provided by a trusted third party. Another reasonable choice is that, a group of trusted workers can upload local data perturbed by additional noise (to help protect the users’ privacy). Typically, the validation dataset is small and different from the training dataset, thus can only be used to validate the gradients, and cannot be directly used for training, as shown in Section 6.

Scheduling : updates in the background. It will only be triggered when the global model is updated and the server is idle. Another scheduling strategy is to trigger after every iterations. Thus, is the upper bound of the delay of . A reasonable choice is , so that ideally is updated after all the workers respond.

Computational efficiency: We can reduce the computation overhead of the Zeno++ server by decreasing the minibatch size , or the frequency of the activation of . However, doing so will potentially incur larger noise for , which makes a tradeoff.
5 Theoretical guarantees
In this section, we prove the convergence of Zeno++ (Algorithm 2) under Byzantine failures. We start with definitions used in the convergence analysis.
Definition 4.
(Smoothness) Differentiable satisfies smoothness if there exists such that ,
Definition 5.
(PolyakŁojasiewicz (PL) inequality) Differentiable satisfies the PL inequality (Polyak, 1963) if there exists , such that :
5.1 Convergence guarantees
We prove the convergence of Algorithm 2 for nonconvex problems with the following assumption.
Assumption 1.
(Bounded server delay) For Zeno++, we assume that the delay of the validation gradient is upperbounded. Without loss of generality, suppose the current model is , and , where . We assume that for , .
Remark 2.
Zeno++ does not require bounded delay for the workers. The bounded delay requirement in Assumption 1 is only for the validation gradient on the server, not for the workers.
We first analyze the convergence of functions that satisfy the PL inequality.
Theorem 1.
Assume that and are smooth and satisfy the PL inequality. Assume that for , the correct gradients and validation gradients are upperbounded: , , and the validation gradients are always nonzero and lowerbounded: , where . Furthermore, we assume that the validation set is close to the training set, which implies bounded variance: . Taking and , after global updates, Algorithm 2 converges to a global optimum:
Remark 3.
The assumption of the lower bounded gradient is necessary. We need , so that the normalization in Line 6 and inner product in Line 7 of Algorithm 2 are feasible. In practice, if we have a minibatch with zero gradient on server, we can simply draw additional samples and add them to the minibatch, until such gradient is nonzero.
For general smooth but nonconvex functions, we have the following convergence guarantee.
Theorem 2.
Assume that and are smooth and potentially nonconvex. Assume that for , the true gradients and validation gradients are upperbounded: , , and the validation gradients are always nonzero and lowerbounded: , where . Furthermore, we assume that the validation set is close to the training set, which implies bounded variance: . Taking and , after global updates, Algorithm 2 converges to a critical point:
Furthermore, if we take , then we have
Remark 4.
controls the tradeoff between the acceptance ratio and the convergence rate. Large positive makes the convergence faster, but fewer candidate gradients pass the test of Zeno++. Small positive increases the acceptance ratio, but may also potentially slow down the convergence or incur larger variance. We use to bridge to the convergence rate and the variance. Larger makes larger, which improves the convergence rate, but also enlarges the variance. Using nonzero potentially results in negative thresholds, which enlarges the acceptance ratio, but also increases the false negative ratio (the ratio of Byzantine gradients that are not filtered out by Zeno++).
6 Experiments
In this section, we evaluate the proposed algorithm, Zeno++. Note that we do not evaluate the prototype algorithm Zeno+, since its computation overhead is too large for practical settings. Due to the space limitation, zoomed figures and additional experiments (including evaluation on an additional labelflipping attack, and testing the sensitivity to hyperparameters) are presented in the appendix.
6.1 Datasets and evaluation metrics
We conduct experiments on the benchmark CIFAR10 image classification dataset (Krizhevsky and Hinton, 2009), which is composed of 50k images for training and 10k images for testing. We use a convolutional neural network (CNN) with 4 convolutional layers followed by 2 fully connected layers. The detailed network architecture can be found in our submitted source code (will be released upon publication). In the 50k images for training, we randomly extracted 2.5k of them as the validation set for Zeno++, the remaining are randomly partitioned onto all the workers. In each experiment, we launch 10 worker processes. We repeat each experiment 10 times and take the average. Each experiment is composed of 200 epochs, where each epoch is a full pass of the training dataset. We simulate asynchrony by drawing random delay from a uniform distribution in the range of , where is the maximum worker delay (different from the maximum server delay of Zeno++).
We use top1 accuracy on the testing set and the crossentropy loss function on the training set as the evaluation metrics. We also report the false positive rate (FP), which is the ratio of correct gradients that are recognized as Byzantine and filtered out by Zeno++ or the Kardam baseline.
6.1.1 Baselines
We use the asynchronous SGD without failures/attacks as the gold standard, which is referred to as AsyncSGD without attack. Since Kardam is the only previous work on Byzantinetolerant asynchronous SGD, we use it as the baseline.
One may conjecture that Zeno++ is analogous to training on the validation data. To explore this, we consider training only on – assumed to be clean data on the server, i.e., update the model only using on the server, without using any workers. We call this baseline Serveronly. We finetune the learning rate and show the best results of Serveronly.
6.2 No attack
We first test the convergence when there are no attacks. In all the experiments, we take the learning rate , minibatch size , , , . For Kardam, we take (i.e. here Kardam assumes that there are 2 Byzantine workers). The result is shown in Figure 1. Zeno++ converges a little bit slower than AsyncSGD, but faster than Kardam, especially when the worker delay is large. When , Zeno++ converges much faster than Kardam. Serveronly performs badly on both training and testing data.
6.3 Signflipping attack
We test the Byzantinetolerance to the “signflipping” attack, which was proposed in Damaskinos et al. (2018). In such attacks, the Byzantine workers send instead of the correct gradient to the server. In all the experiments, we take the learning rate , minibatch size , , , . The result is shown in Figure 2, with different number of Byzantine workers . It is shown that when , Zeno++ converges slightly slower than AsyncSGD without attacks, and much faster than Kardam. Actually, we observe that Kardam fails to make progress when the worker delay is large. When the number of Byzantine workers gets larger (), the convergence of Zeno++ gets slower, but it still makes reasonable progress, while AsyncSGD and Kardam fail. Note that Kardam performs even worse than Serveronly, which means that Kardam is not even as good as training on a single honest worker. Thus, when there are Byzantine workers, distributed training with Kardam is meaningless.
6.4 Discussion
Kardam performs surprisingly badly in our experiments. The experiments in Damaskinos et al. (2018) focus on dampening staleness when there are no Byzantine failures. For Byzantine tolerance, Damaskinos et al. (2018) only reports that Kardam filters out 100% of the Byzantine gradients, which matches the results in our experiments. However, we observe that in addition to filtering out 100% of the Byzantine gradients, Kardam also filters nearly 100% of the correct gradients. In Figure 2, we report that the false positive rate of Kardam is nearly 99%, which makes the convergence extremely slow. To make things worse, Kardam does not even perform as good as Serveronly, which makes the distributed training with Kardam totally meaningless. One reason why Kardam performs badly is that we use a more general threat model in this paper, which does not guarantee an important assumption of Kardam, namely “any sequence of successively received gradients of length must contain at least gradients from honest workers”. It is clear that this assumption is quite strong, as in an asynchronous setting, Byzantine workers can easily send long sequences of erroneous responses. Our approach does not depend on such a strong assumption.
In all the experiments, Zeno++ converges faster than the baselines when there are Byzantine failures. Although the convergence of Zeno++ is slower than AsyncSGD when there are no attacks, we find that it provides a reasonable tradeoff between security and convergence speed. In general, larger worker delay and more Byzantine workers add more error and noise to the gradients, which slows down the convergence, because there are fewer valid gradients for the server to use. Zeno++ can filter out most of the harmful gradients at the cost of .
Note that Serveronly is an extreme case that only uses the server and the validation dataset to train the model in a nondistributed manner, which will not be affected by Byzantine workers. However, only using the validation data is not enough for training, as shown in Figure 1. Similarly in practice, we can use a small dataset separated from the training data for crossvalidation, but will never directly train the model only on such validation dataset. Furthermore, as shown in Figure 2, Zeno++ performs much better than Serveronly. Thus, we can draw to the conclusion that Zeno++ is efficiently training the model on the honest workers in a distributed manner, which is not equivalent to training on the validation dataset only.
On average, the server computes gradients in each iteration, since the validation gradient of Zeno++ is updated after every iterations. Thus, the workload on the server is much smaller than a worker. Furthermore, since we can parallelize the workload on the server and workers, the computation overhead of can be hidden, so that Zeno++ can benefit from distributed training.
7 Conclusion
We propose a novel Byzantinetolerant fully asynchronous SGD algorithm: Zeno++. The algorithm provably converges. Our empirical results show good performance compared to previous work. In future work, we will explore variations of our approach for other settings such as federated learning.
References
 Alistarh et al. (2018) D. Alistarh, Z. AllenZhu, and J. Li. Byzantine stochastic gradient descent. arXiv preprint arXiv:1803.08917, 2018.
 Blanchard et al. (2017) P. Blanchard, R. Guerraoui, J. Stainer, et al. Machine learning with adversaries: Byzantine tolerant gradient descent. In Advances in Neural Information Processing Systems, pages 118–128, 2017.
 Cao and Lai (2018) X. Cao and L. Lai. Robust distributed gradient descent with arbitrary number of byzantine attackers. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6373–6377, 2018.
 Chen et al. (2018) L. Chen, H. Wang, Z. B. Charles, and D. S. Papailiopoulos. Draco: Byzantineresilient distributed training via redundant gradients. In ICML, 2018.
 Chen et al. (2017) Y. Chen, L. Su, and J. Xu. Distributed statistical machine learning in adversarial settings: Byzantine gradient descent. POMACS, 1:44:1–44:25, 2017.
 Damaskinos et al. (2018) G. Damaskinos, E. M. E. Mhamdi, R. Guerraoui, R. Patra, and M. Taziki. Asynchronous byzantine machine learning. arXiv preprint arXiv:1802.07928, 2018.
 Dutta et al. (2018) S. Dutta, G. Joshi, S. Ghosh, P. Dube, and P. Nagpurkar. Slow and stale gradients can win the race: Errorruntime tradeoffs in distributed sgd. In AISTATS, 2018.
 Feng et al. (2014) J. Feng, H. Xu, and S. Mannor. Distributed robust learning. arXiv preprint arXiv:1409.5937, 2014.
 Huber (2011) P. J. Huber. Robust statistics. In International Encyclopedia of Statistical Science, pages 1248–1251. Springer, 2011.
 Krizhevsky and Hinton (2009) A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. 2009.
 Lamport et al. (1982a) L. Lamport, R. Shostak, and M. Pease. The byzantine generals problem. ACM Transactions on Programming Languages and Systems (TOPLAS), 4(3):382–401, 1982a.
 Lamport et al. (1982b) L. Lamport, R. E. Shostak, and M. C. Pease. The byzantine generals problem. ACM Trans. Program. Lang. Syst., 4:382–401, 1982b.
 Lian et al. (2018) X. Lian, W. Zhang, C. Zhang, and J. Liu. Asynchronous decentralized parallel stochastic gradient descent. In ICML, 2018.
 Mhamdi et al. (2018) E. M. E. Mhamdi, R. Guerraoui, and S. Rouault. The hidden vulnerability of distributed learning in byzantium. arXiv preprint arXiv:1802.07927, 2018.
 Polyak (1963) B. T. Polyak. Gradient methods for minimizing functionals. Zhurnal Vychislitel’noi Matematiki i Matematicheskoi Fiziki, 3(4):643–653, 1963.
 Su and Vaidya (2016a) L. Su and N. H. Vaidya. Faulttolerant multiagent optimization: Optimal iterative distributed algorithms. In PODC, 2016a.
 Su and Vaidya (2016b) L. Su and N. H. Vaidya. Defending nonbayesian learning against adversarial attacks. arXiv preprint arXiv:1606.08883, 2016b.
 Xie et al. (2018a) C. Xie, O. Koyejo, and I. Gupta. Phocas: dimensional byzantineresilient stochastic gradient descent. arXiv preprint arXiv:1805.09682, 2018a.
 Xie et al. (2018b) C. Xie, O. O. Koyejo, and I. Gupta. Zeno: Byzantinesuspicious stochastic gradient descent. CoRR, abs/1805.10032, 2018b.
 Yin et al. (2018) D. Yin, Y. Chen, K. Ramchandran, and P. Bartlett. Byzantinerobust distributed learning: Towards optimal statistical rates. arXiv preprint arXiv:1803.01498, 2018.
 Zheng et al. (2017) S. Zheng, Q. Meng, T. Wang, W. Chen, N. Yu, Z.M. Ma, and T.Y. Liu. Asynchronous stochastic gradient descent with delay compensation. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 4120–4129. JMLR. org, 2017.
 Zhou et al. (2018) Z. Zhou, P. Mertikopoulos, N. Bambos, P. W. Glynn, Y. Ye, L.J. Li, and L. FeiFei. Distributed asynchronous optimization with unbounded delays: How slow can you go? In ICML, 2018.
 Zinkevich et al. (2009) M. Zinkevich, J. Langford, and A. J. Smola. Slow learners are fast. In Advances in neural information processing systems, pages 2331–2339, 2009.
Appendix
Appendix A Proofs
a.1 Zeno++
We first analyze the convergence of the functions whose gradients grow as a quadratic function of suboptimality.
Theorem 1.
Assume that and have smoothness and PL inequality (potentially nonconvex). Assume that for , the true gradients and stochastic gradients are upperbounded: , , and the stochastic gradients for Zeno testing are always nonzero and lowerbounded: , where . Furthermore, we assume that the validation set is close to the training set, which implies bounded variance: . Taking and , after global updates, Algorithm 2 converges to a global optimum:
Proof.
If any gradient estimator passes the test of Zeno++, then we have
where .
Thus, we have
can be upperbounded using smoothness and the bounded delay:
Again, using smoothness, taking , we have
By telescoping and taking total expectation, after global updates, we have
∎
For general smooth but nonconvex functions, we have the following convergence guarantee.
Theorem 2.
Assume that and are smooth and potentially nonconvex. Assume that for , the true gradients and stochastic gradients are upperbounded: , , and the stochastic gradients for Zeno testing are always nonzero and lowerbounded: , where . Furthermore, we assume that the validation set is close to the training set, which implies bounded variance: . Taking and , after global updates, Algorithm 2 converges to a critical point:
Furthermore, if we take , then we have
Proof.
Similar to Theorem 1, we have
Using smoothness, taking , we have
Thus, we have
By telescoping and taking total expectation, after global updates, we have
Furthermore, if we take , then we have
∎
a.2 Zeno+
Theorem 3.
Assume that and have smoothness has strong convexity. Assume that for , the true gradients and stochastic gradients are upperbounded: , , and the stochastic gradients for Zeno testing are always nonzero and lowerbounded: , where . Furthermore, we assume that the validation set is close to the training set, which implies bounded variance: . Taking and , after global updates, Algorithm 1 converges to a global optimum:
Proof.
Using strong convexity, we have
Note that . Thus, we have
Then, we have
Taking the expectation on both sides, we have
Using smoothness, conditional on , we have
Again, using strong convexity, we have
Thus, we have
Taking , and , we have
Take , we have
By telescoping and taking total expectation, after global updates, we have
∎
Appendix B Additional experiments
b.1 No attack
We first test the convergence when there are no attacks. In all the experiments, we take the learning rate , minibatch size , , , . For Kardam, we take (Kardam pretends that there are 2 Byzantine workers). The result is shown in Figure 3. We can see that Zeno++ converges a little bit slower than AsyncSGD, but faster than Kardam, especially when the worker delay is large. When , Zeno++ converges much faster than Kardam.
b.2 Signflipping attack
We test the Byzantinetolerance to “signflipping” attack, which is proposed in Damaskinos et al. [2018]. In such attacks, the Byzantine workers send instead of the correct gradient to the server. In all the experiments, we take the learning rate , minibatch size , , , . The result is shown in Figure 4 and 5, with different number of Byzantine workers .
b.3 Labelflipping attack
We test the Byzantine tolerance to the labelflipping attacks. When such kind of attacks happen, the workers compute the gradients based on the training data with “flipped" labels, i.e., any , is replaced by . Such kind of attacks can be caused by data poisoning or software failures. In all the experiments, we take the learning rate , minibatch size , , , . The result is shown in Figure 6 and 7, with different number of Byzantine workers .