Detox: A Redundancybased Framework for Faster and More Robust Gradient Aggregation
Abstract
To improve the resilience of distributed training to worstcase, or Byzantine node failures, several recent approaches have replaced gradient averaging with robust aggregation methods. Such techniques can have high computational costs, often quadratic in the number of compute nodes, and only have limited robustness guarantees. Other methods have instead used redundancy to guarantee robustness, but can only tolerate limited number of Byzantine failures. In this work, we present Detox, a Byzantineresilient distributed training framework that combines algorithmic redundancy with robust aggregation. Detox operates in two steps, a filtering step that uses limited redundancy to significantly reduce the effect of Byzantine nodes, and a hierarchical aggregation step that can be used in tandem with any stateoftheart robust aggregation method. We show theoretically that this leads to a substantial increase in robustness, and has a per iteration runtime that can be nearly linear in the number of compute nodes. We provide extensive experiments over real distributed setups across a variety of largescale machine learning tasks, showing that Detox leads to orders of magnitude accuracy and speedup improvements over many stateoftheart Byzantineresilient approaches.
Detox: A Redundancybased Framework for Faster and More Robust Gradient Aggregation
Shashank Rajput^{†}^{†}thanks: Authors contributed equally to this paper and are listed alphabetically. University of WisconsinMadison rajput3@wisc.edu Hongyi Wang^{1}^{1}footnotemark: 1 University of WisconsinMadison hongyiwang@cs.wisc.edu Zachary Charles University of WisconsinMadison zcharles@wisc.edu Dimitris Papailiopoulos University of WisconsinMadison dimitris@papail.io
noticebox[b]\end@float
1 Introduction
To scale the training of machine learning models, gradient computations can often be distributed across multiple compute nodes. After computing these local gradients, a parameter server then averages them, and updates a global model. As the scale of data and available compute power grows, so does the probability that some compute nodes output unreliable gradients. This can be due to power outages, faulty hardware, or communication failures, or due to security issues, such as the presence of an adversary governing the output of a compute node.
Due to the difficulty in quantifying these different types of errors separately, we often model them as Byzantine failures. Such failures are assumed to be able to result in any output, adversarial or otherwise. Unfortunately, the presence of a single Byzantine compute node can result in arbitrarily bad global models when aggregating gradients via their average [5].
In the context of distributed training, there have generally been two distinct approaches to improve Byzantine robustness. The first replaces the gradient averaging step at the parameter server with a robust aggregation step, such as the geometric median and variants thereof [5, 8, 23, 10, 27, 26]. The second approach instead assigns each node redundant gradients, and uses this redundancy to eliminate the effect of Byzantine failures [7, 12, 28].
Both of the above approaches have their own limitations. For the first, robust aggregators are typically expensive to compute and scale superlinearly (in many cases quadratically [19, 10]) with the number of compute nodes. Moreover, such methods often come with limited theoretical guarantees of Byzantine robustness (e.g., only establishing convergence in the limit, or only guaranteeing that the output of the aggregator has positive inner product with the true gradient [5, 19]) and often require strong assumptions, such as bounds on the dimension of the model being trained. On the other hand, redundancy or codingtheoretic based approaches offer strong guarantees of perfect receovery for the aggregated gradients. However, such approaches, in the worstcase, require each node to compute times more gradients, where is the number of Byzantine machines [7]. This overhead is prohibitive in settings with a large number of Byzantine machines.
Our contributions. In this work, we present Detox, a Byzantineresilient distributed training framework that first uses computational redundancy to filter out almost all Byzantine gradients, and then performs a hierarchical robust aggregation method. Detox is scalable, flexible, and is designed to be used on top of any robust aggregation method to obtain improved robustness and efficiency. A highlevel description of the hierarchical nature of Detox is given in Fig. 2.
Detox proceeds in three steps. First the parameter server orders the compute nodes in groups of to compute the same gradients. While this step requires redundant computation at the node level, it will eventually allow for much faster computation at the PS level, as well as improved robustness. After all compute nodes send their gradients to the PS, the PS takes the majority vote of each group of gradients. We show that by setting to be logarithmic in the number of compute nodes, after the majority vote step only a constant number of Byzantine gradients are still present, even if the number of Byzantine nodes is a constant fraction of the total number of compute nodes. Detox then performs hierarchical robust aggregation in two steps: First, it partitions the filtered gradients in a small number of groups, and aggregates them using simple techniques such as averaging. Second, it applies any robust aggregator (e.g., geometric median [8, 26], Bulyan [19], MultiKrum [10], etc.) to the averaged gradients to further minimize the effect of any remaining traces of the original Byzantine gradients.
We prove that Detox can obtain orders of magnitude improved robustness guarantees compared to its competitors, and can achieve this at a nearly linear complexity in the number of compute nodes , unlike methods like Bulyan [19] that require runtime that is quadratic in . We extensively test our method in real distributed setups and largescale settings, showing that by combining Detox with previously proposed Byzantine robust methods, such as MultiKrum, Bulyan, and coordinatewise median, we increase the robustness and reduce the overall runtime of the algorithm. Moreover, we show that under strong Byzantine attacks, Detox can lead to almost a 40% increase in accuracy over vanilla implementations of Byzantinerobust aggregation. A brief performance comparison with some of the current stateoftheart aggregators in shown in Fig. 2.
Related work.
The topic of Byzantine fault tolerance has been extensively studied since the early 80s by Lamport et al. [16], and deals with worstcase, and/or adversarial failures, e.g., system crashes, power outages, software bugs, and adversarial agents that exploit security flaws. In the context of distributed optimization, these failures are manifested through a subset of compute nodes returning to the master flawed or adversarial updates. It is now well understood that firstorder methods, such as gradient descent or minibatch SGD, are not robust to Byzantine errors; even a single erroneous update can introduce arbitrary errors to the optimization variables.
Byzantinetolerant ML has been extensively studied in recent years [13, 24, 25, 14, 4, 8], establishing that while averagebased gradient methods are susceptible to adversarial nodes, medianbased update methods can in some cases achieve better convergence, while being robust to some attacks. Although theoretical guarantees are provided in many works, the proposed algorithms in many cases only ensure a weak form of resilience against Byzantine failures, and often fail against strong Byzantine attacks [19]. A stronger form of Byzantine resilience is desirable for most of distributed machine learning applications. To the best of our knowledge, Draco [7] and Bulyan [19] are the only proposed methods that guarantee strong Byzantine resilience. However, as mentioned above, Draco requires heavy redundant computation from the compute nodes, while Bulyan requires heavy computation overhead on the parameter server end.
We note that [1] presents an alternative approach that does not fit easily under either category, but requires convexity of the underlying loss function. Finally, [3] examines the robustness of signSGD with a majority vote aggregation, but study a restricted Byzantine failure setup that only allows for a blind multiplicative adversary.
2 Problem Setup
Our goal is to solve solve the following empirical risk minimization problem:
where denotes the parameters of a model, and is the loss function on the th training sample. To approximately solve this problem, we often use minibatch SGD. First, we initialize at some . At iteration , we sample uniformly at random from , and then update via
(1) 
where is a randomly selected subset of the data points. To perform minibatch SGD in a distributed manner, the global model is stored at a parameter server (PS) and updated according to (1), i.e., by using the mean of gradients that are evaluated at the compute nodes.
Let denote the total number of compute nodes. At each iteration , during distributed minibatch SGD, the PS broadcasts to each compute node. Each compute node is assigned , and then evaluates the sum of gradients
The PS then updates the global model via
We note that in our setup we assume that the parameter server is the owner of the data, and has access to the entire data set of size .
Distributed training with Byzantine nodes We assume that a fixed subset of size of the compute nodes are Byzantine. Let be the output of node . If is not Byzantine (), we say it is “honest”, in which case its output where is the true sum of gradients assigned to node . If is Byzantine (), its output can be any dimensional vector. The PS receives , and can then process these vectors to produce some approximation to the true gradient update in (1).
We make no assumptions on the Byzantine outputs. In particular, we allow adversaries with full information about and , and that the byzantine compute nodes can collude. Let be the fraction of Byzantine nodes. We will assume throughout.
3 Detox: A Redundancy Framework to Filter most Byzantine Gradients
We now describe Detox, a framework for Byzantineresilient minibatch SGD with nodes, of which are Byzantine. Let be the desired batchsize, and let be an odd integer. We refer to as the redundancy ratio. For simplicity, we will assume divides and that divides . Detox can be directly extended to the setting where this does not hold.
Detox first computes a random partition of in node groups each of size . This will be fixed throughout. We then initialize at some . For , we wish to compute some approximation to the gradient update in (1). To do so, we need a Byzantinerobust estimate of the true gradient. Fix , and let us suppress the notation when possible. As in minibatch SGD, let be a subset of of size , with each element sampled uniformly at random from . We then partition of in groups of size . For each , the PS assigns node the task of computing
(2) 
If is an honest node, then its output is , while if is Byzantine, it outputs some dimensional . The are then sent to the PS. The PS then computes
where denotes the majority vote. If there is no majority, we set . We will refer to as the “vote” of group .
Since some of these votes are still Byzantine, we must do some robust aggregation of the vote. We employ a hierarchical robust aggregation process HierAggr, which uses two userspecified aggregation methods and . First, the votes are partitioned in to groups. Let denote the output of on each group. The PS then computes and updates the model via . This hierarchical aggregation resembles a median of means approach on the votes [20], and has the benefit of improved robustness and efficiency. We discuss this in further detail in Section 4.
A description of Detox is given in Algorithm 1.
3.1 Filtering out Almost Every Byzantine Node
We now show that Detox filters out the vast majority of Byzantine gradients. Fix the iteration . Recall that all honest nodes in a node group send as in (2) to the PS. If has more honest nodes than Byzantine nodes then and we say is honest. If not, then may not equal in which case is a Byzantine vote. Let be the indicator variable for whether block has more Byzantine nodes than honest nodes, and let . This is the number of Byzantine votes. By filtering, Detox goes from a Byzantine compute node ratioof to a Byzantine vote ratio of where .
We first show that decreases exponentially with , while only decreases linearly with . That is, by incurring a constant factor loss in compute resources, we gain an exponential improvement in the reduction of byzantine nodes. Thus, even small can drastically reduce the Byzantine ratio of votes. This observation will allow us to instead use robust aggregation methods on the , i.e., the votes, greatly improving our Byzantine robustness. We have the following theorem about . All proofs can be found in the appendix. Note that throughout, we did not focus on optimizing constants.
Theorem 1.
There is a universal constant such that if the fraction of Byzantine nodes is , then the effective number of Byzantine votes after filtering becomes
We now wish to use this to derive high probability bounds on . While the variables are not independent, they are negatively correlated. By using a version of Hoeffding’s inequality for weakly dependent variables, we can show that if the redundancy is logarithmic, i.e., , then with high probability the number of effective byzantine votes drops to a constant, i.e., .
Corollary 2.
There is a constant such that if and and then for any , with probability at least , we have that .
In the next section, we exploit this dramatic reduction of Byzantine votes to derive strong robustness guarantees for Detox.
4 Detox Improves the Speed and Robustness of Robust Estimators
Using the results of the previous section, if we set the redundancy ratio to , the filtering stage of Detox reduces the number of Byzantine votes to roughly a constant. While we could apply some robust aggregator directly to the output votes of the filtering stage, such methods often scale poorly with the number of votes . By instead applying HierAggr, we greatly improve efficiency and robustness. Recall that in HierAggr, we partition the votes into “vote groups”, apply some to each group, and apply some to the outputs of . We analyze the case where is roughly constant, computes the mean of its inputs, and is a robust aggregator. In this case, HierAggr is analogous to the Median of Means (MoM) method from robust statistics [20].
Improved speed.
Suppose that without redundancy, the time required for the compute nodes to finish is . Applying Krum [5], MultiKrum [10], and Bulyan [19] to their outputs requires operations, so their overall runtime is . In Detox, the compute nodes require times more computation to evaluate redundant gradients. If , this can be done in . With HierAggr as above, Detox performs three major operations: (1) majority voting, (2) mean computation of the vote groups and (3) robust aggregation of the these means using . (1) and (2) require time. For practical aggregators, including MultiKrum and Bulyan, (3) requires time. Since , Detox has runtime . If (which generally holds for gradient computations), Krum, MultiKrum, and Bulyan require time, but Detox only requires time. Thus, Detox can lead to significant speedups, especially when the number of workers is large.
Improved robustness.
To analyze robustness, we first need some distributional assumptions. At any given iteration, let denote the full gradient of . Throughout this section, we assume that the gradient of each sample is drawn from a distribution on with mean and variance . In Detox, the “honest” votes will also have mean , but their variance will be . This is because each honest compute node gets a sample of size , so its variance is reduced by a factor of .
Suppose is some approximation to the true gradient . We say that is a inexact gradient oracle for if . [27] shows that access to a inexact gradient oracle is sufficient to upper bound the error of a model produced by performing gradient updates with . To bound the robustness of an aggegator, it suffices to bound . Under the distributional assumptions above, we will derive bounds on for the hierarchical aggregator with different base aggregators .
We will analyze Detox when computes the mean of the vote groups, and is geometric median, coordinatewise median, or trimmed mean [26]. We will denote the approximation to computed by Detox in these three instances by and , respectively. Using the proof techniques in [20], we get the following.
Theorem 3.
Assume and where is the constant from Corollary 2. There are constants such that for all , with probability at least :

If , then is a inexact gradient oracle.

If , then is a inexact gradient oracle.

If and , then is a inexact gradient oracle.
The above theorem has three important implications. First, we can derive robustness guarantees for Detox that are virtually independent of the Byzantine ratio . Second, even when there are no Byzantine machines, it is known that no aggregator can achieve [18], and because we achieve , we cannot expect to get an order of better robustness by any other aggregator. Third, other than a logarithmic dependence on , there is no dependence on the number of nodes . Even as and increase, we still maintain roughly the same robustness guarantees.
By comparison, the robustness guarantees of Krum and Geometric Median applied directly to the compute nodes worsens as as increases [4, 23]. Similarly, [26] show if we apply coordinatewise median to nodes, each of which are assigned gradients, we get a inexact gradient oracle where . If is constant and is comparable to , then this is roughly , whereas Detox can produce a inexact gradient oracle for . Thus, the robustness of Detox can scale much better with the number of nodes than naive robust aggregation of gradients.
5 Experiments
In this section we present an experimental study on pairing Detox with a set of previously proposed robust aggregation methods, including Multikrum [4], Bulyan [19], coordinatewise median [27]. We also incorporate Detox with a recently proposed Byzantine resilience distributed training method, signSGD with majority vote [3]. We conduct extensive experiments on the scalability and robustness of these Byzantine resilient methods, and the improvements gained when pairing them with Detox. All our experiments are deployed on real distributed clusters under various Byzantine attack models. Our implementation is publicly available for reproducibility at ^{1}^{1}1https://github.com/hwang595/DETOX.
The main findings are as follows: 1) Applying Detox leads to significant speedups, e.g., up to an order of magnitude endtoend training speedup is observed; 2) in defending against stateoftheart Byzantine attacks, Detox leads to significant Byzantineresilience, e.g., applying Bulyan on top of Detox improves the testset prediction accuracy from 11% to 60% when training VGG13BN on CIFAR100 under the “a little is enough" (ALIE) [2] Byzantine attack. Moreover, incorporating signSGD with Detox improves the testset prediction accuracy from to when defending against a constatnt Byzantine attck for ResNet18 trained on CIFAR10.
5.1 Experimental Setup
We implemented vanilla versions of the aforementioned Byzantine resilient methods, as well as versions of these methods pairing with Detox, in PyTorch [21] with MPI4py [9]. Our experimental comparisons are deployed on a cluster of m5.2xlarge instances on Amazon EC2, where 1 node serves as the PS and the remaining nodes are compute nodes. In all following experiments, we set the number of Byzantine nodes to be .
In each iteration of the vanilla Byzantine resilient methods, each compute node evaluates gradients sampled from its partition of data while in Detox each compute node evaluates times more gradients where , so . The average of these locally computed gradients is then sent back to the PS. After receiving all gradient summations from the compute nodes, the PS applies either vanilla Byzantine resilient methods or their Detox paired variants.
5.2 Implementation of Detox
We emphasize that Detox is not simply a new robust aggregation technique. It is instead a general Byzantineresilient distributed training framework, and any robust aggregation method can be immediately implemented on top of it to increase its Byzantineresilience and scalability. Note that after the majority voting stage on the PS one has a wide range of choices for and . In our implementations, we had the following setups: 1) Mean, Coordinatesize Median, 2) Multikrum, Mean, 3) Bulyan, Mean, and 4) coordinatewise majority vote, coordinatewise majority vote (designed specifically for pairing Detox with signSGD). We tried Mean and Multikrum/Bulyan but we found that setups 2) and 3) had better resilience than these choices. More details on the implementation and systemlevel optimizations that we performed can be found in the Appendix B.1.
Byzantine attack models
We consider two Byzantine attack models for pairing Multikrum, Bulyan, and coordinatewise median with Detox. First, we consider the “reversed gradient" attack, where adversarial nodes that were supposed to send to the PS instead send , for some .
The second Byzantine attack model we study is the recently proposed ALIE [2] attack, where the Byzantine compute nodes collude and use their locally calculated gradients to estimate the mean and standard deviation of the entire set of gradients among all other compute nodes. The Byzantine nodes then use the estimated mean and variance to manipulate the gradient they send back to the PS. To be more specific, Byzantine nodes will send where and are the estimated mean and standard deviation by Byzantine nodes and is a hyperparameter which was tuned empirically in [2].
Then, to compare the resilience of the vanilla signSGD and the one paired with Detox, we will consider a simple attack, i.e., constant Byzantine attack. In constant Byzantine attack, Byzantine compute nodes simply send a constant gradient matrix with dimension equal to that of the true gradient where all elements equals to . Under this attack, and specifically for signSGD, the Byzantine gradients will mislead model updates towards wrong directions and corrupt the final model trained via signSGD.
Datasets and models
We conducted our experiments over ResNet18 [15] on CIFAR10 and VGG13BN [22] on CIFAR100. For each dataset, we use data augmentation (random crops, and flips) and normalize each individual image. Moreover, we tune the learning rate scheduling process and use the constant momentum at in running all experiments. The details of parameter tuning and dataset normalization are reported in the Appendix B.2.
5.3 Results
Scalability
We report a periteration runtime analysis of the aforementioned robust aggregations and their Detox paired variants on both CIFAR10 over ResNet18 and CIFAR100 over VGG13. The results on ResNet18 and VGG13BN are shown in Figure 2 and 3 respectively.
We observe that although Detox requires slightly more compute time per iteration, due to its algorithmic redundancy, it largely reduces the PS computation cost during the aggregation stage, which matches our theoretical analysis. Surprisingly, we observe that by applying Detox, the communication costs decrease. This is because the variance of computation time among compute nodes increases with heavier computational redundancy. Therefore, after applying Detox, compute nodes tend not to send their gradients to the PS at the same time, which mitigates a potential network bandwidth congestion. In a nutshell, applying Detox can lead to up to 3 periteration speedup.
Byzatineresilience under various attacks
Methods  ResNet18  VGG13BN 

DMultikrum  80.3%  42.98% 
DBulyan  76.8%  46.82% 
DMed.  86.21%  59.51% 
Multikrum  45.24%  17.18% 
Bulyan  42.56%  11.06% 
Med.  43.7%  8.64% 
We first study the Byzantineresilience of all methods and baselines under the ALIE attack, which is to the best of our knowledge, the strongest Byzantine attack known. The results on ResNet18 and VGG13BN are shown in Figure 2 and 3 respectively. Applying Detox leads to significant improvement on Byzantineresilience compared to vanilla Multikrum, Bulyan, and coordinatewise median on both datasets as shown in Table 1.
We then consider the reverse gradient attack, the results are shown in Figure 4. Since reverse gradient is a much weaker attack, all vanilla robust aggregation methods and their Detox paired variants defend well.
Moreover, applying Detox leads to significant endtoend speedups. In particular, combining the coordinatewise median with Detox led to a speedup gain in the amount of time to achieve to 90% test set prediction accuracy for ResNet18 trained on CIFAR10. The speedup results are shown in Figure 5. For the experiment where VGG13BN was trained on CIFAR100, up to an order of magnitude endtoend speedup can be observed in coordinatewise median applied on top of Detox.
For completeness, we also compare versions of Detox with Draco [7]. This is not the focus of this work, as we are primarily interested in showing that Detox improves the robustness of traditional robust aggregators. However the comparisons with Draco can be found in the Appendix B.4.
Comparison between Detox and signSgd
We compare Detox paired signSGD with vanilla signSGD where only the sign information of each gradient element will be sent to the PS. The PS, on receiving sign information of gradients, takes coordiantewise majority votes to get the model update. As is argued in [3], the gradient distribution for many mordern deep networks can be close to unimodal and symmetric, hence a random sign flip attack is weak since it will not hurt the gradient distribution. We thus consider a stronger constant Byzantine attack introduced in Section 5.2. To pair Detox with signSGD, after the majority voting stage of Detox, we set both and as coordinatewise majority vote describe in Algorithm 1 in [3]. For hyperparameter tuning, we follow the suggestion in [3] and set the initial learning rate at . However, in defensing the our proposed constant Byzantine attack, we observe that constant learning rates lead to model divergence. Thus, we tune the learning rate schedule and use for both Detox and Detox paired signSGD.
The results of both ResNet18 trained on CIFAR10 and VGG13BN trained on CIFAR100 are shown in Figure 6 where we observe that Detox paired signSGD improves the Byzantine resilience of signSGD significantly. For ResNet18 trained on CIFAR10, Detox improves testset prediction accuracy of vanilla signSGD from to . While for VGG13BN trained on CIFAR100, Detox improves testset prediction accuracy (TOP1) of vanilla signSGD from to .
Mean estimation on synthetic data
To verify our theoretical analysis, we finally conduct an experiment for a simple mean estimation task. The result of our synthetic mean experiment are shown in Figure 7. In the synthetic mean experiment, we set , and for dimension , we generate samples iid from . The Byzantine nodes, instead send a constant vector of the same dimension with norm of 100. The robustness of an estimator is reflected in the norm of its mean estimate. Our experimental results show that Detox increases the robustness of geometric median and coordinatewise median, and decreases the dependecne of the error on .
6 Conclusion
In this paper, we present Detox, a new framework for Byzantineresilient distributed training. Notably, any robust aggregator can be immediatley used with Detox to increase its robustness and efficiency. We demonstrate these improvements theoretically and empirically. In the future, we would like to devise a privacypreserving version of Detox, as currently it requires the PS to be the owner of the data, and also to partition data among compute nodes. This means that the current version of Detox is not privacy preserving. Overcoming this limitation would allow us to develop variants of Detox for federated learning.
References
 [1] (2018) Byzantine stochastic gradient descent. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi and R. Garnett (Eds.), pp. 4618–4628. External Links: Link Cited by: §1.
 [2] (2019) A little is enough: circumventing defenses for distributed learning. arXiv preprint arXiv:1902.06156. Cited by: Figure 2, §5.2, Table 1, §5.
 [3] (2018) SignSGD with majority vote is communication efficient and fault tolerant. arXiv. Cited by: §1, §5.3, §5.
 [4] (2017) Machine learning with adversaries: byzantine tolerant gradient descent. In Advances in Neural Information Processing Systems, pp. 119–129. Cited by: §1, §4, §5.
 [5] (2017) Machine learning with adversaries: byzantine tolerant gradient descent. See DBLP:conf/nips/2017, pp. 118–128. External Links: Link Cited by: 2nd item, §1, §1, §1, §4.
 [6] (2017) Hoeffding’s inequality for sums of weakly dependent random variables. Mediterranean Journal of Mathematics. Cited by: §A.3.
 [7] (2018) DRACO: byzantineresilient distributed training via redundant gradients. In International Conference on Machine Learning, pp. 902–911. Cited by: §1, §1, §1, §5.3.
 [8] (2017) Distributed statistical machine learning in adversarial settings: byzantine gradient descent. Proceedings of the ACM on Measurement and Analysis of Computing Systems 1 (2), pp. 44. Cited by: §1, §1, §1.
 [9] (2011) Parallel distributed computing using python. Advances in Water Resources 34 (9), pp. 1124–1139. Cited by: §5.1.
 [10] (2019) AGGREGATHOR: byzantine machine learning via robust gradient aggregation. Conference on Systems and Machine Learning. Cited by: §1, §1, §1, §4.
 [11] (2019) AggregaThor: byzantine machine learning via robust gradient aggregation. In SysML, Cited by: §B.1.
 [12] (2018) Data encoding for byzantineresilient distributed gradient descent. In 2018 56th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pp. 863–870. Cited by: §1.
 [13] (2019) SGD: decentralized byzantine resilience. arXiv preprint arXiv:1905.03853. Cited by: §1.
 [14] (2019) Fast and secure distributed learning in high dimension. arXiv preprint arXiv:1905.04374. Cited by: §1.
 [15] (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §5.2.
 [16] (1982) The byzantine generals problem. ACM Transactions on Programming Languages and Systems (TOPLAS) 4 (3), pp. 382–401. Cited by: §1.
 [17] (2014) Chernoff’s inequalitya very elementary proof. arXiv preprint arXiv:1403.7739. Cited by: §A.3, Theorem.
 [18] (2019) Subgaussian estimators of the mean of a random vector. The Annals of Statistics 47 (2), pp. 783–794. Cited by: §4.
 [19] (2018) The hidden vulnerability of distributed learning in byzantium. arXiv preprint arXiv:1802.07927. Cited by: 1st item, §1, §1, §1, §1, §4, §5.
 [20] (2015) Geometric median and robust estimation in banach spaces. Bernoulli 21 (4), pp. 2308–2335. Cited by: §A.4, §A.4, §3, §4, §4, Lemma 6.
 [21] (2017) Automatic differentiation in pytorch. Cited by: §5.1.
 [22] (2014) Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §5.2.
 [23] (2018) Generalized byzantinetolerant sgd. arXiv preprint arXiv:1802.10116. Cited by: §1, §4.
 [24] (2018) Zeno: byzantinesuspicious stochastic gradient descent. arXiv preprint arXiv:1805.10032. Cited by: §1.
 [25] (2019) Fall of empires: breaking byzantinetolerant sgd by inner product manipulation. arXiv preprint arXiv:1903.03936. Cited by: §1.
 [26] (2018) Byzantinerobust distributed learning: towards optimal statistical rates. In International Conference on Machine Learning, pp. 5636–5645. Cited by: §1, §1, §4, §4.
 [27] (2018) Defending against saddle point attack in byzantinerobust distributed learning. CoRR abs/1806.05358. External Links: Link, 1806.05358 Cited by: §1, §4, §5.
 [28] (2018) Lagrange coded computing: optimal design for resiliency, security and privacy. arXiv preprint arXiv:1806.00939. Cited by: §1.
Appendix A Proofs
a.1 Proof of Theorem 1
The following is a more precise statement of the theorem.
Theorem.
If , and then falls as which is exponential in r.
Proof.
By direct computation,
Note that is the coefficient of in the binomial expansion of . Therefore, setting , we find that . Therefore,
Note that since and is odd, we have . Therefore,
∎
For , we have the following lemma.
Lemma 4.
If , then when .
Proof.
∎
a.2 Proof of Corollary 2
From Theorem 1 we see that . Now, straightforward analysis implies that if and then . We will then use the following Lemma:
Lemma 5.
For all ,
Now, using Lemma 5 and assuming ,
where we used the fact that in the first implication and the assumption that in the second. Setting , we get the probability bound. Finally, setting makes , which completes the proof.
a.3 Proof of Lemma 5
We will prove the following:
Proof.
Theorem (Linial [17]).
Let be Bernoulli random variables. Let be such that is a positive integer and let be any positive integer such that . Then
Let . Now, . We will show that
where of size . To see this, note that for any , . The conditional probability of some other being given that is would only reduce. Formally, for ,
Note that for to be , the Byzantine machines in the th block must be in the majority. Hence, the reduction in the pool of leftover Byzantine machines was more than honest machines. Since the total number of Byzantine machines is less than the number of honest machines, the probability for them being in a majority in block reduces. Therefore,
Letting , we then have
∎
a.4 Proof of Theorem 3
We will adapt the techniques of Theorem 3.1 in [20].
Lemma 6 ([20], Lemma 2).
Let be some Hilbert space, and for , let be their geometric median. Fix and suppose that satisfies , where
and . Then there exists with such that for all , .
Note that for a general Hilbert or Banach space , the geometric median is defined as:
where is the norm on . This coincides with the notion of geometric median in under the norm. Note that Coordinatewise Median is the Geometric Median in the real space with the norm, which forms a Banach space.
Firstly, we use Corollary 2 to see that with probability , . Now, we assume that is true. We will show the remainder of the theorem holds with probability at least , as then a union bound will give us the desired result.
(1): Let us assume that number of clusters is for some , also note that . Now, choose . Choose . Assume that the Geometric Median is more than distance away from true mean. Then by the previous Lemma, atleast fraction of the empirical means of the clusters must lie atleast distance away from true mean. Because we assume the number of clusters is more than , atleast fraction of empirical means of uncorrupted clusters must also lie atleast distance away from true mean.
Recall that the variance of the mean of an “honest” vote group is given by
By applying Chebyshev’s inequality to the uncorrupted vote group , we find that its empirical mean satisfies
Now, we define a Bernoulli event that is 1 if the empirical mean of an uncorrupted vote group is at distance larger than to the true mean, and 0 otherwise. By the computation above, the probability of this event is less than . Thus, its mean is less than and we want to upper bound the probability that empirical mean is more than . Using the number of events as , we find that this holds with probability at least . For this, we used the following version of Hoeffding’s inequality in this part and part (3) of this proof. For Bernoulli events with mean , empirical mean , number of events and deviation :
To finish the proof, just plug in the values of given in the Lemma 2.1 (written above) from [20], where for Geometric Median.
(2): For coordinatewise median, we set . Then we apply the result proved in previous part for each dimension of . Then, we get that with probability at least ,
where is the coordinate of , is the coordinate of and is the diagonal entry of . Doing a union bound, we get that with probability at least
(3): Define
where is the diagonal entry of . Now, for each uncorrupted vote group, using Chebyshev’s inequality:
Now, coordinate of trimmed mean lies away from if atleast of the coordinates of vote group empirical means lie away from . Note that because of the assumption of the Proposition . Because of these can be corrupted, atleast of true empirical means have coordinates that lie away from . This means fraction have true empirical means have coordinates that lie away from . Define a Bernoulli variable for a vote group as being 1 if the coordinate of empirical mean of that vote group lies more than away from , and 0 otherwise.
The mean of therefore satisfies
Set
Again, using Hoeffding’s inequality in a manner analogous to part (1) of the proof, we get that probability of coordinate of trimmed mean being more than away from is less than .
Taking union bound over all coordinates, we find that the probability of trimmed mean being more than
away from is less than . Hence we have proved that if
and , then with probability at least , . Now, set and . One can easily see that is satisfied and we get that with probability at least , for some constant ,
Appendix B Extra Experimental Details
b.1 Implementation and systemlevel optimization details
We introduce the details of combining Bulyan, Multikrum, and coordinatewise median with Detox.

Bulyan: according to [19] Bulyan requires . In Detox, after the first majority voting level, the corresponding requirement in Bulyan becomes . Thus, we assign all “winning" gradients in to one cluster i.e., Bulyan is conducted across 15 gradients.

Multikrum: according to [5], Multikrum requires . Therefore, for similar reason, we assign 15 “winning" gradients into two groups with uneven sizes at 7 and 8 respectively.

coordinatewise median: for this baseline we follow the theoretical analysis in Section 3.1 i.e., 15 “winning" gradients are evenly assigned to 5 clusters with size at 3 for reverse gradient Byzantine attack. For ALIE attack, we assign those 15 gradients evenly to 3 clusters with size of 5. The reason for this choice is simply that we observe the reported strategies perform better in our experiments. Then mean of the gradients is calculated in each cluster. Finally, we take coordinatewise median across means of all clusters.
One important thing to point out is that we conducted system level optimizations on implementing Multikrum and Bulyan, e.g., parallelizing the computationally heavy parts in order to make the comparisons more fair according to [11]. The main idea of our systemlevel optimization are twofold: i) gradients of all layers of a neural network are firstly vectorized and concatenated to a high dimensional vector. Robust aggregations are then deployed on those high dimensional gradient vectors from all compute nodes. ii) As computational heavy parts exist for several methods e.g., calculating medians in the second stage of Bulyan. To optimize that part, we chunk the high dimensional gradient vectors evenly into pieces, and parallelize the median calculations in all the pieces. Our systemlevel optimization leads to 24 speedup in the robust aggregation stage.
b.2 Hyperparameter tuning
Experiments  CIFAR10 on ResNet18  CIFAR100 on VGG13BN 
DMultikrum  0.1  0.1 
DBulyan  0.1  0.1 
DMed.  
Multikrum  0.03125  0.03125 
Bulyan  0.1  0.1 
Med.  0.1 
Experiments  CIFAR10 on ResNet18  CIFAR100 on VGG13BN 

DMultikrum  
DBulyan  
DMed.  
Multikrum  
Bulyan  
Med. 
b.3 Data augmentation and normalization details
In preprocessing the images in CIFAR10/100 datasets, we follow the standard data augmentation and normalization process. For data augmentation, random cropping and horizontal random flipping are used. Each color channels are normalized with mean and standard deviation by , . Each channel pixel is normalized by subtracting the mean value in this color channel and then divided by the standard deviation of this color channel.
b.4 Comparison between Detox and Draco
We provide the experimental results in comparing Detox with Draco.